Friend: an extensible authentication and authorization library for Clojure Ring webapps and services

Say hello to my little Friend.

There’s plenty of technical stuff in the README to chew on if you like.  In short, I’m hoping this can eventually be a warden/spring-security/everyauth /omniauth for Clojure; that is, a common abstraction for authentication and authorization mechanisms.  Clojure has been around long enough that adding pedestrian things like form and HTTP Basic and $AUTH_METHOD_HERE to a Ring application should be easy.  Right now, it’s not: either you’re pasting together a bunch of different libraries that don’t necessarily compose well together, or you get drawn into shaving the authentication and authorization yaks for the fifth time in your life so you can sleep well at night.

Hopefully Friend will make this a solved problem, or at least push things in that direction.  It plays nice with all of the best principles of Ring, and includes support for:

  • form, HTTP Basic, and OpenID authentication
  • role-based authorization (optionally using hierarchical roles via Clojure’s derive and isa?)
  • su capabilities (multiple login support / a.k.a. “log in as”)
  • channel security (i.e. HTTPS-only for certain Ring routes)
  • …and more

Most importantly, it takes a stab at a couple of core abstractions for others to drop in other authentication workflows, e.g. OAuth in all of its incarnations, NTLM, BrowserID, etc. etc. etc.  There are already plenty of Clojure implementations for all sorts of authentication methods; hopefully someone (you?!) will step up and bring one of them to the party, so anyone’s Friend-empowered Clojure webapp can easily offer any or all of them with a minimum of suffering.

Finally: frankly, it’s absurd that I’m writing security-related stuffs.  (I know it hardly ever works out that way, but it seems like some experts somewhere should be taking care of this.)  It would be a great thing if you were to beat on Friend and try to find exploits, general breakage, etc., especially if you have prior experience in this area.

jsdifflib now on Github

jsdifflib is a Javascript library that provides:

  1. a partial reimplementation of Python’s difflib module (specifically, the SequenceMatcher class)
  2. a visual diff view generator, that offers side-by-side as well as inline formatting of file data

Some years ago, I needed a good in-browser visual diff tool, and couldn’t find anything suitable.  So, I built jsdifflib in 2007 and open-sourced it soon thereafter.

It’s apparently been used a fair bit by others, even though I cruelly sequestered it on one of Snowtide‘s web servers for years (sorry, but Github wasn’t around and I had little patience for SourceForge).  I’ve promised to put the source somewhere more useful many times, but have unfortunately only just gotten around to it today: jsdifflib is now on github.

I’ve not used the library in some time, but it still works well enough.  Send in your patches if you have ’em.

Bandalore: a Clojure client library for Amazon’s Simple Queue Service (SQS)

I recently found myself wanting to work with Amazon’s Simple Queue Service (SQS), but I could find no reasonable Clojure library for accessing it.  Of course, AWS’ own Java SDK is the canonical implementation of their APIs (at least in the JVM space), so putting together a Clojure wrapper that adds a few handy extras wasn’t particularly difficult.

You can find Bandalore hosted on github, licensed under the EPL. A proper release will find its way into Maven central within the next couple of days.  The code isn’t much more than 12 hours old, so consider yourself forewarned. ;-)

I hope people find the library useful.  If you’ve any questions, feel free to ping me in irc or twitter.

What follows is an excerpt from the README documentation for Bandalore that describes some of its more interesting functionality:

seqs being the lingua franca of Clojure collections, it would be helpful if we could treat an SQS queue as a seq of messages. While receive does return a seq of messages, each receive call is limited to receiving a maximum of 10 messages, and there is no streaming or push counterpart in the SQS API.

The solution to this is polling-receive, which returns a lazy seq that reaches out to SQS as necessary:

=> (map (sqs/deleting-consumer client :body)
     (sqs/polling-receive client q :limit 10))
("3" "5" "7" "8" ... "81" "90" "91")

polling-receive accepts all of the same optional kwargs as receive does, but adds two more to control its usage of receive API calls:

  • :period – time in ms to wait after an unsuccessful `receive` request (default: 500)
  • :max-wait – maximum time in ms to wait to successfully receive messages before terminating the lazy seq (default 5000ms)

Often queues are used to direct compute resources, so you’d like to be able to saturate those boxen with as much work as your queue can offer up. The obvious solution is to pmap across a seq of incoming messages, which you can do trivially with the seq provided by polling-receive. Just make sure you tweak the :max-wait time so that, assuming you want to continuously process incoming messages, the seq of messages doesn’t terminate because none have been available for a while.

Here’s an example where one thread sends a message once a second for a minute, and another consumes those messages using a lazy seq provided by polling-receive:

=> (defn send-dummy-messages
     [client q count]
     (future (doseq [n (range count)]
               (Thread/sleep 100)
               (sqs/send client q (str n)))))
=> (defn consume-dummy-messages
     [client q]
     (future (dorun (map (sqs/deleting-consumer client (comp println :body))
                      (sqs/polling-receive client q :max-wait Integer/MAX_VALUE :limit 10)))))
=> (consume-dummy-messages client q)               ;; start the consumer
#<core$future_call$reify__5500@a6f00bc: :pending>
=> (send-dummy-messages client q 1000)             ;; start the sender
#<core$future_call$reify__5500@18986032: :pending>

You’d presumably want to set up some ways to control your consumer. Hopefully it’s clear that it would be trivial to parallelize the processing function being wrapped by deleting-consumer using pmap, distribute processing among agents if that’s more appropriate, etc.

Reducing purchase anxiety is a feature

Talk to anyone outside of the software world, and you’ll quickly realize that one of the most gut-wrenching, anxiety-inducing acts is buying software. Even if one has evaluated the product in question top to bottom, past experience of bugs, botched updates, missing features, and outright failures and crashes has tempered any enthusiasm or confidence that might be felt when the time comes to pull out the credit card or write the purchase order.

Of course, the blame for this lies squarely with the software industry itself – the failures in software quality are well known, both discrete instances as well as in aggregate. Those of us whose business and livelihood are tied to the sale of software (whether sent out the door or delivered as a service) must do whatever we can to reverse this zeitgeist.

Given that, we’ve decided to adopt a very simple, no-nonsense “Satisfaction Guaranteed” policy for PDFTextStream. Hopefully this will help take the anxiety out of someone’s day, somewhere.

This isn’t a new idea, of course. Lots of software companies have had guarantees of some sort or another for ages, but I think my first encounter with the concept as a business owner was Joel Spolsky’s post from a couple of years ago:

I think that our customers are nice because they’re not worried. They’re not worried because we have a ridiculously liberal return policy: “We don’t want your money if you’re not amazingly happy.”

Joel raised the issue again on a recent StackOverflow podcast, which prompted me to think about our own approach…

What do we do about unhappy customers?

To be honest, our customers are pretty happy. Of course, we occasionally receive a bug report, but we generally knock out patches within a couple of days, and sometimes faster. In the 5 years we’ve been selling PDFTextStream, we’ve never had a single request for a refund. Part of that is offering up a very liberal evaluation version, but I’d like to think it’s because what we sell does the job it’s meant to do very well.

Given that, I’ve never thought to make a big stink about a refund policy – it just never came up. But hearing Joel and Jeff talk about the ire that they felt towards various companies that refused to issue refunds when they weren’t happy with something motivated me to make our de facto policy explicit. Thus, the new “Satisfaction Guaranteed” statement.

Part II: the Open Source Influence

An elephant in the room is the influence of open source software on customers’ attitudes towards buying software, and the assessment of risk that goes along with it. As more and more users of technology (just to spread the net as widely as possible) are exposed and become accustomed to the value associated with open source software (which, in simple terms, is generally high because of its zero or near-zero price), it increases pressure on commercial vendors (like us) to up our game along the same vector.

But, the impact of open source software on pricing is a pretty stale story. The real impact is derivative, in that a zero or near-zero price means that the apparent risk associated with using open source software is zero or near-zero. The promise of proprietary, commercial software is that, if it does what the vendor claims (whatever that is), then that software will deliver benefits far in excess of its cost and far in excess of the aggregate benefit provided by the open source alternatives, even given the price differential.

The problem is that a lot of people only turn towards commercial options as a last resort because of the aforementioned historical failures of the software industry vis á vis quality: the apparent risk of commercial options is higher than that associated with open source options, simply because the latter’s super-low price is a psychological antidote to any anxiety about quality issues. So, there’s flight towards low-priced options, rather than a thorough search for optimal solutions. Injecting an explicit guarantee of performance and reliability (like our new “Satisfaction Guarantee”) might be enough to tip the relative apparent risk in favor of the commercial option – or, at the very least, minimize the imbalance so that it’s more likely that price won’t dominate other factors (which are potentially more relevant to overall benefits).

Of course, this can only work if one’s product is actually better than the open source alternatives, and by a good stretch to boot so as to compensate for the price differential. In any case, it’s a win-win for the formerly-anxious software user and buyer: they should feel like they have more choice overall, and therefore have a better chance of discovering and adopting the best solution for any given problem, regardless of software licenses and distribution models.

Working Together: Python and Java, Open Source and Commercial

PDFTextStream started out as a Java library, but is now available and supported for Python. How that leap was made exemplifies how commercial and open source software efforts complement each other in the best of circumstances, and is also a fantastic case study in Java + Python integration.

In general, Java and Python don’t really mix. Their architectures, best-practices, object models, and philsophies are pretty divergent in a lot of ways. Because of this, you don’t often find them cohabiting peacefully.

However, there are significant advantages to be had by bringing these two environments together. Python is a really elegant language, and is very well-suited to whole classes of software development that are much more painful to tackle in Java. Java has its advantages as well: a very mature standard library, a huge array of third-party library support, fantastic development environments, and the backing of big players in IT. As always, there’s a right tool for each job, and sometimes Java works best, and sometimes Python works best, but a combination would truly be more than the sum of its parts.

As PDFTextStream got its legs in the market about 18 months ago, our consulting business picked up, and I began to look for a way to use Python for prototyping and custom development in conjunction with PDFTextStream. Of course, back then, PDFTextStream was only for Java, so some bridge-building was in order.

I came across JPype (, and found it to be a promising solution. JPype is an open-source Python library that gives “python programs full access to java class libraries”. Sounds good, and it was.

Eventually, however, we ran into some problems. Specifically, one of our clients wanted to have PDFTextStream extract text from PDF documents in-memory (i.e. without having the PDF file(s) on disk). That’s not problem with PDFTextStream — we added that feature in short order.

However, this client was also adamant in their desire for a Python-based solution. The rest of their application (with which our piece integrated) is 100% Python, and their performance requirements (think millions of PDF documents processed per month) made running PDFTextStream as some kind of service component unthinkable.

What’s the problem? JPype, circa summer of 2005, copied data between Python and Java. That means that, if you have a PDF file in memory in Python, and want to use PDFTextStream’s in-memory extraction capability, JPype made a copy of that PDF file data before passing it off into the target Java function or constructor.

Bad, bad, bad. That was a huge performance hit to the application, and simply unacceptable from the client’s (and users’) point of view.

The obvious course of action was to make JPype, in effect, “pass by reference” when working with significant chunks of data (byte arrays, Strings, etc). This was no simple task, but we soon contacted the maintainer of JPype, a friendly fellow named Steve Ménard, and explained our predicament.

Within a few days, he had hammered out the idea to expose Python strings (the byte array of the Python world in most environments) as DirectByteBuffer objects in Java. This was a great idea, and meshed nicely with PDFTextStream’s in-memory processing API. Steve and I hammered out a relatively informal work agreement and hourly rate, and it was assumed by both of us that his enhancements to JPype for our purposes would stay licensed under the Apache v2.0 license to be enjoyed by the rest of the JPype community.

Nailing down all the technical details took a few weeks, but in the end, Steve was successful. We were able to put PDFTextStream’s entire API to use from within Python in a way that sacrificed not one ounce of performance or functionality.

So what’s the upshot of all of this?

  • Our consulting job completed with high praise from our customer, and our component of their application continues to hum away, extracting text from millions of PDF documents per month using PDFTextStream from Python
  • We’ve since worked with Steve here and there as necessary in order to make additional tweaks to JPype. Because of his help, we now distribute a supported version of PDFTextStream for Python (click that for more technical details about the Python/Java integration made possible by JPype).
  • The JPype project retains the new/improved functionality that we paid for, and the broader community continues to benefit from that.
  • Steve got to pick up a new mac mini, plus whatever else he felt like buying with his hard-earned cash

That’s what I call a win-win situation, for us, for our customers, for Steve, and for the JPype project and its other users. In an ideal world, this is how open source and commercial software efforts should collaborate and cross-pollinate.

Open Source, Positioning, and Execution

In the past month, I’ve read no fewer than 8 articles and blog posts trying to thread a story around what is apparently the “big” question these days: how can software companies make money in an open source world? Well, we are, quite well thank-you-very-much. Here’s how and why.

Our primary product is PDFTextStream. It came on to the market a year ago, entering a market (Java libraries that can extract content from PDF documents) that was dominated by open source (or dual-licensed) offerings that are generally well-liked by the broader community.

OK, so why are we still here, thriving and growing?

  • Positioning When I decided to enter this market three years ago, I knew we would have a good chance simply because it has characteristics that are uniquely suited to a strong, specialized commercial vendor. While generating PDF documents is generally quite easy (thereby leading to a glut of report-generating libraries), extracting content from PDF documents is not. There are numerous file-format ambiguities to address, as well as the details related to achieving document understanding accuracy that is demanded by corporate and government customers. Anyone not dedicated to serving this market with 100% of their effort will not meet the market’s true demands.
  • Execution Anyone who strives to innovate eventually experiences some anxiety about sharing ideas with colleagues, with the irrational fear that those ideas might be misappropriated, leading to unnecessary competition. The thing is, dozens or hundreds of other people in the same field are likely having the same ideas simultaneously, so the only thing that will ever ensure business success is superior execution.
    Likewise, there are at least four open source Java libraries that extract content out of PDF documents. It’s not arrogant or smug to say that we’ll out-execute the teams or individuals that work on those libraries. We’re in this for the long haul and this is all we do 14 hours a day.
  • Serving a Niche Very closely related to product positioning was the decision to enter a very demanding niche. We’re not trying to build yet another HTTP server, EJB container, etc. We’re not working on a commodity, and therefore we are much less likely to see competition from an open source library staffed by developers from IBM (for example). Beyond this market-centric reality is the fact that PDF content extraction is a much more difficult game than writing an HTTP server (again, for example) — there are no standards, there are no RFC’s, there’s no easy way to tell if you’re doing things the right way. So, if someone wants to go head to head with PDFTextStream, they’ll have to grab their machete and start slicing through the same jungle of PDF specs, mangled documents (which nevertheless open in Acrobat without a hitch), and all of the other fun that goes into building a PDF extraction library.

I’m not saying that this formula we’ve worked out is simple, or that it can be easily replicated with a different product in a different market. However, at least from where I’m sitting, “living in an open source world” is pretty pleasant.