Clojure Atlas (Preview!)

Today, I’m opening up a “preview” site for Clojure Atlas, a new side project of mine that I’m particularly excited about.

Clojure Atlas is an experiment in visualizing a programming language and its standard library.  I’ve long been frustrated with the limitations of text in programming, and this is my attempt to do something about it.  From the site:

While Clojure Atlas has a number of raisons d’être, it fundamentally exists because I’ve consistently thought that typical programming language and API references – being, in general, walls of text and alphabetized links – are really poor at conveying the most important information: not the minutiae of function signatures and class hierarchies, but the stuff that’s “between the lines”, the context and interrelationships between such things that too often are only discovered and internalized by bumping into them in the course of programming. This is especially true if we’re learning a language and its libraries (really, a never-ending process given the march of progress), and what’s standing in our way is not, for example, being able to easily access the documentation or signature for a particular known function, but discovering the mere existence of a previously-unknown function that is perfect for our needs at a given moment.

This is just a preview – all sizzle and no steak, as it were.  I’m working away at the ontology that drives the visualization and user experience, but I want to get some more early (quiet) feedback from a few folks to make sure I’m not committing egregious sins in various ways before throwing open the doors to the world.

In the meantime, if you’re really interested, follow @ClojureAtlas, and/or sign up for email updates on the site.

Bandalore: a Clojure client library for Amazon’s Simple Queue Service (SQS)

I recently found myself wanting to work with Amazon’s Simple Queue Service (SQS), but I could find no reasonable Clojure library for accessing it.  Of course, AWS’ own Java SDK is the canonical implementation of their APIs (at least in the JVM space), so putting together a Clojure wrapper that adds a few handy extras wasn’t particularly difficult.

You can find Bandalore hosted on github, licensed under the EPL. A proper release will find its way into Maven central within the next couple of days.  The code isn’t much more than 12 hours old, so consider yourself forewarned. ;-)

I hope people find the library useful.  If you’ve any questions, feel free to ping me in irc or twitter.

What follows is an excerpt from the README documentation for Bandalore that describes some of its more interesting functionality:

seqs being the lingua franca of Clojure collections, it would be helpful if we could treat an SQS queue as a seq of messages. While receive does return a seq of messages, each receive call is limited to receiving a maximum of 10 messages, and there is no streaming or push counterpart in the SQS API.

The solution to this is polling-receive, which returns a lazy seq that reaches out to SQS as necessary:

=> (map (sqs/deleting-consumer client :body)
     (sqs/polling-receive client q :limit 10))
("3" "5" "7" "8" ... "81" "90" "91")

polling-receive accepts all of the same optional kwargs as receive does, but adds two more to control its usage of receive API calls:

  • :period – time in ms to wait after an unsuccessful `receive` request (default: 500)
  • :max-wait – maximum time in ms to wait to successfully receive messages before terminating the lazy seq (default 5000ms)

Often queues are used to direct compute resources, so you’d like to be able to saturate those boxen with as much work as your queue can offer up. The obvious solution is to pmap across a seq of incoming messages, which you can do trivially with the seq provided by polling-receive. Just make sure you tweak the :max-wait time so that, assuming you want to continuously process incoming messages, the seq of messages doesn’t terminate because none have been available for a while.

Here’s an example where one thread sends a message once a second for a minute, and another consumes those messages using a lazy seq provided by polling-receive:

=> (defn send-dummy-messages
     [client q count]
     (future (doseq [n (range count)]
               (Thread/sleep 100)
               (sqs/send client q (str n)))))
=> (defn consume-dummy-messages
     [client q]
     (future (dorun (map (sqs/deleting-consumer client (comp println :body))
                      (sqs/polling-receive client q :max-wait Integer/MAX_VALUE :limit 10)))))
=> (consume-dummy-messages client q)               ;; start the consumer
#<core$future_call$reify__5500@a6f00bc: :pending>
=> (send-dummy-messages client q 1000)             ;; start the sender
#<core$future_call$reify__5500@18986032: :pending>

You’d presumably want to set up some ways to control your consumer. Hopefully it’s clear that it would be trivial to parallelize the processing function being wrapped by deleting-consumer using pmap, distribute processing among agents if that’s more appropriate, etc.

…wherein I feel the pain of being a generalist

I’ve lately been in a position of offering occasional advice to Lee Spector, a former professor of mine, on various topics related to Clojure, which he’d recently discovered and (as far as I can tell) adopted with some enthusiasm.  I think I’d been of some help to him – that is, until the topic of build tooling came up.

He wanted to “export” a Processing sketch – written in Clojure against the Processing core library and the clj-processing wrapper – to an applet jar, which is the most common deployment path in that sphere.  Helpfully, the Processing “IDE” (I’m not sure what it’s actually called; the that one launches on OS X that provides a Java-ish code editor and an integrated build/run environment) provides a one-button-push export feature that wraps up a sketch into an applet jar and an HTML file one can copy to a web server for easy viewing.

It’s an awesome, targeted solution and clearly hits the sweet spot for people using the Processing kit.

Stepping out of the manicured garden of comes with a cost, though; you’ve lost that vertically-integrated user experience, and have to tangle with everything the JVM ecosystem has to throw at you.  There is no big, simple button to push to get your ready-to-deploy artifact out of your development environment.

So, Lee asked for some direction on how to regain that simple deployment process; my response pointing at the various build tooling options in the JVM neighborhood ended up provoking pain more than anything else due to some fundamental mismatches between our expectations and background.  You can read the full thread here, but I’ll attempt to distill the useful bits here.

Do I really need to bother with this ‘build’ shit?

Building software for deployment is a damn tricky problem, but it’s far more of a people problem than a technical problem: the diversity and complexity of solutions is such that the only known-good solution is immersive exposure to documentation, examples, and community.

In response, Lee eventually compared the current state of affairs with regard to build/deployment issues in Clojure and on the JVM as if one were asking a novelist to learn how to construct a word processor’s interface before being able to write:

This is, I know, a caricature, but imagine a word processing app that came with no GUI because hey, people have different GUI preferences and a lot of people are going to want things to look different. So here’s a word processing app but if you really want to use it and actually see your document you have to immerse yourself in the documentation, examples, and community of GUI design and libraries. This is not helpful to the novelist who wants a word processor! On the other hand if you provide a word processor with a functioning GUI but also make it customizable, or even easy to swap in entirely different GUI components, then that’s all for the good. I (and many others, including many who are long-term/professional programmers, but just not in this ecosystem) are like the novelists here. We want a system that allows us to write Clojure code and make it go (including producing an executable), and any default way of doing that that works will be great. Requiring immersive exposure to documentation, examples, and community to complete a basic step in “making it go” seems to me to be an unnecessary impediment to a large class of potential users.

My response was probably inevitable, as steeped in the ethos of the JVM as I am; nevertheless, Lee’s perspective ended up allowing me to elucidate more clearly than I ever have why I use tools like Maven rather than far simpler (and yes, more approachable) tools like Leiningen, Cake, et al.:

At one time, there were only a few modes of distribution (essentially: build executable for one, perhaps two platforms, send over the wire, done).  That time is long past though, and software developers cannot afford to be strict domain experts that stick to their knitting and write code: the modes of distribution of one’s software are at least as critical as the software itself. Beyond that, interests of quality and continuity have pushed the development and adoption of practices like continuous integration and deployment, which require a rigor w.r.t. configuration management and build tooling as serious as one pays to one’s “real” domain.

To match up with your analogy, programmers are not simply novelists, but must further concern themselves with printing, binding, and channel distribution.

Within that context, I much prefer tools and practices that can ramp from fairly simple cases (as described in my blog), up to the heights of build automation, automated functional testing, and continuous deployment.  One should not have to switch horses at various stages of the growth of a project just to accommodate changing tooling requirements.  Thus, I encourage the use of maven, which has (IMO) the least uneven character along that spectrum; ant, with the caveat that you’ll strain to beat it into shape for more sophisticated tasks, especially in larger projects; and gradle, which appears to be well on its way to being competitive with maven in most if not all circumstances.

In all honesty, I envy Lee and those with similar sensibilities…

The first step to recovery is realizing you have a problem

The complexity that is visited upon us when writing software is enough; in an ideal world, we shouldn’t have to develop all this extraneous expertise in how to build, package, and deploy that software as well.  There are a few things in software that I know how to do really well that make me slightly unique, and I wish I could concentrate on those rather than becoming a generalist in this, yet another vector, which is fundamentally a means to an end.  History and circumstance seem to be stacked against me at the moment, though.

Especially in comparison with monocultures like the .NET and iOS worlds, which have benevolent stewards that helpfully provide well-paved garden paths for such mundane activities, those of us aligned with more “open” platforms like the JVM, Ruby, Python, etc. are constantly pulled in multiple directions by the allure of the shiniest tech on the world and the dreary reality that our vision consistently outpaces our reach when it comes to harnessing the gnarly underbelly of that snazzy kit in any kind of sensible way.  Along the way, the most pernicious thing happens: like the apocryphal frog in a warming pot, we find ourselves lulled into thinking that the state of affairs is normal and reasonable and perfectly sane.

Of course, there’s nothing sane about it…but I’m afraid that doesn’t mean a real solution is at hand.  Perhaps knowing that I’m the frog now is progress enough for now.

Hosting Maven Repos on Github

UPDATE: If you’re using Clojure and Leiningen, read no further. Just use s3-wagon-private to deploy artifacts to S3. (The deployed artifacts can be private or public, depending on the scheme you use to identify the destination bucket, i.e. s3://... vs. s3p://....)


Hosting Maven repos has gotten easier and easier over the years.  We’ve run the free version of Nexus for a couple of years now, which owns all the other options feature-wise as far as I can tell, and is a cinch to get set up and maintain.  There’s a raft of other free Maven repository servers, using plain-Jane FTP, and various recipes on the ‘nets to serve up a Hudson instance’s local Maven repository for remote usage.  Finally, Sonatype began offering free Maven repository hosting (via Nexus) for open source projects earlier this year, which comes with automatic syncing/promotion to Maven Central if you meet the attendant requirements.

Despite all these options, I continue to run into people that are intimidated by the notion of running a Maven repo to support their own projects – something that is increasingly necessary in the Clojure community, where all of the build tools at hand (clojure-maven-plugin, Leiningen, Clojuresque [a Gradle plugin], and those brave souls that use Ant + Ivy) require Maven-style artifact repositories.  Some recent discussions on this topic reminded me of a technique I came across a few months ago for hosting a maven repository on Google Code (also available for those that use Kenai).  This approach (ab)uses a Google Code subversion repo as a maven repo (over either webdav or svn protocols). At the time, I thought that it would be nice to do the same with Github, since I’ve long since sworn off svn, but I didn’t pursue it any further then.

So, perhaps people might find hosting Maven artifacts on Github more approachable than running a proper Maven repository.  Thankfully, it’s remarkably easy to get setup; there’s no rocket science here, and the approach is fundamentally the same as using a subversion repository as a Maven repo, but a walkthrough is warranted nonetheless.  I’ll demonstrate the workflow using clutch, the Clojure CouchDB library that I’ve contributed to a fair bit.

1. Set up/identify your Maven repo project on Github

You need to figure out where you’re going to deploy (and then host) your project artifacts.  I’ve created a new repo, and checked it out at the root of my dev directory.

[catapult:~/dev] chas% git clone
Initialized empty Git repository in ~/dev/cemerick-mvn-repo/.git/
warning: You appear to have cloned an empty repository.

Because Maven namespaces artifacts based on their group and artifact IDs, you should probably have only one Github-hosted Maven repository for all of your projects and other miscellaneous artifact storage.  I can’t see any reason to have a repository-per-project.

2. Set up separate snapshots and releases directories.

Snapshots and releases should be kept separate in Maven repositories.  This isn’t a technical necessity, but will generally be expected by your repo’s consumers, especially if they’re familiar with Maven.  (Repository managers such as Nexus actually require that individual repositories’ types be declared upon creation, as either snapshot or release.)

[catapult:~/dev] chas% cd cemerick-mvn-repo/
[catapult:~/dev/cemerick-mvn-repo] chas% mkdir snapshots
[catapult:~/dev/cemerick-mvn-repo] chas% mkdir releases

3. Deploy your project’s artifacts to your Maven repo

A properly-useful pom.xml contains a <distributionManagement> configuration that specifies the repositories to which one’s project artifacts should be deployed.  If you’re only going to use Github-hosted Maven repositories, then we just need to stub this configuration out (doing this will not be necessary in the future1):


Usually, URLs provided here would describe a Maven repository server’s API endpoint (e.g. a webdav URL, etc).  That’s obviously not available if Github is going to be hosting the contents of the Maven repos, so I’m just using the root URLs where my git Maven repos will be hosted from; as a side effect, this will cause mvn deploy to fail if I  don’t provide a path to my clone of the Github Maven repo.

Now let’s run the clutch build and deploy our artifacts (which handily implies running all of the project’s tests), providing a path to our repo’s clone directory using the altDeploymentRepository system property (heavily edited console output below)2:

[catapult:~/dev/cemerick-mvn-repo/] chas% cd ../vendor/clutch
[catapult:~/dev/vendor/clutch] chas% mvn -DaltDeploymentRepository=snapshot-repo::default::file:../../cemerick-mvn-repo/snapshots clean deploy
[INFO] Building jar: ~/dev/vendor/clutch/target/clutch-0.2.3-SNAPSHOT.jar
[INFO] Using alternate deployment repository snapshot-repo::default::file:../../cemerick-mvn-repo/snapshots
[INFO] Retrieving previous build number from snapshot-repo
Uploading: file:../../cemerick-mvn-repo/snapshots/com/ashafa/clutch/0.2.3-SNAPSHOT/clutch-0.2.3-SNAPSHOT.jar
729K uploaded  (clutch-0.2.3-SNAPSHOT.jar)

That looks happy-making.  Let’s take a look:

[catapult:~/dev/vendor/clutch] chas% find ~/dev/cemerick-mvn-repo/snapshots

That there is a Maven repository.  Just to briefly dissect the altDeploymentRepository argument:


It’s a three-part descriptor of sorts:

  1. snapshot-repo is the ID of the repository we’re defining, and can refer to one of the repositories specified in the <distributionManagement> section of the pom.xml.  This allows one to change a repository’s URL while retaining other <distributionManagement> configuration that might be set.
  2. default is the repository type; unless you’re monkeying with Maven 1-style repositories (hardly anyone is these days), this is required.
  3. file:../../cemerick-mvn-repo/snapshots is the actual repository URL, and has to be relative to the root of your project, or absolute. No ~ here, etc.

4. Push to Github

Remember that your Maven repo is just like any other git repo, so changes need to be committed and pushed up in order to be useful.

[catapult:~/dev/cemerick-mvn-repo] chas% git add *
[catapult:~/dev/cemerick-mvn-repo] chas% git commit -m "clutch 0.2.3-SNAPSHOT"
[master f177c06] clutch 0.2.3-SNAPSHOT
 12 files changed, 164 insertions(+), 2 deletions(-)
 create mode 100644 snapshots/com/ashafa/clutch/0.2.3-SNAPSHOT/clutch-0.2.3-SNAPSHOT.jar
 create mode 100644 snapshots/com/ashafa/clutch/0.2.3-SNAPSHOT/clutch-0.2.3-SNAPSHOT.jar.md5
 create mode 100644 snapshots/com/ashafa/clutch/0.2.3-SNAPSHOT/clutch-0.2.3-SNAPSHOT.jar.sha1
 create mode 100644 snapshots/com/ashafa/clutch/0.2.3-SNAPSHOT/clutch-0.2.3-SNAPSHOT.pom
 create mode 100644 snapshots/com/ashafa/clutch/0.2.3-SNAPSHOT/clutch-0.2.3-SNAPSHOT.pom.md5
 create mode 100644 snapshots/com/ashafa/clutch/0.2.3-SNAPSHOT/clutch-0.2.3-SNAPSHOT.pom.sha1
 create mode 100644 snapshots/com/ashafa/clutch/0.2.3-SNAPSHOT/maven-metadata.xml
 create mode 100644 snapshots/com/ashafa/clutch/0.2.3-SNAPSHOT/maven-metadata.xml.md5
 create mode 100644 snapshots/com/ashafa/clutch/0.2.3-SNAPSHOT/maven-metadata.xml.sha1
[catapult:~/dev/cemerick-mvn-repo] chas% git push origin master
Counting objects: 24, done.
Delta compression using 2 threads.
Compressing objects: 100% (7/7), done.
Writing objects: 100% (19/19), 669.07 KiB, done.
Total 19 (delta 1), reused 0 (delta 0)
 f57ccba..f177c06  master -> master

5. Use your new Maven repository

Your repository’s root will be at<your-github-username>/<your-github-maven-project>/raw/master/

Just append snapshots or releases to that root URL, as appropriate for your project’s dependencies.

You can use your Github-hosted Maven repository in all the same ways as you would use a “normal” Maven repo – configure projects to depend on artifacts from it, proxy and aggregate it with Maven repository servers like Nexus, etc. The most common case of projects depending upon artifacts in the repo only requires a corresponding <repository> entry in their pom.xml, e.g.:



  1. Administrative simplicity: there are no servers to maintain, no additional accounts to obtain (i.e. compared to using Sonatype’s OSS Nexus hosting service), and the workflow is very familiar (at least to those that use git).
  2. Configuration simplicity: compared to (ab)using Google Code, Kenai, or any other subversion host as a Maven repository, the project configuration described above is far simpler.  Subversion options require adding a build extension (for either wagon-svn or wagon-webdav) and specifying svn credentials in one’s ~/.m2/settings.xml.
  3. Tighter alignment with the “real world”: ideally, every artifact would be in Maven Central, and every project would use Hudson and deploy SNAPSHOT artifacts to a Maven repo.  In reality, if you want to depend upon the bleeding edge of most projects – which aren’t regularly built in a continuous integration environment and which don’t regularly throw off artifacts from HEAD – having an artifact repository that is (roughly) co-located with the source repository that you control is very handy.  This is true even if you have your own hosted Maven repo, such as Nexus; especially for SNAPSHOTs of “vendor” artifacts, it’s often easier to simply deploy to a local clone of your git-hosted Maven repo and push that than it is to recall the URL for your Maven server’s third-party snapshots repository, or constantly be adding/modifying a <distributionManagement> element to the projects’ pom.xml.

Caveats / Cons

  1. This practice may be considered harmful by some.  Quoting here from comments on another description of how to host Maven repositories via svn:

    This basically introduces microrepository. Users of these projects require either duplicate entries in their pom: one for the dependency and one for the repository, or put great burden on the maintainer of the local repository to add each microrepository by hand….So please instead of this solution have your Maven build post artifacts to a real repository like Java’s, Maven’s or Sontatype’s.
    – tbee

    Please do NOT use this approach. Sonatype provide free hosting of a Maven Repo for open source projects and BONUS! you get syncing to Maven Central too for free!!!
    Stephen Connolly

    tbee’s point is that, the more “microrepositories” there are, the more work there is for those that maintain their own “proper” Maven repository servers, all of which provide a proxying feature that allows users of such servers (almost always deployed in a corporate environment as a hedge against network issues, artifact rot, and security/provenance issues, among other things) to specify only one source repository in their projects’ pom.xml configurations, even if artifacts are actually originating from dozens or hundreds of upstream repositories.  I don’t really buy that argument, insofar as the cat is definitely out of the bag vis á vis a proliferation of Maven repositories. Our Nexus install proxies around 15 Maven repositories in addition to Maven Central, and I suspect that that’s a very, very low compared to the typical Nexus site. I’ll bet there are hundreds – perhaps thousands – of moderately-active Maven repositories in the world already.

    I agree with Stephen that if you are willing to get set up with deployment rights to Sonatype’s OSS Maven repo, you should do so.  Not everyone is though – and I’d rather see people using dependency management than not, and sharing SNAPSHOT builds rather than not. In any case, if you use this method, know that you’ve strayed from the Maven pack to some extent.

  2. The notion of having to commit the results of a mvn deploy invocation is very foreign.  This is particularly odd for release artifacts, the deployment of which are ostensibly transactional (yes, the git commit / git push workflow is atomic as far as other users of the git/Maven repository are concerned, but I’m referring to the deployer’s perspective here).  The subversion-based Maven hosting arrangements don’t suffer from this workflow oddity, which is nice.  I suppose one could add a post-commit hook to immediately push results, but that’s just illustrating the law of conservation of strangeness, sloughing off unusual semantics from Maven-land to the Realm of git.  You could fall back to using Github’s svn write support, but then you’re back in subversion-land, with the configuration complexity I noted earlier.
  3. Without a proper Maven repository server receiving deployed artifacts, there will never be any indexes offered by these git-hosted Maven repos.  The same goes for subversion-hosted Maven repos as well. Such indexes are a bit of a niche feature, but are well-liked by those using Maven-capable tooling (such as the Eclipse and NetBeans IDEs).  It would be possible to generate and update those indexes in conjunction with the deployment process, but that would likely require a new Maven plugin – a configuration complication, and perhaps enough additional work to make deploying to Sonatype’s OSS repo or hosting one’s own Maven repository worthwhile.


  1. The fact that Maven currently requires a stubbed-out <distributionManagement> configuration is a known bug, slated to be fixed for Maven 2.5. Specifying the deployment repository location via the altDeploymentRepository will then be sufficient.
  2. An alternative to using the altDeploymentRepository option would be to make the git Maven repository a submodule of the project repository.  This would require that the git Maven repo be the canonical Maven repo for the project, and would imply that all the usual git submodule gymnastics be used when deploying to the git Maven repo.  I’ve not tried this workflow myself, but might be worth experimenting with.

Securing web services in a world with few options


We’re building a web service for which we aim to charge money. Further, the data being pushed around may be confidential or otherwise of a sensitive nature. We have good reasons to do everything we can to ensure that the service is secured “properly”:

  • We don’t want to have customers charged for work that is requested by a bad actor exploiting a security hole (of course, we’d issue a refund and an apology in such a case, but the impact to our business through unnecessary processing could be sizable).
  • We don’t want our customers’ data exposed; common vectors for this include sniffing, replay attacks, or simply the use of compromised credentials.

Of course, the impact on our relationship with our customers due to any security breach could be significant and devastating – to our business, our reputation, and potentially even to our customers’ affairs completely outside of their use of our web service. So again, we have a lot of reasons to be highly-motivated when it comes to security.

By way of context, let’s set the stage with regard to the moving pieces. The web service in question:

  • is built on a JVM stack (with the application itself built with Clojure, of course, using the Compojure framework)
  • has a user-facing, HTML browser interface as well as a “RESTful” API surface (“RESTful”, as in, pretty darn close to ROA “style”, so the set of URIs involved in delivering the user-facing interface vs. those delivering the REST API are nearly identical).
    • the user-facing interface offers standard form-based authentication, as well as OpenID authentication (which will be recommended only for more casual users and usage).
  • will always, always delivered over SSL. We assume that every bit of data transferred is confidential, so cleartext is an absolute no-no.

OK, let’s go find an expert

It is with this mindset that I’ve been digging into how to approach web service security. Note that I’m no specialist or expert in this area – I’m merely a practitioner that is usually focused on things far, far away from anything security-related. (It may not surprise you that I’m coming to appreciate that fact more and more as I learn about the “state of the art” in web service security.)

Given this, I set out a few weeks ago to see where things stand on the web service security front. Of course, that realm is just as full of cliques and posturing and strawmen and ad hominem attacks as the broader software development world is, so finding a clear path forward is not easy. First, a bit of literature review, as it were, drawn in particular from a flurry of web service security chatter a few years ago (emphasis here and there is mine, I wish I had noticed and grokked the indicated bits earlier, I’ll explain below):

  • I started by finding Gunnar Peterson’s pair of posts where he compares “REST security” with WS-Security stuffs, where the former (especially approaches like HTTP Basic authentication over SSL) come out sounding like a pretty bad choice:

    people who say REST is simpler than SOAP with WS-Security conveniently ignore things like, oh message level security

    Now if you are at all serious about putting some security mechanisms in to your REST there are some good examples [such as Amazon’s implementation of an HMAC authentication scheme].

    Some people in the REST community are able to see the need for message level security so this is heartening somewhat. If the data is distributed and the security model is point to point (at best), we have a problem.

  • Pete Lacey lays out the counterpoint, saying that SSL works just fine for tons and tons of use cases, thankyouverymuch.

    In summary, RESTful security, that is SSL and HTTP Basic/Digest, provides a stable and mature solution that addresses transport level credential passing, encryption, and integrity. It is ubiquitous, simple, and interoperable. It requires no out-of-band contract negotiation or a priori knowledge of how the resource (okay, service) is secured. It leverages your existing security infrastructure and expertise. And it addresses 99% of the use cases you are likely to encounter. SSL does not support message level security, and if that’s a requirement, then leveraging SOAP and WSS makes sense.

  • Unsurprisingly, Sam Ruby backs up Pete Lacey, but the comments on that post are interesting:
    • From Gunnar Peterson:

      I am no way suggesting there is only way to do this or that WS-Security came down on stone tablets. I am also not suggesting that a NSA level of security is appropriate for Google Maps. There are many shades of gray. “good enough” security is a big challenge, and it isnt about black and white security models, it is about risk management

    • From Bill de hÓra:

      I think this is where quantative analysis comes in and a measured assessement of the risk is taken. What has to be protected and what’s the worthwhile cost of doing so? Being software people, that’s beyond the general state of the art. We do gut feelings, flames and opinions.

  • There’s a variety of “REST security ‘best practices'” posts out there, but a question from StackOverflow links to a variety of additional discussions there that serve as good an indication as any that the accepted way of securing REST web services is Basic auth over SSL.

Before moving on, I just want to point out that Bill de hÓra’s comment above is sadly representative of so many corners of software development.  Let’s ponder that for a moment, while realizing that modern society and its continuation absolutely depends upon the software we build (I’m talking collectively, here).

Take a deep breath

Of course, the above is not an exhaustive survey, just the best tidbits I found over the course of a lot of browsing and searching. Here’s the upshot, as I see it:

  1. WS-Security et al. ostensibly provide message-level security that ensures that your service can be passed along by untrusted intermediaries.
  2. Standard HTTP authentication (generally Basic) over SSL transport is the de facto standard for securing REST services, but it does nothing for you if message security is important.
  3. More sophisticated authentication mechanisms are available – in particular HMAC, as exemplified by Amazon’s web services – which allow services to ensure that a message’s author has not been impersonated. This would resolve the potential holes of .

Unfortunately, I didn’t grok the whole message vs. transport security issue as quickly as I should have, where SSL provides the latter but the former would only be satisfied by something like WS-Security (again, ostensibly, I certainly can’t vouch for it) or HMAC-SHA1 if one were working in a REST environment. If I had come to grips with that point of tension earlier, I would have arrived at my two conclusions much faster:

  1. In our situation, message security is simply not relevant. As Peterson wrote (and I quoted above) “If the data is distributed and the security model is point to point (at best), [REST has] a problem.” Well, in our case, data is not distributed, it is transmitted point-to-point (between our customers and us, a third-party external web service), so transport security provided by SSL should be sufficient.
  2. Here’s the biggie: assuming we support form-based authentication (of course, over SSL) for browser-based UI interaction, supporting anything more sophisticated than HTTP Basic authentication over SSL for our REST API interactions would be a waste of resources. We could go full-tilt and require HMAC-SHA1 for the REST API or provide only a SOAP API that used WS-Security (and whatever else goes into that), but that would mean nothing if an attacker has the “REST API” provided for browser use available to him. Given this, transport security provided by SSL, and that alone, is simply all we can do.  Put another way: when browser-level security mechanisms improve, then so will our APIs’.

An alternative path would be to host a parallel service, available via a REST API secured via HMAC-SHA1 or a WS-Security-enabled SOAP API, that did not provide any kind of browser-capable entry point. Customers could opt into this if they thought the tradeoff was important. Doing this would be technically trivial (or, perhaps only moderately difficult w.r.t. the SOAP option ), but I’ve no idea whether the additional degree of security provided by such a parallel service would be of any interest to anyone.

By the way, if I’m totally blowing this, and my conclusions are completely broken, do speak up.

Coming soon: Part II of my investigation/thinking on the subject of web service security, related to OpenID and the management of credentials in general…which should give me all sorts of new opportunities to say foolish things!

Working with git submodules recursively

Git submodules are a relatively decent way to compose multiple source trees together, but they definitely fall short in a number of areas (which others have discussed at length elsewhere). One thing that immediately irritated me was that there is no way to recursively update, commit, push, etc., across all of one’s project’s submodules. This is something I ran into immediately upon moving to git from svn some months back, and it almost scared me away from git (we used a lot of svn:externals, and now a lot of git submodules).

Thankfully, the raw materials are there in git to work around this. (I’ve since noticed a bunch of other attempts to do similar things, but they all seem way more complicated than my approach…maybe it’s the perl? ;-))

Here’s the script we use for operating over git submodules recursively,


case "$1" in
        "init") CMD="submodule update --init" ;;
        *) CMD="$*" ;;

git $CMD
git submodule foreach "$0" $CMD

Throw that into your $PATH (I trim the .sh), chmod +x, and git submodules become pretty pleasant to work with. All this is doing is applying whatever arguments you would otherwise provide to git within each submodule, and their submodules, etc., all the way down. The one special invocation, git-submodule-recur init, just executes git submodule update --init in all submodules.

So, want to get the status of your current working directory, and all submodules? git-submodule-recur status Want to commit all modifications in cwd and all submodules? git-submodule-recur commit -a -m "some comment" Want to push all commits? git-submodule-recur push You get the picture.

This script has saved me a *ton* of typing over the past months. Hopefully, it finds a good home elsewhere, too.

Note: Starting in git 1.6.5, git submodule will grow a –recursive option for the foreach, update and status commands. That’s very helpful for the most common actions (and critical for building projects that have submodules in CI containers like hudson), but git-submodule-recur definitely still has a place IMO, especially for pushing.

Update 2009/09/28: I tweaked the git-submodule-recur script to quote the path to the script ("$0" instead of $0); this became necessary when I dropped the script into C:\Program Files\Git\bin in our Windows-hosted Hudson environment.

Whoa, Peter Norvig used some of my code!

I’m generally not one to be impressed by celebrity — you won’t catch me reading People or US Weekly, example.  However, this morning I noticed with a shimmer of glee that Peter Norvig used some code that I wrote years ago in one of his recent projects.  So, just for the record, if Dr. Norvig ever shows up in US Weekly, I’ll pick one up!

In case you don’t know, Peter Norvig is the Director of Research at Google.  That’s interesting, but the real reason Dr. Norvig holds sway with me is his classic book, Paradigms of Artificial Intelligence Programming.  If it weren’t for that book, I almost certainly would not be doing what I’m doing today.  Its pages are where I came to understand lisps, and began to imagine what was possible and what I might be able to accomplish in computer science (final results yet to be determined, of course).  For that, I am extraordinarily grateful to him (and others, of course, but I’ll wait to talk about them when they get around to using some of my code! ;-) ).

Back to the story.  This morning, I decided to hop onto Google Analytics for a bit to check up on the traffic stats for our various websites.  Lo and behold, in the “top referrals” listing, I saw ‘’; “Well,” I thought to myself, “that’s interesting!”   A quick grep of the server logs (is there a screen in Google Analytics that actually provides you with the full referral URLs?) showed the referral URL to be Dr. Norvig’s “post” from last week, An Exercise in Species Barcoding.

A search of my name on that page shows that he needed a way to calculate the Levenshtein distance (also known as the edit distance) between two large strings — his quick implementation (like most) operated in O(n^2) space, which would have required weeks of processing time in his particular case.  So, he looked around for a more efficient implementation, and found one that I wrote in October of 2003 that operated in linear space bounds (and was, ironically enough, my first-ever contribution to an open source project).  With a couple of tweaks to suit his specific needs, the code I wrote worked out nicely for him.

This story is satisfying and funny (for me, anyway) in a couple of different ways:

First, there’s the fact that (what I would now consider) throwaway work of mine floating around the nets six years later.  Remember kids, the Internet never forgets!

Second, it reminded me of what I was doing when I wrote that particular code.  I was building what would later become PDFTextStream’s first ground-truthing system1(although I don’t think I knew of that term at the time). It’s a lot more sophisticated now, but back in 2003, I was simply trying to set up a “ground truthing” system where the full (vetted and known-good) extracted text from each PDF document in our nascent test repository would be saved off somewhere, and later builds of PDFTextStream would compare its extracted PDF text to those saved files.

Of course, it wouldn’t be practical to require that PDFTextStream produce identical output forever — some amount of slop had to be allowable, because (for example) if an extracted word was outputted with four spaces before it instead of two, that would generally be sufficient.  For that and other reasons, I wanted to test that current PDF text extracts were the same as the known-good extracts within a defined margin of error.  Unfortunately, I was ground-truthing full document extracts at that time, and most Levenstein functions with their quadratic performance characteristics would take a lot of memory to diff the multi-megabyte strings that were involved.

Solution: write my own Levenshtein function (loosely based off of a pedagogical implementation by Mike Gilleland that had been incorporated into the Apache commons-lang project) that operated in linear space bounds.  Thankfully, I opted to offer the improvement back to the Apache commons-lang project and to Dr. Gilleland — had I not, Dr. Norvig would never had found that code, and I wouldn’t be writing this right now.

Third and finally, this story is satisfying because, hell, Peter Norvig used some of my code.  A person I respect and admire has found it convenient to use some minor thing I created years ago, and was thoughtful enough to say so.  I hope I can follow that example as I go along in my travels.

See, Dr. Norvig, I’m still learning from you.


1 Ground truthing is a testing methodology often used in document processing systems where ideal or otherwise known-good output is cataloged, and then actual or current output is compared to it to determine relative accuracy.  PDFTextStream’s current ground-truthing system serves as a semi-rigorous smoke test of its aggregate text extraction accuracy while we’re doing active development, as well as an ironclad regression test for when we’re looking to cut a release.  Thankfully, it’s come a long, long way from the very naive approach I was pursuing in 2003.

Scala Makes Me Think

(…or, “Oh, Dear, Wasn’t I Thinking Before?”)

As my friends will attest, I really enjoy programming languages. I’m one of those language fetishists that talk about “expressiveness” and “concision”, and yes, I’m one of those very strange fellows who blurt out bad Lisp jokes while getting odd looks from innocent bystanders. And while my bread and butter is built in Java, I often find myself yearning for a more expressive language while deploying, customizing, or integrating PDFTextStream (there I go again with the “expressiveness” bit). That yearning can reach almost pathological extremes at times, prompting me to go so far as to sponsor projects that make it possible to use Java libraries (including PDFTextStream) from within Python.

Fortunately, things don’t always have to be so hard. Case in point, I recently dove head-first into Scala, a language that combines object orientation and functional programming into one very tasty stew. Scala has a number of characteristics that make it interesting aside from its merging of OO and FP mechanisms:

  • it is statically-typed, and provides moderately good type inference that enables one to skip most type declarations and annotations
  • it is compiled, which provides a minimum level of performance (sure, it’s actually byte-compiled, but let’s not quibble right now)
  • and the real kicker: it compiles down to Java class files (or .NET IL), thereby enabling it to be hosted on a JVM (or .NET’s CLR), and call (and be called by) other Java (or .NET) libraries

There’s a lot to like here, for programmers from many walks of life, and I could go on and on about how Scala has single-handedly created and filled a great niche of delivering most of the raw power of purely functional languages like Haskell and ML within a JVM-hosted environment with respectable performance. But what has really impressed me has been the way that Scala has improved how I work. In short, it’s made really think about development again.

I generally have two working styles. In a classic statically-typed environment (say, Java or C#), I tend to generate pretty clean designs, but my level of productivity is very low. I attribute both of these characteristics to the copious amount of actual work (i.e. finger-typing) that has to go into writing Java or C# code, even with the best of tools. See, while I’m typing (and typing, and typing), I’m thinking two, three, four steps ahead, figuring out the design of the next chunk of code. The verbosity of the language gives me time to reason about the next step while my fingers are working off the previous results.

In a dynamically-typed environment (say, Python or Scheme), I tend to be extraordinarily productive, but unless I consciously step back and purposefully engage in design, the code I write is much more complex. In such environments, there’s less finger-typing going on, so I don’t have a natural backlog allowing me to think about the code before it’s already on the screen. Further, I know I can get from point A to point B relatively easily in many circumstances, so I end up skipping the design step, switching into Cowboy Coder mode, and hacking at things until everything works. Oddly enough, in certain circles, this isn’t so much frowned upon as it is recommended.

Scala is statically-typed, so the naive observer might speculate that my working style in Scala would be much the same as in Java. However, I’ve found that working with Scala has prompted (forced?) me to consciously step back and think about everything, at every step along the way: class hierarchies, type relationships in general, testing strategies, eliminating state where possible…the amount of actual thinking I’ve done while working with Scala has far outstripped the amount of reasoning that typically goes into any similar period of coding. Unsurprisingly, this has led to quite the spike in code quality, which translates into productivity through fewer bugs and less rework.

I attribute this to the strong, static typing that Scala enforces, combined with the type inference that Scala provides. The former forces me to reason about what I’m doing (as it does in Java, for instance), but because the latter eliminates so much of the finger-typing associated with static typing in other environments, I’m given the opportunity to realize that a concrete design phase would yield tremendous benefits, regardless of the scope of code in question. I suspect I would find working in Haskell or ML to be a similar experience, but because those languages don’t easily interoperate with the libraries I need to do my work, I’ve never really given them a chance.

Thankfully, I don’t think I’ll have to. Scala is a great environment, and even more important than its technical merits, its design has led me to engage in a more thoughtful, more conscious development process.

Thoughts on Martin Fowler’s Domain Specific Languages Overview

I’m way late in linking to this, but it’s worth it.

Last October, a presentation by Martin Fowler from JAOO 2006 popped up on InfoQ (which does a great job of simulating the actual experience of being at the session with its video/slideshow integration) where he gave a very high-level overview of domain specific languages (DSLs). He really only scratched the surface, but it’s a great introduction for those that haven’t yet thought about DSLs much.

(Of course, that population is getting smaller by the minute thanks to Ruby (and Rails), since it builds in the metaprogramming facilities necessary to implement internal DSLs.)

I recently had occasion to re-watch the presentation. This time around, I took the time to scribble down some thoughts:

  1. I think he played up the potential role of DSLs as “configuration code” too much. Yes, you can tailor a DSL to provide primarily configuration data, and that’s very useful as far as it goes. However, internal DSLs (given an appropriately useful host environment) are able to provide levels of abstraction and composability that go way beyond configuration.
  2. I think that casting the Java example he showed as a DSL is really over the top, and is a result of overemphasizing the potential configuration role DSLs can play. As Mr. Fowler said, the line between an internal DSL and just a bunch of specific goal-driven coding in the host language is fuzzy. However, a big part of that line (and therefore whether an environment can reasonably host a DSL) is how well the host language’s existing constructs can be recast as sensible constructs in the DSL. The Ruby DSL example fits this criterion well, as its range (4..12, etc) and block constructs mapped well to the domain at hand. On the other hand, the Java example is Java, unapologetically so — the explicit object creation, the function calls, return statements, etc., simply do not map to the domain. The fact that the integers and strings being passed in those function calls can be recast as an actual configuration file should not lead us to think that Java configuration code is a functional DSL.
  3. At least in my experience, external DSLs are dead-ends. There’s just too much heavy lifting that needs to be done to consume external DSL “source files” and align their contents with the host language’s environment. True, internal DSLs need to conform to the syntax of their host environment, but the advantages of “symbolic integration” (as Mr. Fowler puts it) and the fact that you get your IDE’s functionality for free are just too compelling to outweigh any nitpicky syntax quibbles that one might have with any DSL-capable language. And, if those syntax quibbles are significant enough, and the problem the DSL is going to solve is significant enough to make you come close to considering building all of the cruft necessary to implement an external DSL, then go find yourself a secondary language/environment that provides a more palatable syntax, and hook everything up with IPC of some kind.

Python, Growth, and Sandboxes

Well, I sure did step in it.

Consider: up until last week, I was simply using this space every now and then for some relatively bland navel-gazing related to selected goings-on at Snowtide. Then, a friend of mine decided to put my most recent post (probably the only potentially inflammatory post I’ve ever written) on reddit, and a variety of people weren’t very happy (both in comments to the post itself, on reddit’s comment page, and to a lesser extent on a Joel On Software thread). For someone who can lay only a tenuous claim to being a blogger (never mind the title of A-, B-, C-, or D-list blogger!), it’s been an interesting experience to say the least.

I tried to participate in the discussions that were swirling around, but eventually the comments became too numerous for me to follow in a timely way given the amount of bandwidth I’ve allocated to such things. So, I’m taking the easy/cheap way out with a response post. I know this is frowned upon by many, but c’est la vie.  Here, I will respond in two parts:

  1. Python and the Growth “Problem”
  2. Sandbox Etiquette

Python and the Growth “Problem”

In reading over all of the commentary, there seem to be three types of responses:

Response Type A: Any lack of growth/”innovation” or a slowing of such growth in Python is good — stability makes it easier to concentrate on customer solutions, and encourages robust library development.

Regardless of your language or platform, if stability and operational continuity is an overriding interest of yours, then lock yourself into a particular build, and stay there as long as you want. This is a significant part of the job of IT organizations in large organizations – to standardize on environments and tools so as to shield the organization from unwanted change and cost.

(As an aside, Ruby’s Matz provides a positive spin on the “Python is stable, and that’s good” attitude, which may or may not be cheeky [it’s hard to tell through the translation]: “Perhaps Python has a sense of responsibility.”)

Response Type B: The “significant improvements” I’d like to see in Python are (take your pick): of academic use only; are overhyped genius toys that only make it easier to build overly complex solutions; distractions from other improvements that would be immediately useful to the majority of the Python userbase.

This attitude pops up frequently in any discussion of programming paradigms that are off the beaten track, any technique that is unfamiliar to the commenter, or any anything that the commenter has had problems with in the past. Meek typified this kind of response with:

Python is not growing because you want programmable syntax and “esoteric” features? Features that 99% of software developers shouldnever use. Let me guess, you have never maintained a project written in a language that supports programmable syntax where geniuses abuse meta-programming where simpler alternatives achieve the same goal.

This is a particularly disturbing line of thought, and one that I had always considered to be antithetical to a central principle of Python (at least in my eyes), that the programmer should always be trusted. I’ve always associated this with a variety of Python features, including duck typing, the lack of access controls around class members (modulo the slightly perverse double-underscore notation and associated name mangling of “private” attributes), the composability of namespaces, etc.

Lots of programming features are “esoteric”, depending on who you ask. Pointer access is esoteric to a web app developer and should never be used in such a context, but it’s critical to a C-language device driver programmer. Any number of language features can simultaneously be considered esoteric by some and necessary by others. Not recognizing this, and then implying that “simpler alternatives” could readily take the place of those “genius” toys is evidence of a lack of perspective. 28/w in the JOS thread makes my point better than I ever could:

It’s precisely because I want my projects to be on time that I don’t use assembly language for everything. That’s the same reason I don’t use C++ either. I’m about 5x as productive in Ocaml as I am in C++ i.e., at least 80% of my time spent coding C++ is spent dealing with language issues; it’s the equivalent of spending time making all my function calls out of gotos.

Most likely, 80% of my time coding Ocaml is wasted too, and I just don’t know it.

Bottom line: just because you don’t see a use for a particular language feature doesn’t mean that someone else doesn’t find it absolutely, positively necessary.

Response Type C: Python is growing, and if you were to pay attention, you’d notice. We’re just not working on what you want.

This point has been made by a variety of people, but I should give special attribution to Phillip J. Eby, since he’s a significant Python contributor:

Um, so you don’t think the “with” statement and coroutines were new features?

What about the new metaclass hook that’ll be in Python 3.0 (and maybe 2.6)? It’s actually a pretty significant step forward for implementing Ruby-like DSL’s in Python.

I suppose this is the nut of the problem, at least as far as this discussion has related specifically to the technical aspects of Python: I’m not bowled over by the improvements Phillip cites.  They’re very useful and handy to the vast majority of Python programmers, but they’re not game-changers (which I suppose is what I meant by “significant growth”). I think the description of the metaclass hook as “a pretty significant step forward for implementing Ruby-like DSL’s in Python” is very telling. The facilities for building DSLs in Ruby are good in so far as they make it possible to get the job done, but they’re by no means conceptually complete nor functionally clean (as pointed out by jerf in the reddit comments), so taking a “significant step” towards implementing such facilities isn’t the whole ballgame.

Regardless of that detail, the point is that progress is being made in Python — just not in the vector I need. And, that’s OK. Which brings me to…

Sandbox Etiquette

After all has been said and done, my original post was a mistake, in that I exhibited a similar type and degree of technological selfishness as those who replied with Type A responses.  As some of my friends will attest, I’ve personally been unhappy with Python and its direction for a variety of reasons for months now, especially as I’ve sunk further and further into a class of problems for which Python isn’t particularly well-suited at the moment.  While I had settled on that conclusion some time ago, I’ve obviously been suffering from a mental block that caused me to do drive-bys against Python.  This came to a head with my blog post.

The more mature (and zen) thing to do would have been to simply go looking for a different sandbox, and leave well enough alone with regard to Python.  (It is, after all, a fantastic language and will likely remain my favorite for most common tasks [especially web programming] for a some time hence).  This is especially true given the fact that I am essentially a nobody in the Python community – I’ve contributed in my own small ways, but it’s not like I’m a core hacker or important library author.  Instead, I adopted the Response Type A attitude, but flipped it on its head, claiming that my favorite language should advance itself to suit my requirements, and to hell with the priorities of others.

So, let’s make a deal: I’ll stop sniping on Python, and maybe everyone else can stop making clever comments about “esoteric” language features.  Then we can all spend more time building bigger and better sandcastles.