Ashton’s plight, fight or flight

Joel Spolsky is a stellar writer and a fine storyteller. His latest, the apocryphal tale of Ashton, tells of the plight of a software developer stuck in a crummy job, surrounded by sycophants and half-wits, desperate to escape the inevitable resulting ennui.  To top it off, this miserable environment happens to exist in…Michigan.

Of course, there could only be one solution:

…it was fucking 24 degrees in that part of Michigan, and it was gray, and smelly, and his Honda was a piece of crap, and he didn’t have any friends in town, and nothing he did mattered.

[…Ashton] just kept driving until he got to the airport over in Grand Rapids, and he left his crappy old Honda out right in front of the terminal, knowing perfectly well it would be towed, and didn’t even close the car door, and he walked right up to the Frontier Airlines counter and he bought himself a ticket on the very next flight to San Francisco, which was leaving in 20 minutes, and he got on the plane, and he left Michigan forever.

It’s a sad story, but mostly because of the thread of self-righteous metropolitan exceptionalism running through it that is all-too-typical of parts of the programming and business of software circles in which I often circulate.  If Ashton hadn’t bought into that worldview, he might have left his disaster of a job with the cubicle manufacturer for any of the small software companies nearby, filled with talented, engaging people working on challenging and rewarding problems. Or, he could have started his own company, with potential to make a big impact in his town, in Michigan, and in the world.

Instead, he flies to San Fransisco, the Lake Wobegon of the software world: where the managers are sane, coworkers are friendly hacker geniuses, and everyone’s exit is above average. Of course, that’s apocryphal too. There are sycophants and half-wits everywhere, and chances are good our friend Ashton will be working on a backwoods project at Google/Facebook/eBay/Yahoo/etc, or at one of those earth-shattering startups you always hear about on TechCrunch (you can hear the pitch now: “it’s like Groupon for Facebook game assets!”). Maybe he’ll be happier, maybe he’ll be more satisfied, maybe he’ll be richer; or, not.

I love large cities dearly, and I wouldn’t mind living, working, or building a business in one. But, metropolitan life is not a salve for what ails you and your career, and not being in one certainly doesn’t doom you or your future prospects. In this age of decentralizing of work, people in more and more professions (perhaps software development in particular!) are uniquely positioned to work where and when and for whom they want regardless of geography. Opportunity is everywhere – and anywhere you are, if you’ve the talent and determination to thrive.

…wherein I feel the pain of being a generalist

I’ve lately been in a position of offering occasional advice to Lee Spector, a former professor of mine, on various topics related to Clojure, which he’d recently discovered and (as far as I can tell) adopted with some enthusiasm.  I think I’d been of some help to him – that is, until the topic of build tooling came up.

He wanted to “export” a Processing sketch – written in Clojure against the Processing core library and the clj-processing wrapper – to an applet jar, which is the most common deployment path in that sphere.  Helpfully, the Processing “IDE” (I’m not sure what it’s actually called; the that one launches on OS X that provides a Java-ish code editor and an integrated build/run environment) provides a one-button-push export feature that wraps up a sketch into an applet jar and an HTML file one can copy to a web server for easy viewing.

It’s an awesome, targeted solution and clearly hits the sweet spot for people using the Processing kit.

Stepping out of the manicured garden of comes with a cost, though; you’ve lost that vertically-integrated user experience, and have to tangle with everything the JVM ecosystem has to throw at you.  There is no big, simple button to push to get your ready-to-deploy artifact out of your development environment.

So, Lee asked for some direction on how to regain that simple deployment process; my response pointing at the various build tooling options in the JVM neighborhood ended up provoking pain more than anything else due to some fundamental mismatches between our expectations and background.  You can read the full thread here, but I’ll attempt to distill the useful bits here.

Do I really need to bother with this ‘build’ shit?

Building software for deployment is a damn tricky problem, but it’s far more of a people problem than a technical problem: the diversity and complexity of solutions is such that the only known-good solution is immersive exposure to documentation, examples, and community.

In response, Lee eventually compared the current state of affairs with regard to build/deployment issues in Clojure and on the JVM as if one were asking a novelist to learn how to construct a word processor’s interface before being able to write:

This is, I know, a caricature, but imagine a word processing app that came with no GUI because hey, people have different GUI preferences and a lot of people are going to want things to look different. So here’s a word processing app but if you really want to use it and actually see your document you have to immerse yourself in the documentation, examples, and community of GUI design and libraries. This is not helpful to the novelist who wants a word processor! On the other hand if you provide a word processor with a functioning GUI but also make it customizable, or even easy to swap in entirely different GUI components, then that’s all for the good. I (and many others, including many who are long-term/professional programmers, but just not in this ecosystem) are like the novelists here. We want a system that allows us to write Clojure code and make it go (including producing an executable), and any default way of doing that that works will be great. Requiring immersive exposure to documentation, examples, and community to complete a basic step in “making it go” seems to me to be an unnecessary impediment to a large class of potential users.

My response was probably inevitable, as steeped in the ethos of the JVM as I am; nevertheless, Lee’s perspective ended up allowing me to elucidate more clearly than I ever have why I use tools like Maven rather than far simpler (and yes, more approachable) tools like Leiningen, Cake, et al.:

At one time, there were only a few modes of distribution (essentially: build executable for one, perhaps two platforms, send over the wire, done).  That time is long past though, and software developers cannot afford to be strict domain experts that stick to their knitting and write code: the modes of distribution of one’s software are at least as critical as the software itself. Beyond that, interests of quality and continuity have pushed the development and adoption of practices like continuous integration and deployment, which require a rigor w.r.t. configuration management and build tooling as serious as one pays to one’s “real” domain.

To match up with your analogy, programmers are not simply novelists, but must further concern themselves with printing, binding, and channel distribution.

Within that context, I much prefer tools and practices that can ramp from fairly simple cases (as described in my blog), up to the heights of build automation, automated functional testing, and continuous deployment.  One should not have to switch horses at various stages of the growth of a project just to accommodate changing tooling requirements.  Thus, I encourage the use of maven, which has (IMO) the least uneven character along that spectrum; ant, with the caveat that you’ll strain to beat it into shape for more sophisticated tasks, especially in larger projects; and gradle, which appears to be well on its way to being competitive with maven in most if not all circumstances.

In all honesty, I envy Lee and those with similar sensibilities…

The first step to recovery is realizing you have a problem

The complexity that is visited upon us when writing software is enough; in an ideal world, we shouldn’t have to develop all this extraneous expertise in how to build, package, and deploy that software as well.  There are a few things in software that I know how to do really well that make me slightly unique, and I wish I could concentrate on those rather than becoming a generalist in this, yet another vector, which is fundamentally a means to an end.  History and circumstance seem to be stacked against me at the moment, though.

Especially in comparison with monocultures like the .NET and iOS worlds, which have benevolent stewards that helpfully provide well-paved garden paths for such mundane activities, those of us aligned with more “open” platforms like the JVM, Ruby, Python, etc. are constantly pulled in multiple directions by the allure of the shiniest tech on the world and the dreary reality that our vision consistently outpaces our reach when it comes to harnessing the gnarly underbelly of that snazzy kit in any kind of sensible way.  Along the way, the most pernicious thing happens: like the apocryphal frog in a warming pot, we find ourselves lulled into thinking that the state of affairs is normal and reasonable and perfectly sane.

Of course, there’s nothing sane about it…but I’m afraid that doesn’t mean a real solution is at hand.  Perhaps knowing that I’m the frog now is progress enough for now.

The placebo effect is what makes the software world go ’round

I’ve been of the opinion for some time now that software development, regardless of the methodology followed or the tools used, is not an engineering discipline (unfortunately), but rather is a craft.  I recently laid out that opinion in some detail, which was quickly followed by many people (both in the tubes and in private communications) mostly disagreeing, some suggesting that I just don’t understand engineering particularly well.

In general, I’ll quickly concede that last point, but I keep running into people who ostensibly do understand engineering, and who also reject the notion of software development being a variety of it.  As an example, if I had known of Terence Parr‘s essay “Why writing software is not like engineering” before writing on the same topic with essentially the same premise, I would have simply tweeted a link or something.

A more pointed, and in my opinion, brilliant restatement of this thesis was delivered by Linda Rising in a talk at the Philly ETE 2010 conference in April of this year.  The focus of the talk was far, far broader than the topic of how to classify and characterize software development, and very much worth a listen (audio of the talk is available).  But for now, the key relevant quote comes at 25:15:

One of the things we can learn from psychology is that they do real experiments…whereas in our industry, we do not.  We cannot call ourselves “engineering”; [our work] is not based on science.  In fact, last year at the Agile conference I gave a talk1 that said “I think that mostly what runs this industry is what you’d call the ‘placebo effect’.”

Think about [what would happen] if drug testing were done the way we decide to do anything in software development.  We’d bring in a bunch of pills, and we’d say, “Oh, look at this one, it’s blue! I love blue! And it has a nice shape, oh my goodness, it’s hexagonal! Hexagonal things have always had a special place in my heart.  I think these hexagonal blue ones will be very powerful in solving this disease.  Let’s go with the hexagonals.

I’ll give this hexagonal blue one to all my friends.  I’ll tell them, “This is it, this hexagonal blue one will solve all your problems.  My team tried it, and we really liked it, so your team tried it, you’ll really like it too.”

And that’s how we run projects.  There’s certainly no double-blind controlled experiments.  Is agile any good?  Oh yeah, it is, there’s so much excitement, and  so much buzz, and everybody seems to think so!

Most if not all software developers will recognize the parallels between their profession and that parable.  Why did your team choose C++, or Rails, or F#, or Clojure?  There are often some tangible, technical considerations involved in such choices.  However, as Ms. Rising covers in the talk, human beings are generally not capable of making “rational” choices, but we are stellar when it comes to rationalizing the choices we have made.  So, even our most fundamental decisions – which tools and methods to use – are not made with the rigor that would be demanded of decisions made in an engineering setting.  Of course, this is entirely separate from how we go about actually building things using those tools, which can only be characterized as relative chaos.  How and why we are able to make such abundantly arbitrary choices is something I addressed in my last post on this topic.

And here is where I appeal to a somewhat more respected authority.  Aside from her expertise in the software methodology space, Ms. Rising’s credentials are fairly impeccable, apparently having had some significant involvement in the software development process attached to the design of the Boeing 777 airliner.  So, I’d say she’s in a fairly secure position to be making informed comparisons between engineering, science, and what we do when we build software systems.  Just one more data point, really, but a strong one.


1 A description of this talk is here, but I wasn’t able to find a recording of it.  However, there is an InfoQ interview with Ms. Rising on the topic.

Hosting Maven Repos on Github

UPDATE: If you’re using Clojure and Leiningen, read no further. Just use s3-wagon-private to deploy artifacts to S3. (The deployed artifacts can be private or public, depending on the scheme you use to identify the destination bucket, i.e. s3://... vs. s3p://....)


Hosting Maven repos has gotten easier and easier over the years.  We’ve run the free version of Nexus for a couple of years now, which owns all the other options feature-wise as far as I can tell, and is a cinch to get set up and maintain.  There’s a raft of other free Maven repository servers, using plain-Jane FTP, and various recipes on the ‘nets to serve up a Hudson instance’s local Maven repository for remote usage.  Finally, Sonatype began offering free Maven repository hosting (via Nexus) for open source projects earlier this year, which comes with automatic syncing/promotion to Maven Central if you meet the attendant requirements.

Despite all these options, I continue to run into people that are intimidated by the notion of running a Maven repo to support their own projects – something that is increasingly necessary in the Clojure community, where all of the build tools at hand (clojure-maven-plugin, Leiningen, Clojuresque [a Gradle plugin], and those brave souls that use Ant + Ivy) require Maven-style artifact repositories.  Some recent discussions on this topic reminded me of a technique I came across a few months ago for hosting a maven repository on Google Code (also available for those that use Kenai).  This approach (ab)uses a Google Code subversion repo as a maven repo (over either webdav or svn protocols). At the time, I thought that it would be nice to do the same with Github, since I’ve long since sworn off svn, but I didn’t pursue it any further then.

So, perhaps people might find hosting Maven artifacts on Github more approachable than running a proper Maven repository.  Thankfully, it’s remarkably easy to get setup; there’s no rocket science here, and the approach is fundamentally the same as using a subversion repository as a Maven repo, but a walkthrough is warranted nonetheless.  I’ll demonstrate the workflow using clutch, the Clojure CouchDB library that I’ve contributed to a fair bit.

1. Set up/identify your Maven repo project on Github

You need to figure out where you’re going to deploy (and then host) your project artifacts.  I’ve created a new repo, and checked it out at the root of my dev directory.

[catapult:~/dev] chas% git clone
Initialized empty Git repository in ~/dev/cemerick-mvn-repo/.git/
warning: You appear to have cloned an empty repository.

Because Maven namespaces artifacts based on their group and artifact IDs, you should probably have only one Github-hosted Maven repository for all of your projects and other miscellaneous artifact storage.  I can’t see any reason to have a repository-per-project.

2. Set up separate snapshots and releases directories.

Snapshots and releases should be kept separate in Maven repositories.  This isn’t a technical necessity, but will generally be expected by your repo’s consumers, especially if they’re familiar with Maven.  (Repository managers such as Nexus actually require that individual repositories’ types be declared upon creation, as either snapshot or release.)

[catapult:~/dev] chas% cd cemerick-mvn-repo/
[catapult:~/dev/cemerick-mvn-repo] chas% mkdir snapshots
[catapult:~/dev/cemerick-mvn-repo] chas% mkdir releases

3. Deploy your project’s artifacts to your Maven repo

A properly-useful pom.xml contains a <distributionManagement> configuration that specifies the repositories to which one’s project artifacts should be deployed.  If you’re only going to use Github-hosted Maven repositories, then we just need to stub this configuration out (doing this will not be necessary in the future1):


Usually, URLs provided here would describe a Maven repository server’s API endpoint (e.g. a webdav URL, etc).  That’s obviously not available if Github is going to be hosting the contents of the Maven repos, so I’m just using the root URLs where my git Maven repos will be hosted from; as a side effect, this will cause mvn deploy to fail if I  don’t provide a path to my clone of the Github Maven repo.

Now let’s run the clutch build and deploy our artifacts (which handily implies running all of the project’s tests), providing a path to our repo’s clone directory using the altDeploymentRepository system property (heavily edited console output below)2:

[catapult:~/dev/cemerick-mvn-repo/] chas% cd ../vendor/clutch
[catapult:~/dev/vendor/clutch] chas% mvn -DaltDeploymentRepository=snapshot-repo::default::file:../../cemerick-mvn-repo/snapshots clean deploy
[INFO] Building jar: ~/dev/vendor/clutch/target/clutch-0.2.3-SNAPSHOT.jar
[INFO] Using alternate deployment repository snapshot-repo::default::file:../../cemerick-mvn-repo/snapshots
[INFO] Retrieving previous build number from snapshot-repo
Uploading: file:../../cemerick-mvn-repo/snapshots/com/ashafa/clutch/0.2.3-SNAPSHOT/clutch-0.2.3-SNAPSHOT.jar
729K uploaded  (clutch-0.2.3-SNAPSHOT.jar)

That looks happy-making.  Let’s take a look:

[catapult:~/dev/vendor/clutch] chas% find ~/dev/cemerick-mvn-repo/snapshots

That there is a Maven repository.  Just to briefly dissect the altDeploymentRepository argument:


It’s a three-part descriptor of sorts:

  1. snapshot-repo is the ID of the repository we’re defining, and can refer to one of the repositories specified in the <distributionManagement> section of the pom.xml.  This allows one to change a repository’s URL while retaining other <distributionManagement> configuration that might be set.
  2. default is the repository type; unless you’re monkeying with Maven 1-style repositories (hardly anyone is these days), this is required.
  3. file:../../cemerick-mvn-repo/snapshots is the actual repository URL, and has to be relative to the root of your project, or absolute. No ~ here, etc.

4. Push to Github

Remember that your Maven repo is just like any other git repo, so changes need to be committed and pushed up in order to be useful.

[catapult:~/dev/cemerick-mvn-repo] chas% git add *
[catapult:~/dev/cemerick-mvn-repo] chas% git commit -m "clutch 0.2.3-SNAPSHOT"
[master f177c06] clutch 0.2.3-SNAPSHOT
 12 files changed, 164 insertions(+), 2 deletions(-)
 create mode 100644 snapshots/com/ashafa/clutch/0.2.3-SNAPSHOT/clutch-0.2.3-SNAPSHOT.jar
 create mode 100644 snapshots/com/ashafa/clutch/0.2.3-SNAPSHOT/clutch-0.2.3-SNAPSHOT.jar.md5
 create mode 100644 snapshots/com/ashafa/clutch/0.2.3-SNAPSHOT/clutch-0.2.3-SNAPSHOT.jar.sha1
 create mode 100644 snapshots/com/ashafa/clutch/0.2.3-SNAPSHOT/clutch-0.2.3-SNAPSHOT.pom
 create mode 100644 snapshots/com/ashafa/clutch/0.2.3-SNAPSHOT/clutch-0.2.3-SNAPSHOT.pom.md5
 create mode 100644 snapshots/com/ashafa/clutch/0.2.3-SNAPSHOT/clutch-0.2.3-SNAPSHOT.pom.sha1
 create mode 100644 snapshots/com/ashafa/clutch/0.2.3-SNAPSHOT/maven-metadata.xml
 create mode 100644 snapshots/com/ashafa/clutch/0.2.3-SNAPSHOT/maven-metadata.xml.md5
 create mode 100644 snapshots/com/ashafa/clutch/0.2.3-SNAPSHOT/maven-metadata.xml.sha1
[catapult:~/dev/cemerick-mvn-repo] chas% git push origin master
Counting objects: 24, done.
Delta compression using 2 threads.
Compressing objects: 100% (7/7), done.
Writing objects: 100% (19/19), 669.07 KiB, done.
Total 19 (delta 1), reused 0 (delta 0)
 f57ccba..f177c06  master -> master

5. Use your new Maven repository

Your repository’s root will be at<your-github-username>/<your-github-maven-project>/raw/master/

Just append snapshots or releases to that root URL, as appropriate for your project’s dependencies.

You can use your Github-hosted Maven repository in all the same ways as you would use a “normal” Maven repo – configure projects to depend on artifacts from it, proxy and aggregate it with Maven repository servers like Nexus, etc. The most common case of projects depending upon artifacts in the repo only requires a corresponding <repository> entry in their pom.xml, e.g.:



  1. Administrative simplicity: there are no servers to maintain, no additional accounts to obtain (i.e. compared to using Sonatype’s OSS Nexus hosting service), and the workflow is very familiar (at least to those that use git).
  2. Configuration simplicity: compared to (ab)using Google Code, Kenai, or any other subversion host as a Maven repository, the project configuration described above is far simpler.  Subversion options require adding a build extension (for either wagon-svn or wagon-webdav) and specifying svn credentials in one’s ~/.m2/settings.xml.
  3. Tighter alignment with the “real world”: ideally, every artifact would be in Maven Central, and every project would use Hudson and deploy SNAPSHOT artifacts to a Maven repo.  In reality, if you want to depend upon the bleeding edge of most projects – which aren’t regularly built in a continuous integration environment and which don’t regularly throw off artifacts from HEAD – having an artifact repository that is (roughly) co-located with the source repository that you control is very handy.  This is true even if you have your own hosted Maven repo, such as Nexus; especially for SNAPSHOTs of “vendor” artifacts, it’s often easier to simply deploy to a local clone of your git-hosted Maven repo and push that than it is to recall the URL for your Maven server’s third-party snapshots repository, or constantly be adding/modifying a <distributionManagement> element to the projects’ pom.xml.

Caveats / Cons

  1. This practice may be considered harmful by some.  Quoting here from comments on another description of how to host Maven repositories via svn:

    This basically introduces microrepository. Users of these projects require either duplicate entries in their pom: one for the dependency and one for the repository, or put great burden on the maintainer of the local repository to add each microrepository by hand….So please instead of this solution have your Maven build post artifacts to a real repository like Java’s, Maven’s or Sontatype’s.
    – tbee

    Please do NOT use this approach. Sonatype provide free hosting of a Maven Repo for open source projects and BONUS! you get syncing to Maven Central too for free!!!
    Stephen Connolly

    tbee’s point is that, the more “microrepositories” there are, the more work there is for those that maintain their own “proper” Maven repository servers, all of which provide a proxying feature that allows users of such servers (almost always deployed in a corporate environment as a hedge against network issues, artifact rot, and security/provenance issues, among other things) to specify only one source repository in their projects’ pom.xml configurations, even if artifacts are actually originating from dozens or hundreds of upstream repositories.  I don’t really buy that argument, insofar as the cat is definitely out of the bag vis á vis a proliferation of Maven repositories. Our Nexus install proxies around 15 Maven repositories in addition to Maven Central, and I suspect that that’s a very, very low compared to the typical Nexus site. I’ll bet there are hundreds – perhaps thousands – of moderately-active Maven repositories in the world already.

    I agree with Stephen that if you are willing to get set up with deployment rights to Sonatype’s OSS Maven repo, you should do so.  Not everyone is though – and I’d rather see people using dependency management than not, and sharing SNAPSHOT builds rather than not. In any case, if you use this method, know that you’ve strayed from the Maven pack to some extent.

  2. The notion of having to commit the results of a mvn deploy invocation is very foreign.  This is particularly odd for release artifacts, the deployment of which are ostensibly transactional (yes, the git commit / git push workflow is atomic as far as other users of the git/Maven repository are concerned, but I’m referring to the deployer’s perspective here).  The subversion-based Maven hosting arrangements don’t suffer from this workflow oddity, which is nice.  I suppose one could add a post-commit hook to immediately push results, but that’s just illustrating the law of conservation of strangeness, sloughing off unusual semantics from Maven-land to the Realm of git.  You could fall back to using Github’s svn write support, but then you’re back in subversion-land, with the configuration complexity I noted earlier.
  3. Without a proper Maven repository server receiving deployed artifacts, there will never be any indexes offered by these git-hosted Maven repos.  The same goes for subversion-hosted Maven repos as well. Such indexes are a bit of a niche feature, but are well-liked by those using Maven-capable tooling (such as the Eclipse and NetBeans IDEs).  It would be possible to generate and update those indexes in conjunction with the deployment process, but that would likely require a new Maven plugin – a configuration complication, and perhaps enough additional work to make deploying to Sonatype’s OSS repo or hosting one’s own Maven repository worthwhile.


  1. The fact that Maven currently requires a stubbed-out <distributionManagement> configuration is a known bug, slated to be fixed for Maven 2.5. Specifying the deployment repository location via the altDeploymentRepository will then be sufficient.
  2. An alternative to using the altDeploymentRepository option would be to make the git Maven repository a submodule of the project repository.  This would require that the git Maven repo be the canonical Maven repo for the project, and would imply that all the usual git submodule gymnastics be used when deploying to the git Maven repo.  I’ve not tried this workflow myself, but might be worth experimenting with.

Programming and software development, medium-rare

In a past life as a “software engineer” on contract, a favorite analogy of coworkers was to compare software development to construction (perhaps influenced by the early 2000’s housing boom?). Projects were like houses, plans were needed to properly architect the end result, and schedules were laid down to ensure that the “foundations” were done before we started “framing” the building and “furnishing” the details. Conceptualizing software development in such a way is common, and there’s a long history of people involved in what has casually been called “software engineering” thinking that what they do is, or should be, related to “old-world” engineering.

This is largely both nonsense and decidedly unnerving.

I’m not like my grandfathers

Both of my grandfathers were involved in engineering; knowing something of what they did makes me even more sure that what I do is not related to engineering.

One was an electrician, more a tradesman than an engineer, who worked in the construction of large commercial buildings in downtown Hartford and Denver. Each day, he would look at the blueprints painstakingly drafted by the project’s architects and engineers, and go about making those plans a reality – installing high-voltage switching equipment, stringing miles of Romex (or, whatever they used back then), and doing all of it in hazardous conditions.

My other grandfather was, as far as I remember, something of a process engineer at a subsidiary of Olin, helping to design the manufacturing processes that would heat, pound, roll, cut, stamp, and test thousands of varieties of copper and stainless steel foils, strips, and other formulations for later inclusion in all sorts of products, both industrial- and consumer-related.

These men’s careers were very different, but they were involved in what are clearly engineering tasks.

An art’s constraints are what define the art

There’s a lot that separates my discipline from my grandfathers’, but I think the most significant is that, as someone who builds software, I have far more discretion in how I achieve my end results than they had in their work. The degree to which this is the case cannot be overstated, but I’m at a loss for words as to how to concisely characterize it. Materials in the real world behave in ways that are, in the modern age anyway, understood: electricity and copper and steel and wood have known physical characteristics and respond in known ways to all of the forces that might be applied to them.1

In contrast, the world of software has so many degrees of freedom in so many vectors, the possibilities are functionally limitless. This is both a blessing and a curse, as it means that the programmer is something of a god within her domain, free to redefine the its fundamental laws at will. Given this context, it’s no wonder that software projects fail at an astounding rate: we simply have nothing akin to the known natural constraints that are conveniently provided for our real-world engineer friends, so we have no option but to discover those constraints as we go along.

The software community’s response to this has been to erect artificial constraints in an effort to make it possible to get things done without simply going insane: machine memory models with defined semantics, managed allocation, garbage collection, object models, concurrency models, type systems, static analysis, frameworks, best practices, software development methodologies for all stripes and inclinations. This is natural, and good, and the only way we could possibly make sense of this thing called software given how young it is.

Yet, even after all this edifice, ask 100 software developers how to build a website, and you will get at least 500 answers – and the people that respond with a hesitant “It depends” are likely the most clueful of the group. If my grandfather had responded “It depends” to a question about how to produce a 1-ton spool of 2mm-thick copper strip, he’d have been fired on the spot.2

Confessions and snotty ego trips

Interlude: If Top Chefs were programmers

Susur’s work is remarkably intricate and exhibits a delicate precision unmatched by his peers. He prefers the crystalline perfection of a perfectly-balanced type system, and so chooses Haskell for most of his work.

Living up to his “Obi-Wan” moniker, Jonathan makes the most familiar t hings amazing, and people are never sure how. His secret is that he usually works in some lisp or scheme, which he then uses to generate whatever comforting form his customers prefer, often Java or C#.

Marcus likes to surprise people with the most esoteric things possible: his pièce de résistance is written using his own prolog variant, implemented in Factor. People love it either for the ballsiness of it all, or because his stuff works so well they don’t notice.

Rick is a madman, insisting on using C++ for everything, even web applications.

Okay, so software development isn’t engineering. It’s probably safe to say it’s a craft, though (in contrast to pure art like painting or sculpting, but that’s a different blog post). In particular, its similarities to cooking are legion, something that I noticed while watching one of the few bits of television I bother with (and a guilty pleasure), Top Chef (though I redeem myself oh-so-slightly by preferring the Masters incarnation by a mile). *blush*

Watching this show with an eye towards the methods and attitudes of the chefs is like watching a group of programmers struggle to build quality software. Functionally no constraints in process, methodology, and materials? Check. Infinitely many ways to get the job done depending on the skill of the craftsman, the tastes of the customers, and the specifics of the materials? Check. Identiying the best craftsmen is difficult, and most of those at the top of their field have a murky combination of ill-defined qualities? Check. A wide disparity between the effectiveness of individuals in the field? Check. At the edges, people experiment with hacks that sometimes come to be part of everyone’s repertoire? Check. Technical capability is not a necessary requirement for success, due to fluctuating trends (less charitably called “fashion”) and a variety of potentially-compensating personal characteristics? Check. Less mature (and often less capable) members of the community are given to snotty ego trips, bad attitudes, and frequent tantrums? Check.

Given all of the similarities, I think the fact that it’s difficult to assess the quality of individuals in either field is the most striking (and perhaps a key indicator that a field has not (or intrinsically cannot) internalized a certain minimum degree of rigor in its methods). It’s telling that TopCoder predated Top Chef by a decade. Great hackers and those guys that can hit the high notes are few and far between, mixed in with scads of cooks that come to work stoned and “programmers” that put HTML on their resumés that still manage to get hired, somewhere. These are fields that are far from science, far from engineering, and well within the lush, rolling hills of craft.

Enough already. So what?

First, words matter and it’s important to call a spade a spade. Thoughtlessly (or disingenuously) call software development “engineering”, and people walk off with all sorts of notions that are inappropriate. This is particularly damning with customers and other nontechnical “civilians”.

Second, craft, as revered a concept as it is, is not desirable when businesses, livelihoods, and actual lives are at stake. I’ve never worked on aeronautics software or the like, but despite its codified standards, its rigorous testing protocols, and access to millions and billions of dollars in resources, we’ve crashed satellites into planets because of something as triflingly simple as conversion between metric and standard measures. Similar examples are everywhere. We need software development to be built with an engineering discipline, because everything we do depends upon it.

Of course, I have no tidy answers for that challenge. I think that if we can pull ourselves out of this primordial ooze of twiddling bits and get to a point where we describe how to do things relevant to the domains we’re working in, then I think there’s a chance for the species to survive. Ideally, domain experts should be doing the bulk of the “programming” work, given that communication between such experts and programmers is probably the source of at least half, if not the vast majority of software design flaws. This is an old chestnut, dating back to the 4GL programming languages (now 5GL?), declarative programming, expert systems, and so on.

Programmers’ work will be done when we’ve put ourselves out of a job

Roughly, I think the point is to solve for X:

blueprint:building :: X:code 3

One strain of real-life examples – such as Pipes, DabbleDB, and WuFoo – allow people to collect and process data without touching a line of code. The best specimens of these kinds of systems are often labeled with the “tool” slur by programmers, but the fact is such things allow for greater aggregate productivity and less possibility for error than any similar purpose-built application. Meta-software that can deliver the same for any given domain is the future, and the result will be that programmers, as they are known today, should cease to exist in commercial settings.

The craft of programming will survive though, and continue to be the source of innovation. After all, there’s no shortage of demand for great chefs, even with all these McDonald’s and Olive Gardens dotting the landscape.

1 There are obviously still unknowns in the physical world, but those are irrelevant insofar as we’re contrasting engineering and software development within the scope of commercial projects. Research matters are an entirely separate issue in both camps.
2 Another good question to ask: Can you estimate time and materials for projects…and be correct within a 100% margin of error? If so, congratulations, yours might be an engineering discipline.
3 From an informal, but very high quality discussion on this topic in #clojure IRC:

Securing web services in a world with few options


We’re building a web service for which we aim to charge money. Further, the data being pushed around may be confidential or otherwise of a sensitive nature. We have good reasons to do everything we can to ensure that the service is secured “properly”:

  • We don’t want to have customers charged for work that is requested by a bad actor exploiting a security hole (of course, we’d issue a refund and an apology in such a case, but the impact to our business through unnecessary processing could be sizable).
  • We don’t want our customers’ data exposed; common vectors for this include sniffing, replay attacks, or simply the use of compromised credentials.

Of course, the impact on our relationship with our customers due to any security breach could be significant and devastating – to our business, our reputation, and potentially even to our customers’ affairs completely outside of their use of our web service. So again, we have a lot of reasons to be highly-motivated when it comes to security.

By way of context, let’s set the stage with regard to the moving pieces. The web service in question:

  • is built on a JVM stack (with the application itself built with Clojure, of course, using the Compojure framework)
  • has a user-facing, HTML browser interface as well as a “RESTful” API surface (“RESTful”, as in, pretty darn close to ROA “style”, so the set of URIs involved in delivering the user-facing interface vs. those delivering the REST API are nearly identical).
    • the user-facing interface offers standard form-based authentication, as well as OpenID authentication (which will be recommended only for more casual users and usage).
  • will always, always delivered over SSL. We assume that every bit of data transferred is confidential, so cleartext is an absolute no-no.

OK, let’s go find an expert

It is with this mindset that I’ve been digging into how to approach web service security. Note that I’m no specialist or expert in this area – I’m merely a practitioner that is usually focused on things far, far away from anything security-related. (It may not surprise you that I’m coming to appreciate that fact more and more as I learn about the “state of the art” in web service security.)

Given this, I set out a few weeks ago to see where things stand on the web service security front. Of course, that realm is just as full of cliques and posturing and strawmen and ad hominem attacks as the broader software development world is, so finding a clear path forward is not easy. First, a bit of literature review, as it were, drawn in particular from a flurry of web service security chatter a few years ago (emphasis here and there is mine, I wish I had noticed and grokked the indicated bits earlier, I’ll explain below):

  • I started by finding Gunnar Peterson’s pair of posts where he compares “REST security” with WS-Security stuffs, where the former (especially approaches like HTTP Basic authentication over SSL) come out sounding like a pretty bad choice:

    people who say REST is simpler than SOAP with WS-Security conveniently ignore things like, oh message level security

    Now if you are at all serious about putting some security mechanisms in to your REST there are some good examples [such as Amazon’s implementation of an HMAC authentication scheme].

    Some people in the REST community are able to see the need for message level security so this is heartening somewhat. If the data is distributed and the security model is point to point (at best), we have a problem.

  • Pete Lacey lays out the counterpoint, saying that SSL works just fine for tons and tons of use cases, thankyouverymuch.

    In summary, RESTful security, that is SSL and HTTP Basic/Digest, provides a stable and mature solution that addresses transport level credential passing, encryption, and integrity. It is ubiquitous, simple, and interoperable. It requires no out-of-band contract negotiation or a priori knowledge of how the resource (okay, service) is secured. It leverages your existing security infrastructure and expertise. And it addresses 99% of the use cases you are likely to encounter. SSL does not support message level security, and if that’s a requirement, then leveraging SOAP and WSS makes sense.

  • Unsurprisingly, Sam Ruby backs up Pete Lacey, but the comments on that post are interesting:
    • From Gunnar Peterson:

      I am no way suggesting there is only way to do this or that WS-Security came down on stone tablets. I am also not suggesting that a NSA level of security is appropriate for Google Maps. There are many shades of gray. “good enough” security is a big challenge, and it isnt about black and white security models, it is about risk management

    • From Bill de hÓra:

      I think this is where quantative analysis comes in and a measured assessement of the risk is taken. What has to be protected and what’s the worthwhile cost of doing so? Being software people, that’s beyond the general state of the art. We do gut feelings, flames and opinions.

  • There’s a variety of “REST security ‘best practices'” posts out there, but a question from StackOverflow links to a variety of additional discussions there that serve as good an indication as any that the accepted way of securing REST web services is Basic auth over SSL.

Before moving on, I just want to point out that Bill de hÓra’s comment above is sadly representative of so many corners of software development.  Let’s ponder that for a moment, while realizing that modern society and its continuation absolutely depends upon the software we build (I’m talking collectively, here).

Take a deep breath

Of course, the above is not an exhaustive survey, just the best tidbits I found over the course of a lot of browsing and searching. Here’s the upshot, as I see it:

  1. WS-Security et al. ostensibly provide message-level security that ensures that your service can be passed along by untrusted intermediaries.
  2. Standard HTTP authentication (generally Basic) over SSL transport is the de facto standard for securing REST services, but it does nothing for you if message security is important.
  3. More sophisticated authentication mechanisms are available – in particular HMAC, as exemplified by Amazon’s web services – which allow services to ensure that a message’s author has not been impersonated. This would resolve the potential holes of .

Unfortunately, I didn’t grok the whole message vs. transport security issue as quickly as I should have, where SSL provides the latter but the former would only be satisfied by something like WS-Security (again, ostensibly, I certainly can’t vouch for it) or HMAC-SHA1 if one were working in a REST environment. If I had come to grips with that point of tension earlier, I would have arrived at my two conclusions much faster:

  1. In our situation, message security is simply not relevant. As Peterson wrote (and I quoted above) “If the data is distributed and the security model is point to point (at best), [REST has] a problem.” Well, in our case, data is not distributed, it is transmitted point-to-point (between our customers and us, a third-party external web service), so transport security provided by SSL should be sufficient.
  2. Here’s the biggie: assuming we support form-based authentication (of course, over SSL) for browser-based UI interaction, supporting anything more sophisticated than HTTP Basic authentication over SSL for our REST API interactions would be a waste of resources. We could go full-tilt and require HMAC-SHA1 for the REST API or provide only a SOAP API that used WS-Security (and whatever else goes into that), but that would mean nothing if an attacker has the “REST API” provided for browser use available to him. Given this, transport security provided by SSL, and that alone, is simply all we can do.  Put another way: when browser-level security mechanisms improve, then so will our APIs’.

An alternative path would be to host a parallel service, available via a REST API secured via HMAC-SHA1 or a WS-Security-enabled SOAP API, that did not provide any kind of browser-capable entry point. Customers could opt into this if they thought the tradeoff was important. Doing this would be technically trivial (or, perhaps only moderately difficult w.r.t. the SOAP option ), but I’ve no idea whether the additional degree of security provided by such a parallel service would be of any interest to anyone.

By the way, if I’m totally blowing this, and my conclusions are completely broken, do speak up.

Coming soon: Part II of my investigation/thinking on the subject of web service security, related to OpenID and the management of credentials in general…which should give me all sorts of new opportunities to say foolish things!

Working with git submodules recursively

Git submodules are a relatively decent way to compose multiple source trees together, but they definitely fall short in a number of areas (which others have discussed at length elsewhere). One thing that immediately irritated me was that there is no way to recursively update, commit, push, etc., across all of one’s project’s submodules. This is something I ran into immediately upon moving to git from svn some months back, and it almost scared me away from git (we used a lot of svn:externals, and now a lot of git submodules).

Thankfully, the raw materials are there in git to work around this. (I’ve since noticed a bunch of other attempts to do similar things, but they all seem way more complicated than my approach…maybe it’s the perl? ;-))

Here’s the script we use for operating over git submodules recursively,


case "$1" in
        "init") CMD="submodule update --init" ;;
        *) CMD="$*" ;;

git $CMD
git submodule foreach "$0" $CMD

Throw that into your $PATH (I trim the .sh), chmod +x, and git submodules become pretty pleasant to work with. All this is doing is applying whatever arguments you would otherwise provide to git within each submodule, and their submodules, etc., all the way down. The one special invocation, git-submodule-recur init, just executes git submodule update --init in all submodules.

So, want to get the status of your current working directory, and all submodules? git-submodule-recur status Want to commit all modifications in cwd and all submodules? git-submodule-recur commit -a -m "some comment" Want to push all commits? git-submodule-recur push You get the picture.

This script has saved me a *ton* of typing over the past months. Hopefully, it finds a good home elsewhere, too.

Note: Starting in git 1.6.5, git submodule will grow a –recursive option for the foreach, update and status commands. That’s very helpful for the most common actions (and critical for building projects that have submodules in CI containers like hudson), but git-submodule-recur definitely still has a place IMO, especially for pushing.

Update 2009/09/28: I tweaked the git-submodule-recur script to quote the path to the script ("$0" instead of $0); this became necessary when I dropped the script into C:\Program Files\Git\bin in our Windows-hosted Hudson environment.

Whoa, Peter Norvig used some of my code!

I’m generally not one to be impressed by celebrity — you won’t catch me reading People or US Weekly, example.  However, this morning I noticed with a shimmer of glee that Peter Norvig used some code that I wrote years ago in one of his recent projects.  So, just for the record, if Dr. Norvig ever shows up in US Weekly, I’ll pick one up!

In case you don’t know, Peter Norvig is the Director of Research at Google.  That’s interesting, but the real reason Dr. Norvig holds sway with me is his classic book, Paradigms of Artificial Intelligence Programming.  If it weren’t for that book, I almost certainly would not be doing what I’m doing today.  Its pages are where I came to understand lisps, and began to imagine what was possible and what I might be able to accomplish in computer science (final results yet to be determined, of course).  For that, I am extraordinarily grateful to him (and others, of course, but I’ll wait to talk about them when they get around to using some of my code! ;-) ).

Back to the story.  This morning, I decided to hop onto Google Analytics for a bit to check up on the traffic stats for our various websites.  Lo and behold, in the “top referrals” listing, I saw ‘’; “Well,” I thought to myself, “that’s interesting!”   A quick grep of the server logs (is there a screen in Google Analytics that actually provides you with the full referral URLs?) showed the referral URL to be Dr. Norvig’s “post” from last week, An Exercise in Species Barcoding.

A search of my name on that page shows that he needed a way to calculate the Levenshtein distance (also known as the edit distance) between two large strings — his quick implementation (like most) operated in O(n^2) space, which would have required weeks of processing time in his particular case.  So, he looked around for a more efficient implementation, and found one that I wrote in October of 2003 that operated in linear space bounds (and was, ironically enough, my first-ever contribution to an open source project).  With a couple of tweaks to suit his specific needs, the code I wrote worked out nicely for him.

This story is satisfying and funny (for me, anyway) in a couple of different ways:

First, there’s the fact that (what I would now consider) throwaway work of mine floating around the nets six years later.  Remember kids, the Internet never forgets!

Second, it reminded me of what I was doing when I wrote that particular code.  I was building what would later become PDFTextStream’s first ground-truthing system1(although I don’t think I knew of that term at the time). It’s a lot more sophisticated now, but back in 2003, I was simply trying to set up a “ground truthing” system where the full (vetted and known-good) extracted text from each PDF document in our nascent test repository would be saved off somewhere, and later builds of PDFTextStream would compare its extracted PDF text to those saved files.

Of course, it wouldn’t be practical to require that PDFTextStream produce identical output forever — some amount of slop had to be allowable, because (for example) if an extracted word was outputted with four spaces before it instead of two, that would generally be sufficient.  For that and other reasons, I wanted to test that current PDF text extracts were the same as the known-good extracts within a defined margin of error.  Unfortunately, I was ground-truthing full document extracts at that time, and most Levenstein functions with their quadratic performance characteristics would take a lot of memory to diff the multi-megabyte strings that were involved.

Solution: write my own Levenshtein function (loosely based off of a pedagogical implementation by Mike Gilleland that had been incorporated into the Apache commons-lang project) that operated in linear space bounds.  Thankfully, I opted to offer the improvement back to the Apache commons-lang project and to Dr. Gilleland — had I not, Dr. Norvig would never had found that code, and I wouldn’t be writing this right now.

Third and finally, this story is satisfying because, hell, Peter Norvig used some of my code.  A person I respect and admire has found it convenient to use some minor thing I created years ago, and was thoughtful enough to say so.  I hope I can follow that example as I go along in my travels.

See, Dr. Norvig, I’m still learning from you.


1 Ground truthing is a testing methodology often used in document processing systems where ideal or otherwise known-good output is cataloged, and then actual or current output is compared to it to determine relative accuracy.  PDFTextStream’s current ground-truthing system serves as a semi-rigorous smoke test of its aggregate text extraction accuracy while we’re doing active development, as well as an ironclad regression test for when we’re looking to cut a release.  Thankfully, it’s come a long, long way from the very naive approach I was pursuing in 2003.

Scala Makes Me Think

(…or, “Oh, Dear, Wasn’t I Thinking Before?”)

As my friends will attest, I really enjoy programming languages. I’m one of those language fetishists that talk about “expressiveness” and “concision”, and yes, I’m one of those very strange fellows who blurt out bad Lisp jokes while getting odd looks from innocent bystanders. And while my bread and butter is built in Java, I often find myself yearning for a more expressive language while deploying, customizing, or integrating PDFTextStream (there I go again with the “expressiveness” bit). That yearning can reach almost pathological extremes at times, prompting me to go so far as to sponsor projects that make it possible to use Java libraries (including PDFTextStream) from within Python.

Fortunately, things don’t always have to be so hard. Case in point, I recently dove head-first into Scala, a language that combines object orientation and functional programming into one very tasty stew. Scala has a number of characteristics that make it interesting aside from its merging of OO and FP mechanisms:

  • it is statically-typed, and provides moderately good type inference that enables one to skip most type declarations and annotations
  • it is compiled, which provides a minimum level of performance (sure, it’s actually byte-compiled, but let’s not quibble right now)
  • and the real kicker: it compiles down to Java class files (or .NET IL), thereby enabling it to be hosted on a JVM (or .NET’s CLR), and call (and be called by) other Java (or .NET) libraries

There’s a lot to like here, for programmers from many walks of life, and I could go on and on about how Scala has single-handedly created and filled a great niche of delivering most of the raw power of purely functional languages like Haskell and ML within a JVM-hosted environment with respectable performance. But what has really impressed me has been the way that Scala has improved how I work. In short, it’s made really think about development again.

I generally have two working styles. In a classic statically-typed environment (say, Java or C#), I tend to generate pretty clean designs, but my level of productivity is very low. I attribute both of these characteristics to the copious amount of actual work (i.e. finger-typing) that has to go into writing Java or C# code, even with the best of tools. See, while I’m typing (and typing, and typing), I’m thinking two, three, four steps ahead, figuring out the design of the next chunk of code. The verbosity of the language gives me time to reason about the next step while my fingers are working off the previous results.

In a dynamically-typed environment (say, Python or Scheme), I tend to be extraordinarily productive, but unless I consciously step back and purposefully engage in design, the code I write is much more complex. In such environments, there’s less finger-typing going on, so I don’t have a natural backlog allowing me to think about the code before it’s already on the screen. Further, I know I can get from point A to point B relatively easily in many circumstances, so I end up skipping the design step, switching into Cowboy Coder mode, and hacking at things until everything works. Oddly enough, in certain circles, this isn’t so much frowned upon as it is recommended.

Scala is statically-typed, so the naive observer might speculate that my working style in Scala would be much the same as in Java. However, I’ve found that working with Scala has prompted (forced?) me to consciously step back and think about everything, at every step along the way: class hierarchies, type relationships in general, testing strategies, eliminating state where possible…the amount of actual thinking I’ve done while working with Scala has far outstripped the amount of reasoning that typically goes into any similar period of coding. Unsurprisingly, this has led to quite the spike in code quality, which translates into productivity through fewer bugs and less rework.

I attribute this to the strong, static typing that Scala enforces, combined with the type inference that Scala provides. The former forces me to reason about what I’m doing (as it does in Java, for instance), but because the latter eliminates so much of the finger-typing associated with static typing in other environments, I’m given the opportunity to realize that a concrete design phase would yield tremendous benefits, regardless of the scope of code in question. I suspect I would find working in Haskell or ML to be a similar experience, but because those languages don’t easily interoperate with the libraries I need to do my work, I’ve never really given them a chance.

Thankfully, I don’t think I’ll have to. Scala is a great environment, and even more important than its technical merits, its design has led me to engage in a more thoughtful, more conscious development process.

Thoughts on Martin Fowler’s Domain Specific Languages Overview

I’m way late in linking to this, but it’s worth it.

Last October, a presentation by Martin Fowler from JAOO 2006 popped up on InfoQ (which does a great job of simulating the actual experience of being at the session with its video/slideshow integration) where he gave a very high-level overview of domain specific languages (DSLs). He really only scratched the surface, but it’s a great introduction for those that haven’t yet thought about DSLs much.

(Of course, that population is getting smaller by the minute thanks to Ruby (and Rails), since it builds in the metaprogramming facilities necessary to implement internal DSLs.)

I recently had occasion to re-watch the presentation. This time around, I took the time to scribble down some thoughts:

  1. I think he played up the potential role of DSLs as “configuration code” too much. Yes, you can tailor a DSL to provide primarily configuration data, and that’s very useful as far as it goes. However, internal DSLs (given an appropriately useful host environment) are able to provide levels of abstraction and composability that go way beyond configuration.
  2. I think that casting the Java example he showed as a DSL is really over the top, and is a result of overemphasizing the potential configuration role DSLs can play. As Mr. Fowler said, the line between an internal DSL and just a bunch of specific goal-driven coding in the host language is fuzzy. However, a big part of that line (and therefore whether an environment can reasonably host a DSL) is how well the host language’s existing constructs can be recast as sensible constructs in the DSL. The Ruby DSL example fits this criterion well, as its range (4..12, etc) and block constructs mapped well to the domain at hand. On the other hand, the Java example is Java, unapologetically so — the explicit object creation, the function calls, return statements, etc., simply do not map to the domain. The fact that the integers and strings being passed in those function calls can be recast as an actual configuration file should not lead us to think that Java configuration code is a functional DSL.
  3. At least in my experience, external DSLs are dead-ends. There’s just too much heavy lifting that needs to be done to consume external DSL “source files” and align their contents with the host language’s environment. True, internal DSLs need to conform to the syntax of their host environment, but the advantages of “symbolic integration” (as Mr. Fowler puts it) and the fact that you get your IDE’s functionality for free are just too compelling to outweigh any nitpicky syntax quibbles that one might have with any DSL-capable language. And, if those syntax quibbles are significant enough, and the problem the DSL is going to solve is significant enough to make you come close to considering building all of the cruft necessary to implement an external DSL, then go find yourself a secondary language/environment that provides a more palatable syntax, and hook everything up with IPC of some kind.