Adding Gzip compression to a Clojure webapp in 30 seconds

As you might have seen, I’m working on a new web project, which happens to involve shipping a metric ton of content to each user’s browser upon visiting the meat of the site.  We’re talking about something like 1.5MB of HTML, Javascript, and CSS, and that’s after best-effort minification and such.

Clearly, Gzipping the whole mess is called for.  I’ve never worked on any high-volume sites that called for such measures, so this is a new requirement for me.  The site’s backend is implemented in Clojure though, so my first instinct was to Google “gzip ring clojure” (Ring being the thoroughly spectacular Clojure web framework), whereupon I found Michael Stephens’ ring-gzip-middleware project.  Seems simple enough: Ring request handlers are just functions, and you can apply middleware trivially via function composition, so ring-gzip-middleware provides a function that wraps your assembled Ring handler(s) to compress outgoing responses appropriately.

Thankfully, I stopped short just in the nick of time: while Ring middleware is a damn fine hammer, Gzip response compression shouldn’t be a nail in this scenario.  I didn’t want to have to read through entirely unrelated infrastructure-related bits in my codebase henceforth, no matter how elegantly they folded themselves in.  There remains value in the notion of separation of concerns.

Again thankfully, Clojure web apps are Java web apps, universally deployed in a servlet container like Tomcat, Jetty, Glassfish, and so on. So, I have quite the tasty menu to choose from.

Container-provided Gzip compression

Most if not all Java servlet containers provide Gzip response compression out-of-the-box.  Tomcat, for example, requires simply adding a couple of attributes to the Connector element in its server.xml file.  Jetty handily provides Gzip compression of static resources via its default servlet; just set its gzip init parameter to true in your web.xml1 file, and you’re done:


I think you can leave out the <servlet-class> element there; I didn’t experiment much on this path, because:

  1. I needed to be able to Gzip dynamically-generated content, and
  2. I use Jetty in my development environment, but deploy to Tomcat in production, so I wanted a general-purpose solution.

Thus, what I consider to be the ideal approach:

Gzip Servlet Filters

Servlet filters are independent, composable components that can dynamically modify requests and responses, the circa-1999 Java corollary to Ring middleware.  The difference is largely in packaging and context: while Ring middleware is just a function that can be folded into a codebase programmatically, servlet filters are specified statically as part of an application’s web.xml file, and in general are not modified from within the servlet at runtime.  (Things can get interesting when you implement servlet filters using Clojure, but that’s perhaps a topic for another post.)

There are various Gzip compression servlet filter implementations floating around the ‘nets, including one particularly bad example that appeared in some magazine in 2004 that has an unreasonable amount of Google juice associated with it for some reason.  Use any of them, and your Clojure web application will be Gzip-ready, regardless of which servlet container you deploy to.  For my money, the best one is provided by the Jetty project, simply because it’s impossible to argue with its provenance given that Jetty is used everywhere: if there was a problem with its Gzip servlet filter implementation, it certainly would’ve been found out by now.

Using it is cake; add the corresponding dependency to your pom.xml file:


and add it to your web.xml file with a corresponding URL mapping:




All my content is now Gzip-compressed, both dynamically-generated and static (because I always use the servlet container’s default servlet for serving up static resources).  I didn’t have to make any changes to my codebase, and I’ll never once be reminded of Gzip compression when I fiddle with my Ring handlers.

The Jetty Gzip filter has various options for tweaking which mime types and sizes of content should be included and for excluding specific user agents from receiving Gzipped content, but I’ll just leave the defaults alone for now (i.e. compress everything for everyone).

Postscript: Wait, what’s with all the XML in this “Clojure web app”?

Some people are allergic to parentheses; some are allergic to XML; I choose to find peace in both as appropriate. :-)

That Jetty Gzip filter does a good job of something I don’t want to think about.  Just as core vs. context is a useful frame in business affairs, I think it’s handy when thinking about how to approach software development.  I don’t get points for having a “pure Clojure” stack for my web application, especially if I need to break away from having a reasonable separation of concerns or deal with FUD creeping up in my head about whether a reimplementation of fundamentally commodity operations is really up to spec or not — it may very well be, but there’s simply no gain to be had when that choice pans out favorably if there are safer alternatives for such matters of context.

Thus, I chose the ~5-year-old Gzip filter from Jetty, just like I often choose to use boringly reliable Java-land libraries (e.g. Spring Security) and tools (e.g. Maven and Eclipse) to support far more interesting things for which I use bleeding-edge kit like Clojure.

  1. You know what a web.xml file is, right?  Every Java web application has one, whether it’s generated by your build process or you create it yourself.  The latter is generally preferable IMO, simply because you can take advantage of all the goodies that it opens up for you.  You can read generalities about web.xml files all over the web; there’s a sample Clojure web project over here that contains a simple example, and I talk about them a bit in my post and screencast here.

Oracle VP: “We have a strategy to run Java inside a Javascript environment”

This statement from Adam Messinger – the Vice President of Java Development at Oracle – was shocking to me (original podcast; transcript; emphasis mine):

Roger: One last question here. What’s Oracle going to do to make Java successful on the desktop?

Adam: …another way is this strategy we have around running Java inside of a JavaScript environment. So there the programming language is Java, but the platform is not a JVM platform.

This is a little bit of a scary thing, honestly, for Oracle, because while the language is something we know and love, a lot of the value we have comes from the stack underneath the JVM, the library, and so on and so forth.

But we think it’s something that we need to do for the community so we can make Java available more places from tablet devices like the iPad where there is not an easy way to get Java there today, to desktops where, while there are applets, some people choose not to use applets, and we want a solution that works there.

Trying to pull meaning from a brief statement like this is a dangerous thing to do, but it sounds like Oracle is working on a way to use Javascript as a compilation target.  It’s anyone’s guess whether this is Oracle’s stab at something akin to GWT…or, perhaps their objective is more along the lines of Orto (an apparently defunct implementation of the JVM in Javascript, as reported by John Resig in 2008, which would allow one to cross-compile existing Java libraries to Javascript – a far more interesting prospect, IMO).

I try to stay on the ball when it comes to developments in the JVM space, but I’ve never heard of an Oracle effort along these lines, and I can’t find anyone online about it otherwise.  Am I just not looking in the right places, or is Mr. Messinger’s comment really the first public mention on the topic?

All my methods take 316 arguments, and I like it that way

Of course, I’m not so daft as to say that, but:

If you use an imperative programming language that provides for mutable state, that’s what you are saying.

For some background, I read this article yesterday, which contains this choice passage (emphasis mine):

Imagine you’ve implemented a large program in a purely functional way. All the data is properly threaded in and out of functions, and there are no truly destructive updates to speak of. Now pick the two lowest-level and most isolated functions in the entire codebase. They’re used all over the place, but are never called from the same modules. Now make these dependent on each other: function A behaves differently depending on the number of times function B has been called and vice-versa.

In C, this is easy! It can be done quickly and cleanly by adding some global variables. In purely functional code, this is somewhere between a major rearchitecting of the data flow and hopeless.

A comment on proggit very concisely summed up just how crazy the above passage is:

Considering that one of the majors reasons to use FP is so that you don’t have such inter-dependencies, it’s odd to point that out as an issue.

The whole problem with imperative programming is that state gets threaded everywhere, and you can’t look at any function individually and know how it will behave. I won’t even go into problems associated with concurrency, where state becomes incredibly difficult to reason about if you allow that sort of thing.

I really appreciated the notion of imperative programming “threading state everywhere”. Let’s drive the point home, though.

Hey, I’m just the messenger

Consider a method you might see in any Java application (I oh-so-love the jvm, so I get to pick on Java), but the same sort of thing applies in C, C++, C#, python, ruby, perl, et al.:

public void doSomething (String arg1, int arg2, FooBar arg3) throws IOException;

Simple enough, right? Hey, we’re programming, life is good. But, what if you saw a signature like this:

public void doSomething (String arg1, int arg2, FooBar arg3, .....,
                         String arg316) throws IOException;

316 arguments to a method (which I don’t think is actually possible in the jvm, but bear with me)? “That’s absurd!”, you’d say. The problem, of course, is that the 3-arg doSomething actually has far more arguments than its signature implies:

The behaviour of every function in a mutable, imperative environment is dependent upon the state of all of the other (variables|attributes|bindings|whatever) in your program at the time the function is invoked.

So, if you have 313 other variables in your program, that 3-arg doSomething is functionally (ha!) operating over 316 arguments.

Would you ever intentionally write a method signature that takes 316 arguments? Would you use any library that contained such a function signature? No? Then why are you using tools that force such craziness upon you?


Of course, there is a place for mutable, imperative programming. The fellow who wrote the blog post to which I linked above appears to work on games, one of the few places where one could unapologetically use an imperative programming language with mutable state. Update: Looks like the state-of-the-art in game programming is heading towards FP languages more than I thought. Thanks to this comment, here’s a LtU thread, with slides, about the guys who wrote Gears of War and the Unreal engine recommending FP as the future of game development.

However, we need to collectively get past encouraging other software developers – the vast majority of whom do not have the particular requirements of game, systems, or embedded development – to inflict the pain of imperative languages and mutable state upon themselves, especially given the concurrency challenges that lie ahead (never mind the general problems such environments present, as I argue above). The languages are ready, the runtimes are widespread…let’s stop doing it wrong.

Java is dead, but you’ll learn to love it

A favorite hobby-horse among various programming-related communities is to talk about why “Java is dead”, and further, that programmers working in the Java ecosystem should really look for greener pastures elsewhere.  You see these sorts of posts pop up on proggit, for example, often enough for it to get old.  That’s a lot of hot air, with plenty blowing in the other direction from various folks that have been pushing hard for significant improvements and changes to Java. Both sides are wrong, though, because as a result of its success and a series of historical accidents:

Java-the-language is dead.

Get over it, and realize that because of that fact, you’ll probably come to depend upon Java more than you ever thought possible.

The JVM is probably one of the most vibrant platforms for developing new programming languages there is, in part because of the status of Java-the-language.

First, let’s settle the premise. In comments on one of his recent blog posts, Joe Darcy, one of the fellows the heads up Sun’s management of the JVM and JDK (I’m not sure of his exact title and portfolio), said a couple of key things about the never-ending saga regarding closures in Java:

There are millions upon millions of Java developers who would have to learn about closures if they were added in the platform.

…there is far from unanimity in the Java community on the underlying choice of whether or not closures would be an appropriate language change for Java at this time.

OK, there it is, closures are never going to be added to the Java language.  Done, and done.  And if closures aren’t going in, then you can surely bet that other things aren’t going to make it, either.  To further make the point, Joe commented on an earlier blog post of his here 1,saying in reference to a question about why the Java standard libraries don’t slough off deprecated APIs:

To date, we have valued continued binary compatibility with code calling the deprecated elements more than cleaning up the API.

This sort of stuff pisses a lot of people off, and leads others to propose mildly absurd things IMO, like forking the Java language into “stable” and “experimental” versions. This a lot of wasted effort.

It seems that Sun decided long ago, through pressure from its customers and developers, that compatibility is more important than innovating at the language level. With that, managing Java and the JDK became more an exercise in stewardship than anything else. The quotes above from an authoritative source are proof-positive that this is the case.

That may make the Java language dead with regard to features, but it’s hardly useless – it’s simply transitioned to be the stable “systems language” for the JVM that a large swath of programmers (who Sun likely correctly identifies as being uninterested in things like closures, syntactic improvements, etc. etc.) happen to use for applications as well.

Trading off “progress” for stability bestows upon Java at least two characteristics that are shared by other systems languages:

  • screaming into the void about how improvements and changes should be made yesterday is generally pointless and irrelevant
  • knowing that the language is essentially fixed for years to come means that it fades into the background as a very useful artifact for those that want to build on top of a system with well-known characteristics

A side effect of this is that the JVM is a very fertile spot for new(er) languages, where language implementers don’t have to worry about their building blocks being taken away or changed radically from year to year2. At the same time, the JVM itself has been getting tweaked and tuned heavily under the covers to support non-Java languages, not the least of which is Sun’s JavaFX, their entry into the post-Java JVM language fray3. So, you want your fork of Java that pushes boundaries? They are many and plentiful, so go choose one, already.

The upshot of all this is that it’s more likely than not that over the course of the coming years, your life (and quite likely your professional life as well, if you’re involved in software) will come to rely upon Java, the JVM behind it, and many different other language stacks built on one or both of those technologies.

Of course, interop between these languages is a concern: only APIs matching Java’s binary signatures are accessible by all languages, there’s no standard interface for closures, there’s no standard (sane) numeric tower, etc. etc. These things are frustrating if one happens to be working in a polyglot environment, but I’ve no doubt that necessity will draw the larger players in the JVM language space together to establish certain baselines to ensure interoperability.

In the end, we might have all been better off if the current state of affairs had arrived years ago. A steady drip, drip, drip of Java language improvements serves only to keep developers tied around what is functionally a frozen language, and away from superior alternatives (on the same JVM platform!) if they’re so inclined to look up from their work. Since the state of play vis á vis Java-the-language is clear, maybe those that care so deeply about programming language productivity, innovation, and progress can set about enjoying the advantages of the future that Java has ensured for us all.

1 I don’t mean to pick on Joe, BTW. He just happens to have been relatively visible of late, in conjunction with his appearance on the Java Posse podcast, as well as in various chatter around the recent JVM Language Summmit.

2 The reality is that if you’re a language implementer (or an aspiring one), you have two platforms to choose from, the JVM or the CLR, and it’s worth noting that the former appears to be outpacing the latter in terms of attracting innovation in language design. There’s a lot one can attribute that to, but having an essentially fixed baseline language (e.g. not what C# is at all) might be a minor contributing factor.

3 Worth noting is the fact that JavaFX has oodles of features that people have been banging on about for Java to get for years and years. This is further verification that Sun’s reticence to produce feature-rich languages has nothing to do with their technical capabilities or general motivations, but with decisions made about Java’s status driven by business considerations.

Memory-mapping Files in Java Causes Problems

Today, we released PDFTextStream v2.0.1— a minor patch release that contains a workaround for an interesting and unfortunate bug: on Windows, if one accesses a PDF file on disk using PDFTextStream, then closes the PDFTextStream instance (using PDFTextStream.close()), the PDF file will still be locked. It can’t be moved or deleted.

This is actually not a bug in PDFTextStream, but in Java, documented as Sun bug #4724038. In short, any file that is memory-mapped cannot reliably be “closed” (i.e. the `DirectByteBuffer` (or some native proxy, perhaps) that holds the OS-level file handle does not release those resources, even when the `FileChannel` is closed that was used to create the `DirectByteBuffer`). Reading the comments on that bug report show a great deal of frustration, and rightly so: regardless of the technical reasons for the behavior, memory-mapping files isn’t rocket science (or, hasn’t been for 20 years or somesuch), and this kind of thing shouldn’t happen.

Since we can’t fix the bug, we devised a workaround: if you set the `pdfts.mmap.disable` system property to `Y`, then PDFTextStream won’t memory-map PDF files. Simple enough fix. FYI, there appears to be no performance degredation associated with using PDFTextStream in this mode.

Of course, this is only a problem on Windows, which does not allow files to be moved or deleted while a process has an open file handle. We have a number of customers that deploy on Windows Server (although that number is much smaller than those that deploy on a variety of *nix), but until last week, they hadn’t reported any problems. Our best guess is that, given the systems we know those customers are running, they are probably using PDFTextStream’s in-memory mode (where PDF data is in memory, and provided to PDFTextStream as a `ByteBuffer`). Of course, in that case, no file handles are ever opened, so all is well.

Working Together: Python and Java, Open Source and Commercial

PDFTextStream started out as a Java library, but is now available and supported for Python. How that leap was made exemplifies how commercial and open source software efforts complement each other in the best of circumstances, and is also a fantastic case study in Java + Python integration.

In general, Java and Python don’t really mix. Their architectures, best-practices, object models, and philsophies are pretty divergent in a lot of ways. Because of this, you don’t often find them cohabiting peacefully.

However, there are significant advantages to be had by bringing these two environments together. Python is a really elegant language, and is very well-suited to whole classes of software development that are much more painful to tackle in Java. Java has its advantages as well: a very mature standard library, a huge array of third-party library support, fantastic development environments, and the backing of big players in IT. As always, there’s a right tool for each job, and sometimes Java works best, and sometimes Python works best, but a combination would truly be more than the sum of its parts.

As PDFTextStream got its legs in the market about 18 months ago, our consulting business picked up, and I began to look for a way to use Python for prototyping and custom development in conjunction with PDFTextStream. Of course, back then, PDFTextStream was only for Java, so some bridge-building was in order.

I came across JPype (, and found it to be a promising solution. JPype is an open-source Python library that gives “python programs full access to java class libraries”. Sounds good, and it was.

Eventually, however, we ran into some problems. Specifically, one of our clients wanted to have PDFTextStream extract text from PDF documents in-memory (i.e. without having the PDF file(s) on disk). That’s not problem with PDFTextStream — we added that feature in short order.

However, this client was also adamant in their desire for a Python-based solution. The rest of their application (with which our piece integrated) is 100% Python, and their performance requirements (think millions of PDF documents processed per month) made running PDFTextStream as some kind of service component unthinkable.

What’s the problem? JPype, circa summer of 2005, copied data between Python and Java. That means that, if you have a PDF file in memory in Python, and want to use PDFTextStream’s in-memory extraction capability, JPype made a copy of that PDF file data before passing it off into the target Java function or constructor.

Bad, bad, bad. That was a huge performance hit to the application, and simply unacceptable from the client’s (and users’) point of view.

The obvious course of action was to make JPype, in effect, “pass by reference” when working with significant chunks of data (byte arrays, Strings, etc). This was no simple task, but we soon contacted the maintainer of JPype, a friendly fellow named Steve Ménard, and explained our predicament.

Within a few days, he had hammered out the idea to expose Python strings (the byte array of the Python world in most environments) as DirectByteBuffer objects in Java. This was a great idea, and meshed nicely with PDFTextStream’s in-memory processing API. Steve and I hammered out a relatively informal work agreement and hourly rate, and it was assumed by both of us that his enhancements to JPype for our purposes would stay licensed under the Apache v2.0 license to be enjoyed by the rest of the JPype community.

Nailing down all the technical details took a few weeks, but in the end, Steve was successful. We were able to put PDFTextStream’s entire API to use from within Python in a way that sacrificed not one ounce of performance or functionality.

So what’s the upshot of all of this?

  • Our consulting job completed with high praise from our customer, and our component of their application continues to hum away, extracting text from millions of PDF documents per month using PDFTextStream from Python
  • We’ve since worked with Steve here and there as necessary in order to make additional tweaks to JPype. Because of his help, we now distribute a supported version of PDFTextStream for Python (click that for more technical details about the Python/Java integration made possible by JPype).
  • The JPype project retains the new/improved functionality that we paid for, and the broader community continues to benefit from that.
  • Steve got to pick up a new mac mini, plus whatever else he felt like buying with his hard-earned cash

That’s what I call a win-win situation, for us, for our customers, for Steve, and for the JPype project and its other users. In an ideal world, this is how open source and commercial software efforts should collaborate and cross-pollinate.