Python, Growth, and Sandboxes

Well, I sure did step in it.

Consider: up until last week, I was simply using this space every now and then for some relatively bland navel-gazing related to selected goings-on at Snowtide. Then, a friend of mine decided to put my most recent post (probably the only potentially inflammatory post I’ve ever written) on reddit, and a variety of people weren’t very happy (both in comments to the post itself, on reddit’s comment page, and to a lesser extent on a Joel On Software thread). For someone who can lay only a tenuous claim to being a blogger (never mind the title of A-, B-, C-, or D-list blogger!), it’s been an interesting experience to say the least.

I tried to participate in the discussions that were swirling around, but eventually the comments became too numerous for me to follow in a timely way given the amount of bandwidth I’ve allocated to such things. So, I’m taking the easy/cheap way out with a response post. I know this is frowned upon by many, but c’est la vie.  Here, I will respond in two parts:

  1. Python and the Growth “Problem”
  2. Sandbox Etiquette

Python and the Growth “Problem”

In reading over all of the commentary, there seem to be three types of responses:

Response Type A: Any lack of growth/”innovation” or a slowing of such growth in Python is good — stability makes it easier to concentrate on customer solutions, and encourages robust library development.

Regardless of your language or platform, if stability and operational continuity is an overriding interest of yours, then lock yourself into a particular build, and stay there as long as you want. This is a significant part of the job of IT organizations in large organizations – to standardize on environments and tools so as to shield the organization from unwanted change and cost.

(As an aside, Ruby’s Matz provides a positive spin on the “Python is stable, and that’s good” attitude, which may or may not be cheeky [it’s hard to tell through the translation]: “Perhaps Python has a sense of responsibility.”)

Response Type B: The “significant improvements” I’d like to see in Python are (take your pick): of academic use only; are overhyped genius toys that only make it easier to build overly complex solutions; distractions from other improvements that would be immediately useful to the majority of the Python userbase.

This attitude pops up frequently in any discussion of programming paradigms that are off the beaten track, any technique that is unfamiliar to the commenter, or any anything that the commenter has had problems with in the past. Meek typified this kind of response with:

Python is not growing because you want programmable syntax and “esoteric” features? Features that 99% of software developers shouldnever use. Let me guess, you have never maintained a project written in a language that supports programmable syntax where geniuses abuse meta-programming where simpler alternatives achieve the same goal.

This is a particularly disturbing line of thought, and one that I had always considered to be antithetical to a central principle of Python (at least in my eyes), that the programmer should always be trusted. I’ve always associated this with a variety of Python features, including duck typing, the lack of access controls around class members (modulo the slightly perverse double-underscore notation and associated name mangling of “private” attributes), the composability of namespaces, etc.

Lots of programming features are “esoteric”, depending on who you ask. Pointer access is esoteric to a web app developer and should never be used in such a context, but it’s critical to a C-language device driver programmer. Any number of language features can simultaneously be considered esoteric by some and necessary by others. Not recognizing this, and then implying that “simpler alternatives” could readily take the place of those “genius” toys is evidence of a lack of perspective. 28/w in the JOS thread makes my point better than I ever could:

It’s precisely because I want my projects to be on time that I don’t use assembly language for everything. That’s the same reason I don’t use C++ either. I’m about 5x as productive in Ocaml as I am in C++ i.e., at least 80% of my time spent coding C++ is spent dealing with language issues; it’s the equivalent of spending time making all my function calls out of gotos.

Most likely, 80% of my time coding Ocaml is wasted too, and I just don’t know it.

Bottom line: just because you don’t see a use for a particular language feature doesn’t mean that someone else doesn’t find it absolutely, positively necessary.

Response Type C: Python is growing, and if you were to pay attention, you’d notice. We’re just not working on what you want.

This point has been made by a variety of people, but I should give special attribution to Phillip J. Eby, since he’s a significant Python contributor:

Um, so you don’t think the “with” statement and coroutines were new features?

What about the new metaclass hook that’ll be in Python 3.0 (and maybe 2.6)? It’s actually a pretty significant step forward for implementing Ruby-like DSL’s in Python.

I suppose this is the nut of the problem, at least as far as this discussion has related specifically to the technical aspects of Python: I’m not bowled over by the improvements Phillip cites.  They’re very useful and handy to the vast majority of Python programmers, but they’re not game-changers (which I suppose is what I meant by “significant growth”). I think the description of the metaclass hook as “a pretty significant step forward for implementing Ruby-like DSL’s in Python” is very telling. The facilities for building DSLs in Ruby are good in so far as they make it possible to get the job done, but they’re by no means conceptually complete nor functionally clean (as pointed out by jerf in the reddit comments), so taking a “significant step” towards implementing such facilities isn’t the whole ballgame.

Regardless of that detail, the point is that progress is being made in Python — just not in the vector I need. And, that’s OK. Which brings me to…

Sandbox Etiquette

After all has been said and done, my original post was a mistake, in that I exhibited a similar type and degree of technological selfishness as those who replied with Type A responses.  As some of my friends will attest, I’ve personally been unhappy with Python and its direction for a variety of reasons for months now, especially as I’ve sunk further and further into a class of problems for which Python isn’t particularly well-suited at the moment.  While I had settled on that conclusion some time ago, I’ve obviously been suffering from a mental block that caused me to do drive-bys against Python.  This came to a head with my blog post.

The more mature (and zen) thing to do would have been to simply go looking for a different sandbox, and leave well enough alone with regard to Python.  (It is, after all, a fantastic language and will likely remain my favorite for most common tasks [especially web programming] for a some time hence).  This is especially true given the fact that I am essentially a nobody in the Python community – I’ve contributed in my own small ways, but it’s not like I’m a core hacker or important library author.  Instead, I adopted the Response Type A attitude, but flipped it on its head, claiming that my favorite language should advance itself to suit my requirements, and to hell with the priorities of others.

So, let’s make a deal: I’ll stop sniping on Python, and maybe everyone else can stop making clever comments about “esoteric” language features.  Then we can all spend more time building bigger and better sandcastles.

Python 3 and Growth (or the lack thereof)

Paul Bissex just posted a simple three-step procedure for how one might become acquainted with the changes coming in Python 3 (née Python 3000). The mere mention of Python 3 prompted me to start writing a comment for Paul’s post, but it went on long enough that I figured it wiser to post here.

In understanding Python 3, I think it’s equally important (especially for those of us who might not walk on the beaten path with regard to domains, types of apps, etc) to review PEP 3099, which outlines what won’t be included in Python 3.

Personally, reading PEPs 3099 and 3100 (which Paul references in his post) is depressing. I’ll explain why by example and contrast. Consider this post from GvR from about a year ago:

But please save your breath. Programmable syntax is not in Python’s future — or at least it’s not for Python 3000. The problem IMO is that everybody will abuse it to define their own language. And the problem with that is that it will fracture the Python community because nobody can read each other’s code any more.

This is predictable, especially given GvR’s prior comments on various topics surrounding more “esoteric” features like multiline anonymous functions, operator overloading, etc., etc., etc. However, compare and contrast that to this slideshow from Ruby’s Matz from about two years ago. I don’t have a money quote, but the difference in attitude and approach is staggering. There’s a line in Annie Hall (I think, or some other Woody Allen movie) where Woody’s character says that relationships are like sharks: they must continue to move forward in order to survive.

Now, I’m not saying that Python is dying somehow — far from it. However, I think it’s safe to say that it has stopped growing (and hasn’t grown significantly since v2.2 with the great class/type unification). Meanwhile, Ruby is “a white-hot nexus of innovation”, according to Tim Bray, anyway; and really, anyone who knows and cares about what Python is lacking in terms of expressiveness and capability would agree.

At this point, you might fairly assume that I’m some kind of Ruby booster, but in fact, about 90% of the coding I do these days is in Python, and that has been the case for almost 3 years. And as much as I enjoy working in Python, it has not (and looks like it will not) grow along with the problems that I need to solve.

Part II: The feedback to this post has been significant, both in comments here, on reddit, by email, and elsewhere on the net. You can read my summary follow up here.

Introducing jsdifflib

Note: jsdifflib is now on github.

I’d like to introduce jsdifflib, an in-browser visual diff tool and library:

In the process of building a new web-based document-centric service, it became clear that I needed a good in-browser visual diff tool. I’ve become friends with a number of desktop “thick client” diff tools over the years, but the interface to this new service is 100% through the browser, and all those old friends aren’t amenable to diffing web-based resources.

Some web searches didn’t turn up anything particularly promising. I was looking for an in-browser diff tool, preferably in Javascript (but I suppose Flash would have done the trick, too). I found a few not-so-great Java applets that would do the bare minimum, but nothing ideal. There were a few javascript diff algorithm implementations (like this), but nothing that could be considered a complete solution.

So, I built jsdifflib over a weekend in February of 2007.

I hope you find jsdifflib useful. On its page, you’ll find some more background information, implementation details, examples, a live demo, and free downloads (with a BSD license).

Working Together: Python and Java, Open Source and Commercial

PDFTextStream started out as a Java library, but is now available and supported for Python. How that leap was made exemplifies how commercial and open source software efforts complement each other in the best of circumstances, and is also a fantastic case study in Java + Python integration.

In general, Java and Python don’t really mix. Their architectures, best-practices, object models, and philsophies are pretty divergent in a lot of ways. Because of this, you don’t often find them cohabiting peacefully.

However, there are significant advantages to be had by bringing these two environments together. Python is a really elegant language, and is very well-suited to whole classes of software development that are much more painful to tackle in Java. Java has its advantages as well: a very mature standard library, a huge array of third-party library support, fantastic development environments, and the backing of big players in IT. As always, there’s a right tool for each job, and sometimes Java works best, and sometimes Python works best, but a combination would truly be more than the sum of its parts.

As PDFTextStream got its legs in the market about 18 months ago, our consulting business picked up, and I began to look for a way to use Python for prototyping and custom development in conjunction with PDFTextStream. Of course, back then, PDFTextStream was only for Java, so some bridge-building was in order.

I came across JPype (, and found it to be a promising solution. JPype is an open-source Python library that gives “python programs full access to java class libraries”. Sounds good, and it was.

Eventually, however, we ran into some problems. Specifically, one of our clients wanted to have PDFTextStream extract text from PDF documents in-memory (i.e. without having the PDF file(s) on disk). That’s not problem with PDFTextStream — we added that feature in short order.

However, this client was also adamant in their desire for a Python-based solution. The rest of their application (with which our piece integrated) is 100% Python, and their performance requirements (think millions of PDF documents processed per month) made running PDFTextStream as some kind of service component unthinkable.

What’s the problem? JPype, circa summer of 2005, copied data between Python and Java. That means that, if you have a PDF file in memory in Python, and want to use PDFTextStream’s in-memory extraction capability, JPype made a copy of that PDF file data before passing it off into the target Java function or constructor.

Bad, bad, bad. That was a huge performance hit to the application, and simply unacceptable from the client’s (and users’) point of view.

The obvious course of action was to make JPype, in effect, “pass by reference” when working with significant chunks of data (byte arrays, Strings, etc). This was no simple task, but we soon contacted the maintainer of JPype, a friendly fellow named Steve Ménard, and explained our predicament.

Within a few days, he had hammered out the idea to expose Python strings (the byte array of the Python world in most environments) as DirectByteBuffer objects in Java. This was a great idea, and meshed nicely with PDFTextStream’s in-memory processing API. Steve and I hammered out a relatively informal work agreement and hourly rate, and it was assumed by both of us that his enhancements to JPype for our purposes would stay licensed under the Apache v2.0 license to be enjoyed by the rest of the JPype community.

Nailing down all the technical details took a few weeks, but in the end, Steve was successful. We were able to put PDFTextStream’s entire API to use from within Python in a way that sacrificed not one ounce of performance or functionality.

So what’s the upshot of all of this?

  • Our consulting job completed with high praise from our customer, and our component of their application continues to hum away, extracting text from millions of PDF documents per month using PDFTextStream from Python
  • We’ve since worked with Steve here and there as necessary in order to make additional tweaks to JPype. Because of his help, we now distribute a supported version of PDFTextStream for Python (click that for more technical details about the Python/Java integration made possible by JPype).
  • The JPype project retains the new/improved functionality that we paid for, and the broader community continues to benefit from that.
  • Steve got to pick up a new mac mini, plus whatever else he felt like buying with his hard-earned cash

That’s what I call a win-win situation, for us, for our customers, for Steve, and for the JPype project and its other users. In an ideal world, this is how open source and commercial software efforts should collaborate and cross-pollinate.