PDFTextStream now available free (as in beer)

PDFTextStream v2.6.0 was released today with a variety of small new features and a couple of bugfixes.  The bigger change is that PDFTextStream is now available free for use in single-threaded applications.

Because of the realities of the economics around developing and maintaining a product like PDFTextStream, its pricing has often been out of reach of many projects and very small organizations that really need high-quality PDF content extraction functionality.  That’s not to say that PDFTextStream is overpriced — it’s actually less expensive than other options — but that is small comfort to many that simply cannot afford or cannot justify the expenditure yet.

This change should fix that: if you have a smaller project, are working on a startup, are involved in information research, etc., you can now benefit from all that PDFTextStream has to offer.  And, if and when your architecture requires concurrent PDF processing, or your PDF content extraction workload is large enough to need to worry about properly utilizing your hardware and compute resources, you can easily upgrade to the unlimited, licensed “edition” of PDFTextStream to parallelize that workload.

It will be fun to see what people build now that PDFTextStream is gratisTry it out!

On the stewardship of mature software

I just flipped the switch on v2.5.0 of PDFTextStream.  It’s a fairly significant release, representing hundreds of distinct improvements and bugfixes, most in response to feedback and experiences reported by Snowtide customers.  If you find yourself needing to get data out of some PDF documents, you might want to give it a look…especially if existing open source libraries are falling down on certain documents or aren’t cutting it performance-wise.

But, this piece isn’t about PDFTextStream, not really.  After prepping the release last night, I realized that PDFTextStream is ten years old, by at least one reckoning: though the first public release was in early 2004, I started the project two years prior, in early 2002, ten years ago. Ten years.

It’s interesting to contemplate that I’m chiefly responsible for something that is ten years old, that is relied upon by lots of organizations internally, and by lots of companies as part of their own products.  Aside from the odd personal retrospectives that can be had by someone in my situation (e.g. friends of mine have children that are around the same age as PDFTextStream; am I better or worse off having “had” the latter when I did instead of a son or daughter?), some thought has to be given to what the longevity and particular role of PDFTextStream (or, really, any other piece of long-lived software) implies and requires.

I don’t know if there are any formal models for determining the maturity of a piece of software, but it seems that PDFTextStream should qualify by at least some measures, in addition to its vintage.  So, for your consideration, some observations and opinions from someone that has their hand in a piece of mature software:

Mature software transcends platforms and runtimes

PDFTextStream is in production on three different classes of runtimes: all flavours of the JVM, both Microsoft and Mono varieties of the .NET CLR, and the CPython implementation of Python.  This all flows from a single codebase, which reminds me many kinds mature systems (sometimes referred to as “legacy” once they’re purely in maintenance mode — a stage of life that PDFTextStream certainly hasn’t entered yet) that, once constructed, are often lifted out of their original runtime/platform/architecture to sit on top of whatever happens to be the flavour of the month, without touching the source tree.

Often, the effort required to make this happen simply isn’t worth it; the less mature a piece of software is, the easier it is at any point to port it by brute force, e.g. rewriting something in C# or Haskell that was originally written in Java.  This is how lots of libraries made the crossing from the JVM to .NET (NAnt and NHibernate are two examples off the top of my head).

However, the more mature a codebase, and the more challenging the domain, the more unthinkable such a plan becomes. For example, the prospect of rewriting PDFTextStream in C# to target .NET — or, if I had my druthers, rewriting PDFTextStream in Clojure to satisfy my geek id — is absolutely terrifying.  All those years of fixes and tweaks in the PDFTextStream sources…trying to port all of them to a new implementation would constitute both technical and business suicide.

In PDFTextStream’s case, going from its Java sources to a .NET assembly is fairly straightforward given the excellent IKVM cross-compiler.  However, there’s no easy Java->Python transpiler to reach for, and a bytecode cross-compiler wasn’t available either.  The best solution was to invest in making it possible to efficiently load and use a JVM from within CPython (via JNI).  With that, PDFTextStream, derived from Java sources, ran without a hitch in production CPython environments. Maybe it was a hack, but it was, in relative terms, easier and safer than any alternative, and had no downsides in terms of performance or capabilities.

(I eventually nixed the CPython option a few years ago due to a lack of broad commercial interest.)

Thou shalt not break mature APIs

When I first started programming in Java, I sat aghast in the ominous glow of java.util.Date. It was a horror then, and remains so. The whole thing has been marked as deprecated since 1997; and, despite the availability of all sorts of better options, it has not been removed from the standard library.  Similar examples abound throughout the JRE, and all sorts of decidedly mature libraries.

For some time, I attributed this to sloth, or pointy-haired corporate policies, or accommodation of such characteristics amongst the broad userbase, or…god, I dunno, what are those guys thinking? In the abstract, if the physician’s creed is to “do no harm”, it seems that the engineer’s should be “fix what’s broken”; so, continual improvement should be the law of the land, API compatibility be damned.

Of course, it was naïve for me to think so.  Brokenness is often in the eye of the beholder, and formal correctness is a rare thing outside of mathematics.  Thus, the urge one has to “make things better” must be tempered by an understanding of the knock-on effects for whoever is living downstream of you.  In particular, while making “fixes” to APIs that manifest breaking changes — either in terms of signatures or semantics — might make you feel better, there are repercussions:

  • You’ll absolutely piss off all of your customers and users.  They had working code that now doesn’t work. Whether you are charging them money or benefiting from their trust, you are now asking them to take time out of their day to help you feel better about yourself.
  • Since their code is broken already, your customers and users might see this as the perfect opportunity to make their own changes to not have to cope with your self-interested “fixes” anymore.  Surely you can imagine the scene:

    Sarah: “Hey Gene, the new version of FooLib changes the semantics of the Bar(string) function. Do you want me to fix it now?”

    Gene: “Sheesh, again? Well, weren’t you looking at BazLib before?”

    Sarah: “Yeah; BazLib isn’t quite as slick, but Pete over in Accounts said he’s not had any troubles with it.”

    Gene: “I’m sold. Stick with the current version of FooLib for now, but next time you’re in that area of the code, swap it out for BazLib instead.”

This is why semantic versioning is so important: when used and understood properly, it allows you to communicate a great deal of information in a single token.  It’s also why I can often be found urging people to make good breaking changes in v0.0.X releases of libraries, and why PDFTextStream hasn’t had a breaking change in 6 years.

Of course there are parts of PDFTextStream’s API that I’m not super proud of; I’ve learned a ton over the course of its ten year existence, and there are a lot of things I’d do differently if I knew then what I know now.  However, overall, it works, and it works very well, and it would be selfish (not to mention a bad business decision) to start whacking away at changes that make the API aesthetically more pleasant, or of marginally higher quality, but which make customers miss a beat.

It seems to me that a good guideline might be that any breaking change needs to be accompanied by a corresponding 10x improvement in capability in order to be justifiable.  This ties up well with the notion that a product new to the market must be 10x better than its competition in order to win; insofar as a new version of the same product with API breakage can potentially be considered as foreign as competing products, that new version is a new product.

Managing risk is Job #1

If your hand is on the tiller of some mature software — or, some software that you would like to see live long enough to qualify as mature — your first priority at all times is to manage, a.k.a. minimize, risk for your users and customers.

As Prof. Christensen might say, software is hired to do a job.  Now, “managing risk” isn’t generally the job your software is hired to do, e.g. PDFTextStream’s job is to efficiently extract content from any PDF document that is thrown at it, and do so faster and more accurately than the other alternatives.  But, implicit in being hired for a job is not only that the task at hand will be completed appropriately, but that the thing being hired to do that job doesn’t itself introduce risk.

The scope of software as risk management is huge, and goes way beyond technical considerations:

  • API risk, as discussed above in the “breakage” section
  • Platform risk. Aside from doubling the potential market for PDFTextStream, offering it on .NET in addition to the JVM serves a purpose in mitigating platform risk for our customers on the JVM: they know that, if they end up having to migrate to .NET, they won’t have to go find, license, and learn a new PDF content extraction library.  In fact, because PDFTextStream licenses are sold in a platform-agnostic way, such a migration won’t cost a customer of ours a penny.  Of course, the same risk mitigation applies to our .NET customers, too.
  • Purchasing risk. Buying commercial software outside of the consumer realm can be a minefield: tricky licensing, shady sales tactics, pricing jumping all over the map (generally up), and so on.  PDFTextStream has had one price increase in eight years, and its licensing and support model hasn’t changed in six.  Our pricing is always public, as is our discount schedule.  When one of our customers needs to expand their installation, they know what they’re getting, how much it’s going to cost, and how much it’ll cost next time, too.

Even if one is selling a component library (which PDFTextStream essentially is), managing risk effectively for customers and users can be a key way to offer a sort of a whole product.  Indeed, for many customers, managing risk is something that you must do, or you will simply never be hired for that job, no matter how well you fulfill the explicit requirements.

Launching DocuHarvest – Turning documents into data

I’m happy today to introduce:

Getting valuable data out of documents should not require an I.T. staff,  outside consultants, building or buying software, or an up-front  investment of hundreds or thousands of dollars, regardless of how many  documents and how much data is involved.

This may seem strange to hear coming from me: you may or may not know that I’ve been principally involved in selling PDF content extraction software for the past six years.  Over that time, I’ve had the opportunity to come face-to-face with hundreds of content and data extraction challenges across dozens of industries. If there’s one takeaway I can offer up from that experience, it’s this:

No one cares about the process of data extraction: people only care about their data.

Seems simple enough, but ask anyone who’s been involved in any kind of data integration project, or tried to help a nontechnical user get useful data out of a directory full of documents, and you’ll know that people are forced to care. The situation is worse for e.g. small business owners and others that simply can’t afford additional software and the attendant consulting hours.

DocuHarvest is an alternative path: a web application that provides data extraction and content conversion services through the browser, usable by everyone, costing pennies per document processed. There’s even a free option, if you’re willing to process only one document at a time.

Available Now

We’re starting small, offering three types of document processing jobs:

  1. Extracting document metadata (date published, author, title, keywords, etc.),
  2. Conversion of documents to text, and
  3. Extraction of interactive PDF form data

DocuHarvest currently only accepts PDF documents as input, but that will change relatively soon – PDF just happens to be where we “come from”, so we’re rolling that out first. Support for additional file formats will come.

In addition, we have a variety of additional types of jobs in the pipeline, including:

  • conversion of documents to images (rasterization),
  • extraction of embedded images, and
  • thumbnail generation

That’s hardly a comprehensive list. This is just the beginning; we have a lot of tricks we’ve saved up over the course of those six years. :-)

I’d love it if you were to go to DocuHarvest.com now, and try it out. Remember, it’s free to try (and still free to use, one document a time).

If you have any feedback, comments, questions, suggestions, or complaints, don’t hesitate to contact me; leave a comment below or in the feedback boxes on the DocuHarvest site, message me (@cemerick) or @docuharvest on Twitter, or email me directly.

Reducing purchase anxiety is a feature

Talk to anyone outside of the software world, and you’ll quickly realize that one of the most gut-wrenching, anxiety-inducing acts is buying software. Even if one has evaluated the product in question top to bottom, past experience of bugs, botched updates, missing features, and outright failures and crashes has tempered any enthusiasm or confidence that might be felt when the time comes to pull out the credit card or write the purchase order.

Of course, the blame for this lies squarely with the software industry itself – the failures in software quality are well known, both discrete instances as well as in aggregate. Those of us whose business and livelihood are tied to the sale of software (whether sent out the door or delivered as a service) must do whatever we can to reverse this zeitgeist.

Given that, we’ve decided to adopt a very simple, no-nonsense “Satisfaction Guaranteed” policy for PDFTextStream. Hopefully this will help take the anxiety out of someone’s day, somewhere.

This isn’t a new idea, of course. Lots of software companies have had guarantees of some sort or another for ages, but I think my first encounter with the concept as a business owner was Joel Spolsky’s post from a couple of years ago:

I think that our customers are nice because they’re not worried. They’re not worried because we have a ridiculously liberal return policy: “We don’t want your money if you’re not amazingly happy.”

Joel raised the issue again on a recent StackOverflow podcast, which prompted me to think about our own approach…

What do we do about unhappy customers?

To be honest, our customers are pretty happy. Of course, we occasionally receive a bug report, but we generally knock out patches within a couple of days, and sometimes faster. In the 5 years we’ve been selling PDFTextStream, we’ve never had a single request for a refund. Part of that is offering up a very liberal evaluation version, but I’d like to think it’s because what we sell does the job it’s meant to do very well.

Given that, I’ve never thought to make a big stink about a refund policy – it just never came up. But hearing Joel and Jeff talk about the ire that they felt towards various companies that refused to issue refunds when they weren’t happy with something motivated me to make our de facto policy explicit. Thus, the new “Satisfaction Guaranteed” statement.

Part II: the Open Source Influence

An elephant in the room is the influence of open source software on customers’ attitudes towards buying software, and the assessment of risk that goes along with it. As more and more users of technology (just to spread the net as widely as possible) are exposed and become accustomed to the value associated with open source software (which, in simple terms, is generally high because of its zero or near-zero price), it increases pressure on commercial vendors (like us) to up our game along the same vector.

But, the impact of open source software on pricing is a pretty stale story. The real impact is derivative, in that a zero or near-zero price means that the apparent risk associated with using open source software is zero or near-zero. The promise of proprietary, commercial software is that, if it does what the vendor claims (whatever that is), then that software will deliver benefits far in excess of its cost and far in excess of the aggregate benefit provided by the open source alternatives, even given the price differential.

The problem is that a lot of people only turn towards commercial options as a last resort because of the aforementioned historical failures of the software industry vis á vis quality: the apparent risk of commercial options is higher than that associated with open source options, simply because the latter’s super-low price is a psychological antidote to any anxiety about quality issues. So, there’s flight towards low-priced options, rather than a thorough search for optimal solutions. Injecting an explicit guarantee of performance and reliability (like our new “Satisfaction Guarantee”) might be enough to tip the relative apparent risk in favor of the commercial option – or, at the very least, minimize the imbalance so that it’s more likely that price won’t dominate other factors (which are potentially more relevant to overall benefits).

Of course, this can only work if one’s product is actually better than the open source alternatives, and by a good stretch to boot so as to compensate for the price differential. In any case, it’s a win-win for the formerly-anxious software user and buyer: they should feel like they have more choice overall, and therefore have a better chance of discovering and adopting the best solution for any given problem, regardless of software licenses and distribution models.

Activity is not Progress (or, ‘Did you really need to shave that yak’)

Anyone who is accountable for any sufficiently-complex objective is constantly having their focus being pulled away from that larger goal by a thousand different fiddly tasks. Christened as yak shaving some time ago by a fellow at the MIT media lab, the concept has become a favorite shorthand in various programming and software development circles. I only heard of it this year, but it’s helped to coalesce my thinking about focused work and the relationship between activity and progress.

In particular, I think it’s helpful to occasionally check one’s activity using what I’d call “root objective analysis”.

Many people in technical fields are familiar with root cause analysis, where a problem or failure is analyzed in such a way as to determine its root cause. There are lots of flavors of root cause analysis, with Five Whys being popular among programmers due to the Joel Effect and probably some loose association between Five Whys and the lean development/startup methodologies that are all the rage these days.

In contrast, root objective analysis runs in the “opposite direction”, so to speak: for any given activity, you trace the likely causal link between that activity you’re engaged in, and the progress you want to make. In short: “Is what you’re doing right now getting you closer to your end goal?”1 If you do this right, or at all, you’ll go down fewer dead-ends, waste less time, and prioritize the yaks you do shave so that you get to your desired end state sooner rather than later.

There’s obviously a lot of fuzziness in any kind of speculative analysis like this; if there weren’t, then project management would always bring jobs in on time and within budget. However, if your work often leads you far afield of your “main line” of focus, then asking yourself the question above from time to time may help you to ensure that every yak shaving you engage in is necessary, as opposed to a distraction caused by confusing activity for progress.

Yak shaving close to home

A yak shaving that is near and dear to my heart is the fable of the software developer and the PDF documents (not surprising, since we talk to a lot of developers who have worked with lots of PDF documents). There are many variations, but the most extreme goes something like this:

  1. Joe the developer needs to get some chunk of data into his company’s database (maybe it’s financial data, maybe he’s working with excerpts of academic journal articles – such details are mostly irrelevant)
  2. The data is only available in PDF documents, and there’s a lot of them. Thousands, perhaps millions of chunks of data in as many different PDF documents.
  3. Joe’s first thought is that he needs to build a function to extract text from these PDFs so that he can get at the data he needs.  But, after…
    • reading the 1,000+ page PDF specification,
    • adding support for the 8 different versions of the spec,
    • adding support for a half-dozen encryption protocols, and
    • adding support for extracting Chinese (or Japanese, or Korean, or Icelandic with its lovely ð (“eth”) character) along with the embedded fonts that go along with it
  4. …Joe now has spent nearly a year building a one-off PDF text extraction library that (again, depending on the version of the fable) fails on 24% of the documents his company needs to access, and still doesn’t run fast enough to finish in the batch window he has to work with.

Seriously, scouts-honor, I’ve heard this story at least 5 times…and each time right before or right after the developer/company in question purchased PDFTextStream to replace their homebrew PDF library. That, my friends, is activity without progress, yak shaving at its most epic.

1 Careful and clueful readers will recognize this as little more than a distilled version of OODA, the granddaddy of all decision-making formalisms.

Surprising Praise

I happen to work in a particular corner of the software industry that isn’t exactly the most happenin’ party zone.  Compared to whatever is “hot” at any point in time, extracting data from documents seems dull to most.  I’m not deterred though – quite the contrary, being able to deliver products into contexts where we have a big positive impact on the well-being of our customers’ organizations makes for a level of satisfaction that (I suspect) outpaces the quick, fleeting high of popularity.

That said, one downside of this is that many of our customers realize so much benefit from working with us is that they often are very reluctant to allow us to discuss our relationships with them in public.  So, far more often than not, we can have the satisfaction of realizing that big impact on a customer’s organization, but are barred from talking about it.  Given that, we are doubly appreciative of those that have worked with us to develop case studies on how they use PDFTextStream.

But last week, I discovered that someone we’ve worked with in the past – Neil Gandhi, who was heavily involved with Zinio’s deployment of PDFTextStream— had added a recommendation to my LinkedIn profile.  Now, most recommendations are very pleasant to begin with, but Neil’s was so effusive and unsolicited that I thought I’d share it here (oddly, LinkedIn doesn’t show recommendations on the “public” profile pages, but if you log in and view my profile, you’ll see Neil’s comments):

I worked with Chas during my time at Zinio LLC, a digital publication company that specializes in online and offline delivery of digital magazines. At the time, we were implementing global search functionality but our PDF text extraction solution was really sub-par. We found Chas at Snowtide and worked with him and his team in implementing PDFTextStream; their PDF text extraction solution. We were also testing against a slew of other vendors and open-source solutions to find the best product based on accuracy and service. You can find the case-study here: http://snowtide.com/cs-zinio

Needless to say, PDFTextStream was by far the most accurate solution, but to my surprise, Chas and his team provided the best service a small company like Zinio could ask for. I never had to wait more than half a day for a response and the questions and requests were always answered with a can-do attitude. If something couldn’t be done, Chas always had the time to explain why and also suggested (many times, better) solutions. He could talk the talk as both a CEO and as a developer and could switch back and forth when talking to my Director of Engineering, VP of Technology, and me and was more than competent at all levels. In the end we ended up purchasing the solution for all of our extraction servers, and I made a connection to someone I can always turn to when I need anything PDF related.

Pretty cool.  We’ll obviously be publishing more case studies as the opportunities arise, but it’s hard to beat comments as personal and unsolicited as those.  Thanks, Neil.

New Year’s PDFTextStream Sale!

This morning, we put some limited-time-only discounts into place for PDFTextStream to celebrate the new year. You can now purchase PDFTextStream server deployment licenses for as little as $999 USD (optionally with Premium Support). These licenses carry no CPU restriction, so you can use them on your 1CPU development box or your 64-CPU Superdome. And, as always, you can use the same license under Java, Python, or .NET. This sale starts today, and ends on January 31, 2007. You can place your order here (with payments handled by Google Checkout).

This is quite a deal — these unlimited-CPU server licenses usually cost $13,750 USD. That’s quite an insane discount, but I thought it was worth the chance. Theoretically, this will create a little buzz, increase our customer list by quite a bit, and maybe expose a different class of users to PDFTextStream that might have previously written it off because of its admittedly high (normal) price tag.

This is also a decent pricing experiment. We’ve never done much experimentation in the area of pricing, so we’ll now have one more data point on our demand curve (as described brilliantly by Joel). I don’t think that this particular experiment will have any lasting effect on our pricing for PDFTextStream, but it will be an interesting exercise nonetheless.

Free PDFTextStream for Academic Use

The title says it all.  Today we’re announcing that PDFTextStream is free for academic use: read the press release, and if you are a qualifying academic developer, go ahead and apply for a free PDFTextStream license file.

Don’t worry, the application “process” will take you 2 minutes, and assuming you are eligible (i.e. a student, faculty, academic researcher, or university IT staff), you’ll get your free PDFTextStream license file within a week.  Why a week?  Well, we want to set expectations properly, as we assume we’ll get a pretty solid barrage of applications — after all, everyone likes free stuff.

I’m hoping that this will make life easier for many, especially those who are building truly cool new search, content management, and other webby and/or document-oriented processing systems.  Too often, we’ve run across university-funded researchers who have bare-metal budgets, and are forced to use substandard tools and libraries (but who still manage to build amazing technologies).  PDF is obviously important (and will only become more prevalent), so making sure those folks can get the best PDF content extraction library available at no cost to them will hopefully enable even greater, faster progress.

It’s the very least we can do to “give back”.

Memory-mapping Files in Java Causes Problems

Today, we released PDFTextStream v2.0.1— a minor patch release that contains a workaround for an interesting and unfortunate bug: on Windows, if one accesses a PDF file on disk using PDFTextStream, then closes the PDFTextStream instance (using PDFTextStream.close()), the PDF file will still be locked. It can’t be moved or deleted.

This is actually not a bug in PDFTextStream, but in Java, documented as Sun bug #4724038. In short, any file that is memory-mapped cannot reliably be “closed” (i.e. the `DirectByteBuffer` (or some native proxy, perhaps) that holds the OS-level file handle does not release those resources, even when the `FileChannel` is closed that was used to create the `DirectByteBuffer`). Reading the comments on that bug report show a great deal of frustration, and rightly so: regardless of the technical reasons for the behavior, memory-mapping files isn’t rocket science (or, hasn’t been for 20 years or somesuch), and this kind of thing shouldn’t happen.

Since we can’t fix the bug, we devised a workaround: if you set the `pdfts.mmap.disable` system property to `Y`, then PDFTextStream won’t memory-map PDF files. Simple enough fix. FYI, there appears to be no performance degredation associated with using PDFTextStream in this mode.

Of course, this is only a problem on Windows, which does not allow files to be moved or deleted while a process has an open file handle. We have a number of customers that deploy on Windows Server (although that number is much smaller than those that deploy on a variety of *nix), but until last week, they hadn’t reported any problems. Our best guess is that, given the systems we know those customers are running, they are probably using PDFTextStream’s in-memory mode (where PDF data is in memory, and provided to PDFTextStream as a `ByteBuffer`). Of course, in that case, no file handles are ever opened, so all is well.

Working Together: Python and Java, Open Source and Commercial

PDFTextStream started out as a Java library, but is now available and supported for Python. How that leap was made exemplifies how commercial and open source software efforts complement each other in the best of circumstances, and is also a fantastic case study in Java + Python integration.

In general, Java and Python don’t really mix. Their architectures, best-practices, object models, and philsophies are pretty divergent in a lot of ways. Because of this, you don’t often find them cohabiting peacefully.

However, there are significant advantages to be had by bringing these two environments together. Python is a really elegant language, and is very well-suited to whole classes of software development that are much more painful to tackle in Java. Java has its advantages as well: a very mature standard library, a huge array of third-party library support, fantastic development environments, and the backing of big players in IT. As always, there’s a right tool for each job, and sometimes Java works best, and sometimes Python works best, but a combination would truly be more than the sum of its parts.

As PDFTextStream got its legs in the market about 18 months ago, our consulting business picked up, and I began to look for a way to use Python for prototyping and custom development in conjunction with PDFTextStream. Of course, back then, PDFTextStream was only for Java, so some bridge-building was in order.

I came across JPype (http://jpype.sourceforge.net), and found it to be a promising solution. JPype is an open-source Python library that gives “python programs full access to java class libraries”. Sounds good, and it was.

Eventually, however, we ran into some problems. Specifically, one of our clients wanted to have PDFTextStream extract text from PDF documents in-memory (i.e. without having the PDF file(s) on disk). That’s not problem with PDFTextStream — we added that feature in short order.

However, this client was also adamant in their desire for a Python-based solution. The rest of their application (with which our piece integrated) is 100% Python, and their performance requirements (think millions of PDF documents processed per month) made running PDFTextStream as some kind of service component unthinkable.

What’s the problem? JPype, circa summer of 2005, copied data between Python and Java. That means that, if you have a PDF file in memory in Python, and want to use PDFTextStream’s in-memory extraction capability, JPype made a copy of that PDF file data before passing it off into the target Java function or constructor.

Bad, bad, bad. That was a huge performance hit to the application, and simply unacceptable from the client’s (and users’) point of view.

The obvious course of action was to make JPype, in effect, “pass by reference” when working with significant chunks of data (byte arrays, Strings, etc). This was no simple task, but we soon contacted the maintainer of JPype, a friendly fellow named Steve Ménard, and explained our predicament.

Within a few days, he had hammered out the idea to expose Python strings (the byte array of the Python world in most environments) as DirectByteBuffer objects in Java. This was a great idea, and meshed nicely with PDFTextStream’s in-memory processing API. Steve and I hammered out a relatively informal work agreement and hourly rate, and it was assumed by both of us that his enhancements to JPype for our purposes would stay licensed under the Apache v2.0 license to be enjoyed by the rest of the JPype community.

Nailing down all the technical details took a few weeks, but in the end, Steve was successful. We were able to put PDFTextStream’s entire API to use from within Python in a way that sacrificed not one ounce of performance or functionality.

So what’s the upshot of all of this?

  • Our consulting job completed with high praise from our customer, and our component of their application continues to hum away, extracting text from millions of PDF documents per month using PDFTextStream from Python
  • We’ve since worked with Steve here and there as necessary in order to make additional tweaks to JPype. Because of his help, we now distribute a supported version of PDFTextStream for Python (click that for more technical details about the Python/Java integration made possible by JPype).
  • The JPype project retains the new/improved functionality that we paid for, and the broader community continues to benefit from that.
  • Steve got to pick up a new mac mini, plus whatever else he felt like buying with his hard-earned cash

That’s what I call a win-win situation, for us, for our customers, for Steve, and for the JPype project and its other users. In an ideal world, this is how open source and commercial software efforts should collaborate and cross-pollinate.