On the stewardship of mature software

I just flipped the switch on v2.5.0 of PDFTextStream.  It’s a fairly significant release, representing hundreds of distinct improvements and bugfixes, most in response to feedback and experiences reported by Snowtide customers.  If you find yourself needing to get data out of some PDF documents, you might want to give it a look…especially if existing open source libraries are falling down on certain documents or aren’t cutting it performance-wise.

But, this piece isn’t about PDFTextStream, not really.  After prepping the release last night, I realized that PDFTextStream is ten years old, by at least one reckoning: though the first public release was in early 2004, I started the project two years prior, in early 2002, ten years ago. Ten years.

It’s interesting to contemplate that I’m chiefly responsible for something that is ten years old, that is relied upon by lots of organizations internally, and by lots of companies as part of their own products.  Aside from the odd personal retrospectives that can be had by someone in my situation (e.g. friends of mine have children that are around the same age as PDFTextStream; am I better or worse off having “had” the latter when I did instead of a son or daughter?), some thought has to be given to what the longevity and particular role of PDFTextStream (or, really, any other piece of long-lived software) implies and requires.

I don’t know if there are any formal models for determining the maturity of a piece of software, but it seems that PDFTextStream should qualify by at least some measures, in addition to its vintage.  So, for your consideration, some observations and opinions from someone that has their hand in a piece of mature software:

Mature software transcends platforms and runtimes

PDFTextStream is in production on three different classes of runtimes: all flavours of the JVM, both Microsoft and Mono varieties of the .NET CLR, and the CPython implementation of Python.  This all flows from a single codebase, which reminds me many kinds mature systems (sometimes referred to as “legacy” once they’re purely in maintenance mode — a stage of life that PDFTextStream certainly hasn’t entered yet) that, once constructed, are often lifted out of their original runtime/platform/architecture to sit on top of whatever happens to be the flavour of the month, without touching the source tree.

Often, the effort required to make this happen simply isn’t worth it; the less mature a piece of software is, the easier it is at any point to port it by brute force, e.g. rewriting something in C# or Haskell that was originally written in Java.  This is how lots of libraries made the crossing from the JVM to .NET (NAnt and NHibernate are two examples off the top of my head).

However, the more mature a codebase, and the more challenging the domain, the more unthinkable such a plan becomes. For example, the prospect of rewriting PDFTextStream in C# to target .NET — or, if I had my druthers, rewriting PDFTextStream in Clojure to satisfy my geek id — is absolutely terrifying.  All those years of fixes and tweaks in the PDFTextStream sources…trying to port all of them to a new implementation would constitute both technical and business suicide.

In PDFTextStream’s case, going from its Java sources to a .NET assembly is fairly straightforward given the excellent IKVM cross-compiler.  However, there’s no easy Java->Python transpiler to reach for, and a bytecode cross-compiler wasn’t available either.  The best solution was to invest in making it possible to efficiently load and use a JVM from within CPython (via JNI).  With that, PDFTextStream, derived from Java sources, ran without a hitch in production CPython environments. Maybe it was a hack, but it was, in relative terms, easier and safer than any alternative, and had no downsides in terms of performance or capabilities.

(I eventually nixed the CPython option a few years ago due to a lack of broad commercial interest.)

Thou shalt not break mature APIs

When I first started programming in Java, I sat aghast in the ominous glow of java.util.Date. It was a horror then, and remains so. The whole thing has been marked as deprecated since 1997; and, despite the availability of all sorts of better options, it has not been removed from the standard library.  Similar examples abound throughout the JRE, and all sorts of decidedly mature libraries.

For some time, I attributed this to sloth, or pointy-haired corporate policies, or accommodation of such characteristics amongst the broad userbase, or…god, I dunno, what are those guys thinking? In the abstract, if the physician’s creed is to “do no harm”, it seems that the engineer’s should be “fix what’s broken”; so, continual improvement should be the law of the land, API compatibility be damned.

Of course, it was naïve for me to think so.  Brokenness is often in the eye of the beholder, and formal correctness is a rare thing outside of mathematics.  Thus, the urge one has to “make things better” must be tempered by an understanding of the knock-on effects for whoever is living downstream of you.  In particular, while making “fixes” to APIs that manifest breaking changes — either in terms of signatures or semantics — might make you feel better, there are repercussions:

  • You’ll absolutely piss off all of your customers and users.  They had working code that now doesn’t work. Whether you are charging them money or benefiting from their trust, you are now asking them to take time out of their day to help you feel better about yourself.
  • Since their code is broken already, your customers and users might see this as the perfect opportunity to make their own changes to not have to cope with your self-interested “fixes” anymore.  Surely you can imagine the scene:

    Sarah: “Hey Gene, the new version of FooLib changes the semantics of the Bar(string) function. Do you want me to fix it now?”

    Gene: “Sheesh, again? Well, weren’t you looking at BazLib before?”

    Sarah: “Yeah; BazLib isn’t quite as slick, but Pete over in Accounts said he’s not had any troubles with it.”

    Gene: “I’m sold. Stick with the current version of FooLib for now, but next time you’re in that area of the code, swap it out for BazLib instead.”

This is why semantic versioning is so important: when used and understood properly, it allows you to communicate a great deal of information in a single token.  It’s also why I can often be found urging people to make good breaking changes in v0.0.X releases of libraries, and why PDFTextStream hasn’t had a breaking change in 6 years.

Of course there are parts of PDFTextStream’s API that I’m not super proud of; I’ve learned a ton over the course of its ten year existence, and there are a lot of things I’d do differently if I knew then what I know now.  However, overall, it works, and it works very well, and it would be selfish (not to mention a bad business decision) to start whacking away at changes that make the API aesthetically more pleasant, or of marginally higher quality, but which make customers miss a beat.

It seems to me that a good guideline might be that any breaking change needs to be accompanied by a corresponding 10x improvement in capability in order to be justifiable.  This ties up well with the notion that a product new to the market must be 10x better than its competition in order to win; insofar as a new version of the same product with API breakage can potentially be considered as foreign as competing products, that new version is a new product.

Managing risk is Job #1

If your hand is on the tiller of some mature software — or, some software that you would like to see live long enough to qualify as mature — your first priority at all times is to manage, a.k.a. minimize, risk for your users and customers.

As Prof. Christensen might say, software is hired to do a job.  Now, “managing risk” isn’t generally the job your software is hired to do, e.g. PDFTextStream’s job is to efficiently extract content from any PDF document that is thrown at it, and do so faster and more accurately than the other alternatives.  But, implicit in being hired for a job is not only that the task at hand will be completed appropriately, but that the thing being hired to do that job doesn’t itself introduce risk.

The scope of software as risk management is huge, and goes way beyond technical considerations:

  • API risk, as discussed above in the “breakage” section
  • Platform risk. Aside from doubling the potential market for PDFTextStream, offering it on .NET in addition to the JVM serves a purpose in mitigating platform risk for our customers on the JVM: they know that, if they end up having to migrate to .NET, they won’t have to go find, license, and learn a new PDF content extraction library.  In fact, because PDFTextStream licenses are sold in a platform-agnostic way, such a migration won’t cost a customer of ours a penny.  Of course, the same risk mitigation applies to our .NET customers, too.
  • Purchasing risk. Buying commercial software outside of the consumer realm can be a minefield: tricky licensing, shady sales tactics, pricing jumping all over the map (generally up), and so on.  PDFTextStream has had one price increase in eight years, and its licensing and support model hasn’t changed in six.  Our pricing is always public, as is our discount schedule.  When one of our customers needs to expand their installation, they know what they’re getting, how much it’s going to cost, and how much it’ll cost next time, too.

Even if one is selling a component library (which PDFTextStream essentially is), managing risk effectively for customers and users can be a key way to offer a sort of a whole product.  Indeed, for many customers, managing risk is something that you must do, or you will simply never be hired for that job, no matter how well you fulfill the explicit requirements.

Stymied from within (an entrepreneurial experience)

In his interview of Andrew Warner, Jason Cohen provided a nice snapshot of the experience of being a software entrepreneur (taken from the transcription provided on the interview page, emphasis mine):

You hear all these stories of success, and luck also always plays a part. We didn’t talk about that much, but of course it’s true. And great networking helps. If you’re ideas really good, maybe the customers will come easy. Again, you hear all these stories on Mixergy. When people condense the timeline, it’s almost inevitable that it sounds easy and straightforward. And one leads to two and that sort of thing.

But then as a listener, you come back to your own reality where it’s hard. You don’t have a network like Jason Baptiste has. You don’t have an article on Mashable. You don’t have a strong sense of your own philosophy and what you think is true in the world and who you want to be. You don’t know yet, because you haven’t gone and done a whole bunch of things and seen how you feel about it. And there isn’t a big rush of customers at your door.

You’re sitting there in front of the computer, with your inbox and a compiler and an A/B tester. You’re overwhelmed by all this stuff. How do you know – obviously, this is too abstract a question to have a specific answer – but how do I take that next step and go running? How do I just go running? Do I just have to say, “Look, there’s nothing but going and running.” That’s it? It’s as simple as that? It’s hard, but that’s all there is?

…where “running” is being used as a metaphor for getting the job done and achieving what you want to achieve.  What really resonated with me was this perspective of being one guy (or maybe just a couple guys/gals) who is at point A, wants to get to point B, and fundamentally must find his own path when none is visible.

As a solo entrepreneur / business owner, I feel like I face an uncertain, massive void every day, through which I must find my way, avoiding all sorts of pitfalls.  Sometimes, that void is the outer world, filled with challenges of all sorts related to technology, business, customers, and money.  However, that void is usually myself, where far more difficult riddles await: What do I want to achieve? What should I focus on now in order to be successful? How can I break out of unproductive habits and cycles? What do I not know today that would save my ass tomorrow?

Some days are easier than others, with answers coming easily.  It’s especially pleasant to have a run of time where one is simply doing, knocking down technology and business problems like so many piñatas. Other days, it’s impossible to make such obvious progress when it’s not clear what should be done in the first place.  And then there are the really bad days, when overwork or self-doubt breeds loss of focus, procrastination, and prodigious mental bonking: a particularly painful state where the flesh is willing, but the mind is weak.

For weeks now, I’ve been tangling with this dim state of being — probably the longest and deepest period of time I’ve ever spent coping with an inability to focus and execute.  As painful as it’s been, I take some small bit of solace that every other entrepreneur, engineer, artist, and writer seems to have been afflicted with similar problems from time to time. Unfortunately, each time this happens to me, I seem to forget for a time that the best salve is almost always the same: keep trying to build, write, create—and eventually the blocks and cobwebs and limits I place on myself fall away.  One really does just need to “go running”.

Recovering from and avoiding “cloud service” lock-in

We all love our shiny cloud services — until they break, die, or otherwise go away, turning all those unicorns and butterflies into what can only be described as a bark salad.  Case in point: sometime this week, I’ll be wasting time migrating data out of DabbleDB.

DabbleDB, if you don’t know, is was this great interactive “relational” database service: think of it as a massive spreadsheet where anything could be related to anything else — schema-free, BTW — with hooks into web forms for surveys and such, an excellent reporting and query engine, and all sorts of goodies like mapping geographically-related data, charting, etc. Similar services include WuFoo, Intuit’s Quickbase, and ZoHo Creator.

I say was because DabbleDB was acquired by Twitter last year; as a result, the service is shutting down next week, and I need to yank the data we were storing there and reconstitute it into some corresponding in-house applications.  Mind you, there’s nothing difficult about this, but I’m slightly irked at myself for being in this position.  While building the replacement apps will be far more costly over the long term than the $8/month we were paying DabbleDB, the real cost is the dislocation associated with relying upon a “cloud” service provider to provide a particular set of features and then being forced to roll back that reliance.

In this case, the precipitating event is a happy one for DabbleDB; they’re good guys (go Smalltalkers!), and I got my data out just fine, but the scenario isn’t so different than if they went out of business, or had a massive technical failure.

This perspective prompted me to think about what I would have done differently, what questions I should ask myself before again committing to use a particular “cloud” service, and what I should focus on as a vendor of such services to minimize the chances that my customers will be faced with the same grim drudgery that I’m facing now.  Obviously not a comprehensive treatment, but off the top of my head:

Things that (prospective) users of cloud services need to think about

  1. How likely is it that the providers of this service will be around in a year? Five years?  Do they have a reputation for retiring new services if they don’t take over the world? (viz. Google Wave)
  2. Do you have a suitable backup plan? Just because data is in the cloud doesn’t mean it can’t be lost; service providers go *poof*, data centers burn.  Get and keep snapshots of your data just like you do for data on your local workstations and in-house applications. If that’s not practical (i.e. you’ve too much data in the cloud to store locally), then at least push snapshots into another provider.
  3. Don’t be (too) swayed by the chrome and glitter.  Many online services put design at the center of their offerings, and it’s true that quality, functional design can be compelling — just make sure that you’re not taking on a ton of risk just to get a shinier dashboard.
  4. For each service you use, ensure that you can either reasonably recreate it in-house, or source a comparable service from another provider.

Like all rules, you have to know when to break them.  If a service is amazing enough, the risks of using it may be dwarfed by its benefits, and maybe your data is transient or otherwise not worth bothering with backups.  In any case, the key is to choose who to do business with wisely, and after properly considering alternatives and attendant risks.

Things that builders of cloud services need to think about

If you’re building (or already providing) a “cloud” service, you need to think about all of the issues your customers should be thinking about — to minimize the perceived risks associated with using your service at the very least, and ideally to maximize the actual trustworthiness of your service.

  1. Ensure that data loss is asymptotically impossible.  I could say “don’t ever lose data”, but that’s sort of like saying “don’t get into a car wreck”.
  2. If data is compromised (not lost but obtained by someone unauthorized to have the data), make it so that that data is unusable.  This isn’t always possible, but when it is, encrypting bits before shuffling them off to persistent storage is ideal.
  3. Always provide an obvious way for customers to get their data out, and in the most useful form(s) possible.  A silly counterexample are these CSV files I got out of DabbleDB — a dreadful format for schemaless yet relational data.  It’s easy to view data export as an unnecessary early cost, but many potential early customers will rightfully view robust export capabilities as a necessary condition before they will trust your shiny new service.
  4. Assuming yours is not a commodity service, consider what it would take to reimplement or reprovision it if something went badly wrong.  Play out scenarios that include fatal design flaws in your software as well as failures of your upstream vendors, from a couple days’ outage, to the-CEO-was-indicted-and-their-servers-are-already-liquidated.  Is it possible to fail over to other providers? Is it possible to replace your service with another as a temporary bridge until you can restore service properly? The result may not be perfect, but customers will always prefer a degraded service to a disappeared service.
  5. If you do find yourself in the happy situation of being acquired, but your service is not relevant to your acquirer, do right by the customers that got you there.  The DabbleDB guys did pretty well on this count, providing sane data exports and nearly a year of live service allowing their customers to calmly migrate (certainly far better than many other cloud services that get shut down within days or weeks after being hoovered up into Google and Twitter and Facebook).  Going beyond this would, in DabbleDB’s case for example, mean partnering with a competitor or two to make migrations absolutely painless.  (It looks like Zoho Creator has done this to some extent on their own using DabbleDB’s APIs.)

Ironically, enticing your customers to commit (née be locked in) to your service requires that you give them reason enough to believe that they can leave any time, and sanely survive your passing.

Ashton’s plight, fight or flight

Joel Spolsky is a stellar writer and a fine storyteller. His latest, the apocryphal tale of Ashton, tells of the plight of a software developer stuck in a crummy job, surrounded by sycophants and half-wits, desperate to escape the inevitable resulting ennui.  To top it off, this miserable environment happens to exist in…Michigan.

Of course, there could only be one solution:

…it was fucking 24 degrees in that part of Michigan, and it was gray, and smelly, and his Honda was a piece of crap, and he didn’t have any friends in town, and nothing he did mattered.

[…Ashton] just kept driving until he got to the airport over in Grand Rapids, and he left his crappy old Honda out right in front of the terminal, knowing perfectly well it would be towed, and didn’t even close the car door, and he walked right up to the Frontier Airlines counter and he bought himself a ticket on the very next flight to San Francisco, which was leaving in 20 minutes, and he got on the plane, and he left Michigan forever.

It’s a sad story, but mostly because of the thread of self-righteous metropolitan exceptionalism running through it that is all-too-typical of parts of the programming and business of software circles in which I often circulate.  If Ashton hadn’t bought into that worldview, he might have left his disaster of a job with the cubicle manufacturer for any of the small software companies nearby, filled with talented, engaging people working on challenging and rewarding problems. Or, he could have started his own company, with potential to make a big impact in his town, in Michigan, and in the world.

Instead, he flies to San Fransisco, the Lake Wobegon of the software world: where the managers are sane, coworkers are friendly hacker geniuses, and everyone’s exit is above average. Of course, that’s apocryphal too. There are sycophants and half-wits everywhere, and chances are good our friend Ashton will be working on a backwoods project at Google/Facebook/eBay/Yahoo/etc, or at one of those earth-shattering startups you always hear about on TechCrunch (you can hear the pitch now: “it’s like Groupon for Facebook game assets!”). Maybe he’ll be happier, maybe he’ll be more satisfied, maybe he’ll be richer; or, not.

I love large cities dearly, and I wouldn’t mind living, working, or building a business in one. But, metropolitan life is not a salve for what ails you and your career, and not being in one certainly doesn’t doom you or your future prospects. In this age of decentralizing of work, people in more and more professions (perhaps software development in particular!) are uniquely positioned to work where and when and for whom they want regardless of geography. Opportunity is everywhere – and anywhere you are, if you’ve the talent and determination to thrive.

Xerox’s Inspirational Carlson and Wilson

I recently finished reading Xerox: American Samurai, an out-of-print business case study of sorts that tells the story of Xerox from a mid-1980’s (the book was published in 1986), decidedly American perspective of worrying how domestic industry would compete with the growing influence and capability of Asia, and Japan in particular. It’s a very entertaining read, something of a more business-side Soul of a New Machine: the core of the narrative is the engineering, marketing, manufacturing, and organizational efforts that brought about Xerox’s 10 Series copiers to market starting in 1982 (some of which appear to still be supported and in service!), which were to be Xerox’s response to the accelerating success of its Japanese competitors.  Along the way, the book weaves a story encompassing Xerox’s early days developing and commercializing electrophotography, its fantastic success in the late 1950’s and 1960’s, its “lost decade” in the 1970’s where innovation stagnated and the business began to fray, and finally its then-in-progress rejuvenation into the early 1980’s as Xerox slimmed down and refactored its business and engineering practices to compete effectively.

It’s a great story, but I’m not writing a book review.  Most striking about the book was the glimpse provided of Chester Carlson, the inventor of electrophotography, and Joseph C. Wilson, the co-founder of Xerox (née The Haloid Photographic Company).  As far as I can tell, these men were forces of nature unto themselves, and possessed an array of values and principles that I find inspirational.  Indeed, the book mentions more than once that part of what held Xerox together, especially in the bad years, was the legacy of its progenitors, Wilson and Carlson.

To illustrate, it would be best if I simply quoted from the book; here, from pages 54-56 (bold emphasis mine):

…Wilson never liked it when people referred to Xerox during its spectacular growth years as a Cinderella story.  The company earned its success, he said.  The only magic was the magic of hard work.

As a boy, Wilson grew up in the shadow of Kodak’s largest manufacturing facility in Rochester–Kodak Park.  His dream was to build a company as great as George Eastman’s.  He didn’t want to make a quick killing and then retire with his riches, he wanted his company to have an impact on the world. He wanted to make his company his life’s work, just as Eastman had done.

Chester Carlson and the 914 copier helped Wilson realize his dream. Carlson, the investor of xerography, filed his first patent in 1937, calling his discovery electrophotography.  His first successful image was made in 1938.  Over the next nine years he tried to sell his idea to more than twenty companies, including RCA, Remington Rand, General Electric, Kodak, and IBM.  They all turned him down, wondering why anyone would need a machine to do something you could do with carbon paper.

Although Carlson was often frustrated by the lack of interest in his invention, he never quit.  Sometimes he put his idea and equipment on the shelf for a few months, but soon the enthusiasm would return.  He scraped together a few hundred dollars in 1939, a large sum during the Depression, and had a prototype of an automatic copier built by a model shop in New York.  It didn’t work.  Another model maker got it working, briefly, but soon the war diverted expert machinists to more urgent tasks.  Carlson went back to demonstrating his process with manual plates.  Finally, in 1944, Battelle Memorial Institute in Columbus, Ohio, signed a royalty-sharing agreement with him and began to develop the process.  A short time later, John Dessauer, Haloid’s director of research, showed Joe Wilson a technical article on Carlson’s electrophotography in Radio News. Haloid made the initial contacts with Battelle, and in 1947, it signed an agreement with Battelle and began funding research.  With the help of a professor from Ohio State University, the term “xerography,” Greek for “dry writing,” was coined.

The early manual copying process was excruciatingly slow, almost like developing a photographic print.  An early Haloid brochure describes Thirty-Nine Steps for making good copies on its first commercial copier, the Model A Xerox, which was sometimes called the Ox Box.  The best operators took two to three minutes to make a print, a long way from Carlson’s vision of an automatic machine.  Still, Wilson and Haloid pressed on.  Over the next thirteen years, Wilson committed more money than his company made to developing the process.

Carlson and Wilson both made fortunes on xerography; Carlson earned more than $200 million, Wilson more than $100 million. Their backgrounds and personalities were different, but both of them were reflective men who were concerned with more than money and business.  Carlson was a quiet, shy man from a poor family who struggled to put himself through college and never knew material comfort until late in life when the royalties from xerography finally started to arrive.  During the early years at Haloid, Dessauer once asked him out to lunch.  Carlson declined because he couldn’t afford to reciprocate.  When he made his great breakthrough in xerography he was working days in a patent office, going to law school at night, and doing his experiments on weekends.  He always felt uncomfortable in large groups and avoided public involvement in causes, although he anonymously donated millions of dollars to many of them.

Carlson was never on the regular Xerox payroll, though Wilson made several offers.  Instead, he preferred the independence of working as a consultant.  He died in 1968, at the age of sixty-two, of a heart attack.  A year before his death his wife asked him if he had any unfulfilled desires.  “Just one,” he said. “I would like to die a poor man.”  When he died he had given away more than $150 million. U Thant, secretary-general of the United Nations, sent this tribute to Carlson’s memorial service in honor of his substantial financial contributions: “His concern for the future of the human situation was genuine, and his dedication to the principles of the United Nations was profound.”

Wilson was a graduate of the Harvard Business School.  His father was president of Haloid before him and his grandfather had served as mayor of Rochester.  Unlike Carlson, Wilson was an outgoing person.  His speeches were as likely to contain quotes from Byron and Dostoyevski as they were to contain the latest earnings and revenue numbers.  Even after the company became successful, he would frequently lunch on peanut butter and jelly sandwiches at his desk so he could catch up on his reading.  He welcomes involvement in community affairs, often speaking about the obligation of successful enterprises to contribute to society.   Wilson died in 1971, at the age of sixty-one, of a heart attack, while having lunch with the governor of New York, Nelson Rockefeller.  A frayed, blue index card that he had carried since the early days of his career was found in his wallet.  It summarized his goals: “To be a whole man; to attain serenity through the creation of a family life of uncommon richness; through leadership of a business which brings happiness to its workers, serves well its customers and brings prosperity to its owners; by aiding a society threatened by fratricidal division to gain unity.”

The tenacity, dedication, and grounding principles of these individuals are remarkable, both on spec and compared to the fluff usually offered to entrepreneurs and business owners like myself as examples of success.  Carlson as the inventor and technologist and Wilson as the investor and clueful technical entrepreneur and executive would appear to be far better options.

For those that are interested, it looks like there are at least two other books specifically about Carlson and Wilson, at least in connection with their development of electrophotography and association with Xerox.

Open Source, Positioning, and Execution

In the past month, I’ve read no fewer than 8 articles and blog posts trying to thread a story around what is apparently the “big” question these days: how can software companies make money in an open source world? Well, we are, quite well thank-you-very-much. Here’s how and why.

Our primary product is PDFTextStream. It came on to the market a year ago, entering a market (Java libraries that can extract content from PDF documents) that was dominated by open source (or dual-licensed) offerings that are generally well-liked by the broader community.

OK, so why are we still here, thriving and growing?

  • Positioning When I decided to enter this market three years ago, I knew we would have a good chance simply because it has characteristics that are uniquely suited to a strong, specialized commercial vendor. While generating PDF documents is generally quite easy (thereby leading to a glut of report-generating libraries), extracting content from PDF documents is not. There are numerous file-format ambiguities to address, as well as the details related to achieving document understanding accuracy that is demanded by corporate and government customers. Anyone not dedicated to serving this market with 100% of their effort will not meet the market’s true demands.
  • Execution Anyone who strives to innovate eventually experiences some anxiety about sharing ideas with colleagues, with the irrational fear that those ideas might be misappropriated, leading to unnecessary competition. The thing is, dozens or hundreds of other people in the same field are likely having the same ideas simultaneously, so the only thing that will ever ensure business success is superior execution.
    Likewise, there are at least four open source Java libraries that extract content out of PDF documents. It’s not arrogant or smug to say that we’ll out-execute the teams or individuals that work on those libraries. We’re in this for the long haul and this is all we do 14 hours a day.
  • Serving a Niche Very closely related to product positioning was the decision to enter a very demanding niche. We’re not trying to build yet another HTTP server, EJB container, etc. We’re not working on a commodity, and therefore we are much less likely to see competition from an open source library staffed by developers from IBM (for example). Beyond this market-centric reality is the fact that PDF content extraction is a much more difficult game than writing an HTTP server (again, for example) — there are no standards, there are no RFC’s, there’s no easy way to tell if you’re doing things the right way. So, if someone wants to go head to head with PDFTextStream, they’ll have to grab their machete and start slicing through the same jungle of PDF specs, mangled documents (which nevertheless open in Acrobat without a hitch), and all of the other fun that goes into building a PDF extraction library.

I’m not saying that this formula we’ve worked out is simple, or that it can be easily replicated with a different product in a different market. However, at least from where I’m sitting, “living in an open source world” is pretty pleasant.

Benchmarks and honesty

…like oil and water, right? Not necessarily; we should hope not, for otherwise we’re all in trouble.

Last week, someone anonymously posted a comment to a previous entry of mine. In a nutshell, he or she implied that the benchmarks we publishcomparing PDFTextStream text extraction performance to that of other Java PDF libraries was rubbish. Here’s the comment in its entirety:

If the product is so good, why are your speed comparisons using your latest version against 2 year old products.

Wow, that hurt. I responded with a comment to the same entry, but the original implication was serious enough that I felt compelled to make a more visible statement about the benchmark that we publish.

The core complaint in the comment was that we’re tilting the playing field by comparing PDFTextStream to other years-old Java PDF libraries. That was and is fundamentally untrue, except in the case of Etymon’s PJ library. Here, I’ll quote my response on this issue from my comment in the original entry:

Etymon PJ was abandoned in favor of PJx years ago; PJx hasn’t been under active development since April of 2004 though (see http://sourceforge.net/projects/pjx/), and in its current state provides no API for text extraction that we can see. However, our original benchmarks nevertheless showed the older PJ library to be the fastest of the available libraries (second to PDFTextStream), so we included it even though Etymon doesn’t appear to support it anymore.

Our perspective on this is that we have been trying to be as transparent and honest as possible with these benchmarks from day one; therefore, when searching out Java PDF libraries to compare to PDFTextStream, we wanted to find the toughest competition possible. We found Etymon’s PJ library to be the fastest text extraction library (second to PDFTextStream), so we included it in the benchmark.

I think that’s very fair, and very honest. Frankly, given the sometimes rabid nature of skepticism in some developer circles, we would like likely have been suspected of hiding something if we had originally decided to exclude the PJ library because it’s no longer supported.

Benchmarks have long been viewed with suspicion by technologists of all stripes, but being a publisher of a benchmark has provided me with some perspective. Yes, benchmarks can be gamed; yes, internally-conducted benchmarks canbe more vendor fantasy than reality. We knew this from the start, which is why we made extraordinary efforts to make the benchmark as transparent as possible (by publishing the benchmark code, test files, and methodology along with the bottom-line results). Any skeptics are free to run the benchmarks themselves, and report any observed discrepancies.

If that’s not the gold standard of honesty when it comes to benchmarking, and if a benchmark conducted and published in this manner cannot be trusted by the broader developer community, then we’re all in trouble. There are thousands of software products out there, all of which claim a particular advantage over their competition. Some advantages are qualitative, and cannot be measured — that’s fine. However, other advantages are quantifiable; for these claims, we should all welcome a transparent, published benchmark. Otherwise, the process of selecting software products descends into a matter of who has the better marketing and PR game (not that that hasn’t already happened to a very large extent already, but that’s a different post!).

Fundamentally, I hope the benchmark doesn’t matter. In the end, I would hope that every developer that is looking for a PDF extraction solution for Java would download all of the available libraries and do some real due diligence to determine which library delivers the best features and throughput in their environment. Voilà, everyone wins.

There’s little left to say, except that, if you find our benchmark to be unconvincing, we remain open and receptive to feedback. If there’s a way we can improve the benchmark, whether through changing methodology, test files, or tweaking timing code, we’ll do it.

Totally flattened

The past 10 days have been just nuts.

When it rains, it’s buckets.

We got hit last week with serious inquiries from a half a dozen very large organizations — a good mix of governmental, corporate, and nonprofit/research. Each of them already had a grasp of what PDFTextStream could mean to them and their projects, especially on the performance and text extraction quality fronts. However, each of them also were looking for some broader extraction functionality: bookmarks, annotations, tagged PDF structures, etc.

This is stuff we were already working on and planning to add into the mix, but these new requests certainly kicked the pace up quite a bit. Some of it was pretty quick and easy to finish up and move into beta phase — that will find its way into released versions very soon.

Other stuff is a little harder though, to put it mildly: OCR of text in images in PDF’s, decryption of digitally-signed documents, and other higher-order functionality. Again, all stuff we’ve been positioning ourselves to jump on, but when there’s fish to fry, we all start cooking a little faster. (Now’s when you’re supposed to groan at the horrible pun….)

So, we’re definitely busy. Now, who said software slowed down in the summer?

Marketing is hard and scary

Marketing is really hard, despite the rumors you’ve heard. The more I get into it, the more I’ve come to respect the skills (if not necessarily the tactics) necessary to deliver a message to prospective customers.

Up until this point, Snowtide has done virtually no marketing, and we’ve made out very nicely. We now have a mature product that really kicks ass. I’m proud of what PDFTextStream is doing for its users, some of whom simply would not be able to do their jobs if it weren’t for it.

But we’re past the point of working small niches. Scores of development shops, large and small, would have fewer bad days if they had PDFTextStream humming on their servers and in their products. So, the time has come to spread the gospel and make sure they know that.

To that end, we’re starting a new marketing strategy in July. It’s going to start slow as we learn our footing (the conventional wisdom is that summertime sees a slowdown in corporate software purchases because of vacationing). It will build through the end of the year. And, it will end with PDFTextStream being the only serious choice for developers in enterprise-class environments.

There’s the tricky part, though: convincing people that our product is better than its competition. The foundation for that has been laid for PDFTextStream — it’s been borne out in customer experiences. The problem is that, without appropriate marketing, the people that are likely to appreciate that fact will never even know about your product. In order to change that, we’ve got to write good ad copy, hire good designers to craft and mold that copy into digestable elements (ad banners, text ads, white papers, editorial placements, etc), and feed those elements into a cacaphony of interruptive marketing noise to be noticed and not ignored.

Technical people and marketing folks have always had their differences; they simply do not understand the difficulties inherent in their respective trades, and that often leads to disrespect. That is ever so slowly changing, in part because of pieces similar to this post, typically made by an in-the-trenches software company founder (like myself, I suppose), who inevitably describes how difficult marketing is. And seriously — it’s really, really, hard.

Every step in the progression of tasks I enumerated that leads to a prospective customer seeing, noticing, and acting on a pice of advertising is hard. And personally, I find it very unpleasant, simply because I am, by nature, technical. I know how the bits in software work, and I know those types of things very well. It’s a perfect occupation for someone who is a bit of a control nut. Yes, I am that.

So it makes me very uneasy to engage in an activity (like marketing) where I cannot readily control the outcome. It makes me even more uneasy to engage in an activity (like marketing) where I am less than fully confident in my (and in this case, our) abilities. We are fundamentally technical; we know how the bits work. Even with help, we find the fuzzy, soft, vague world of marketing just a little scary.

That will get better in time, as we fail a little, succeed a little, and do a little more of the latter and a little less of the former each time we try. It would be a high crime to not try, try hard, and try often; we have a great product, it should be seen, and it will be seen.

Clients and Customers

Many times I’ve been told, ‘Snowtide should forget about doing custom development work, and concentrate on selling product’. Of course, few people realize just how important our custom development clients are to our overall business and to the quality of our product.

It should all be very simple, right? We have a great product (PDFTextStream) that does a fantastic job of extracting text and metadata out of PDF documents. It’s an ideal solution for Java developers that need to integrate PDF workflows into their desktop or web applications. Many of the more technically-oriented circles we travel in think that that’s enough, a clear indication that they fall prey to the myth of ‘if you build it, they will come’.

However, one of the ongoing keys to our business is how we maintain a divide between customers and clients:

  • Customers keep the trains running on time. They purchase licenses for PDFTextStream as-is, and require very little hand-holding and advice — after all, they’re software developers themselves, and in general know what they’re doing.
  • Clients form our inspiration. They need the base functionality of PDFTextStream, but it doesn’t do anything for their business needs alone — our clients by necessity look to us to help them build features and application functionality on top of PDFTextStream that will give them a competitive edge in their industry.

It’s no secret that there’s a lot more money to be had in selling product to customers instead of building new applications for clients. However, few understand just how important the latter are in informing our product development strategy, which makes PDFTextStream that much more attractive to new customers in more and more industries.

As it stands, PDFTextStream incorporates a number of features that have been specifically requested by various clients from various industries. And, of course, we’ll continue rolling out new sets of features that complement specialized needs. So, we thank our clients; without them, we would never have made it this far.