Update: I’ve shut down IMHO. It was amusing while it lasted, but it’s clearly not a domain in which I have a fundamental interest. I hope you all enjoyed playing with it while it was among us!

I don’t write stupid Twitter apps often…but when I do, they’re really stupid.

— Me, wearing a smoking jacket

I’ve always enjoyed Twitter, but I’ve never built an app of any kind around it, or done anything with Twitter data.  Insofar as that’s roughly the modern equivalent of a ‘hello world’ program, I was perhaps lacking in some critical way.

Well, no longer.  I, too, have built a Stupid Twitter App™: go check out IMHO (or its companion Twitter handle, @IMHO)!

logo-med-darkWhile Twitter has many roles — some quite important — everyone uses it as a dumping ground for their opinions.  IMHO (a common Internet colloquialism meaning “in my humble opinion”) will maybe provide an entertaining and perhaps informative view of those opinions, in aggregate.  I’ve seeded the site with just short of 2 million opinions culled from an archive of 90 million tweets; but, from here on out, new opinions will only be added if they are tweeted at @IMHO, like so:

.@IMHO The Celtics will get revenge next year!

Through a combination of some crude natural language processing and a lot of hard-working squirrels, opinions are parsed and indexed by their subject/topic.  Oh, but leave your nuanced stances at the door: only simpler, discrete assertions will be recognized.  Madness?  No, this is Twitter!

Right now, only reverse-chronological listings of opinions are available (far more interesting visualizations are of course close at hand); perhaps I’m just that puerile, but I’ve found even that simplistic view entertaining enough for now.  Check out some popular/contentious topics:

There’s plenty of snort-worthy gems in there, if you care to go fishing.  Again, some better UI will make it easier to surface them, which will be added if the service catches on in any way.  So, if you have an opinion you’re going to put on Twitter, put a .@IMHO on it.

PDFTextStream now available free (as in beer)

PDFTextStream v2.6.0 was released today with a variety of small new features and a couple of bugfixes.  The bigger change is that PDFTextStream is now available free for use in single-threaded applications.

Because of the realities of the economics around developing and maintaining a product like PDFTextStream, its pricing has often been out of reach of many projects and very small organizations that really need high-quality PDF content extraction functionality.  That’s not to say that PDFTextStream is overpriced — it’s actually less expensive than other options — but that is small comfort to many that simply cannot afford or cannot justify the expenditure yet.

This change should fix that: if you have a smaller project, are working on a startup, are involved in information research, etc., you can now benefit from all that PDFTextStream has to offer.  And, if and when your architecture requires concurrent PDF processing, or your PDF content extraction workload is large enough to need to worry about properly utilizing your hardware and compute resources, you can easily upgrade to the unlimited, licensed “edition” of PDFTextStream to parallelize that workload.

It will be fun to see what people build now that PDFTextStream is gratisTry it out!

A refresh of Clojure Atlas

I’m sorry to admit that I let the Clojure Atlas wilt a bit over the past year or so. (I was a little busy!)  However, I am conversely quite happy to say that that’s over now; Clojure Atlas has been refreshed to add editions for Clojure v1.3.0 and v1.4.0.

(If you don’t know what Clojure Atlas is, head on over and check out the snazzy new demo/tour video.)

Other highlights include:

Pricing changes

I think the previous pricing was too high.  (You never know until you try.)  Pricing has been lowered, and I’ve added a fun option whereby you can get any edition of Clojure Atlas for just $5.  I don’t quite know what I’ll end up doing for upgrades going forward, but you will definitely be able to stay current without paying the full boat each time.

Free upgrades

Between the too-high pricing and the far-too-long period between the initial release of Clojure Atlas and now, those that prepaid for access to the Clojure v1.3.0 Atlas how have access to all of them, up to and including v1.4.0.  Those early significant supporters will also get free upgrades to all future Clojure Atlas revisions.  Thanks, guys and gals.

If you only purchased the Atlas for Clojure v1.2.0 previously, your account has been upgraded to include the Atlas for v1.3.0.

Ontology improvements

Aside from the obvious additions that needed to go in to reflect changes in Clojure v1.3.0 and v1.4.0, the ontology has been improved significantly to be more comprehensive and more accurate.  In addition, I’ve started adding detailed documentation (for example) to subjects/nodes within the ontology that I’ve added (in contrast to vars, which in general already have documentation of their own).

Visualization improvements

The graph visualization is certainly far from perfect, but I’ve tweaked it a fair bit to get it to “settle” faster than it did before.  I’m also pondering a complete reworking of the visualization to make it deterministic (rather than using a particle simulation as it does now).

No more PayPal

Many people balked at using PayPal — and believe me, no one is happier than I to be rid of it at this point.  Payments are now all handled courtesy of Stripe, which has been a dream to work with.

Friend: an extensible authentication and authorization library for Clojure Ring webapps and services

Say hello to my little Friend.

There’s plenty of technical stuff in the README to chew on if you like.  In short, I’m hoping this can eventually be a warden/spring-security/everyauth /omniauth for Clojure; that is, a common abstraction for authentication and authorization mechanisms.  Clojure has been around long enough that adding pedestrian things like form and HTTP Basic and $AUTH_METHOD_HERE to a Ring application should be easy.  Right now, it’s not: either you’re pasting together a bunch of different libraries that don’t necessarily compose well together, or you get drawn into shaving the authentication and authorization yaks for the fifth time in your life so you can sleep well at night.

Hopefully Friend will make this a solved problem, or at least push things in that direction.  It plays nice with all of the best principles of Ring, and includes support for:

  • form, HTTP Basic, and OpenID authentication
  • role-based authorization (optionally using hierarchical roles via Clojure’s derive and isa?)
  • su capabilities (multiple login support / a.k.a. “log in as”)
  • channel security (i.e. HTTPS-only for certain Ring routes)
  • …and more

Most importantly, it takes a stab at a couple of core abstractions for others to drop in other authentication workflows, e.g. OAuth in all of its incarnations, NTLM, BrowserID, etc. etc. etc.  There are already plenty of Clojure implementations for all sorts of authentication methods; hopefully someone (you?!) will step up and bring one of them to the party, so anyone’s Friend-empowered Clojure webapp can easily offer any or all of them with a minimum of suffering.

Finally: frankly, it’s absurd that I’m writing security-related stuffs.  (I know it hardly ever works out that way, but it seems like some experts somewhere should be taking care of this.)  It would be a great thing if you were to beat on Friend and try to find exploits, general breakage, etc., especially if you have prior experience in this area.

‘Clojure Programming’ book now available

Update [2011-08-23 18:49 UTC]: The Rough Cut of Clojure Programming has been updated significantly since this post originally went live.  Go check it out. :-)

Some time ago, I announced that I was coauthoring a book on Clojure for O’Reilly (see original announcement).  I’m very happy to report that an early and incomplete version of Clojure Programming is now available in Rough Cuts.

Rough Cuts is O’Reilly’s early-access program, similar to Manning’s MEAP.  By purchasing it now, you will be able to read the ebook via Safari as it progresses through its final stages, and leave feedback that we will take into account through that process.  Please make use of the comment/feedback facility on the book’s Safari page; we are eager to hear what you have to say about the book — though personally, I vacillate between hoping you’ll be gentle and hoping you’ll be brutal.

What’s in the first Rough Cut is actually the state of the book from about two months ago.  I dropped the ball on giving the final word to our editor to go ahead with the release, so I’m afraid you’re all getting this much later than you could (and should) have.  On the upside, there’s a lot of content queued up to be added to the Rough Cut, so you’ll be seeing new stuff stream in very rapidly from here on out.

I do want to apologize about (inadvertently) maintaining radio silence about the book since my original announcement.  Writing the book has ended up overlapping with a very busy time in my life, and I needed to recruit new coauthors mid-stream to boot.  Dave had some killer opportunities that he simply couldn’t turn down; his departure was unfortunate, but it gave me the great opportunity to work with two very well-known figures in the Clojure community:

  • Brian Carper, a stellar writer (I’d been a fan of his blog for some time) and former Ruby hacker (a perspective I wanted to make sure we serviced in the book well)
  • Christophe Grand, the author of a host of popular Clojure libraries such as Enlive, Parsley, and Moustache, and blogger of all things bleeding-edge in Clojure

I’m biased of course, but the book is shaping up to be what I think will be a great introduction to Clojure — especially for those coming from Java, Ruby, and Python — and simply none of it would have been possible if it were not for Brian and Christophe.  Thanks, guys! :-D

Preview and purchase the book: Clojure Programming

P.S. I just want to take a moment to let it settle in that, yes, O’Reilly is publishing a Lisp book, despite their explicitly discouraging Lisp topics in their book proposal guidelines.  (Sorry guys, a single friendly needling is warranted. ;-)) I know it’s not an old concept (they accepted our proposal, after all, and then there was the sadly ill-fated Lisp: Out of the Box), but now the bits are flowing, orders are being taken, and it can’t get much more official. Happy days indeed.

jsdifflib now on Github

jsdifflib is a Javascript library that provides:

  1. a partial reimplementation of Python’s difflib module (specifically, the SequenceMatcher class)
  2. a visual diff view generator, that offers side-by-side as well as inline formatting of file data

Some years ago, I needed a good in-browser visual diff tool, and couldn’t find anything suitable.  So, I built jsdifflib in 2007 and open-sourced it soon thereafter.

It’s apparently been used a fair bit by others, even though I cruelly sequestered it on one of Snowtide‘s web servers for years (sorry, but Github wasn’t around and I had little patience for SourceForge).  I’ve promised to put the source somewhere more useful many times, but have unfortunately only just gotten around to it today: jsdifflib is now on github.

I’ve not used the library in some time, but it still works well enough.  Send in your patches if you have ’em.

Clojure Atlas now available

A few weeks ago, I previewed Clojure Atlas; I’m happy to announce that it is now publicly available:

Clojure Atlas, an experimental visualization of the Clojure language and its standard library

There’s a free demo available which will nag you after a bit; if you find it useful, interesting, helpful, or even just a little fun, it would be great if you purchased Clojure Atlas for whichever version(s) of Clojure you’re interested in.

For some limited period of time, you can pre-order an Atlas for the forthcoming Clojure v1.3 at a $10 discount, so grab that soon if you are interested in it.

Please note that Clojure Atlas is by no means Done, or even “done”.  I’m opening it up for use and purchase now in the spirit of release early, release often, so you are sure to find rough edges in the UI and straightforward incompleteness in the ontology that drives the Atlas.  You can read more about the current status here.

As for what’s in the future for Clojure Atlas, my personal to-do list is far too long to go into.  Part of why I’m releasing it “early” is to get a sense of what people want, and what will be most useful.  Doing otherwise would surely lead me to fritter away precious days and weeks honing features interesting to only a few.  There are links all over the site and in the Atlas itself where you can submit ideas, suggestions, and bug reports via all sorts of channels; I’ll be keeping a close eye on the email, tweets, and UserVoice threads submitted in order to filter and prioritize future work.

Why are you still reading this post?  Go check out the Clojure Atlas!

P.S. I’d like to thank all of the early-access testers that provided valuable feedback leading up to this release.  In particular, Edmund Jackson spent far more time with me on irc than he needed to, helping to ferret out issues in the earlier revisions of the graph visualization, and making a variety of excellent suggestions for future development.

“Clojure Programming”, the book

Update: Clojure Programming is now available!

I’m very happy to announce that I and Dave Fayram (formerly of Powerset and Microsoft, and now of BankSimple) have recently committed to writing a book on Clojure, tentatively titled “Clojure Programming”, to be published by O’Reilly Media.

This is pretty significant news for me, but likely also for the broader Clojure community.  Having another Clojure book on the shelves is always a Good Thing™, even better if it’s from O’Reilly, the granddaddy of modern technology publishers.  That imprimatur will do nothing but help Clojure gain exposure, and perhaps in circles as yet unaware of the language.

I think the fabulous growth of the community and the (apparent) success of the other books out there have already made it clear that Clojure is here to stay as a serious language, more than ready for use by a broad population of programmers in real, production systems.  Dave and I are just thrilled that we have the opportunity to introduce the language, its facilities, and its general approach to the next wave or two of Clojure programmers.

I’m a better programmer and a better person for having wandered into #clojure in early 2008, and I’m incredibly grateful to have had the opportunity to meet and know the array of wonderful people that have gathered around the language.  I’m hoping this will prove to be an opportunity for me to give back to the Clojure community as it has given to me.

Quickie FAQs

What will be the target audience, table of contents, publication date, &c?

At this point, writing has only recently begun, so there’s much to do and it would be foolish to discuss any specifics.  But, I’m excited, Dave’s excited, and I thought others might be too.

How will this affect Snowtide and Docuharvest?

It won’t.  Development of both PDFTextStream and Docuharvest will continue apace, if not accelerate over the coming months.

That’s all for now.  Wish us luck!

Launching DocuHarvest – Turning documents into data

I’m happy today to introduce:

Getting valuable data out of documents should not require an I.T. staff,  outside consultants, building or buying software, or an up-front  investment of hundreds or thousands of dollars, regardless of how many  documents and how much data is involved.

This may seem strange to hear coming from me: you may or may not know that I’ve been principally involved in selling PDF content extraction software for the past six years.  Over that time, I’ve had the opportunity to come face-to-face with hundreds of content and data extraction challenges across dozens of industries. If there’s one takeaway I can offer up from that experience, it’s this:

No one cares about the process of data extraction: people only care about their data.

Seems simple enough, but ask anyone who’s been involved in any kind of data integration project, or tried to help a nontechnical user get useful data out of a directory full of documents, and you’ll know that people are forced to care. The situation is worse for e.g. small business owners and others that simply can’t afford additional software and the attendant consulting hours.

DocuHarvest is an alternative path: a web application that provides data extraction and content conversion services through the browser, usable by everyone, costing pennies per document processed. There’s even a free option, if you’re willing to process only one document at a time.

Available Now

We’re starting small, offering three types of document processing jobs:

  1. Extracting document metadata (date published, author, title, keywords, etc.),
  2. Conversion of documents to text, and
  3. Extraction of interactive PDF form data

DocuHarvest currently only accepts PDF documents as input, but that will change relatively soon – PDF just happens to be where we “come from”, so we’re rolling that out first. Support for additional file formats will come.

In addition, we have a variety of additional types of jobs in the pipeline, including:

  • conversion of documents to images (rasterization),
  • extraction of embedded images, and
  • thumbnail generation

That’s hardly a comprehensive list. This is just the beginning; we have a lot of tricks we’ve saved up over the course of those six years. :-)

I’d love it if you were to go to DocuHarvest.com now, and try it out. Remember, it’s free to try (and still free to use, one document a time).

If you have any feedback, comments, questions, suggestions, or complaints, don’t hesitate to contact me; leave a comment below or in the feedback boxes on the DocuHarvest site, message me (@cemerick) or @docuharvest on Twitter, or email me directly.

Reducing purchase anxiety is a feature

Talk to anyone outside of the software world, and you’ll quickly realize that one of the most gut-wrenching, anxiety-inducing acts is buying software. Even if one has evaluated the product in question top to bottom, past experience of bugs, botched updates, missing features, and outright failures and crashes has tempered any enthusiasm or confidence that might be felt when the time comes to pull out the credit card or write the purchase order.

Of course, the blame for this lies squarely with the software industry itself – the failures in software quality are well known, both discrete instances as well as in aggregate. Those of us whose business and livelihood are tied to the sale of software (whether sent out the door or delivered as a service) must do whatever we can to reverse this zeitgeist.

Given that, we’ve decided to adopt a very simple, no-nonsense “Satisfaction Guaranteed” policy for PDFTextStream. Hopefully this will help take the anxiety out of someone’s day, somewhere.

This isn’t a new idea, of course. Lots of software companies have had guarantees of some sort or another for ages, but I think my first encounter with the concept as a business owner was Joel Spolsky’s post from a couple of years ago:

I think that our customers are nice because they’re not worried. They’re not worried because we have a ridiculously liberal return policy: “We don’t want your money if you’re not amazingly happy.”

Joel raised the issue again on a recent StackOverflow podcast, which prompted me to think about our own approach…

What do we do about unhappy customers?

To be honest, our customers are pretty happy. Of course, we occasionally receive a bug report, but we generally knock out patches within a couple of days, and sometimes faster. In the 5 years we’ve been selling PDFTextStream, we’ve never had a single request for a refund. Part of that is offering up a very liberal evaluation version, but I’d like to think it’s because what we sell does the job it’s meant to do very well.

Given that, I’ve never thought to make a big stink about a refund policy – it just never came up. But hearing Joel and Jeff talk about the ire that they felt towards various companies that refused to issue refunds when they weren’t happy with something motivated me to make our de facto policy explicit. Thus, the new “Satisfaction Guaranteed” statement.

Part II: the Open Source Influence

An elephant in the room is the influence of open source software on customers’ attitudes towards buying software, and the assessment of risk that goes along with it. As more and more users of technology (just to spread the net as widely as possible) are exposed and become accustomed to the value associated with open source software (which, in simple terms, is generally high because of its zero or near-zero price), it increases pressure on commercial vendors (like us) to up our game along the same vector.

But, the impact of open source software on pricing is a pretty stale story. The real impact is derivative, in that a zero or near-zero price means that the apparent risk associated with using open source software is zero or near-zero. The promise of proprietary, commercial software is that, if it does what the vendor claims (whatever that is), then that software will deliver benefits far in excess of its cost and far in excess of the aggregate benefit provided by the open source alternatives, even given the price differential.

The problem is that a lot of people only turn towards commercial options as a last resort because of the aforementioned historical failures of the software industry vis á vis quality: the apparent risk of commercial options is higher than that associated with open source options, simply because the latter’s super-low price is a psychological antidote to any anxiety about quality issues. So, there’s flight towards low-priced options, rather than a thorough search for optimal solutions. Injecting an explicit guarantee of performance and reliability (like our new “Satisfaction Guarantee”) might be enough to tip the relative apparent risk in favor of the commercial option – or, at the very least, minimize the imbalance so that it’s more likely that price won’t dominate other factors (which are potentially more relevant to overall benefits).

Of course, this can only work if one’s product is actually better than the open source alternatives, and by a good stretch to boot so as to compensate for the price differential. In any case, it’s a win-win for the formerly-anxious software user and buyer: they should feel like they have more choice overall, and therefore have a better chance of discovering and adopting the best solution for any given problem, regardless of software licenses and distribution models.