Launching DocuHarvest – Turning documents into data

I’m happy today to introduce:

Getting valuable data out of documents should not require an I.T. staff,  outside consultants, building or buying software, or an up-front  investment of hundreds or thousands of dollars, regardless of how many  documents and how much data is involved.

This may seem strange to hear coming from me: you may or may not know that I’ve been principally involved in selling PDF content extraction software for the past six years.  Over that time, I’ve had the opportunity to come face-to-face with hundreds of content and data extraction challenges across dozens of industries. If there’s one takeaway I can offer up from that experience, it’s this:

No one cares about the process of data extraction: people only care about their data.

Seems simple enough, but ask anyone who’s been involved in any kind of data integration project, or tried to help a nontechnical user get useful data out of a directory full of documents, and you’ll know that people are forced to care. The situation is worse for e.g. small business owners and others that simply can’t afford additional software and the attendant consulting hours.

DocuHarvest is an alternative path: a web application that provides data extraction and content conversion services through the browser, usable by everyone, costing pennies per document processed. There’s even a free option, if you’re willing to process only one document at a time.

Available Now

We’re starting small, offering three types of document processing jobs:

  1. Extracting document metadata (date published, author, title, keywords, etc.),
  2. Conversion of documents to text, and
  3. Extraction of interactive PDF form data

DocuHarvest currently only accepts PDF documents as input, but that will change relatively soon – PDF just happens to be where we “come from”, so we’re rolling that out first. Support for additional file formats will come.

In addition, we have a variety of additional types of jobs in the pipeline, including:

  • conversion of documents to images (rasterization),
  • extraction of embedded images, and
  • thumbnail generation

That’s hardly a comprehensive list. This is just the beginning; we have a lot of tricks we’ve saved up over the course of those six years. :-)

I’d love it if you were to go to DocuHarvest.com now, and try it out. Remember, it’s free to try (and still free to use, one document a time).

If you have any feedback, comments, questions, suggestions, or complaints, don’t hesitate to contact me; leave a comment below or in the feedback boxes on the DocuHarvest site, message me (@cemerick) or @docuharvest on Twitter, or email me directly.

About these ads
This entry was posted in Announcements, DocuHarvest, PDFTextStream. Bookmark the permalink.

2 Responses to Launching DocuHarvest – Turning documents into data

  1. Sandeep says:

    Very nice – it would be nice if you could add TIFF next, since it is the format which is used by most scanners.. at least by default. (Especially for stuff like contract-entry).
    It would be interesting to look at this then.

    • Chas Emerick says:

      Sandeep,

      You can do a lot of things will TIFFs — do you mean:

      - converting PDFs to TIFF images
      - applying an OCR process to images (including TIFFs) to yield text
      - or something else?

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s