I’m happy today to introduce:
Getting valuable data out of documents should not require an I.T. staff, outside consultants, building or buying software, or an up-front investment of hundreds or thousands of dollars, regardless of how many documents and how much data is involved.
This may seem strange to hear coming from me: you may or may not know that I’ve been principally involved in selling PDF content extraction software for the past six years. Over that time, I’ve had the opportunity to come face-to-face with hundreds of content and data extraction challenges across dozens of industries. If there’s one takeaway I can offer up from that experience, it’s this:
No one cares about the process of data extraction: people only care about their data.
Seems simple enough, but ask anyone who’s been involved in any kind of data integration project, or tried to help a nontechnical user get useful data out of a directory full of documents, and you’ll know that people are forced to care. The situation is worse for e.g. small business owners and others that simply can’t afford additional software and the attendant consulting hours.
DocuHarvest is an alternative path: a web application that provides data extraction and content conversion services through the browser, usable by everyone, costing pennies per document processed. There’s even a free option, if you’re willing to process only one document at a time.
We’re starting small, offering three types of document processing jobs:
- Extracting document metadata (date published, author, title, keywords, etc.),
- Conversion of documents to text, and
- Extraction of interactive PDF form data
DocuHarvest currently only accepts PDF documents as input, but that will change relatively soon – PDF just happens to be where we “come from”, so we’re rolling that out first. Support for additional file formats will come.
In addition, we have a variety of additional types of jobs in the pipeline, including:
- conversion of documents to images (rasterization),
- extraction of embedded images, and
- thumbnail generation
That’s hardly a comprehensive list. This is just the beginning; we have a lot of tricks we’ve saved up over the course of those six years. :-)
I’d love it if you were to go to DocuHarvest.com now, and try it out. Remember, it’s free to try (and still free to use, one document a time).
If you have any feedback, comments, questions, suggestions, or complaints, don’t hesitate to contact me; leave a comment below or in the feedback boxes on the DocuHarvest site, message me (@cemerick) or @docuharvest on Twitter, or email me directly.