Launch HN: Midship (YC S24) – Turn PDFs, docs, and images into usable data
I’m curious to hear more about your pivot from AI workflow builder to document parsing. I can see correlations there, but that original idea seems like a much larger opportunity than parsing PDFs to tables in what is an already very crowded space. What verticals did you find have this problem specifically that gave you enough conviction to pivot?
Firstly as a function of the independent components in our pipeline. For example, we rely on commercial models for document layout and character recognition. We evaluate each of these and select the highest accuracy, then fine-tune where required.
Secondly we evaluate accuracy per customer. This is because however good the individual compenents are, if the model "misinterprets" a single column, every row of data will be wrong in some way. This is more difficult to put a top level number on and something we're still working on scaling on a per-customer basis, but much easier to do when the customer has historic extractions they have done by hand.
Great Q - there is definitely a lot of competition in dev tool offerings but less so in end to end experiences for non technical users.
Some of the things we offer above and beyond dev tools: 1. Schema building to define “what data to extract” 2. A hosted web app to review, audit and export extracted data 3. Integrations into downstream applications like spreadsheets
Outside of those user facing pieces, the biggest engineering effort for us has been in dealing with very complex inputs, like 100+ page PDFs. Just dumping into ChatGPT and asking nicely for the structured data falls over in both obvious (# input/output tokens exceeded) and subtle ways (e.g. missing a row in the middle of the extraction).
Can you do this with emails?
saving the email as a pdf would work!
- https://www.ycombinator.com/companies/tableflow
- https://www.ycombinator.com/companies/reducto
- https://www.ycombinator.com/companies/mindee
- https://www.ycombinator.com/companies/omniai
- https://www.ycombinator.com/companies/trellis
At the same time, accurate document extraction is becoming a commodity with powerful VLMs. Are you planning to focus on a specific industry, or how do you plan to differentiate?
We see a ton of industries/use-cases still bogged down by manual workflows that start with data extraction. These are often large companies throwing many people at the issue ($$). The vast majority of these companies lack technical teams required to leverage VLMs directly (or at least the desire to manage their own software). There’s a ton of room for tailored solutions here, and I don't think it's a winner-take-all space.
Agree.
The capability is fairly trivial for orgs with decent technical talent. The tech / processes all look similar:
User uploads file --> Azure prebuilt-layout returns .MD --> prompt + .MD + schema set to LLM --> JSON returned. Do whatever you want with it.