Open-source platform for extracting structured data from documents using AI. - DocumindHQ/documind
You can’t perform that action at this time.
From the source, Documind appears to:
1) Install tools like Ghostscript, GraphicsMagick, and LibreOffice with a JS script. 2) Convert document pages to Base64 PNGs and send them to OpenAI for data extraction. 3) Use Supabase for unclear reasons.
Some issues with this approach:
* OpenAI may retain and use your data for training, raising privacy concerns [1].
* Dependencies should be managed with Docker or package managers like Nix or Pixi, which are more robust. Example: a tool like Parsr [2] provides a Dockerized pdf-to-json solution, complete with OCR support and an HTTP api.
* GPT-4 vision seems like a costly, error-prone, and unreliable solution, not really suited for extracting data from sensitive docs like invoices, without review.
* Traditional methods (PDF parsers with OCR support) are cheaper, more reliable, and avoid retention risks for this particular use case. Although these tools do require some plumbing... probably LLMs can really help with that!
While there are plenty of tools for structured data extraction, I think there’s still room for a streamlined, all-in-one solution. This gap likely explains the abundance of closed-source commercial options tackling this very challenge.
---
1: https://platform.openai.com/docs/models#how-we-use-your-data
Disappointed to see this is an exact rip of our open source tool zerox [1]. With no attribution. They also took the MIT License and changed it out for an AGPL.
If you inspect the source code, it's a verbatim copy. They literally just renamed the ZeroxOutput to DocumindOutput [2][3]
[1] https://github.com/getomni-ai/zerox
[2] https://github.com/DocumindHQ/documind/blob/main/core/src/ty...
[3] https://github.com/getomni-ai/zerox/blob/main/node-zerox/src...
Are there any reputation mechanisms or github flagging systems to alert users to such scams?
It’s a pretty unethical behavior if what you describe is the full story and as a user of many open source projects how can one be aware of this type of behavior?
Hello. I apologize that it came across this way. This was not the intention. Zerox was definitely used and I made sure to copy and include the MIT license exactly as it was inside the part of the code that uses Zerox.
If there's any additional thing I can do, please let me know so I would make all amendements immediately.
You took their code, did a search and replace on the product name and you're relicensed the code AGPL?
You're going to have to delete this thing and start over man.
It appears that the MIT license was correctly included to apply to the zerox code used while the AGPL license applies to their own code. Isn’t this how it should be?
For the MIT license to make sense it needs a copyright notice, I don’t actually see one in the original license. It just says “The MIT license” but then the text below references the above copyright notice, which doesn’t exist.
I think both sides here can learn from this, copyright notices are technically not required but when some text references them it is very useful. The original author should have added one. The user of the code could also have asked about the copyright. If this were to go to court having the original license not making sense could create more questions than it should.
tl;dr: add a copyright line at the top of the file when you’re using the MIT license.
If you are looking for the latest/greatest in file processing i'd recommend checking out vision language models. They generate embeddings of the images themselves (as a collection of patches) and you can see query matching displayed as a heatmap over the document. Picks up text that OCR misses. My company DataFog has an open-source demo if you want to try it out: https://github.com/DataFog/vlm-api
If you're looking for an all-in-one solution, little plug for our new platform that does the above and also allows you to create custom 'patterns' that get picked up via semantic search. Uses open-source models by default, can deploy into your internal network. www.datafog.ai. In beta now and onboarding manually. Shoot me an email if you'd like to learn more!
That's not what [1] says, though? Quoth: "As of March 1, 2023, data sent to the OpenAI API will not be used to train or improve OpenAI models (unless you explicitly opt-in to share data with us, such as by providing feedback in the Playground). "
"Traditional methods (PDF parsers with OCR support) are cheaper, more reliable"
Not sure on the reliability - the ones I'm using all fail at structured data. You want a table extracted from a PDF, LLMs are your friend. (Recommendations welcome)
We found that for extracting tables, OpenAIs LLMs aren't great. What is working well for us is Docling (https://github.com/DS4SD/docling/)
Haven't seen Docling before, it looks great! Thanks for sharing.
agreed, extracting tables in pdfs using any of the available openAI models has been a waste of prompting time here too.
> That's not what [1] says, though?
Documind is using https://api.openai.com/v1/chat/completions, check the docs at the end of the long API table [1]:
> * Chat Completions:
> Image inputs via the gpt-4o, gpt-4o-mini, chatgpt-4o-latest, or gpt-4-turbo models (or previously gpt-4-vision-preview) are not eligible for zero retention."
--
1: https://platform.openai.com/docs/models#how-we-use-your-data
Thanks for pointing there!
It's still not used for training, though, and the retention period is 30 days. It's... a livable compromise for some(many) use cases.
I kind of get the abuse policy reason for image inputs. It makes sense for multi-turn conversations to require a 1h audio retention, too. I'm just incredibly puzzled why schemas for structured outputs aren't eligible for zero-retention.
It takes >50 seconds to generate these schemas for some pretty simple use-cases with large enums, for example. Imagine that latency added to each request...
Gotcha, from what I could find online I think you are right. I was conflating data not under zero-retention-policy with data-for-training.
OpenAI isn't retaining your details sent via the API for training details. Stopp.
OP, you've been accused of literally ripping off somebody's more popular repository and posing it as your own.
https://news.ycombinator.com/item?id=42178413
You may wanna get ahead of this because the evidence is fairly damning. Failing to even give credit to the original project is a pretty gross move.
Hi. This was definitely not the intention.
I made sure to copy and past the MIT license in Zerox exactly as it was into the folder of the code that uses it. I also included it in the main license file as well. If there's anything I could do to make corrections please let me know so I'd change that ASAP.
Your initial commit makes it look like you wrote all the code. https://github.com/DocumindHQ/documind/commit/d91121739df038... This is because you copied and uploaded the code instead of forking. You could do a lot by restoring attribution. Your history would look the same as https://github.com/getomni-ai/zerox/commits/main/ and diverge from where you forked.
People are getting upset because this is not a nice thing to do. Attribution is significant. No one would care if you replaced all the names with the new ones in a fork because they would see commits that do that.
Hi. Thank you for pointing this out. I totally understand now that forking would have kept the commit history visible and made the attribution clearer. I have since added a direct note in the repo acknowledging that it is built on the original Zerox project and also linked back to it. If there’s anything else you’d suggest, happy to hear it. Thanks again.
It would be better to attribute. You can still do this by fixing the git commit history and doing a force push. It would do a lot to make people feel better.
Multimodal LLM are not the way to do this for a business workflow yet.
In my experience your much better of starting with a Azure Doc Intelligence or AWS Textract to first get the structure of the document (PDF). These tools are incredibly robust and do a great job with most of the common cases you can throw at it. From there you can use an LLM to interrogate and structure the data to your hearts delight.
> AWS Textract to first get the structure of the document (PDF). These tools are incredibly robust and do a great job with most of the common cases you can throw at it.
Do they work for Bills of Lading yet? When I tested a sample of these bills a few years back (2022 I think), the results were not good at all. But I honestly wouldn't be surprised if they'd massively improved lately.
Have not used in on your docs but I can say that it definitely works well with forms and forms with tables like a Bill of Lading. It costs extra but you need to turn on table extract (at least in AWS). You then can get a markdown representation of that page include table, you can of course pull out the table itself but unless its standardized you will need the middleman LLM figuring out the exact data/structure you are looking for.
Huh, interesting. I'll have to try again next time I need to parse stuff like this.
Plus one, using the exact setup to make it scale. If Azure Doc Intelligence gets too expensive, VLMs also work great
What is a VLM?
Vision Language Model like Qwen VL https://github.com/QwenLM/Qwen2-VL or CoPali https://huggingface.co/blog/manu/colpali
VLMs are cool - they generate embeddings of the images themselves (as a collection of patches) and you can see query matching displayed as a heatmap over the document. Picks up text that OCR misses. Here's an open-source API demo I built if you want to try it out: https://github.com/DataFog/vlm-api
This project is an enhanced reader for Ycombinator Hacker News: https://news.ycombinator.com/.
The interface also allow to comment, post and interact with the original HN platform. Credentials are stored locally and are never sent to any server, you can check the source code here: https://github.com/GabrielePicco/hacker-news-rich.
For suggestions and features requests you can write me here: gabrielepicco.github.io