Datashare is a self-hosted search engine for documents.
Datashare is a self-hosted search engine for documents, using Apache Tika and Apache Tesseract to read thousands of file formats. This tool is developed by the International Consortium of Investigative Journalists (ICIJ), famously known for its groundbreaking investigations into the offshore world (Pandora Papers, Panama Papers, etc).
It also provides:
- Many search filters (file types, creation date, languages, tags, etc)
- Search in batch (with a CSV)
- Search results download
- Tagging and recommendation
- Named Entities recognition with CoreNLP
- Optical characters recognition with Apache Tesseract
After the installation, open a terminal and use the following command to start Datashare:
datashare
Datashare should now be available on http://localhost:8080 🚀