Python library to extract text from any file type compatiable with TIKA. It defaults to OCR when text extraction of a PDF file fails.
- Download tika-server-1.7.jar from Apache Tika
- Mac:
brew install ghostscriptsUbuntu:sudo apt-get install ghostscript - Mac:
brew install tesseractUbuntu:sudo apt-get install tesseract-ocr - Mac:
brew tap homebrew/x11andbrew install xpdfUbuntu:sudo apt-get install poppler-utils - Install Python dependencies with
pip install -r requirements.txt
These script assume that an instance of Tika server is running.
Starting Tika Servers
java -jar tika-server-1.7.jar --port 9998
In Python script
from textextraction.extractors import text_extractor
text_extractor(doc_path=doc_path, force_convert=False)In order to run tests:
- All requirements must be installed
- Both Tika servers need to be running
Tests are run with nose
Installation
pip install -r test-requirements.txt
Running tests
nosetests
Documents are converted to gray PNGs with a DPI of 300 using Ghostscript and then OCRed with Tesseract. Settings for OCR adapted from OPTIMAL IMAGE CONVERSION SETTINGS FOR TESSERACT OCR and The Free Law Project's Courtlistener.
