Extracting Text
Boilerpipe (https://code.google.com/p/boilerpipe/) for HTML
POI (http://poi.apache.org/index.html) for Word
PDFBox (http://pdfbox.apache.org/) for PDF
Boilerpipe (https://code.google.com/p/boilerpipe/) for HTML
POI (http://poi.apache.org/index.html) for Word
PDFBox (http://pdfbox.apache.org/) for PDF