Extracting Text

Boilerpipe (https://code.google.com/p/boilerpipe/) for HTML

POI (http://poi.apache.org/index.html) for Word

PDFBox (http://pdfbox.apache.org/) for PDF