发布于 2016-11-11 01:20:58 | 172 次阅读 | 评论: 0 | 来源: 网友投递
Apache Tika 内容抽取工具集合
Apache Tika 利用现有的解析类库,从不同格式的文档中(例如HTML, PDF, Doc),侦测和提取出元数据和结构化内容。
Apache Tika 1.14 发布了,该版本包含了一些改进和 Bug 修复。
更新如下:
Extract all headers from MSG/RFC822 (TIKA-2122).
Upgrade metadata-extractor to 2.9.1 (TIKA-2113).
Extract PDF DocInfo metadata into separate keys to prevent overwriting by XMP metadata (TIKA-2057).
Re-enable fileUrl for tika-server (TIKA-2081). If you choose,to use this feature, beware of the security vulnerabilities!See: https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2015-3271
Add Tesseract's hOCR output format as an option, via Eric Pugh(TIKA-2093)
Extract macros from MSOffice files (TIKA-2069).
Maintain passed-in mime in TXTParser (TIKA-2047).
Upgrade to POI.3-15 (TIKA-2013).
Upgrade to PDFBox 2.0.3 (TIKA-2051).
Fix hyperlinks with formatting in DOC and DOCX (TIKA-1255 and TIKA-2078)
Tika now is integrated with the Tensorflow library from Google and it can use its Inception v3 image classification model to identify objects in images (TIKA-1993).
Parser configuration is now type-safe and parameters for parsers can have assigned types (TIKA-1508, TIKA-1986).
Prevent OOM/permanent hang on some corrupt CHM files (TIKA-2040).
Upgrade ICU4J charset detection components to fix multithreading bug (TIKA-2041).
Upgrade to Jackcess 2.1.4 (TIKA-2039).
Maintain more significant digits in cells of "General" format in XLS and XLSX (TIKA-2025).
Avoid mark/reset issues when extracting or detecting embedded resources in RFC822 emails (TIKA-2037).
Improving accuracy of Tesseract for better extraction of numeric and alphanumeric text from images (TIKA-2021, TIKA-2031).
Improve extraction of embedded documents from PPT, PPTX and XLSX(TIKA-2026).
Add parser for applefile (AppleSingle) (TIKA-2022).
Add mime types, mime magic and/or globs for:
Endnote Import File (TIKA-2011)
DJVU files (TIKA-2009)
MS Owner File (TIKA-2008)
Windows Media Metafile (TIKA-2004)
iCal and vCalendar (TIKA-2006)
MBOX (TIKA-2042)
Stata DTA (TIKA-2064)
Add configurable maximum threshold for number of events extracted from the XMP Media Management Schema in JempboxExtractor (TIKA-1999).
Integrate TesseractOCR with full page image rendering for PDFs (TIKA-1994).
Add mime detection via Nick C and parser for DBF files (TIKA-1513).
Add mime detection and parsers for MSOffice 2003 XML Word and Excel formats (TIKA-1958).
Extract hyperlinks from PPT, PPTX, XSLX (TIKA-1454).
Upgrade to Commons Compress 1.12 (supports progress on TIKA-1358)
下载地址: