发布于 2015-04-21 00:54:30 | 213 次阅读 | 评论: 0 | 来源: 网友投递
Apache Tika 内容抽取工具集合
Apache Tika 利用现有的解析类库,从不同格式的文档中(例如HTML, PDF, Doc),侦测和提取出元数据和结构化内容。
Apache Tika 1.8 发布,此版本主要有以下更新:
Fix null pointer when processing ODT footer styles (TIKA-1600).
Upgrade to com.drewnoakes' metadata-extractor to 2.0 and
add parser for webp metadata (TIKA-1594).
Duration extracted from MP3s with no ID3 tags (TIKA-1589).
Upgraded to PDFBox 1.8.9 (TIKA-1575).
Tika now supports the IsaTab data standard for bioinformatics
both in terms of MIME identification and in terms of parsing
(TIKA-1580).
Tika server can now enable CORS requests with the command line
"--cors" or "-C" option (TIKA-1586).
Update jhighlight dependency to avoid using LGPL license. Thank
@kkrugler for his great contribution (TIKA-1581).
Updated HDF and NetCDF parsers to output file version in
metadata (TIKA-1578 and TIKA-1579).
Upgraded to POI 3.12-beta1 (TIKA-1531).
Added tika-batch module for directory to directory batch
processing. This is a new, experimental capability, and the API will
likely change in future releases (TIKA-1330).
Translator.translate() Exceptions are now restricted to
TikaException and IOException (TIKA-1416).
Tika now supports MIME detection for Microsoft Extended
Makefiles (EMF) (TIKA-1554).
Tika has improved delineation in XML and HTML MIME detection
(TIKA-1365).
Upgraded the Drew Noakes metadata-extractor to version 2.7.2
(TIKA-1576).
Added basic style support for ODF documents, contributed by
Axel D枚rfler (TIKA-1063).
Move Tika server resources and writers to separate
org.apache.tika.server.resource and writer packages (TIKA-1564).
Upgrade UCAR dependencies to 4.5.5 (TIKA-1571).
Fix Paths in Tika server welcome page (TIKA-1567).
Fixed infinite recursion while parsing some PDFs (TIKA-1038).
XHTMLContentHandler now properly passes along body attributes,
contributed by Markus Jelsma (TIKA-995).
TikaCLI option --compare-file-magic to report mime types known to
the file(1) tool but not known / fully known to Tika.
MediaTypeRegistry support for returning known child types.
Support for excluding (blacklisting) certain Parsers from being
used by DefaultParser via the Tika Config file, using the new
parser-exclude tag (TIKA-1558).
详细信息请查看发行页面。
此版本现已提供下载:
http://www.apache.org/dyn/closer.cgi/tika/apache-tika-1.8-src.zip
Apache Tika 利用现有的解析类库,从不同格式的文档中(例如HTML, PDF, Doc),侦测和提取出元数据和结构化内容。
Tika是一个内容抽取的工具集合(a toolkit for text extracting)。它集成了POI, Pdfbox 并且为文本抽取工作提供了一个统一的界面。其次,Tika也提供了便利的扩展API,用来丰富其对第三方文件格式的支持。
Tika的API十分便捷,核心是Parser interface,其中定义了一个parse方法:
public void parse(InputStream stream, ContentHandler handler, Metadata metadata)
用stream参数传递需要解析的文件流, 文本内容会被传入handler,而元数据会更新至metadata。
可以使用Tika的ParserUtils工具来根据文件的mime-type
来得到一个适当的Parser来进行解析工作。或者Tika还提供了一个AutoDetectParser根据不同的二进制文件的特殊格式 (比如说Magic Code),来寻找适合的Parser。