发布于 2015-01-17 01:33:18 | 216 次阅读 | 评论: 0 | 来源: 网友投递
Apache Tika 内容抽取工具集合
Apache Tika 利用现有的解析类库,从不同格式的文档中(例如HTML, PDF, Doc),侦测和提取出元数据和结构化内容。
Apache Tika 1.7 发布了,Tika是一个内容抽取的工具集合(a toolkit for text extracting)。它集成了POI, Pdfbox 并且为文本抽取工作提供了一个统一的界面。其次,Tika也提供了便利的扩展API,用来丰富其对第三方文件格式的支持。
该版本包含很多改进和 bug 修复,详细列表如下:
* Fixed resource leak in OutlookPSTParser that caused TikaException when invoked via AutoDetectParser on Windows (TIKA-1506). * HTML tags are properly stripped from content by FeedParser (TIKA-1500). * Tika Server support for selecting a single metadata key; wrapped MetadataEP into MetadataResource (TIKA-1499). * Tika Server support for JSON and XMP views of metadata (TIKA-1497). * Tika Parent uses dependency management to keep duplicate dependencies in different modules the same version (TIKA-1384). * Upgraded slf4j to version 1.7.7 (TIKA-1496). * Tika Server support for RecursiveParserWrapper's JSON output (endpoint=rmeta) equivalent to (TIKA-1451's) -J option in tika-app (TIKA-1498). * Tika Server support for providing the password for files on a per-request basis through the Password http header (TIKA-1494). * Simple support for the BPG (Better Portable Graphics) image format (TIKA-1491, TIKA-1495). * Prevent exceptions from being thrown for some malformed mp3 files (TIKA-1218). * Reformat pom.xml files to use two spaces per indent (TIKA-1475). * Fix warning of slf4j logger on Tika Server startup (TIKA-1472). * Tika CLI and GUI now have option to view JSON rendering of output of RecursiveParserWrapper (TIKA-1451). * Tika now integrates the Geospatial Data Abstraction Library (GDAL) for parsing hundreds of geospatial formats (TIKA-605, TIKA-1503). * ExternalParsers can now use Regexs to specify dynamic keys (TIKA-1441). * Thread safety issues in ImageMetadataExtractor were resolved (TIKA-1369). * The ForkParser service is now registered in Activator (TIKA-1354). * The Rome Library was upgraded to version 1.5 (TIKA-1435). * Add markup for files embedded in PDFs (TIKA-1427). * Extract files embedded in annotations in PDFS (TIKA-1433). * Upgrade to PDFBox 1.8.8 (TIKA-1419, TIKA-1442). * Add RecursiveParserWrapper (aka Jukka's and Nick's) RecursiveMetadataParser (TIKA-1329) * Add example for how to dump TikaConfig to XML (TIKA-1418). * Allow users to specify a tika config file for tika-app (TIKA-1426). * PackageParser includes the last-modified date from the archive in the metadata, when handling embedded entries (TIKA-1246) * Created a new Tesseract OCR Parser to extract text from images. Requires installation of Tesseract before use (TIKA-93). * Basic parser for older Excel formats, such as Excel 4, 5 and 95, which can get simple text, and metadata for Excel 5+95 (TIKA-1490)