Apache Tika 1.7 发布，文本内容抽取集

发布于 2015-01-17 01:33:18 | 219 次阅读 | 评论: 0 | 来源: 网友投递

Apache Tika 内容抽取工具集合

Apache Tika 利用现有的解析类库，从不同格式的文档中（例如HTML, PDF, Doc)，侦测和提取出元数据和结构化内容。

Apache Tika 1.7 发布了，Tika是一个内容抽取的工具集合(a toolkit for text extracting)。它集成了POI, Pdfbox 并且为文本抽取工作提供了一个统一的界面。其次，Tika也提供了便利的扩展API，用来丰富其对第三方文件格式的支持。

该版本包含很多改进和 bug 修复，详细列表如下：

  * Fixed resource leak in OutlookPSTParser that caused TikaException 
    when invoked via AutoDetectParser on Windows (TIKA-1506).

  * HTML tags are properly stripped from content by FeedParser
    (TIKA-1500).

  * Tika Server support for selecting a single metadata key;
    wrapped MetadataEP into MetadataResource (TIKA-1499).

  * Tika Server support for JSON and XMP views of metadata (TIKA-1497).

  * Tika Parent uses dependency management to keep duplicate 
    dependencies in different modules the same version (TIKA-1384).

  * Upgraded slf4j to version 1.7.7 (TIKA-1496).

  * Tika Server support for RecursiveParserWrapper's JSON output
    (endpoint=rmeta) equivalent to (TIKA-1451's) -J option 
    in tika-app (TIKA-1498).

  * Tika Server support for providing the password for files on a 
    per-request basis through the Password http header (TIKA-1494).

  * Simple support for the BPG (Better Portable Graphics) image format
    (TIKA-1491, TIKA-1495).

  * Prevent exceptions from being thrown for some malformed
    mp3 files (TIKA-1218).

  * Reformat pom.xml files to use two spaces per indent (TIKA-1475).

  * Fix warning of slf4j logger on Tika Server startup (TIKA-1472).

  * Tika CLI and GUI now have option to view JSON rendering of output
    of RecursiveParserWrapper (TIKA-1451).

  * Tika now integrates the Geospatial Data Abstraction Library
    (GDAL) for parsing hundreds of geospatial formats (TIKA-605,
    TIKA-1503).

  * ExternalParsers can now use Regexs to specify dynamic keys
   (TIKA-1441).

  * Thread safety issues in ImageMetadataExtractor were resolved
    (TIKA-1369).
 
  * The ForkParser service is now registered in Activator
    (TIKA-1354).

  * The Rome Library was upgraded to version 1.5 (TIKA-1435).

  * Add markup for files embedded in PDFs (TIKA-1427).
 
  * Extract files embedded in annotations in PDFS (TIKA-1433).

  * Upgrade to PDFBox 1.8.8 (TIKA-1419, TIKA-1442).

  * Add RecursiveParserWrapper (aka Jukka's and Nick's) 
    RecursiveMetadataParser (TIKA-1329)

  * Add example for how to dump TikaConfig to XML (TIKA-1418).

  * Allow users to specify a tika config file for tika-app (TIKA-1426).

  * PackageParser includes the last-modified date from the archive
    in the metadata, when handling embedded entries (TIKA-1246)

  * Created a new Tesseract OCR Parser to extract text from images.
    Requires installation of Tesseract before use (TIKA-93).

  * Basic parser for older Excel formats, such as Excel 4, 5 and 95,
    which can get simple text, and metadata for Excel 5+95 (TIKA-1490)

Apache Tika 1.7 发布，文本内容抽取集

Apache Tika 内容抽取工具集合

后端技术

前端技术

数据库

热门框架

常用IDE

其他