发布于 2017-06-12 04:57:13 | 185 次阅读 | 评论: 0 | 来源: 网友投递
jsoup HTML解析器
jsoup 是一款 Java 的HTML 解析器,可直接解析某个URL地址、HTML文本内容。它提供了一套非常省力的API,可通过DOM,CSS以及类似于JQuery的操作方法来取出和操作数据。
jsoup 1.10.3 发布了,该版本带来了更好的 CSS 选择器性能,Jsoup.Connection 改进和其他 bug 修复。
详情包括:
Improvements
Added Elements.eachText()
and Elements.eachAttr()
, which return a list of an Element's
text or attribute values, respectively. This makes it simpler to for example get a list of each URL on a page: List<String> urls = doc.select("a").eachAttr("abs:href"");
Improved selector validation for :contains(...)
with unbalanced quotes.
Improved the speed of index based CSS selectors and other methods that use elementSiblingIndex, by a factor of 34x.
Added Node.clearAttributes()
, to simplify removing of all attributes of a Node
/ Element
.
Fixes
Bugfix: if an attribute name started or ended with a control character, the parse would fail with a validation exception.
Bugfix: Element.hasClass()
and the .classname
selector would not find the class attribute case-insensitively.
Bugfix: In Jsoup.Connection
, if a redirect contained a query string with %xx
escapes, they would be double escaped before the redirect was followed, leading to fetching an incorrect location.
Bugfix: In Jsoup.Connection
, if a request body was set and the connection was redirected, the body would incorrectly still be sent.
Bugfix: In DataUtil
when detecting the character set from meta data, and there are two Content-Types defined, use the one that defines a character set.
Bugfix: when parsing unknown tags in case-sensitive HTML mode, end tags would not close scope correctly.
In Jsoup.Connection
, ensure there is no Content-Type set when being redirected to a GET.
Bugfix: in certain locales (Turkish specifically), lowercasing and case insensitivity could fail for specific items.
下载地址:https://jsoup.org/download