fbpx
Wikipedia

Comparison of HTML parsers

HTML parsers are software for automated Hypertext Markup Language (HTML) parsing. They have two main purposes:

  • HTML traversal: offer an interface for programmers to easily access and modify the "HTML string code". Canonical example: DOM parsers.
  • HTML clean: to fix invalid HTML and to improve the layout and indent style of the resulting markup. Canonical example: HTML Tidy.
Parser License Implementation language(s) Latest date* HTML parsing[1] HTML5-compliant parsing Clean HTML** Update HTML***
HTML Tidy W3C license ANSI C 2021-07-17[2] Yes[3] Yes Yes[3] Yes
HtmlUnit Apache License 2.0 Java 2023-10-31[4] Yes ? No No
Beautiful Soup MIT License Python 2023-04-07[5] Yes Yes ? No
jsoup MIT License Java 2023-11-27[6] Yes Yes Yes Yes
Parser License Implementation language(s) Latest date* HTML Parsing HTML5-compliant Parsing Clean HTML** Update HTML***
* Latest release (of significant changes) date.
** sanitize (generating standard-compatible web-page, reduce spam, etc.) and clean (strip out surplus presentational tags, remove XSS code, etc.) HTML code.
*** Updates HTML4.X to XHTML or to HTML5, converting deprecated tags (ex. CENTER) to valid ones (ex. DIV with style="text-align:center;").

References edit

  1. ^ 12.2 Parsing HTML documents — HTML Standard 2013-01-16 at the Wayback Machine
  2. ^ HTML Tidy release 5.8.0
  3. ^ a b What is Tidy?
  4. ^ HtmlUnit 3.7.0
  5. ^ Beautiful Soup release 4.10
  6. ^ jsoup Java HTML Parser release 1.17.1

comparison, html, parsers, this, article, multiple, issues, please, help, improve, discuss, these, issues, talk, page, learn, when, remove, these, template, messages, this, article, needs, additional, citations, verification, please, help, improve, this, artic. This article has multiple issues Please help improve it or discuss these issues on the talk page Learn how and when to remove these template messages This article needs additional citations for verification Please help improve this article by adding citations to reliable sources Unsourced material may be challenged and removed Find sources Comparison of HTML parsers news newspapers books scholar JSTOR May 2015 Learn how and when to remove this template message This article possibly contains original research Please improve it by verifying the claims made and adding inline citations Statements consisting only of original research should be removed May 2015 Learn how and when to remove this template message Learn how and when to remove this template message HTML parsers are software for automated Hypertext Markup Language HTML parsing They have two main purposes HTML traversal offer an interface for programmers to easily access and modify the HTML string code Canonical example DOM parsers HTML clean to fix invalid HTML and to improve the layout and indent style of the resulting markup Canonical example HTML Tidy Parser License Implementation language s Latest date HTML parsing 1 HTML5 compliant parsing Clean HTML Update HTML HTML Tidy W3C license ANSI C 2021 07 17 2 Yes 3 Yes Yes 3 YesHtmlUnit Apache License 2 0 Java 2023 10 31 4 Yes No NoBeautiful Soup MIT License Python 2023 04 07 5 Yes Yes Nojsoup MIT License Java 2023 11 27 6 Yes Yes Yes YesParser License Implementation language s Latest date HTML Parsing HTML5 compliant Parsing Clean HTML Update HTML Latest release of significant changes date sanitize generating standard compatible web page reduce spam etc and clean strip out surplus presentational tags remove XSS code etc HTML code Updates HTML4 X to XHTML or to HTML5 converting deprecated tags ex CENTER to valid ones ex DIV with style text align center References edit 12 2 Parsing HTML documents HTML Standard Archived 2013 01 16 at the Wayback Machine HTML Tidy release 5 8 0 a b What is Tidy HtmlUnit 3 7 0 Beautiful Soup release 4 10 jsoup Java HTML Parser release 1 17 1 Retrieved from https en wikipedia org w index php title Comparison of HTML parsers amp oldid 1187044361, wikipedia, wiki, book, books, library,

article

, read, download, free, free download, mp3, video, mp4, 3gp, jpg, jpeg, gif, png, picture, music, song, movie, book, game, games.