SEARCH
NEW RPMS
DIRECTORIES
ABOUT
FAQ
VARIOUS
BLOG
DONATE


YUM REPOSITORY

 
 

MAN page from openSUSE Tumbleweed perl-HTML-Tree-5.07-1.4.noarch.rpm

HTML::TreeBuilder

Section: User Contributed Perl Documentation (3)
Updated: 2017-09-01
Index 

NAME

HTML::TreeBuilder - Parser that builds a HTML syntax tree 

VERSION

This document describes version 5.07 ofHTML::TreeBuilder, released August 31, 2017as part of HTML-Tree. 

SYNOPSIS

  use HTML::TreeBuilder 5 -weak; # Ensure weak references in use  foreach my $file_name (@ARGV) {    my $tree = HTML::TreeBuilder->new; # empty tree    $tree->parse_file($file_name);    print "Hey, here's a dump of the parse tree of $file_name:\n";    $tree->dump; # a method we inherit from HTML::Element    print "And here it is, bizarrely rerendered as HTML:\n",      $tree->as_HTML, "\n";    # Now that we're done with it, we must destroy it.    # $tree = $tree->delete; # Not required with weak references  }
 

DESCRIPTION

(This class is part of the HTML::Tree dist.)

This class is for HTML syntax trees that get built out of HTMLsource. The way to use it is to:

1. start a new (empty) HTML::TreeBuilder object,

2. then use one of the methods from HTML::Parser (presumably with"$tree->parse_file($filename)" for files, or with"$tree->parse($document_content)" and "$tree->eof" if you've gotthe content in a string) to parse the HTMLdocument into the tree $tree.

(You can combine steps 1 and 2 with the ``new_from_file'' or``new_from_content'' methods.)

2b. call "$root->elementify()" if you want.

3. do whatever you need to do with the syntax tree, presumablyinvolving traversing it looking for some bit of information in it,

4. previous versions of HTML::TreeBuilder required you to call"$tree->delete()" to erase the contents of the tree from memorywhen you're done with the tree. This is not normally required anymore.See ``Weak References'' in HTML::Element for details. 

ATTRIBUTES

Most of the following attributes native to HTML::TreeBuilder control howparsing takes place; they should be set before you try parsing intothe given object. You can set the attributes by passing a TRUE orFALSE value as argument. E.g., "$root->implicit_tags" returnsthe current setting for the "implicit_tags" option,"$root->implicit_tags(1)" turns that option on,and "$root->implicit_tags(0)" turns it off. 

implicit_tags

Setting this attribute to true will instruct the parser to try todeduce implicit elements and implicit end tags. If it is false youget a parse tree that just reflects the text as it stands, which isunlikely to be useful for anything but quick and dirty parsing.(In fact, I'd be curious to hear from anyone who finds it useful tohave "implicit_tags" set to false.)Default is true.

Implicit elements have the ``implicit'' in HTML::Element attribute set. 

implicit_body_p_tag

This controls an aspect of implicit element behavior, if "implicit_tags"is on: If a text element (PCDATA) or a phrasal element (such as"<em>") is to be inserted under "<body>", two thingscan happen: if "implicit_body_p_tag" is true, it's placed under a new,implicit "<p>" tag. (Past DTDs suggested this was the onlycorrect behavior, and this is how past versions of this modulebehaved.) But if "implicit_body_p_tag" is false, nothing is implicated--- the PCDATA or phrasal element is simply placed under"<body>". Default is false. 

no_expand_entities

This attribute controls whether entities are decoded during the initialparse of the source. Enable this if you don't want entities decoded totheir character value. e.g. '&amp;' is decoded to '&' by default, butwill be unchanged if this is enabled.Default is false (entities will be decoded.) 

ignore_unknown

This attribute controls whether unknown tags should be represented aselements in the parse tree, or whether they should be ignored.Default is true (to ignore unknown tags.) 

ignore_text

Do not represent the text content of elements. This saves space ifall you want is to examine the structure of the document. Default isfalse. 

ignore_ignorable_whitespace

If set to true, TreeBuilder will try to avoidcreating ignorable whitespace text nodes in the tree. Default istrue. (In fact, I'd be interested in hearing if there's ever a casewhere you need this off, or where leaving it on leads to incorrectbehavior.) 

no_space_compacting

This determines whether TreeBuilder compacts all whitespace stringsin the document (well, outside of PRE or TEXTAREA elements), orleaves them alone. Normally (default, value of 0), each string ofcontiguous whitespace in the document is turned into a single space.But that's not done if "no_space_compacting" is set to 1.

Setting "no_space_compacting" to 1 might be useful if you wantto read in a tree just to make some minor changes to it beforewriting it back out.

This method is experimental. If you use it, be sure to reportany problems you might have with it. 

p_strict

If set to true (and it defaults to false), TreeBuilder will take anarrower than normal view of what can be under a "<p>" element; if it seesa non-phrasal element about to be inserted under a "<p>", it willclose that "<p>". Otherwise it will close "<p>" elements only forother "<p>"'s, headings, and "<form>" (although the latter may beremoved in future versions).

For example, when going thru this snippet of code,

  <p>stuff  <ul>

TreeBuilder will normally (with "p_strict" false) put the "<ul>" elementunder the "<p>" element. However, with "p_strict" set to true, it willclose the "<p>" first.

In theory, there should be strictness options like this for other/allelements besides just "<p>"; but I treat this as a special case simplybecause of the fact that "<p>" occurs so frequently and its end-tag isomitted so often; and also because application of strictness rulesat parse-time across all elements often makes tiny errors in HTMLcoding produce drastically bad parse-trees, in my experience.

If you find that you wish you had an option like this to enforcecontent-models on all elements, then I suggest that what you want iscontent-model checking as a stage after TreeBuilder has finishedparsing. 

store_comments

This determines whether TreeBuilder will normally store comments foundwhile parsing content into $root. Currently, this is off by default. 

store_declarations

This determines whether TreeBuilder will normally store markupdeclarations found while parsing content into $root. This is onby default. 

store_pis

This determines whether TreeBuilder will normally store processinginstructions found while parsing content into $root --- assuming arecent version of HTML::Parser (old versions won't parse PIscorrectly). Currently, this is off (false) by default.

It is somewhat of a known bug (to be fixed one of these days, ifanyone needs it?) that PIs in the preamble (before the "<html>"start-tag) end up actually under the "<html>" element. 

warn

This determines whether syntax errors during parsing should generatewarnings, emitted via Perl's "warn" function.

This is off (false) by default. 

METHODS

Objects of this class inherit the methods of both HTML::Parser andHTML::Element. The methods inherited from HTML::Parser are used forbuilding the HTML tree, and the methods inherited from HTML::Elementare what you use to scrutinize the tree. Besides this(HTML::TreeBuilder) documentation, you must also carefully read theHTML::Element documentation, and also skim the HTML::Parserdocumentation --- probably only its parse and parse_file methods are ofinterest. 

new_from_file

  $root = HTML::TreeBuilder->new_from_file($filename_or_filehandle);

This ``shortcut'' constructor merely combines constructing a new object(with the ``new'' method, below), and calling "$new->parse_file(...)" onit. Returns the new object. Note that this provides no way ofsetting any parse options like "store_comments" (for that, call "new", andthen set options, before calling "parse_file"). See the notes (below)on parameters to ``parse_file''.

If HTML::TreeBuilder is unable to read the file, then "new_from_file"dies. The error can also be found in $!. (This behavior is new inHTML-Tree 5. Previous versions returned a tree with only implicit elements.) 

new_from_content

  $root = HTML::TreeBuilder->new_from_content(...);

This ``shortcut'' constructor merely combines constructing a new object(with the ``new'' method, below), and calling "for(...){$new->parse($_)}"and "$new->eof" on it. Returns the new object. Note that this providesno way of setting any parse options like "store_comments" (for that,call "new", and then set options, before calling "parse"). Exampleusages: "HTML::TreeBuilder->new_from_content(@lines)", or"HTML::TreeBuilder->new_from_content($content)". 

new_from_url

  $root = HTML::TreeBuilder->new_from_url($url)

This ``shortcut'' constructor combines constructing a new object (withthe ``new'' method, below), loading LWP::UserAgent, fetching thespecified URL, and calling "$new->parse( $response->decoded_content)"and "$new->eof" on it.Returns the new object. Note that this provides no way of setting anyparse options like "store_comments".

If LWP is unable to fetch the URL, or the response is not HTML (asdetermined by ``content_is_html'' in HTTP::Headers), then "new_from_url"dies, and the HTTP::Response object is found in$HTML::TreeBuilder::lwp_response.

You must have installed LWP::UserAgent for this method to work. LWPis not installed automatically, because it's a large set of modulesand you might not need it. 

new

  $root = HTML::TreeBuilder->new();

This creates a new HTML::TreeBuilder object. This method takes noattributes. 

parse_file

 $root->parse_file(...)

[An important method inherited from HTML::Parser, whichsee. Current versions of HTML::Parser can take a filespec, or afilehandle object, like *FOO, or some object from class IO::Handle,IO::File, IO::Socket) or the like.I think you should check that a given file exists before calling"$root->parse_file($filespec)".]

When you pass a filename to "parse_file", HTML::Parser opens it inbinary mode, which means it's interpreted as Latin-1 (ISO-8859-1). Ifthe file is in another encoding, like UTF-8 or UTF-16, this will notdo the right thing.

One solution is to open the file yourself using the proper":encoding" layer, and pass the filehandle to "parse_file". You canautomate this process by using ``html_file'' in IO::HTML, which will usethe HTML5 encoding sniffing algorithm to automatically determine theproper ":encoding" layer and apply it.

In the next major release of HTML-Tree, I plan to have it use IO::HTMLautomatically. If you really want your file opened in binary mode,you should open it yourself and pass the filehandle to "parse_file".

The return value is "undef" if there's an error opening the file. Inthat case, the error will be in $!. 

parse

  $root->parse(...)

[A important method inherited from HTML::Parser, whichsee. See the note below for "$root->eof()".] 

eof

  $root->eof();

This signals that you're finished parsing content into this tree; thisruns various kinds of crucial cleanup on the tree. This is calledfor you when you call "$root->parse_file(...)", but not whenyou call "$root->parse(...)". So if you call"$root->parse(...)", then you must call "$root->eof()"once you've finished feeding all the chunks to "parse(...)", andbefore you actually start doing anything else with the tree in $root. 

parse_content

  $root->parse_content(...);

Basically a handy alias for "$root->parse(...); $root->eof".Takes the exact same arguments as "$root->parse()". 

delete

  $root->delete();

[A previously important method inherited from HTML::Element,which see.] 

elementify

  $root->elementify();

This changes the class of the object in $root fromHTML::TreeBuilder to the class used for all the rest of the elementsin that tree (generally HTML::Element). Returns $root.

For most purposes, this is unnecessary, but if you call this after(after!!)you've finished building a tree, then it keeps you from accidentallytrying to call anything but HTML::Element methods on it. (I.e., ifyou accidentally call "$root->parse_file(...)" on thealready-complete and elementified tree, then instead of charging aheadand wreaking havoc, it'll throw a fatal error --- since $root isnow an object just of class HTML::Element which has no "parse_file"method.

Note that "elementify" currently deletes all the private attributes of$root except for ``_tag'', ``_parent'', ``_content'', ``_pos'', and``_implicit''. If anyone requests that I change this to leave in yetmore private attributes, I might do so, in future versions. 

guts

 @nodes = $root->guts(); $parent_for_nodes = $root->guts();

In list context (as in the first case), this method returns the topmostnon-implicit nodes in a tree. This is useful when you're parsing HTMLcode that you know doesn't expect an HTML document, but instead justa fragment of an HTML document. For example, if you wanted the parsetree for a file consisting of just this:

  <li>I like pie!

Then you would get that with "@nodes = $root->guts();".It so happens that in this case, @nodes will contain just oneelement object, representing the "<li>" node (with ``I like pie!'' beingits text child node). However, consider if you were parsing this:

  <hr>Hooboy!<hr>

In that case, "$root->guts()" would return three items:an element object for the first "<hr>", a text string ``Hooboy!'', andanother "<hr>" element object.

For cases where you want definitely one element (so you can treat it asa ``document fragment'', roughly speaking), call "guts()" in scalarcontext, as in "$parent_for_nodes = $root->guts()". That works like"guts()" in list context; in fact, "guts()" in list context wouldhave returned exactly one value, and if it would have been an object (asopposed to a text string), then that's what "guts" in scalar contextwill return. Otherwise, if "guts()" in list context would have returnedno values at all, then "guts()" in scalar context returns undef. Inall other cases, "guts()" in scalar context returns an implicit "<div>"element node, with children consisting of whatever nodes "guts()"in list context would have returned. Note that that may detach thosenodes from $root's tree. 

disembowel

  @nodes = $root->disembowel();  $parent_for_nodes = $root->disembowel();

The "disembowel()" method works just like the "guts()" method, exceptthat disembowel definitively destroys the tree above the nodes thatare returned. Usually when you want the guts from a tree, you're justgoing to toss out the rest of the tree anyway, so this saves you thebother. (Remember, ``disembowel'' means ``remove the guts from''.) 

INTERNAL METHODS

You should not need to call any of the following methods directly. 

element_class

  $classname = $h->element_class;

This method returns the class which will be used for new elements. Itdefaults to HTML::Element, but can be overridden by subclassing or esotericmeans best left to those will will read the source and then not complain whenthose esoteric means change. (Just subclass.) 

comment

Accept a ``here's a comment'' signal from HTML::Parser. 

declaration

Accept a ``here's a markup declaration'' signal from HTML::Parser. 

done

TODO: document 

end

Either: Acccept an end-tag signal from HTML::ParserOr: Method for closing currently open elements in some fairly complexway, as used by other methods in this class.

TODO: Why is this hidden? 

process

Accept a ``here's a PI'' signal from HTML::Parser. 

start

Accept a signal from HTML::Parser for start-tags.

TODO: Why is this hidden? 

stunt

TODO: document 

stunted

TODO: document 

text

Accept a ``here's a text token'' signal from HTML::Parser.

TODO: Why is this hidden? 

tighten_up

Legacy

Redirects to ``delete_ignorable_whitespace'' in HTML::Element. 

warning

Wrapper for CORE::warn

TODO: why not just use carp? 

SUBROUTINES

 

DEBUG

Are we in Debug mode? This is a constant subroutine, to allowcompile-time optimizations. To control debug mode, set$HTML::TreeBuilder::DEBUG before loading HTML::TreeBuilder. 

HTML AND ITS DISCONTENTS

HTML is rather harder to parse than people who write it generallysuspect.

Here's the problem: HTML is a kind of SGML that permits ``minimization''and ``implication''. In short, this means that you don't have to closeevery tag you open (because the opening of a subsequent tag mayimplicitly close it), and if you use a tag that can't occur in thecontext you seem to using it in, under certain conditions the parserwill be able to realize you mean to leave the current context andenter the new one, that being the only one that your code couldcorrectly be interpreted in.

Now, this would all work flawlessly and unproblematically if: 1) allthe rules that both prescribe and describe HTML were (and had been)clearly set out, and 2) everyone was aware of these rules and wrotetheir code in compliance to them.

However, it didn't happen that way, and so most HTML pages aredifficult if not impossible to correctly parse with nearly any set ofstraightforward SGML rules. That's why the internals ofHTML::TreeBuilder consist of lots and lots of special cases --- insteadof being just a generic SGML parser with HTML DTD rules plugged in. 

TRANSLATIONS?

The techniques that HTML::TreeBuilder uses to perform what I considervery robust parses on everyday code are not things that can work onlyin Perl. To date, the algorithms at the center of HTML::TreeBuilderhave been implemented only in Perl, as far as I know; and I don'tforesee getting around to implementing them in any other language anytime soon.

If, however, anyone is looking for a semester project for an appliedprogramming class (or if they merely enjoy extra-curricularmasochism), they might do well to see about choosing as a topic theimplementation/adaptation of these routines to any other interestingprogramming language that you feel currently suffers from a lack ofrobust HTML-parsing. I welcome correspondence on this subject, andpoint out that one can learn a great deal about languages by trying totranslate between them, and then comparing the result.

The HTML::TreeBuilder source may seem long and complex, but it israther well commented, and symbol names are generallyself-explanatory. (You are encouraged to read the Mozilla HTML parsersource for comparison.) Some of the complexity comes from little-usedfeatures, and some of it comes from having the HTML tokenizer(HTML::Parser) being a separate module, requiring somewhat of adifferent interface than you'd find in a combined tokenizer andtree-builder. But most of the length of the source comes from the factthat it's essentially a long list of special cases, with lots and lotsof sanity-checking, and sanity-recovery --- because, as RoseanneRosannadanna once said, "it's always something".

Users looking to compare several HTML parsers should look at thesource for Raggett's Tidy("<http://www.w3.org/People/Raggett/tidy/>"),Mozilla("<http://www.mozilla.org/>"),and possibly root around the browsers section of Yahooto find the various open-source ones("<http://dir.yahoo.com/Computers_and_Internet/Software/Internet/World_Wide_Web/Browsers/>"). 

BUGS

* Framesets seem to work correctly now. Email me if you get a strangeparse from a document with framesets.

* Really bad HTML code will, often as not, make for a somewhatobjectionable parse tree. Regrettable, but unavoidably true.

* If you're running with "implicit_tags" off (God help you!), considerthat "$tree->content_list" probably contains the tree or grove from theparse, and not $tree itself (which will, oddly enough, be an implicit"<html>" element). This seems counter-intuitive and problematic; butseeing as how almost no HTML ever parses correctly with "implicit_tags"off, this interface oddity seems the least of your problems. 

BUG REPORTS

When a document parses in a way different from how you think itshould, I ask that you report this to me as a bug. The first thingyou should do is copy the document, trim out as much of it as you canwhile still producing the bug in question, and then email me thatmini-document and the code you're using to parse it, to the HTML::Treebug queue at "<bug-html-tree at rt.cpan.org>".

Include a note as to how itparses (presumably including its "$tree->dump" output), and then acareful and clear explanation of where you think the parser isgoing astray, and how you would prefer that it work instead. 

SEE ALSO

For more information about the HTML-Tree distribution: HTML::Tree.

Modules used by HTML::TreeBuilder:HTML::Parser, HTML::Element, HTML::Tagset.

For converting between XML::DOM::Node, HTML::Element, andXML::Element trees: HTML::DOMbo.

For opening a HTML file with automatic charset detection: IO::HTML. 

AUTHOR

Current maintainers:
*
Christopher J. Madsen "<perl AT cjmweb.net>"
*
Jeff Fearn "<jfearn AT cpan.org>"

Original HTML-Tree author:

*
Gisle Aas

Former maintainers:

*
Sean M. Burke
*
Andy Lester
*
Pete Krawczyk "<petek AT cpan.org>"

You can follow or contribute to HTML-Tree's development at<https://github.com/kentfredric/HTML-Tree>. 

COPYRIGHT AND LICENSE

Copyright 1995-1998 Gisle Aas, 1999-2004 Sean M. Burke,2005 Andy Lester, 2006 Pete Krawczyk, 2010 Jeff Fearn,2012 Christopher J. Madsen.

This library is free software; you can redistribute it and/ormodify it under the same terms as Perl itself.

The programs in this library are distributed in the hope that theywill be useful, but without any warranty; without even the impliedwarranty of merchantability or fitness for a particular purpose.


 

Index

NAME
VERSION
SYNOPSIS
DESCRIPTION
ATTRIBUTES
implicit_tags
implicit_body_p_tag
no_expand_entities
ignore_unknown
ignore_text
ignore_ignorable_whitespace
no_space_compacting
p_strict
store_comments
store_declarations
store_pis
warn
METHODS
new_from_file
new_from_content
new_from_url
new
parse_file
parse
eof
parse_content
delete
elementify
guts
disembowel
INTERNAL METHODS
element_class
comment
declaration
done
end
process
start
stunt
stunted
text
tighten_up
warning
SUBROUTINES
DEBUG
HTML AND ITS DISCONTENTS
TRANSLATIONS?
BUGS
BUG REPORTS
SEE ALSO
AUTHOR
COPYRIGHT AND LICENSE

This document was created byman2html,using the manual pages.