tools/perl-5.8.0/man/man3/HTML::Tagset.3

.\" Automatically generated by Pod::Man v1.34, Pod::Parser v1.13
.\"
.\" Standard preamble:
.\" ========================================================================
.de Sh \" Subsection heading
.br
.if t .Sp
.ne 5
.PP
\fB\\$1\fR
.PP
..
.de Sp \" Vertical space (when we can't use .PP)
.if t .sp .5v
.if n .sp
..
.de Vb \" Begin verbatim text
.ft CW
.nf
.ne \\$1
..
.de Ve \" End verbatim text
.ft R
.fi
..
.\" Set up some character translations and predefined strings.  \*(-- will
.\" give an unbreakable dash, \*(PI will give pi, \*(L" will give a left
.\" double quote, and \*(R" will give a right double quote.  | will give a
.\" real vertical bar.  \*(C+ will give a nicer C++.  Capital omega is used to
.\" do unbreakable dashes and therefore won't be available.  \*(C` and \*(C'
.\" expand to `' in nroff, nothing in troff, for use with C<>.
.tr \(*W-|\(bv\*(Tr
.ds C+ C\v'-.1v'\h'-1p'\s-2+\h'-1p'+\s0\v'.1v'\h'-1p'
.ie n \{\
.    ds -- \(*W-
.    ds PI pi
.    if (\n(.H=4u)&(1m=24u) .ds -- \(*W\h'-12u'\(*W\h'-12u'-\" diablo 10 pitch
.    if (\n(.H=4u)&(1m=20u) .ds -- \(*W\h'-12u'\(*W\h'-8u'-\"  diablo 12 pitch
.    ds L" ""
.    ds R" ""
.    ds C` ""
.    ds C' ""
'br\}
.el\{\
.    ds -- \|\(em\|
.    ds PI \(*p
.    ds L" ``
.    ds R" ''
'br\}
.\"
.\" If the F register is turned on, we'll generate index entries on stderr for
.\" titles (.TH), headers (.SH), subsections (.Sh), items (.Ip), and index
.\" entries marked with X<> in POD.  Of course, you'll have to process the
.\" output yourself in some meaningful fashion.
.if \nF \{\
.    de IX
.    tm Index:\\$1\t\\n%\t"\\$2"
..
.    nr % 0
.    rr F
.\}
.\"
.\" For nroff, turn off justification.  Always turn off hyphenation; it makes
.\" way too many mistakes in technical documents.
.hy 0
.if n .na
.\"
.\" Accent mark definitions (@(#)ms.acc 1.5 88/02/08 SMI; from UCB 4.2).
.\" Fear.  Run.  Save yourself.  No user-serviceable parts.
.    \" fudge factors for nroff and troff
.if n \{\
.    ds #H 0
.    ds #V .8m
.    ds #F .3m
.    ds #[ \f1
.    ds #] \fP
.\}
.if t \{\
.    ds #H ((1u-(\\\\n(.fu%2u))*.13m)
.    ds #V .6m
.    ds #F 0
.    ds #[ \&
.    ds #] \&
.\}
.    \" simple accents for nroff and troff
.if n \{\
.    ds ' \&
.    ds ` \&
.    ds ^ \&
.    ds , \&
.    ds ~ ~
.    ds /
.\}
.if t \{\
.    ds ' \\k:\h'-(\\n(.wu*8/10-\*(#H)'\'\h"|\\n:u"
.    ds ` \\k:\h'-(\\n(.wu*8/10-\*(#H)'\`\h'|\\n:u'
.    ds ^ \\k:\h'-(\\n(.wu*10/11-\*(#H)'^\h'|\\n:u'
.    ds , \\k:\h'-(\\n(.wu*8/10)',\h'|\\n:u'
.    ds ~ \\k:\h'-(\\n(.wu-\*(#H-.1m)'~\h'|\\n:u'
.    ds / \\k:\h'-(\\n(.wu*8/10-\*(#H)'\z\(sl\h'|\\n:u'
.\}
.    \" troff and (daisy-wheel) nroff accents
.ds : \\k:\h'-(\\n(.wu*8/10-\*(#H+.1m+\*(#F)'\v'-\*(#V'\z.\h'.2m+\*(#F'.\h'|\\n:u'\v'\*(#V'
.ds 8 \h'\*(#H'\(*b\h'-\*(#H'
.ds o \\k:\h'-(\\n(.wu+\w'\(de'u-\*(#H)/2u'\v'-.3n'\*(#[\z\(de\v'.3n'\h'|\\n:u'\*(#]
.ds d- \h'\*(#H'\(pd\h'-\w'~'u'\v'-.25m'\f2\(hy\fP\v'.25m'\h'-\*(#H'
.ds D- D\\k:\h'-\w'D'u'\v'-.11m'\z\(hy\v'.11m'\h'|\\n:u'
.ds th \*(#[\v'.3m'\s+1I\s-1\v'-.3m'\h'-(\w'I'u*2/3)'\s-1o\s+1\*(#]
.ds Th \*(#[\s+2I\s-2\h'-\w'I'u*3/5'\v'-.3m'o\v'.3m'\*(#]
.ds ae a\h'-(\w'a'u*4/10)'e
.ds Ae A\h'-(\w'A'u*4/10)'E
.    \" corrections for vroff
.if v .ds ~ \\k:\h'-(\\n(.wu*9/10-\*(#H)'\s-2\u~\d\s+2\h'|\\n:u'
.if v .ds ^ \\k:\h'-(\\n(.wu*10/11-\*(#H)'\v'-.4m'^\v'.4m'\h'|\\n:u'
.    \" for low resolution devices (crt and lpr)
.if \n(.H>23 .if \n(.V>19 \
\{\
.    ds : e
.    ds 8 ss
.    ds o a
.    ds d- d\h'-1'\(ga
.    ds D- D\h'-1'\(hy
.    ds th \o'bp'
.    ds Th \o'LP'
.    ds ae ae
.    ds Ae AE
.\}
.rm #[ #] #H #V #F C
.\" ========================================================================
.\"
.IX Title "Tagset 3"
.TH Tagset 3 "2000-10-20" "perl v5.8.0" "User Contributed Perl Documentation"
.SH "NAME"
HTML::Tagset \- data tables useful in parsing HTML
.SH "SYNOPSIS"
.IX Header "SYNOPSIS"
.Vb 3
\&  use HTML::Tagset;
\&  # Then use any of the items in the HTML::Tagset package
\&  #  as need arises
.Ve
.SH "DESCRIPTION"
.IX Header "DESCRIPTION"
This module contains several data tables useful in various kinds of
\&\s-1HTML\s0 parsing operations.
.PP
Note that all tag names used are lowercase.
.PP
In the following documentation, a \*(L"hashset\*(R" is a hash being used as a
set \*(-- the hash conveys that its keys are there, and the actual values
associated with the keys are not significant.  (But what values are
there, are always true.)
.ie n .IP "hashset %HTML::Tagset::emptyElement" 4
.el .IP "hashset \f(CW%HTML::Tagset::emptyElement\fR" 4
.IX Item "hashset %HTML::Tagset::emptyElement"
This hashset has as values the tag-names (GIs) of elements that cannot
have content.  (For example, \*(L"base\*(R", \*(L"br\*(R", \*(L"hr\*(R".)  So
\&\f(CW$HTML::Tagset::emptyElement{'hr'}\fR exists and is true.
\&\f(CW$HTML::Tagset::emptyElement{'dl'}\fR does not exist, and so is not true.
.ie n .IP "hashset %HTML::Tagset::optionalEndTag" 4
.el .IP "hashset \f(CW%HTML::Tagset::optionalEndTag\fR" 4
.IX Item "hashset %HTML::Tagset::optionalEndTag"
This hashset lists tag-names for elements that can have content, but whose
end-tags are generally, \*(L"safely\*(R", omissible.  Example:
\&\f(CW$HTML::Tagset::emptyElement{'li'}\fR exists and is true.
.ie n .IP "hash %HTML::Tagset::linkElements" 4
.el .IP "hash \f(CW%HTML::Tagset::linkElements\fR" 4
.IX Item "hash %HTML::Tagset::linkElements"
Values in this hash are tagnames for elements that might contain
links, and the value for each is a reference to an array of the names
of attributes whose values can be links.
.ie n .IP "hash %HTML::Tagset::boolean_attr" 4
.el .IP "hash \f(CW%HTML::Tagset::boolean_attr\fR" 4
.IX Item "hash %HTML::Tagset::boolean_attr"
This hash (not hashset) lists what attributes of what elements can be
printed without showing the value (for example, the \*(L"noshade\*(R" attribute
of \*(L"hr\*(R" elements).  For elements with only one such attribute, its value
is simply that attribute name.  For elements with many such attributes,
the value is a reference to a hashset containing all such attributes.
.ie n .IP "hashset %HTML::Tagset::isPhraseMarkup" 4
.el .IP "hashset \f(CW%HTML::Tagset::isPhraseMarkup\fR" 4
.IX Item "hashset %HTML::Tagset::isPhraseMarkup"
This hashset contains all phrasal-level elements.
.ie n .IP "hashset %HTML::Tagset::is_Possible_Strict_P_Content" 4
.el .IP "hashset \f(CW%HTML::Tagset::is_Possible_Strict_P_Content\fR" 4
.IX Item "hashset %HTML::Tagset::is_Possible_Strict_P_Content"
This hashset contains all phrasal-level elements that be content of a
P element, for a strict model of \s-1HTML\s0.
.ie n .IP "hashset %HTML::Tagset::isHeadElement" 4
.el .IP "hashset \f(CW%HTML::Tagset::isHeadElement\fR" 4
.IX Item "hashset %HTML::Tagset::isHeadElement"
This hashset contains all elements that elements that should be
present only in the 'head' element of an \s-1HTML\s0 document.
.ie n .IP "hashset %HTML::Tagset::isList" 4
.el .IP "hashset \f(CW%HTML::Tagset::isList\fR" 4
.IX Item "hashset %HTML::Tagset::isList"
This hashset contains all elements that can contain \*(L"li\*(R" elements.
.ie n .IP "hashset %HTML::Tagset::isTableElement" 4
.el .IP "hashset \f(CW%HTML::Tagset::isTableElement\fR" 4
.IX Item "hashset %HTML::Tagset::isTableElement"
This hashset contains all elements that are to be found only in/under
a \*(L"table\*(R" element.
.ie n .IP "hashset %HTML::Tagset::isFormElement" 4
.el .IP "hashset \f(CW%HTML::Tagset::isFormElement\fR" 4
.IX Item "hashset %HTML::Tagset::isFormElement"
This hashset contains all elements that are to be found only in/under
a \*(L"form\*(R" element.
.ie n .IP "hashset %HTML::Tagset::isBodyMarkup" 4
.el .IP "hashset \f(CW%HTML::Tagset::isBodyMarkup\fR" 4
.IX Item "hashset %HTML::Tagset::isBodyMarkup"
This hashset contains all elements that are to be found only in/under
the \*(L"body\*(R" element of an \s-1HTML\s0 document.
.ie n .IP "hashset %HTML::Tagset::isHeadOrBodyElement" 4
.el .IP "hashset \f(CW%HTML::Tagset::isHeadOrBodyElement\fR" 4
.IX Item "hashset %HTML::Tagset::isHeadOrBodyElement"
This hashset includes all elements that I notice can fall either in
the head or in the body.
.ie n .IP "hashset %HTML::Tagset::isKnown" 4
.el .IP "hashset \f(CW%HTML::Tagset::isKnown\fR" 4
.IX Item "hashset %HTML::Tagset::isKnown"
This hashset lists all known \s-1HTML\s0 elements.
.ie n .IP "hashset %HTML::Tagset::canTighten" 4
.el .IP "hashset \f(CW%HTML::Tagset::canTighten\fR" 4
.IX Item "hashset %HTML::Tagset::canTighten"
This hashset lists elements that might have ignorable whitespace as
children or siblings.
.ie n .IP "array @HTML::Tagset::p_closure_barriers" 4
.el .IP "array \f(CW@HTML::Tagset::p_closure_barriers\fR" 4
.IX Item "array @HTML::Tagset::p_closure_barriers"
This array has a meaning that I have only seen a need for in
\&\f(CW\*(C`HTML::TreeBuilder\*(C'\fR, but I include it here on the off chance that someone
might find it of use:
.Sp
When we see a "<p>" token, we go lookup up the lineage for a p
element we might have to minimize.  At first sight, we might say that
if there's a p anywhere in the lineage of this new p, it should be
closed.  But that's wrong.  Consider this document:
.Sp
.Vb 17
\&  <html>
\&    <head>
\&      <title>foo</title>
\&    </head>
\&    <body>
\&      <p>foo
\&        <table>
\&          <tr>
\&            <td>
\&               foo
\&               <p>bar
\&            </td>
\&          </tr>
\&        </table>
\&      </p>
\&    </body>
\&  </html>
.Ve
.Sp
The second p is quite legally inside a much higher p.
.Sp
My formalization of the reason why this is legal, but this:
.Sp
.Vb 1
\&  <p>foo<p>bar</p></p>
.Ve
.Sp
isn't, is that something about the table constitutes a \*(L"barrier\*(R" to
the application of the rule about what p must minimize.
.Sp
So \f(CW@HTML::Tagset::p_closure_barriers\fR is the list of all such
barrier\-tags.
.ie n .IP "hashset %isCDATA_Parent" 4
.el .IP "hashset \f(CW%isCDATA_Parent\fR" 4
.IX Item "hashset %isCDATA_Parent"
This hashset includes all elements whose content is \s-1CDATA\s0.
.SH "CAVEATS"
.IX Header "CAVEATS"
You may find it useful to alter the behavior of modules (like
\&\f(CW\*(C`HTML::Element\*(C'\fR or \f(CW\*(C`HTML::TreeBuilder\*(C'\fR) that use \f(CW\*(C`HTML::Tagset\*(C'\fR's
data tables by altering the data tables themselves.  You are welcome
to try, but be careful; and be aware that different modules may or may
react differently to the data tables being changed.
.PP
Note that it may be inappropriate to use these tables for \fIproducing\fR
\&\s-1HTML\s0 \*(-- for example, \f(CW%isHeadOrBodyElement\fR lists the tagnames
for all elements that can appear either in the head or in the body,
such as \*(L"script\*(R".  That doesn't mean that I am saying your code that
produces \s-1HTML\s0 should feel free to put script elements in either place!
If you are producing programs that spit out \s-1HTML\s0, you should be
\&\fIintimately\fR familiar with the DTDs for \s-1HTML\s0 or \s-1XHTML\s0 (available at
\&\f(CW\*(C`http://www.w3.org/\*(C'\fR), and you should slavishly obey them, not
the data tables in this document.
.SH "SEE ALSO"
.IX Header "SEE ALSO"
HTML::Element, HTML::TreeBuilder, HTML::LinkExtor
.SH "COPYRIGHT"
.IX Header "COPYRIGHT"
Copyright 1995\-2000 Gisle Aas; copyright 2000 Sean M. Burke.
.PP
This library is free software; you can redistribute it and/or
modify it under the same terms as Perl itself.
.SH "AUTHOR"
.IX Header "AUTHOR"
Current maintainer: Sean M. Burke, <sburke@cpan.org>
.PP
Most of the code/data in this module was adapted from code written by
Gisle Aas <gisle@aas.no> for \f(CW\*(C`HTML::Element\*(C'\fR,
\&\f(CW\*(C`HTML::TreeBuilder\*(C'\fR, and \f(CW\*(C`HTML::LinkExtor\*(C'\fR.