Thread utf8 mit HTML::Treebuilder verarbeiten (8 answers)
Opened by Nordlicht at 2011-11-09 06:22

bianca
 2011-11-09 07:37
#154046 #154046
User since
2009-09-13
7016 Artikel
BenutzerIn

user image
Dazu steht was beim HTML::Parser:
http://search.cpan.org/~gaas/HTML-Parser-3.69/Pars...
Quote
Parsing of undecoded UTF-8 will give garbage when decoding entities
(W) The first chunk parsed appears to contain undecoded UTF-8 and one or more argspecs that decode entities are used for the callback handlers.

The result of decoding will be a mix of encoded and decoded characters for any entities that expand to characters with code above 127. This is not a good thing.

The recommened solution is to apply Encode::decode_utf8() on the data before feeding it to the $p->parse(). For $p->parse_file() pass a file that has been opened in ":utf8" mode.

The alternative solution is to enable the utf8_mode and not decode before passing strings to $p->parse(). The parser can process raw undecoded UTF-8 sanely if the utf8_mode is enabled, or if the "attr", "@attr" or "dtext" argspecs are avoided.
10 print "Hallo"
20 goto 10

View full thread utf8 mit HTML::Treebuilder verarbeiten