Dazu steht was beim HTML::Parser:
http://search.cpan.org/~gaas/HTML-Parser-3.69/Pars...
QuoteParsing of undecoded UTF-8 will give garbage when decoding entities
(W) The first chunk parsed appears to contain undecoded UTF-8 and one or more argspecs that decode entities are used for the callback handlers.
The result of decoding will be a mix of encoded and decoded characters for any entities that expand to characters with code above 127. This is not a good thing.
The recommened solution is to apply Encode::decode_utf8() on the data before feeding it to the $p->parse(). For $p->parse_file() pass a file that has been opened in ":utf8" mode.
The alternative solution is to enable the utf8_mode and not decode before passing strings to $p->parse(). The parser can process raw undecoded UTF-8 sanely if the utf8_mode is enabled, or if the "attr", "@attr" or "dtext" argspecs are avoided.
10 print "Hallo"
20 goto 10