Thread Mit perl einen "wide character" ausfindig machen (3 answers)
Opened by smallish at 2007-09-24 00:10

ptk
 2007-09-24 23:09
#99948 #99948
User since
2003-11-28
3645 Artikel
ModeratorIn
[default_avatar]
Wenn du eine allgemeine Lösung möchtest, kannst du CPAN:Text::Unidecode verwenden. Damit können alle Unicode-Zeichen in ASCII-Pendant umgewandelt werden. Mit der Funktion unten kann man das Modul auch verwenden, um statt nach ASCII in ein beliebiges Encoding zu wandeln.

Code (perl): (dl )
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
=head2 unidecode_any($text, $encoding)

Similar to Text::Unidecode::unidecode, but convert to the given
$encoding. This will return an octet string in the given I<$encoding>.
If all you want is just to restrict the charset of the string to a
specific encoding charset, then it's best to C<Encode::decode> the
result again with I<$encoding>.

=cut

sub unidecode_any {
    my($text, $encoding) = @_;

    require Text::Unidecode;
    require Encode;

    # provide better conversions for german umlauts
    my %override = ("\xc4" => "Ae",
                    "\xd6" => "Oe",
                    "\xdc" => "Ue",
                    "\xe4" => "ae",
                    "\xf6" => "oe",
                    "\xfc" => "ue",
                   );
    my $override_rx = "(" . join("|", map { quotemeta } keys %override) . ")";
    $override_rx = qr{$override_rx};

    my $res = "";

    if (!eval {
        Encode->VERSION(2.12); # need v2.12 to support coderef
        $res = Encode::encode($encoding, $text,
                              sub {
                                  my $ch = chr $_[0];
                                  if ($ch =~ $override_rx) {
                                      return $override{$ch};
                                  } else {
                                      my $ascii = unidecode($ch);
                                      Encode::_utf8_off($ascii);
                                      $ascii;
                                  }
                              });
        1;
    }) {
        for (split //, $text) {
            my $conv = eval { Encode::encode($encoding, $_, Encode::FB_CROAK()) };
            if ($@) {
                $res .= Text::Unidecode::unidecode($_);
            } else {
                $res .= $conv;
            }
        }
    }

    $res;
}

View full thread Mit perl einen "wide character" ausfindig machen