Unicode::UCD -- Update? (Fragen zu Perl-Modulen)

[thread]20411[/thread]

Unicode::UCD -- Update?

Tags: perl5 Ähnliche Threads

Leser: 12

Articles: hide open all | hide show old branches

+5 replies

rosti

2018-01-02 10:36

User since
2011-03-19
3566 Artikel
BenutzerIn

Alles Gute fürs Neue Jahr Euch allen!

Als Alternative zu CPAN:

Unicode::UCD habe ich da eine eigene API am Wickel wo man bezüglich Update lediglich die Datei UnicodeData.txt austauschen muss.

Demo hier: http://rolfrost.de/ucd.html

Wie läuft denn ein Update für CPAN:

Unicode::UCD wenn das Konsortium eine neue Datei UnicodeData.txt ausgibt?

MfG

http://blog.rolfrost.de/

The art of steam.

+4 replies

Linuxer

2018-01-02 18:44

User since
2006-01-27
3891 Artikel
HausmeisterIn

user image

Frohes Neues auch Dir!

Vorabwarnung: keine Ahnung ;-)

Wenn Du fundierte Aussagen willst, kannst Du hier aufhören zu lesen.

Ab hier mein Raten:
more (1.0kb)

meine Beiträge: I.d.R. alle Angaben ohne Gewähr und auf Linux abgestimmt!
Die Sprache heisst Perl, nicht PERL. - Bitte Crossposts als solche kenntlich machen!

+3 replies

rosti

2018-01-02 22:19

User since
2011-03-19
3566 Artikel
BenutzerIn

Hi Danke Dir ;)

ja, da in dem von Dir genannten Verzeichnis hab ich auch schon geschaut. Aber so wie es bei mir 5.16.3 aussieht, ist die UnicodeData.txt da nicht vorhanden, aber scheinbar deren Inhalt auf eine Reihe anderer Dateien aufgeteilt.

$Unicode::UCD::UnicodeVersion gibt mir ein undef und so komme ich insgesamt zu dem Schluss, daß die ganze lib für ein Update gar nicht vorgesehen ist, denn wenn ich keine Version kriege weiß ich auch nicht, ob es eine Neue gibt ;)

MfG

PS: Die aktuelle Version UnicodeData.txt beschreibt 31618 Zeichen in 1.7MB CSV Format. Da gabs Einiges zu optimieren beim Algorithmus um den Inhalt auf eine hash-Struktur zu kriegen für den wahlfreien Zugriff: 474270 Schlüssel-Werte-Paare ergeben sich da! Beim Einlesen macht es schon einen Unterschied ob die Datei mit "r" im Textmodus oder mit O_BINARY geöffnet wird. Letzeres ist auf XP/Win32 merklich schneller.

Die Antwortzeiten hier http://rolfrost.de/ucd.html?query=angst&gui=1 sind nun einwandfrei.
Last edited: 2018-01-02 22:31:50 +0100 (CET)

http://blog.rolfrost.de/

The art of steam.

+2 replies

renee

2018-01-03 10:16

User since
2003-08-04
14371 Artikel
ModeratorIn

Unicode/UCD.pm
# This function has traditionally mimicked what is in UnicodeData.txt,
# warts and all. This is a re-write that avoids UnicodeData.txt so that
# it can be removed to save disk space. Instead, this assembles
# information gotten by other methods that get data from various other
# files. It uses charnames to get the character name; and various
# mktables tables.

Ich habe aber noch nicht rausgefunden ob/wie man bei einer älteren Perl-Version die Unicodeversion aktualisiert.

OTRS-Erweiterungen (http://feature-addons.de/)
Frankfurt Perlmongers (http://frankfurt.pm/)
--

Unterlagen OTRS-Workshop 2012: http://otrs.perl-services.de/workshop.html
Perl-Entwicklung: http://perl-services.de/

rosti

2018-01-03 10:57

User since
2011-03-19
3566 Artikel
BenutzerIn

Hi Danke auch Dir.

Das Problem liegt einmal daran, daß Unicode in der Datei UnicodeData.txt vielleicht einfach erscheint, in Wirklichkeit jedoch sehr komplex ist. Das heißt, daß man über diese Datei hinausgehend noch eine ganze Menge weitere Informationen braucht, Beispielsweise die Feldnamen siehe hier http://www.unicode.org/reports/tr44/#UnicodeData.t... -- Diese Namen, vom Konsortium publiziert, gehen ja schonmal nicht konform mit den Namen die Unicode::UCD::charinfo() liefert, anstatt Simple_Uppercase_Mapping liefert charinfo() das Feld upper (nur als Beispiel).

Da würde ich schonmal meinen, daß hinsichtlich Publizieren seitens des Konsortium noch Einiges zu verbessern ist. Damit meine ich nicht die ganze technische Dokumentation sondern das was man automatisiert zu Verteilen gedenkt. Und es kann nicht sein, daß das ein Dutzend Dateien sind, wo man sich die Zusammenhänge zwischen diesen erst nach einem mehrjährigen Studium selbst herstellen muss.

Die Anzahl der in Unicode erfassten und dokumentierten Zeichen hat sich in den letzten 3 Jahren verdoppelt! Jedoch enthält die Datei UnicodeData.txt selbst keine Angabe zur Version. Auch die Feldnamen nicht und diese eine Zeile einzufügen ist ganz bestimmt kein Problem.

Da ist es schon stark anzunehmen daß die mit moderner IT und datenbankgestützt herangehen ;) Und von dieser Annahme ausgehend, gäbe es bestimmt Möglichkeiten für einen Export nach Version/Release der automatisiert ausführbar wäre und sei es nur ein DB-Dump. Idealerweise jedoch einen Webervice.

MfG

PS: Für meine API habe ich die Feldnamen und weitere Tabellen von Hand übernommen.

Code (perl): (dl )

# siehe http://www.unicode.org/reports/tr44/#UnicodeData.txt
my @FIELDS = qw(
    Codepoint
    Name
    General_Category
    Canonical_Combining_Class
    Bidi_Class
    Decomposition_Type_Mapping
    IF_Numeric_Type_Decimal
    If_Numeric_Type_Digit
    If_Numeric_Type_Numeric
    Bidi_Mirrored
    Unicode_1_Name
    ISO_Comment
    Simple_Uppercase_Mapping
    Simple_Lowercase_Mapping
    Simple_Titlecase_Mapping
);

my %CANONICAL_COMBINING_CLASS_VALUES = (
    0   => 'Not_Reordered, Spacing and enclosing marks; also many vowel and consonant signs, even if nonspacing',
    1   => 'Overlay, Marks which overlay a base letter or symbol',
    7   => 'Nukta, Diacritic nukta marks in Brahmi-derived scripts',
    8   => 'Kana_Voicing, Hiragana/Katakana voicing marks',
    9   => 'Virama, Viramas',
    10  => 'Ccc10, Start of fixed position classes',
    199 => 'End of fixed position classes',
    200 => 'Attached_Below_Left, Marks attached at the bottom left',
    202 => 'Attached_Below,  Marks attached directly below',
    204 => 'Marks attached at the bottom right',
    208 => 'Marks attached to the left',
    210 => 'Marks attached to the right',
    212 => 'Marks attached at the top left',
    214 => 'Attached_Above, Marks attached directly above',
    216 => 'Attached_Above_Right, Marks attached at the top right',
    218 => 'Below_Left, Distinct marks at the bottom left',
    220 => 'Below, Distinct marks directly below',
    222 => 'Below_Right, Distinct marks at the bottom right',
    224 => 'Left, Distinct marks to the left',
    226 => 'Right, Distinct marks to the right',
    228 => 'Above_Left, Distinct marks at the top left',
    230 => 'Above, Distinct marks directly above',
    232 => 'Above_Right, Distinct marks at the top right',
    233 => 'Double_Below, Distinct marks subtending two bases',
    234 => 'Double_Above, Distinct marks extending above two bases',
    240 => 'Iota_Subscript, Greek iota subscript only'
);

my %BIDI_CLASS_VALUES = (
    L   => 'Left_To_Right, any strong left-to-right character',
    R   => 'Right_To_Left, any strong right-to-left (non-Arabic-type) character',
    AL  => 'Arabic_Letter, any strong right-to-left (Arabic-type) character',
    EN  => 'European_Number, any ASCII digit or Eastern Arabic-Indic digit',
    ES  => 'European_Separator, plus and minus signs',
    ET  => 'European_Terminator, a terminator in a numeric format context, includes currency signs',
    AN  => 'Arabic_Number, any Arabic-Indic digit',
    CS  => 'Common_Separator, commas, colons, and slashes',
    NSM => 'Nonspacing_Mark, any nonspacing mark',
    BN  => 'Boundary_Neutral, most format characters, control codes, or noncharacters',
    B   => 'Paragraph_Separator, various newline characters',
    S   => 'Segment_Separator, various segment-related control codes',
    WS  => 'White_Space, spaces',
    ON  => 'Other_Neutral, most other symbols and punctuation marks',
    LRE => 'Left_To_Right_Embedding, U+202A: the LR embedding control',
    LRO => 'Left_To_Right_Override, U+202D: the LR override control',
    RLE => 'Right_To_Left_Embedding, U+202B: the RL embedding control',
    RLO => 'Right_To_Left_Override, U+202E: the RL override control',
    PDF => 'Pop_Directional_Format, U+202C: terminates an embedding or override control',
    LRI => 'Left_To_Right_Isolate, U+2066: the LR isolate control',
    RLI => 'Right_To_Left_Isolate, U+2067: the RL isolate control',
    FSI => 'First_Strong_Isolate, U+2068: the first strong isolate control',
    PDI => 'Pop_Directional_Isolate, U+2069: terminates an isolate control'
);


my %COMPATIBILITY_TAGS = (
    font     => 'Font variant (for example, a blackletter form)',
    noBreak  => 'No-break version of a space or hyphen',
    initial  => 'Initial presentation form (Arabic)',
    medial   => 'Medial presentation form (Arabic)',
    final    => 'Final presentation form (Arabic)',
    isolated => 'Isolated presentation form (Arabic)',
    circle   => 'Encircled form',
    super    => 'Superscript form',
    sub      => 'Subscript form',
    vertical => 'Vertical layout presentation form',
    wide     => 'Wide (or zenkaku) compatibility character',
    narrow   => 'Narrow (or hankaku) compatibility character',
    small    => 'Small variant form (CNS compatibility)',
    square   => 'CJK squared font variant',
    fraction => 'Vulgar fraction form',
    compat   => 'Otherwise unspecified compatibility character'
);

my %GENERAL_CATEGORY_VALUES = (
    Lu  => 'Uppercase_Letter, an uppercase letter',
    Ll  => 'Lowercase_Letter, a lowercase letter',
    Lt  => 'Titlecase_Letter, a digraphic character, with first part uppercase',
    LC  => 'Cased_Letter, Lu | Ll | Lt',
    Lm  => 'Modifier_Letter, a modifier letter',
    Lo  => 'Other_Letter, other letters, including syllables and ideographs',
    L   => 'Letter Lu | Ll | Lt | Lm | Lo',
    Mn  => 'Nonspacing_Mark, a nonspacing combining mark (zero advance width)',
    Mc  => 'Spacing_Mark, a spacing combining mark (positive advance width)',
    Me  => 'Enclosing_Mark, an enclosing combining mark',
    M   => 'Mark, Mn | Mc | Me',
    Nd  => 'Decimal_Number, a decimal digit',
    Nl  => 'Letter_Number, a letterlike numeric character',
    No  => 'Other_Number, a numeric character of other type',
    N   => 'Number  Nd | Nl | No',
    Pc  => 'Connector_Punctuation, a connecting punctuation mark, like a tie',
    Pd  => 'Dash_Punctuation, a dash or hyphen punctuation mark',
    Ps  => 'Open_Punctuation, an opening punctuation mark (of a pair)',
    Pe  => 'Close_Punctuation, a closing punctuation mark (of a pair)',
    Pi  => 'Initial_Punctuation, an initial quotation mark',
    Pf  => 'Final_Punctuation, a final quotation mark',
    Po  => 'Other_Punctuation, a punctuation mark of other type',
    P   => 'Punctuation Pc | Pd | Ps | Pe | Pi | Pf | Po',
    Sm  => 'Math_Symbol, a symbol of mathematical use',
    Sc  => 'Currency_Symbol, a currency sign',
    Sk  => 'Modifier_Symbol, a non-letterlike modifier symbol',
    So  => 'Other_Symbol, a symbol of other type',
    S   => 'Symbol  Sm | Sc | Sk | So',
    Zs  => 'Space_Separator, a space character (of various non-zero widths)',
    Zl  => 'Line_Separator, U+2028 LINE SEPARATOR only',
    Zp  => 'Paragraph_Separator, U+2029 PARAGRAPH SEPARATOR only',
    Z   => 'Separator, Zs | Zl | Zp',
    Cc  => 'Control, a C0 or C1 control code',
    Cf  => 'Format, a format control character',
    Cs  => 'Surrogate, a surrogate code point',
    Co  => 'Private_Use, a private-use character',
    Cn  => 'Unassigned, a reserved unassigned code point or a noncharacter',
    C   => 'Other, Cc | Cf | Cs | Co | Cn',
);

Sowas muss automatisch möglich sein. Und v.a.m.! Die Gefahr, daß sich beim händischen Kopieren Fehler einschleichen ist da.

.
Last edited: 2018-01-03 11:03:30 +0100 (CET)

http://blog.rolfrost.de/

The art of steam.

View all threads created 2018-01-02 10:36.