Home | History | Annotate | Download | only in library
      1 
      2 :mod:`unicodedata` --- Unicode Database
      3 =======================================
      4 
      5 .. module:: unicodedata
      6    :synopsis: Access the Unicode Database.
      7 .. moduleauthor:: Marc-Andre Lemburg <mal (a] lemburg.com>
      8 .. sectionauthor:: Marc-Andre Lemburg <mal (a] lemburg.com>
      9 .. sectionauthor:: Martin v. Lwis <martin (a] v.loewis.de>
     10 
     11 
     12 .. index::
     13    single: Unicode
     14    single: character
     15    pair: Unicode; database
     16 
     17 This module provides access to the Unicode Character Database which defines
     18 character properties for all Unicode characters. The data in this database is
     19 based on the :file:`UnicodeData.txt` file version 5.2.0 which is publicly
     20 available from ftp://ftp.unicode.org/.
     21 
     22 The module uses the same names and symbols as defined by the UnicodeData File
     23 Format 5.2.0 (see http://www.unicode.org/reports/tr44/tr44-4.html).
     24 It defines the following functions:
     25 
     26 
     27 .. function:: lookup(name)
     28 
     29    Look up character by name.  If a character with the given name is found, return
     30    the corresponding Unicode character.  If not found, :exc:`KeyError` is raised.
     31 
     32 
     33 .. function:: name(unichr[, default])
     34 
     35    Returns the name assigned to the Unicode character *unichr* as a string. If no
     36    name is defined, *default* is returned, or, if not given, :exc:`ValueError` is
     37    raised.
     38 
     39 
     40 .. function:: decimal(unichr[, default])
     41 
     42    Returns the decimal value assigned to the Unicode character *unichr* as integer.
     43    If no such value is defined, *default* is returned, or, if not given,
     44    :exc:`ValueError` is raised.
     45 
     46 
     47 .. function:: digit(unichr[, default])
     48 
     49    Returns the digit value assigned to the Unicode character *unichr* as integer.
     50    If no such value is defined, *default* is returned, or, if not given,
     51    :exc:`ValueError` is raised.
     52 
     53 
     54 .. function:: numeric(unichr[, default])
     55 
     56    Returns the numeric value assigned to the Unicode character *unichr* as float.
     57    If no such value is defined, *default* is returned, or, if not given,
     58    :exc:`ValueError` is raised.
     59 
     60 
     61 .. function:: category(unichr)
     62 
     63    Returns the general category assigned to the Unicode character *unichr* as
     64    string.
     65 
     66 
     67 .. function:: bidirectional(unichr)
     68 
     69    Returns the bidirectional class assigned to the Unicode character *unichr* as
     70    string. If no such value is defined, an empty string is returned.
     71 
     72 
     73 .. function:: combining(unichr)
     74 
     75    Returns the canonical combining class assigned to the Unicode character *unichr*
     76    as integer. Returns ``0`` if no combining class is defined.
     77 
     78 
     79 .. function:: east_asian_width(unichr)
     80 
     81    Returns the east asian width assigned to the Unicode character *unichr* as
     82    string.
     83 
     84    .. versionadded:: 2.4
     85 
     86 
     87 .. function:: mirrored(unichr)
     88 
     89    Returns the mirrored property assigned to the Unicode character *unichr* as
     90    integer. Returns ``1`` if the character has been identified as a "mirrored"
     91    character in bidirectional text, ``0`` otherwise.
     92 
     93 
     94 .. function:: decomposition(unichr)
     95 
     96    Returns the character decomposition mapping assigned to the Unicode character
     97    *unichr* as string. An empty string is returned in case no such mapping is
     98    defined.
     99 
    100 
    101 .. function:: normalize(form, unistr)
    102 
    103    Return the normal form *form* for the Unicode string *unistr*. Valid values for
    104    *form* are 'NFC', 'NFKC', 'NFD', and 'NFKD'.
    105 
    106    The Unicode standard defines various normalization forms of a Unicode string,
    107    based on the definition of canonical equivalence and compatibility equivalence.
    108    In Unicode, several characters can be expressed in various way. For example, the
    109    character U+00C7 (LATIN CAPITAL LETTER C WITH CEDILLA) can also be expressed as
    110    the sequence U+0043 (LATIN CAPITAL LETTER C) U+0327 (COMBINING CEDILLA).
    111 
    112    For each character, there are two normal forms: normal form C and normal form D.
    113    Normal form D (NFD) is also known as canonical decomposition, and translates
    114    each character into its decomposed form. Normal form C (NFC) first applies a
    115    canonical decomposition, then composes pre-combined characters again.
    116 
    117    In addition to these two forms, there are two additional normal forms based on
    118    compatibility equivalence. In Unicode, certain characters are supported which
    119    normally would be unified with other characters. For example, U+2160 (ROMAN
    120    NUMERAL ONE) is really the same thing as U+0049 (LATIN CAPITAL LETTER I).
    121    However, it is supported in Unicode for compatibility with existing character
    122    sets (e.g. gb2312).
    123 
    124    The normal form KD (NFKD) will apply the compatibility decomposition, i.e.
    125    replace all compatibility characters with their equivalents. The normal form KC
    126    (NFKC) first applies the compatibility decomposition, followed by the canonical
    127    composition.
    128 
    129    Even if two unicode strings are normalized and look the same to
    130    a human reader, if one has combining characters and the other
    131    doesn't, they may not compare equal.
    132 
    133    .. versionadded:: 2.3
    134 
    135 In addition, the module exposes the following constant:
    136 
    137 
    138 .. data:: unidata_version
    139 
    140    The version of the Unicode database used in this module.
    141 
    142    .. versionadded:: 2.3
    143 
    144 
    145 .. data:: ucd_3_2_0
    146 
    147    This is an object that has the same methods as the entire module, but uses the
    148    Unicode database version 3.2 instead, for applications that require this
    149    specific version of the Unicode database (such as IDNA).
    150 
    151    .. versionadded:: 2.5
    152 
    153 Examples:
    154 
    155    >>> import unicodedata
    156    >>> unicodedata.lookup('LEFT CURLY BRACKET')
    157    u'{'
    158    >>> unicodedata.name(u'/')
    159    'SOLIDUS'
    160    >>> unicodedata.decimal(u'9')
    161    9
    162    >>> unicodedata.decimal(u'a')
    163    Traceback (most recent call last):
    164      File "<stdin>", line 1, in <module>
    165    ValueError: not a decimal
    166    >>> unicodedata.category(u'A')  # 'L'etter, 'u'ppercase
    167    'Lu'
    168    >>> unicodedata.bidirectional(u'\u0660') # 'A'rabic, 'N'umber
    169    'AN'
    170 
    171