1 .. highlightlang:: c 2 3 .. _unicodeobjects: 4 5 Unicode Objects and Codecs 6 -------------------------- 7 8 .. sectionauthor:: Marc-Andre Lemburg <mal (a] lemburg.com> 9 10 Unicode Objects 11 ^^^^^^^^^^^^^^^ 12 13 14 Unicode Type 15 """""""""""" 16 17 These are the basic Unicode object types used for the Unicode implementation in 18 Python: 19 20 21 .. c:type:: Py_UNICODE 22 23 This type represents the storage type which is used by Python internally as 24 basis for holding Unicode ordinals. Python's default builds use a 16-bit type 25 for :c:type:`Py_UNICODE` and store Unicode values internally as UCS2. It is also 26 possible to build a UCS4 version of Python (most recent Linux distributions come 27 with UCS4 builds of Python). These builds then use a 32-bit type for 28 :c:type:`Py_UNICODE` and store Unicode data internally as UCS4. On platforms 29 where :c:type:`wchar_t` is available and compatible with the chosen Python 30 Unicode build variant, :c:type:`Py_UNICODE` is a typedef alias for 31 :c:type:`wchar_t` to enhance native platform compatibility. On all other 32 platforms, :c:type:`Py_UNICODE` is a typedef alias for either :c:type:`unsigned 33 short` (UCS2) or :c:type:`unsigned long` (UCS4). 34 35 Note that UCS2 and UCS4 Python builds are not binary compatible. Please keep 36 this in mind when writing extensions or interfaces. 37 38 39 .. c:type:: PyUnicodeObject 40 41 This subtype of :c:type:`PyObject` represents a Python Unicode object. 42 43 44 .. c:var:: PyTypeObject PyUnicode_Type 45 46 This instance of :c:type:`PyTypeObject` represents the Python Unicode type. It 47 is exposed to Python code as ``unicode`` and ``types.UnicodeType``. 48 49 The following APIs are really C macros and can be used to do fast checks and to 50 access internal read-only data of Unicode objects: 51 52 53 .. c:function:: int PyUnicode_Check(PyObject *o) 54 55 Return true if the object *o* is a Unicode object or an instance of a Unicode 56 subtype. 57 58 .. versionchanged:: 2.2 59 Allowed subtypes to be accepted. 60 61 62 .. c:function:: int PyUnicode_CheckExact(PyObject *o) 63 64 Return true if the object *o* is a Unicode object, but not an instance of a 65 subtype. 66 67 .. versionadded:: 2.2 68 69 70 .. c:function:: Py_ssize_t PyUnicode_GET_SIZE(PyObject *o) 71 72 Return the size of the object. *o* has to be a :c:type:`PyUnicodeObject` (not 73 checked). 74 75 .. versionchanged:: 2.5 76 This function returned an :c:type:`int` type. This might require changes 77 in your code for properly supporting 64-bit systems. 78 79 80 .. c:function:: Py_ssize_t PyUnicode_GET_DATA_SIZE(PyObject *o) 81 82 Return the size of the object's internal buffer in bytes. *o* has to be a 83 :c:type:`PyUnicodeObject` (not checked). 84 85 .. versionchanged:: 2.5 86 This function returned an :c:type:`int` type. This might require changes 87 in your code for properly supporting 64-bit systems. 88 89 90 .. c:function:: Py_UNICODE* PyUnicode_AS_UNICODE(PyObject *o) 91 92 Return a pointer to the internal :c:type:`Py_UNICODE` buffer of the object. *o* 93 has to be a :c:type:`PyUnicodeObject` (not checked). 94 95 96 .. c:function:: const char* PyUnicode_AS_DATA(PyObject *o) 97 98 Return a pointer to the internal buffer of the object. *o* has to be a 99 :c:type:`PyUnicodeObject` (not checked). 100 101 102 .. c:function:: int PyUnicode_ClearFreeList() 103 104 Clear the free list. Return the total number of freed items. 105 106 .. versionadded:: 2.6 107 108 109 Unicode Character Properties 110 """""""""""""""""""""""""""" 111 112 Unicode provides many different character properties. The most often needed ones 113 are available through these macros which are mapped to C functions depending on 114 the Python configuration. 115 116 117 .. c:function:: int Py_UNICODE_ISSPACE(Py_UNICODE ch) 118 119 Return ``1`` or ``0`` depending on whether *ch* is a whitespace character. 120 121 122 .. c:function:: int Py_UNICODE_ISLOWER(Py_UNICODE ch) 123 124 Return ``1`` or ``0`` depending on whether *ch* is a lowercase character. 125 126 127 .. c:function:: int Py_UNICODE_ISUPPER(Py_UNICODE ch) 128 129 Return ``1`` or ``0`` depending on whether *ch* is an uppercase character. 130 131 132 .. c:function:: int Py_UNICODE_ISTITLE(Py_UNICODE ch) 133 134 Return ``1`` or ``0`` depending on whether *ch* is a titlecase character. 135 136 137 .. c:function:: int Py_UNICODE_ISLINEBREAK(Py_UNICODE ch) 138 139 Return ``1`` or ``0`` depending on whether *ch* is a linebreak character. 140 141 142 .. c:function:: int Py_UNICODE_ISDECIMAL(Py_UNICODE ch) 143 144 Return ``1`` or ``0`` depending on whether *ch* is a decimal character. 145 146 147 .. c:function:: int Py_UNICODE_ISDIGIT(Py_UNICODE ch) 148 149 Return ``1`` or ``0`` depending on whether *ch* is a digit character. 150 151 152 .. c:function:: int Py_UNICODE_ISNUMERIC(Py_UNICODE ch) 153 154 Return ``1`` or ``0`` depending on whether *ch* is a numeric character. 155 156 157 .. c:function:: int Py_UNICODE_ISALPHA(Py_UNICODE ch) 158 159 Return ``1`` or ``0`` depending on whether *ch* is an alphabetic character. 160 161 162 .. c:function:: int Py_UNICODE_ISALNUM(Py_UNICODE ch) 163 164 Return ``1`` or ``0`` depending on whether *ch* is an alphanumeric character. 165 166 These APIs can be used for fast direct character conversions: 167 168 169 .. c:function:: Py_UNICODE Py_UNICODE_TOLOWER(Py_UNICODE ch) 170 171 Return the character *ch* converted to lower case. 172 173 174 .. c:function:: Py_UNICODE Py_UNICODE_TOUPPER(Py_UNICODE ch) 175 176 Return the character *ch* converted to upper case. 177 178 179 .. c:function:: Py_UNICODE Py_UNICODE_TOTITLE(Py_UNICODE ch) 180 181 Return the character *ch* converted to title case. 182 183 184 .. c:function:: int Py_UNICODE_TODECIMAL(Py_UNICODE ch) 185 186 Return the character *ch* converted to a decimal positive integer. Return 187 ``-1`` if this is not possible. This macro does not raise exceptions. 188 189 190 .. c:function:: int Py_UNICODE_TODIGIT(Py_UNICODE ch) 191 192 Return the character *ch* converted to a single digit integer. Return ``-1`` if 193 this is not possible. This macro does not raise exceptions. 194 195 196 .. c:function:: double Py_UNICODE_TONUMERIC(Py_UNICODE ch) 197 198 Return the character *ch* converted to a double. Return ``-1.0`` if this is not 199 possible. This macro does not raise exceptions. 200 201 202 Plain Py_UNICODE 203 """""""""""""""" 204 205 To create Unicode objects and access their basic sequence properties, use these 206 APIs: 207 208 209 .. c:function:: PyObject* PyUnicode_FromUnicode(const Py_UNICODE *u, Py_ssize_t size) 210 211 Create a Unicode object from the Py_UNICODE buffer *u* of the given size. *u* 212 may be *NULL* which causes the contents to be undefined. It is the user's 213 responsibility to fill in the needed data. The buffer is copied into the new 214 object. If the buffer is not *NULL*, the return value might be a shared object. 215 Therefore, modification of the resulting Unicode object is only allowed when *u* 216 is *NULL*. 217 218 .. versionchanged:: 2.5 219 This function used an :c:type:`int` type for *size*. This might require 220 changes in your code for properly supporting 64-bit systems. 221 222 223 .. c:function:: PyObject* PyUnicode_FromStringAndSize(const char *u, Py_ssize_t size) 224 225 Create a Unicode object from the char buffer *u*. The bytes will be interpreted 226 as being UTF-8 encoded. *u* may also be *NULL* which 227 causes the contents to be undefined. It is the user's responsibility to fill in 228 the needed data. The buffer is copied into the new object. If the buffer is not 229 *NULL*, the return value might be a shared object. Therefore, modification of 230 the resulting Unicode object is only allowed when *u* is *NULL*. 231 232 .. versionadded:: 2.6 233 234 235 .. c:function:: PyObject *PyUnicode_FromString(const char *u) 236 237 Create a Unicode object from a UTF-8 encoded null-terminated char buffer 238 *u*. 239 240 .. versionadded:: 2.6 241 242 243 .. c:function:: PyObject* PyUnicode_FromFormat(const char *format, ...) 244 245 Take a C :c:func:`printf`\ -style *format* string and a variable number of 246 arguments, calculate the size of the resulting Python unicode string and return 247 a string with the values formatted into it. The variable arguments must be C 248 types and must correspond exactly to the format characters in the *format* 249 string. The following format characters are allowed: 250 251 .. % The descriptions for %zd and %zu are wrong, but the truth is complicated 252 .. % because not all compilers support the %z width modifier -- we fake it 253 .. % when necessary via interpolating PY_FORMAT_SIZE_T. 254 255 .. tabularcolumns:: |l|l|L| 256 257 +-------------------+---------------------+--------------------------------+ 258 | Format Characters | Type | Comment | 259 +===================+=====================+================================+ 260 | :attr:`%%` | *n/a* | The literal % character. | 261 +-------------------+---------------------+--------------------------------+ 262 | :attr:`%c` | int | A single character, | 263 | | | represented as a C int. | 264 +-------------------+---------------------+--------------------------------+ 265 | :attr:`%d` | int | Exactly equivalent to | 266 | | | ``printf("%d")``. | 267 +-------------------+---------------------+--------------------------------+ 268 | :attr:`%u` | unsigned int | Exactly equivalent to | 269 | | | ``printf("%u")``. | 270 +-------------------+---------------------+--------------------------------+ 271 | :attr:`%ld` | long | Exactly equivalent to | 272 | | | ``printf("%ld")``. | 273 +-------------------+---------------------+--------------------------------+ 274 | :attr:`%lu` | unsigned long | Exactly equivalent to | 275 | | | ``printf("%lu")``. | 276 +-------------------+---------------------+--------------------------------+ 277 | :attr:`%zd` | Py_ssize_t | Exactly equivalent to | 278 | | | ``printf("%zd")``. | 279 +-------------------+---------------------+--------------------------------+ 280 | :attr:`%zu` | size_t | Exactly equivalent to | 281 | | | ``printf("%zu")``. | 282 +-------------------+---------------------+--------------------------------+ 283 | :attr:`%i` | int | Exactly equivalent to | 284 | | | ``printf("%i")``. | 285 +-------------------+---------------------+--------------------------------+ 286 | :attr:`%x` | int | Exactly equivalent to | 287 | | | ``printf("%x")``. | 288 +-------------------+---------------------+--------------------------------+ 289 | :attr:`%s` | char\* | A null-terminated C character | 290 | | | array. | 291 +-------------------+---------------------+--------------------------------+ 292 | :attr:`%p` | void\* | The hex representation of a C | 293 | | | pointer. Mostly equivalent to | 294 | | | ``printf("%p")`` except that | 295 | | | it is guaranteed to start with | 296 | | | the literal ``0x`` regardless | 297 | | | of what the platform's | 298 | | | ``printf`` yields. | 299 +-------------------+---------------------+--------------------------------+ 300 | :attr:`%U` | PyObject\* | A unicode object. | 301 +-------------------+---------------------+--------------------------------+ 302 | :attr:`%V` | PyObject\*, char \* | A unicode object (which may be | 303 | | | *NULL*) and a null-terminated | 304 | | | C character array as a second | 305 | | | parameter (which will be used, | 306 | | | if the first parameter is | 307 | | | *NULL*). | 308 +-------------------+---------------------+--------------------------------+ 309 | :attr:`%S` | PyObject\* | The result of calling | 310 | | | :func:`PyObject_Unicode`. | 311 +-------------------+---------------------+--------------------------------+ 312 | :attr:`%R` | PyObject\* | The result of calling | 313 | | | :func:`PyObject_Repr`. | 314 +-------------------+---------------------+--------------------------------+ 315 316 An unrecognized format character causes all the rest of the format string to be 317 copied as-is to the result string, and any extra arguments discarded. 318 319 .. versionadded:: 2.6 320 321 322 .. c:function:: PyObject* PyUnicode_FromFormatV(const char *format, va_list vargs) 323 324 Identical to :func:`PyUnicode_FromFormat` except that it takes exactly two 325 arguments. 326 327 .. versionadded:: 2.6 328 329 330 .. c:function:: Py_UNICODE* PyUnicode_AsUnicode(PyObject *unicode) 331 332 Return a read-only pointer to the Unicode object's internal 333 :c:type:`Py_UNICODE` buffer, *NULL* if *unicode* is not a Unicode object. 334 Note that the resulting :c:type:`Py_UNICODE*` string may contain embedded 335 null characters, which would cause the string to be truncated when used in 336 most C functions. 337 338 339 .. c:function:: Py_ssize_t PyUnicode_GetSize(PyObject *unicode) 340 341 Return the length of the Unicode object. 342 343 .. versionchanged:: 2.5 344 This function returned an :c:type:`int` type. This might require changes 345 in your code for properly supporting 64-bit systems. 346 347 348 .. c:function:: PyObject* PyUnicode_FromEncodedObject(PyObject *obj, const char *encoding, const char *errors) 349 350 Coerce an encoded object *obj* to a Unicode object and return a reference with 351 incremented refcount. 352 353 String and other char buffer compatible objects are decoded according to the 354 given encoding and using the error handling defined by errors. Both can be 355 *NULL* to have the interface use the default values (see the next section for 356 details). 357 358 All other objects, including Unicode objects, cause a :exc:`TypeError` to be 359 set. 360 361 The API returns *NULL* if there was an error. The caller is responsible for 362 decref'ing the returned objects. 363 364 365 .. c:function:: PyObject* PyUnicode_FromObject(PyObject *obj) 366 367 Shortcut for ``PyUnicode_FromEncodedObject(obj, NULL, "strict")`` which is used 368 throughout the interpreter whenever coercion to Unicode is needed. 369 370 If the platform supports :c:type:`wchar_t` and provides a header file wchar.h, 371 Python can interface directly to this type using the following functions. 372 Support is optimized if Python's own :c:type:`Py_UNICODE` type is identical to 373 the system's :c:type:`wchar_t`. 374 375 376 wchar_t Support 377 """"""""""""""" 378 379 :c:type:`wchar_t` support for platforms which support it: 380 381 .. c:function:: PyObject* PyUnicode_FromWideChar(const wchar_t *w, Py_ssize_t size) 382 383 Create a Unicode object from the :c:type:`wchar_t` buffer *w* of the given *size*. 384 Return *NULL* on failure. 385 386 .. versionchanged:: 2.5 387 This function used an :c:type:`int` type for *size*. This might require 388 changes in your code for properly supporting 64-bit systems. 389 390 391 .. c:function:: Py_ssize_t PyUnicode_AsWideChar(PyUnicodeObject *unicode, wchar_t *w, Py_ssize_t size) 392 393 Copy the Unicode object contents into the :c:type:`wchar_t` buffer *w*. At most 394 *size* :c:type:`wchar_t` characters are copied (excluding a possibly trailing 395 0-termination character). Return the number of :c:type:`wchar_t` characters 396 copied or ``-1`` in case of an error. Note that the resulting :c:type:`wchar_t` 397 string may or may not be 0-terminated. It is the responsibility of the caller 398 to make sure that the :c:type:`wchar_t` string is 0-terminated in case this is 399 required by the application. Also, note that the :c:type:`wchar_t*` string 400 might contain null characters, which would cause the string to be truncated 401 when used with most C functions. 402 403 .. versionchanged:: 2.5 404 This function returned an :c:type:`int` type and used an :c:type:`int` 405 type for *size*. This might require changes in your code for properly 406 supporting 64-bit systems. 407 408 409 .. _builtincodecs: 410 411 Built-in Codecs 412 ^^^^^^^^^^^^^^^ 413 414 Python provides a set of built-in codecs which are written in C for speed. All of 415 these codecs are directly usable via the following functions. 416 417 Many of the following APIs take two arguments encoding and errors, and they 418 have the same semantics as the ones of the built-in :func:`unicode` Unicode 419 object constructor. 420 421 Setting encoding to *NULL* causes the default encoding to be used which is 422 ASCII. The file system calls should use :c:data:`Py_FileSystemDefaultEncoding` 423 as the encoding for file names. This variable should be treated as read-only: on 424 some systems, it will be a pointer to a static string, on others, it will change 425 at run-time (such as when the application invokes setlocale). 426 427 Error handling is set by errors which may also be set to *NULL* meaning to use 428 the default handling defined for the codec. Default error handling for all 429 built-in codecs is "strict" (:exc:`ValueError` is raised). 430 431 The codecs all use a similar interface. Only deviation from the following 432 generic ones are documented for simplicity. 433 434 435 Generic Codecs 436 """""""""""""" 437 438 These are the generic codec APIs: 439 440 441 .. c:function:: PyObject* PyUnicode_Decode(const char *s, Py_ssize_t size, const char *encoding, const char *errors) 442 443 Create a Unicode object by decoding *size* bytes of the encoded string *s*. 444 *encoding* and *errors* have the same meaning as the parameters of the same name 445 in the :func:`unicode` built-in function. The codec to be used is looked up 446 using the Python codec registry. Return *NULL* if an exception was raised by 447 the codec. 448 449 .. versionchanged:: 2.5 450 This function used an :c:type:`int` type for *size*. This might require 451 changes in your code for properly supporting 64-bit systems. 452 453 454 .. c:function:: PyObject* PyUnicode_Encode(const Py_UNICODE *s, Py_ssize_t size, const char *encoding, const char *errors) 455 456 Encode the :c:type:`Py_UNICODE` buffer *s* of the given *size* and return a Python 457 string object. *encoding* and *errors* have the same meaning as the parameters 458 of the same name in the Unicode :meth:`~unicode.encode` method. The codec 459 to be used is looked up using the Python codec registry. Return *NULL* if 460 an exception was raised by the codec. 461 462 .. versionchanged:: 2.5 463 This function used an :c:type:`int` type for *size*. This might require 464 changes in your code for properly supporting 64-bit systems. 465 466 467 .. c:function:: PyObject* PyUnicode_AsEncodedString(PyObject *unicode, const char *encoding, const char *errors) 468 469 Encode a Unicode object and return the result as Python string object. 470 *encoding* and *errors* have the same meaning as the parameters of the same name 471 in the Unicode :meth:`encode` method. The codec to be used is looked up using 472 the Python codec registry. Return *NULL* if an exception was raised by the 473 codec. 474 475 476 UTF-8 Codecs 477 """""""""""" 478 479 These are the UTF-8 codec APIs: 480 481 482 .. c:function:: PyObject* PyUnicode_DecodeUTF8(const char *s, Py_ssize_t size, const char *errors) 483 484 Create a Unicode object by decoding *size* bytes of the UTF-8 encoded string 485 *s*. Return *NULL* if an exception was raised by the codec. 486 487 .. versionchanged:: 2.5 488 This function used an :c:type:`int` type for *size*. This might require 489 changes in your code for properly supporting 64-bit systems. 490 491 492 .. c:function:: PyObject* PyUnicode_DecodeUTF8Stateful(const char *s, Py_ssize_t size, const char *errors, Py_ssize_t *consumed) 493 494 If *consumed* is *NULL*, behave like :c:func:`PyUnicode_DecodeUTF8`. If 495 *consumed* is not *NULL*, trailing incomplete UTF-8 byte sequences will not be 496 treated as an error. Those bytes will not be decoded and the number of bytes 497 that have been decoded will be stored in *consumed*. 498 499 .. versionadded:: 2.4 500 501 .. versionchanged:: 2.5 502 This function used an :c:type:`int` type for *size*. This might require 503 changes in your code for properly supporting 64-bit systems. 504 505 506 .. c:function:: PyObject* PyUnicode_EncodeUTF8(const Py_UNICODE *s, Py_ssize_t size, const char *errors) 507 508 Encode the :c:type:`Py_UNICODE` buffer *s* of the given *size* using UTF-8 and return a 509 Python string object. Return *NULL* if an exception was raised by the codec. 510 511 .. versionchanged:: 2.5 512 This function used an :c:type:`int` type for *size*. This might require 513 changes in your code for properly supporting 64-bit systems. 514 515 516 .. c:function:: PyObject* PyUnicode_AsUTF8String(PyObject *unicode) 517 518 Encode a Unicode object using UTF-8 and return the result as Python string 519 object. Error handling is "strict". Return *NULL* if an exception was raised 520 by the codec. 521 522 523 UTF-32 Codecs 524 """"""""""""" 525 526 These are the UTF-32 codec APIs: 527 528 529 .. c:function:: PyObject* PyUnicode_DecodeUTF32(const char *s, Py_ssize_t size, const char *errors, int *byteorder) 530 531 Decode *size* bytes from a UTF-32 encoded buffer string and return the 532 corresponding Unicode object. *errors* (if non-*NULL*) defines the error 533 handling. It defaults to "strict". 534 535 If *byteorder* is non-*NULL*, the decoder starts decoding using the given byte 536 order:: 537 538 *byteorder == -1: little endian 539 *byteorder == 0: native order 540 *byteorder == 1: big endian 541 542 If ``*byteorder`` is zero, and the first four bytes of the input data are a 543 byte order mark (BOM), the decoder switches to this byte order and the BOM is 544 not copied into the resulting Unicode string. If ``*byteorder`` is ``-1`` or 545 ``1``, any byte order mark is copied to the output. 546 547 After completion, *\*byteorder* is set to the current byte order at the end 548 of input data. 549 550 In a narrow build code points outside the BMP will be decoded as surrogate pairs. 551 552 If *byteorder* is *NULL*, the codec starts in native order mode. 553 554 Return *NULL* if an exception was raised by the codec. 555 556 .. versionadded:: 2.6 557 558 559 .. c:function:: PyObject* PyUnicode_DecodeUTF32Stateful(const char *s, Py_ssize_t size, const char *errors, int *byteorder, Py_ssize_t *consumed) 560 561 If *consumed* is *NULL*, behave like :c:func:`PyUnicode_DecodeUTF32`. If 562 *consumed* is not *NULL*, :c:func:`PyUnicode_DecodeUTF32Stateful` will not treat 563 trailing incomplete UTF-32 byte sequences (such as a number of bytes not divisible 564 by four) as an error. Those bytes will not be decoded and the number of bytes 565 that have been decoded will be stored in *consumed*. 566 567 .. versionadded:: 2.6 568 569 570 .. c:function:: PyObject* PyUnicode_EncodeUTF32(const Py_UNICODE *s, Py_ssize_t size, const char *errors, int byteorder) 571 572 Return a Python bytes object holding the UTF-32 encoded value of the Unicode 573 data in *s*. Output is written according to the following byte order:: 574 575 byteorder == -1: little endian 576 byteorder == 0: native byte order (writes a BOM mark) 577 byteorder == 1: big endian 578 579 If byteorder is ``0``, the output string will always start with the Unicode BOM 580 mark (U+FEFF). In the other two modes, no BOM mark is prepended. 581 582 If *Py_UNICODE_WIDE* is not defined, surrogate pairs will be output 583 as a single code point. 584 585 Return *NULL* if an exception was raised by the codec. 586 587 .. versionadded:: 2.6 588 589 590 .. c:function:: PyObject* PyUnicode_AsUTF32String(PyObject *unicode) 591 592 Return a Python string using the UTF-32 encoding in native byte order. The 593 string always starts with a BOM mark. Error handling is "strict". Return 594 *NULL* if an exception was raised by the codec. 595 596 .. versionadded:: 2.6 597 598 599 UTF-16 Codecs 600 """"""""""""" 601 602 These are the UTF-16 codec APIs: 603 604 605 .. c:function:: PyObject* PyUnicode_DecodeUTF16(const char *s, Py_ssize_t size, const char *errors, int *byteorder) 606 607 Decode *size* bytes from a UTF-16 encoded buffer string and return the 608 corresponding Unicode object. *errors* (if non-*NULL*) defines the error 609 handling. It defaults to "strict". 610 611 If *byteorder* is non-*NULL*, the decoder starts decoding using the given byte 612 order:: 613 614 *byteorder == -1: little endian 615 *byteorder == 0: native order 616 *byteorder == 1: big endian 617 618 If ``*byteorder`` is zero, and the first two bytes of the input data are a 619 byte order mark (BOM), the decoder switches to this byte order and the BOM is 620 not copied into the resulting Unicode string. If ``*byteorder`` is ``-1`` or 621 ``1``, any byte order mark is copied to the output (where it will result in 622 either a ``\ufeff`` or a ``\ufffe`` character). 623 624 After completion, *\*byteorder* is set to the current byte order at the end 625 of input data. 626 627 If *byteorder* is *NULL*, the codec starts in native order mode. 628 629 Return *NULL* if an exception was raised by the codec. 630 631 .. versionchanged:: 2.5 632 This function used an :c:type:`int` type for *size*. This might require 633 changes in your code for properly supporting 64-bit systems. 634 635 636 .. c:function:: PyObject* PyUnicode_DecodeUTF16Stateful(const char *s, Py_ssize_t size, const char *errors, int *byteorder, Py_ssize_t *consumed) 637 638 If *consumed* is *NULL*, behave like :c:func:`PyUnicode_DecodeUTF16`. If 639 *consumed* is not *NULL*, :c:func:`PyUnicode_DecodeUTF16Stateful` will not treat 640 trailing incomplete UTF-16 byte sequences (such as an odd number of bytes or a 641 split surrogate pair) as an error. Those bytes will not be decoded and the 642 number of bytes that have been decoded will be stored in *consumed*. 643 644 .. versionadded:: 2.4 645 646 .. versionchanged:: 2.5 647 This function used an :c:type:`int` type for *size* and an :c:type:`int *` 648 type for *consumed*. This might require changes in your code for 649 properly supporting 64-bit systems. 650 651 652 .. c:function:: PyObject* PyUnicode_EncodeUTF16(const Py_UNICODE *s, Py_ssize_t size, const char *errors, int byteorder) 653 654 Return a Python string object holding the UTF-16 encoded value of the Unicode 655 data in *s*. Output is written according to the following byte order:: 656 657 byteorder == -1: little endian 658 byteorder == 0: native byte order (writes a BOM mark) 659 byteorder == 1: big endian 660 661 If byteorder is ``0``, the output string will always start with the Unicode BOM 662 mark (U+FEFF). In the other two modes, no BOM mark is prepended. 663 664 If *Py_UNICODE_WIDE* is defined, a single :c:type:`Py_UNICODE` value may get 665 represented as a surrogate pair. If it is not defined, each :c:type:`Py_UNICODE` 666 values is interpreted as a UCS-2 character. 667 668 Return *NULL* if an exception was raised by the codec. 669 670 .. versionchanged:: 2.5 671 This function used an :c:type:`int` type for *size*. This might require 672 changes in your code for properly supporting 64-bit systems. 673 674 675 .. c:function:: PyObject* PyUnicode_AsUTF16String(PyObject *unicode) 676 677 Return a Python string using the UTF-16 encoding in native byte order. The 678 string always starts with a BOM mark. Error handling is "strict". Return 679 *NULL* if an exception was raised by the codec. 680 681 682 UTF-7 Codecs 683 """""""""""" 684 685 These are the UTF-7 codec APIs: 686 687 688 .. c:function:: PyObject* PyUnicode_DecodeUTF7(const char *s, Py_ssize_t size, const char *errors) 689 690 Create a Unicode object by decoding *size* bytes of the UTF-7 encoded string 691 *s*. Return *NULL* if an exception was raised by the codec. 692 693 694 .. c:function:: PyObject* PyUnicode_DecodeUTF7Stateful(const char *s, Py_ssize_t size, const char *errors, Py_ssize_t *consumed) 695 696 If *consumed* is *NULL*, behave like :c:func:`PyUnicode_DecodeUTF7`. If 697 *consumed* is not *NULL*, trailing incomplete UTF-7 base-64 sections will not 698 be treated as an error. Those bytes will not be decoded and the number of 699 bytes that have been decoded will be stored in *consumed*. 700 701 702 .. c:function:: PyObject* PyUnicode_EncodeUTF7(const Py_UNICODE *s, Py_ssize_t size, int base64SetO, int base64WhiteSpace, const char *errors) 703 704 Encode the :c:type:`Py_UNICODE` buffer of the given size using UTF-7 and 705 return a Python bytes object. Return *NULL* if an exception was raised by 706 the codec. 707 708 If *base64SetO* is nonzero, "Set O" (punctuation that has no otherwise 709 special meaning) will be encoded in base-64. If *base64WhiteSpace* is 710 nonzero, whitespace will be encoded in base-64. Both are set to zero for the 711 Python "utf-7" codec. 712 713 714 Unicode-Escape Codecs 715 """"""""""""""""""""" 716 717 These are the "Unicode Escape" codec APIs: 718 719 720 .. c:function:: PyObject* PyUnicode_DecodeUnicodeEscape(const char *s, Py_ssize_t size, const char *errors) 721 722 Create a Unicode object by decoding *size* bytes of the Unicode-Escape encoded 723 string *s*. Return *NULL* if an exception was raised by the codec. 724 725 .. versionchanged:: 2.5 726 This function used an :c:type:`int` type for *size*. This might require 727 changes in your code for properly supporting 64-bit systems. 728 729 730 .. c:function:: PyObject* PyUnicode_EncodeUnicodeEscape(const Py_UNICODE *s, Py_ssize_t size) 731 732 Encode the :c:type:`Py_UNICODE` buffer of the given *size* using Unicode-Escape and 733 return a Python string object. Return *NULL* if an exception was raised by the 734 codec. 735 736 .. versionchanged:: 2.5 737 This function used an :c:type:`int` type for *size*. This might require 738 changes in your code for properly supporting 64-bit systems. 739 740 741 .. c:function:: PyObject* PyUnicode_AsUnicodeEscapeString(PyObject *unicode) 742 743 Encode a Unicode object using Unicode-Escape and return the result as Python 744 string object. Error handling is "strict". Return *NULL* if an exception was 745 raised by the codec. 746 747 748 Raw-Unicode-Escape Codecs 749 """"""""""""""""""""""""" 750 751 These are the "Raw Unicode Escape" codec APIs: 752 753 754 .. c:function:: PyObject* PyUnicode_DecodeRawUnicodeEscape(const char *s, Py_ssize_t size, const char *errors) 755 756 Create a Unicode object by decoding *size* bytes of the Raw-Unicode-Escape 757 encoded string *s*. Return *NULL* if an exception was raised by the codec. 758 759 .. versionchanged:: 2.5 760 This function used an :c:type:`int` type for *size*. This might require 761 changes in your code for properly supporting 64-bit systems. 762 763 764 .. c:function:: PyObject* PyUnicode_EncodeRawUnicodeEscape(const Py_UNICODE *s, Py_ssize_t size, const char *errors) 765 766 Encode the :c:type:`Py_UNICODE` buffer of the given *size* using Raw-Unicode-Escape 767 and return a Python string object. Return *NULL* if an exception was raised by 768 the codec. 769 770 .. versionchanged:: 2.5 771 This function used an :c:type:`int` type for *size*. This might require 772 changes in your code for properly supporting 64-bit systems. 773 774 775 .. c:function:: PyObject* PyUnicode_AsRawUnicodeEscapeString(PyObject *unicode) 776 777 Encode a Unicode object using Raw-Unicode-Escape and return the result as 778 Python string object. Error handling is "strict". Return *NULL* if an exception 779 was raised by the codec. 780 781 782 Latin-1 Codecs 783 """""""""""""" 784 785 These are the Latin-1 codec APIs: Latin-1 corresponds to the first 256 Unicode 786 ordinals and only these are accepted by the codecs during encoding. 787 788 789 .. c:function:: PyObject* PyUnicode_DecodeLatin1(const char *s, Py_ssize_t size, const char *errors) 790 791 Create a Unicode object by decoding *size* bytes of the Latin-1 encoded string 792 *s*. Return *NULL* if an exception was raised by the codec. 793 794 .. versionchanged:: 2.5 795 This function used an :c:type:`int` type for *size*. This might require 796 changes in your code for properly supporting 64-bit systems. 797 798 799 .. c:function:: PyObject* PyUnicode_EncodeLatin1(const Py_UNICODE *s, Py_ssize_t size, const char *errors) 800 801 Encode the :c:type:`Py_UNICODE` buffer of the given *size* using Latin-1 and return 802 a Python string object. Return *NULL* if an exception was raised by the codec. 803 804 .. versionchanged:: 2.5 805 This function used an :c:type:`int` type for *size*. This might require 806 changes in your code for properly supporting 64-bit systems. 807 808 809 .. c:function:: PyObject* PyUnicode_AsLatin1String(PyObject *unicode) 810 811 Encode a Unicode object using Latin-1 and return the result as Python string 812 object. Error handling is "strict". Return *NULL* if an exception was raised 813 by the codec. 814 815 816 ASCII Codecs 817 """""""""""" 818 819 These are the ASCII codec APIs. Only 7-bit ASCII data is accepted. All other 820 codes generate errors. 821 822 823 .. c:function:: PyObject* PyUnicode_DecodeASCII(const char *s, Py_ssize_t size, const char *errors) 824 825 Create a Unicode object by decoding *size* bytes of the ASCII encoded string 826 *s*. Return *NULL* if an exception was raised by the codec. 827 828 .. versionchanged:: 2.5 829 This function used an :c:type:`int` type for *size*. This might require 830 changes in your code for properly supporting 64-bit systems. 831 832 833 .. c:function:: PyObject* PyUnicode_EncodeASCII(const Py_UNICODE *s, Py_ssize_t size, const char *errors) 834 835 Encode the :c:type:`Py_UNICODE` buffer of the given *size* using ASCII and return a 836 Python string object. Return *NULL* if an exception was raised by the codec. 837 838 .. versionchanged:: 2.5 839 This function used an :c:type:`int` type for *size*. This might require 840 changes in your code for properly supporting 64-bit systems. 841 842 843 .. c:function:: PyObject* PyUnicode_AsASCIIString(PyObject *unicode) 844 845 Encode a Unicode object using ASCII and return the result as Python string 846 object. Error handling is "strict". Return *NULL* if an exception was raised 847 by the codec. 848 849 850 Character Map Codecs 851 """""""""""""""""""" 852 853 This codec is special in that it can be used to implement many different codecs 854 (and this is in fact what was done to obtain most of the standard codecs 855 included in the :mod:`encodings` package). The codec uses mapping to encode and 856 decode characters. 857 858 Decoding mappings must map single string characters to single Unicode 859 characters, integers (which are then interpreted as Unicode ordinals) or ``None`` 860 (meaning "undefined mapping" and causing an error). 861 862 Encoding mappings must map single Unicode characters to single string 863 characters, integers (which are then interpreted as Latin-1 ordinals) or ``None`` 864 (meaning "undefined mapping" and causing an error). 865 866 The mapping objects provided must only support the __getitem__ mapping 867 interface. 868 869 If a character lookup fails with a LookupError, the character is copied as-is 870 meaning that its ordinal value will be interpreted as Unicode or Latin-1 ordinal 871 resp. Because of this, mappings only need to contain those mappings which map 872 characters to different code points. 873 874 These are the mapping codec APIs: 875 876 .. c:function:: PyObject* PyUnicode_DecodeCharmap(const char *s, Py_ssize_t size, PyObject *mapping, const char *errors) 877 878 Create a Unicode object by decoding *size* bytes of the encoded string *s* using 879 the given *mapping* object. Return *NULL* if an exception was raised by the 880 codec. If *mapping* is *NULL* latin-1 decoding will be done. Else it can be a 881 dictionary mapping byte or a unicode string, which is treated as a lookup table. 882 Byte values greater that the length of the string and U+FFFE "characters" are 883 treated as "undefined mapping". 884 885 .. versionchanged:: 2.4 886 Allowed unicode string as mapping argument. 887 888 .. versionchanged:: 2.5 889 This function used an :c:type:`int` type for *size*. This might require 890 changes in your code for properly supporting 64-bit systems. 891 892 893 .. c:function:: PyObject* PyUnicode_EncodeCharmap(const Py_UNICODE *s, Py_ssize_t size, PyObject *mapping, const char *errors) 894 895 Encode the :c:type:`Py_UNICODE` buffer of the given *size* using the given 896 *mapping* object and return a Python string object. Return *NULL* if an 897 exception was raised by the codec. 898 899 .. versionchanged:: 2.5 900 This function used an :c:type:`int` type for *size*. This might require 901 changes in your code for properly supporting 64-bit systems. 902 903 904 .. c:function:: PyObject* PyUnicode_AsCharmapString(PyObject *unicode, PyObject *mapping) 905 906 Encode a Unicode object using the given *mapping* object and return the result 907 as Python string object. Error handling is "strict". Return *NULL* if an 908 exception was raised by the codec. 909 910 The following codec API is special in that maps Unicode to Unicode. 911 912 913 .. c:function:: PyObject* PyUnicode_TranslateCharmap(const Py_UNICODE *s, Py_ssize_t size, PyObject *table, const char *errors) 914 915 Translate a :c:type:`Py_UNICODE` buffer of the given *size* by applying a 916 character mapping *table* to it and return the resulting Unicode object. Return 917 *NULL* when an exception was raised by the codec. 918 919 The *mapping* table must map Unicode ordinal integers to Unicode ordinal 920 integers or ``None`` (causing deletion of the character). 921 922 Mapping tables need only provide the :meth:`__getitem__` interface; dictionaries 923 and sequences work well. Unmapped character ordinals (ones which cause a 924 :exc:`LookupError`) are left untouched and are copied as-is. 925 926 .. versionchanged:: 2.5 927 This function used an :c:type:`int` type for *size*. This might require 928 changes in your code for properly supporting 64-bit systems. 929 930 931 MBCS codecs for Windows 932 """"""""""""""""""""""" 933 934 These are the MBCS codec APIs. They are currently only available on Windows and 935 use the Win32 MBCS converters to implement the conversions. Note that MBCS (or 936 DBCS) is a class of encodings, not just one. The target encoding is defined by 937 the user settings on the machine running the codec. 938 939 940 .. c:function:: PyObject* PyUnicode_DecodeMBCS(const char *s, Py_ssize_t size, const char *errors) 941 942 Create a Unicode object by decoding *size* bytes of the MBCS encoded string *s*. 943 Return *NULL* if an exception was raised by the codec. 944 945 .. versionchanged:: 2.5 946 This function used an :c:type:`int` type for *size*. This might require 947 changes in your code for properly supporting 64-bit systems. 948 949 950 .. c:function:: PyObject* PyUnicode_DecodeMBCSStateful(const char *s, int size, const char *errors, int *consumed) 951 952 If *consumed* is *NULL*, behave like :c:func:`PyUnicode_DecodeMBCS`. If 953 *consumed* is not *NULL*, :c:func:`PyUnicode_DecodeMBCSStateful` will not decode 954 trailing lead byte and the number of bytes that have been decoded will be stored 955 in *consumed*. 956 957 .. versionadded:: 2.5 958 959 960 .. c:function:: PyObject* PyUnicode_EncodeMBCS(const Py_UNICODE *s, Py_ssize_t size, const char *errors) 961 962 Encode the :c:type:`Py_UNICODE` buffer of the given *size* using MBCS and return a 963 Python string object. Return *NULL* if an exception was raised by the codec. 964 965 .. versionchanged:: 2.5 966 This function used an :c:type:`int` type for *size*. This might require 967 changes in your code for properly supporting 64-bit systems. 968 969 970 .. c:function:: PyObject* PyUnicode_AsMBCSString(PyObject *unicode) 971 972 Encode a Unicode object using MBCS and return the result as Python string 973 object. Error handling is "strict". Return *NULL* if an exception was raised 974 by the codec. 975 976 977 Methods & Slots 978 """"""""""""""" 979 980 .. _unicodemethodsandslots: 981 982 Methods and Slot Functions 983 ^^^^^^^^^^^^^^^^^^^^^^^^^^ 984 985 The following APIs are capable of handling Unicode objects and strings on input 986 (we refer to them as strings in the descriptions) and return Unicode objects or 987 integers as appropriate. 988 989 They all return *NULL* or ``-1`` if an exception occurs. 990 991 992 .. c:function:: PyObject* PyUnicode_Concat(PyObject *left, PyObject *right) 993 994 Concat two strings giving a new Unicode string. 995 996 997 .. c:function:: PyObject* PyUnicode_Split(PyObject *s, PyObject *sep, Py_ssize_t maxsplit) 998 999 Split a string giving a list of Unicode strings. If *sep* is *NULL*, splitting 1000 will be done at all whitespace substrings. Otherwise, splits occur at the given 1001 separator. At most *maxsplit* splits will be done. If negative, no limit is 1002 set. Separators are not included in the resulting list. 1003 1004 .. versionchanged:: 2.5 1005 This function used an :c:type:`int` type for *maxsplit*. This might require 1006 changes in your code for properly supporting 64-bit systems. 1007 1008 1009 .. c:function:: PyObject* PyUnicode_Splitlines(PyObject *s, int keepend) 1010 1011 Split a Unicode string at line breaks, returning a list of Unicode strings. 1012 CRLF is considered to be one line break. If *keepend* is ``0``, the Line break 1013 characters are not included in the resulting strings. 1014 1015 1016 .. c:function:: PyObject* PyUnicode_Translate(PyObject *str, PyObject *table, const char *errors) 1017 1018 Translate a string by applying a character mapping table to it and return the 1019 resulting Unicode object. 1020 1021 The mapping table must map Unicode ordinal integers to Unicode ordinal integers 1022 or ``None`` (causing deletion of the character). 1023 1024 Mapping tables need only provide the :meth:`__getitem__` interface; dictionaries 1025 and sequences work well. Unmapped character ordinals (ones which cause a 1026 :exc:`LookupError`) are left untouched and are copied as-is. 1027 1028 *errors* has the usual meaning for codecs. It may be *NULL* which indicates to 1029 use the default error handling. 1030 1031 1032 .. c:function:: PyObject* PyUnicode_Join(PyObject *separator, PyObject *seq) 1033 1034 Join a sequence of strings using the given *separator* and return the resulting 1035 Unicode string. 1036 1037 1038 .. c:function:: Py_ssize_t PyUnicode_Tailmatch(PyObject *str, PyObject *substr, Py_ssize_t start, Py_ssize_t end, int direction) 1039 1040 Return ``1`` if *substr* matches ``str[start:end]`` at the given tail end 1041 (*direction* == ``-1`` means to do a prefix match, *direction* == ``1`` a suffix match), 1042 ``0`` otherwise. Return ``-1`` if an error occurred. 1043 1044 .. versionchanged:: 2.5 1045 This function used an :c:type:`int` type for *start* and *end*. This 1046 might require changes in your code for properly supporting 64-bit 1047 systems. 1048 1049 1050 .. c:function:: Py_ssize_t PyUnicode_Find(PyObject *str, PyObject *substr, Py_ssize_t start, Py_ssize_t end, int direction) 1051 1052 Return the first position of *substr* in ``str[start:end]`` using the given 1053 *direction* (*direction* == ``1`` means to do a forward search, *direction* == ``-1`` a 1054 backward search). The return value is the index of the first match; a value of 1055 ``-1`` indicates that no match was found, and ``-2`` indicates that an error 1056 occurred and an exception has been set. 1057 1058 .. versionchanged:: 2.5 1059 This function used an :c:type:`int` type for *start* and *end*. This 1060 might require changes in your code for properly supporting 64-bit 1061 systems. 1062 1063 1064 .. c:function:: Py_ssize_t PyUnicode_Count(PyObject *str, PyObject *substr, Py_ssize_t start, Py_ssize_t end) 1065 1066 Return the number of non-overlapping occurrences of *substr* in 1067 ``str[start:end]``. Return ``-1`` if an error occurred. 1068 1069 .. versionchanged:: 2.5 1070 This function returned an :c:type:`int` type and used an :c:type:`int` 1071 type for *start* and *end*. This might require changes in your code for 1072 properly supporting 64-bit systems. 1073 1074 1075 .. c:function:: PyObject* PyUnicode_Replace(PyObject *str, PyObject *substr, PyObject *replstr, Py_ssize_t maxcount) 1076 1077 Replace at most *maxcount* occurrences of *substr* in *str* with *replstr* and 1078 return the resulting Unicode object. *maxcount* == ``-1`` means replace all 1079 occurrences. 1080 1081 .. versionchanged:: 2.5 1082 This function used an :c:type:`int` type for *maxcount*. This might 1083 require changes in your code for properly supporting 64-bit systems. 1084 1085 1086 .. c:function:: int PyUnicode_Compare(PyObject *left, PyObject *right) 1087 1088 Compare two strings and return ``-1``, ``0``, ``1`` for less than, equal, and greater than, 1089 respectively. 1090 1091 1092 .. c:function:: int PyUnicode_RichCompare(PyObject *left, PyObject *right, int op) 1093 1094 Rich compare two unicode strings and return one of the following: 1095 1096 * ``NULL`` in case an exception was raised 1097 * :const:`Py_True` or :const:`Py_False` for successful comparisons 1098 * :const:`Py_NotImplemented` in case the type combination is unknown 1099 1100 Note that :const:`Py_EQ` and :const:`Py_NE` comparisons can cause a 1101 :exc:`UnicodeWarning` in case the conversion of the arguments to Unicode fails 1102 with a :exc:`UnicodeDecodeError`. 1103 1104 Possible values for *op* are :const:`Py_GT`, :const:`Py_GE`, :const:`Py_EQ`, 1105 :const:`Py_NE`, :const:`Py_LT`, and :const:`Py_LE`. 1106 1107 1108 .. c:function:: PyObject* PyUnicode_Format(PyObject *format, PyObject *args) 1109 1110 Return a new string object from *format* and *args*; this is analogous to 1111 ``format % args``. 1112 1113 1114 .. c:function:: int PyUnicode_Contains(PyObject *container, PyObject *element) 1115 1116 Check whether *element* is contained in *container* and return true or false 1117 accordingly. 1118 1119 *element* has to coerce to a one element Unicode string. ``-1`` is returned if 1120 there was an error. 1121