Lib:Architecture/Base Layer/Text Module
From GNUpdf
| Library Module | |
|---|---|
| Text Module | |
| Layer | Base |
| API Documentation | Reference Manual |
| Source Files | src/base/pdf-text.h
src/base/pdf-text-host-encoding.h src/base/pdf-text-host-encoding.c src/base/pdf-text-ucd-combclass.h src/base/pdf-text-ucd-combclass.c src/base/pdf-text-ucd-wordbreak.h src/base/pdf-text-ucd-wordbreak.c |
Contents |
Overview
This module provides access to a text abstraction as an encoded sequence of characters. Several text encodings are supported and conversion functions to change the encoding of a text are provided.
Text Variables
The Text Module implement a data abstraction called text variables. It is an opaque type: the client doesn't know details about the implementation of text variables.
A text variable (a variable of type pdf_text_t) holds the following information:
- A text (sequence of characters), which is internally stored in UTF32-BE.
- A ISO 639-1 language code (optional).
- A ISO 3166-1 alpha-2 country code (optional).
Text content
Every text variable can contain any sequence of UNICODE points, which is internally stored in UTF32-HE (Host Endian) encoding with no Byte Order Marker. This means that every character will be stored as a 32bit value.
The length of the UNICODE point sequence is not stored as the number of characters, but as the number of bytes actually allocated for the data (four times the number of characters).
In addition to this base UTF32-HE based content, other additional approaches can be taken, as storing the data in PDFDocEncoding or UTF16BE as well (for a faster encoding conversion, but using more memory). This extra feature will be decided in the future, when a clearer view of which the most used encoding conversions are is obtained. It will be simple to add this feature even as an extra option at configure level.
Country/Language information
The ISO 3166-1 alpha-2 Country Code and ISO 639-1 Language Code are used within PDF strings to give more information on the text being stored, so that there is no need for a specific function to guess which the language being used in the text is. These Country Code and Language Code are inherited from UTF16-BE encoded PDF strings (between U+001B escape sequences), or can be explicitly set using Text Module functions (see description of Text Module features below).
It should be noted that this information will always be linked to the UTF16-BE PDF string representation. No other encoding representation will handle this information.
List of Word Boundaries
As a side product (so optional), the text object will store, if necessary, a list of word boundaries. This list is used in the casing algorithms basically, so that a specific 'casing context' is defined for each unicode point to be case-converted. The word boundaries are computed using the algorithms explained in Unicode Standard Annex #29: Text Boundaries, and they are stored in a pdf_list_t object.
Encodings
The Text Module has support for the following encodings:
- PDF Doc Encoding
- Unicode Encodings
- Host Encodings
PDF Doc Encoding
The PDF specification defines a custom encoding. The PDF Doc Encoding does not cover all the Unicode characters, just a small subset of 256 of them.
Note: PDF Strings can be encoded in either PDF Doc Encoding or UTF16-BE with BOM
Unicode
The following encodings of ISO-10646 (Unicode) are supported:
- UTF-8
- UTF-16 Big Endian
- UTF-16 Little Endian
- UTF-32 Big Endian
- UTF-32 Little Endian
System endianness is also detected by the library, so additionally specific host-endian mappings are also available:
- UTF-16 Host Endian (BE or LE)
- UTF-32 Host Endian (BE or LE)
Note: PDF Strings can be encoded in either PDF Doc Encoding or UTF16-BE with BOM
Host Encodings
In addition to "internal" built-in encodings such as Unicode and PDF Doc Encoding the Text Module supports conversions to and from several "host encodings".
The supported host encodings depend on the specific host types:
- GNU and Unix encodings
- Windows encodings
- Mac OS X encodings
The Text Module provides a specific data type to handle host encoding information, pdf_text_host_encoding_t, which stores not only the name of the encoding being used, but also the specific ANSI Code Page identifier.
The full list of supported host encodings for each OS is given in Supported Host Encodings
GNU and Unix encodings
The Text Module uses the nl_langinfo() function to get the specific encoding configured by the user in the Locale information (LC_TYPE, LC_ALL or LANG). For specific conversions to and from the internal UTF32-BE encoding of the pdf_char_t type, the Text Module uses the "iconv" API, which is part of the Single Unix Specification (SUS) standard.
- Dependencies in GNU/Linux: 'iconv.h' and 'langinfo.h' headers
- Dependencies in UNIX systems: 'iconv.h' and 'langinfo.h' headers, plus specific system-dependent library
In all modern GNU/Linux operation systems, the 'iconv' utility is provided within the gnu c library.
Windows encodings
The Text Module uses the Windows API to get the specific encoding being used by the system (GetACP). By the means of this Windows API, the text module can convert any Windows specific host encoding (identified by an unique ANSI Code Page) to the internal UTF32-BE representation of the pdf_text_t element (MultiByteToWideChar and WideCharToMultiByte).
A full list of ANSI Code Pages supported by Windows OS is given in the following link: Windows host encodings
- Dependencies in Windows (95/98/Me/NT/2000/XP/Vista): 'windows.h' header, 'Kernel32.lib' library
- Dependencies in Windows CE: 'Winnls.h' header, 'Coreloc.lib' library
Mac OS X encodings
Modern Mac OS X operating systems are fully UNIX 03 compliant, so the same encoding conversion scheme based on the 'iconv' utility used for GNU/Linux and UNIX systems can be applied (See iconv man page at Apple DC).
Filters
It is possible to filter the contents of a text variable through several simple filters supported by the Text Module:
- The Identity filter does not perform any transformation in the encoded text. It is a no-op.
- The Line Endings Normalization filter normalizes EOL sequences, converting all types of line endings to the platform-specific EOL type.
- The Upper Case filter makes all text upper case.
- The Lower Case filter makes all text lower case.
- The Title Case filter makes all text title case.
- The Remove Ampersands filter removes all single ampersands, substituting ` &' with a white space character. This filter also transforms ` && ' into ` & '.
- The Normalize with Full-Width characters filters transforms all the unicode points into the corresponding Full-Width variant (if available)
- The Remove Line Endings filter replaces line endings with space characters.
More than one filter can be applied at the same time (in the same function call), except for the case conversion filters ('Title Case', 'Upper Case', 'Lower Case'), where only one per function call is applied. The same could be applied to the 'Remove Line Endings' and 'Line Endings Normalization', but the difference is that in this second case the resulting string does not depend on the order the filter are applied. See Lib:Reference_Manual for more details.
Operations
The following operations are supported by the Text Module. See the Lib:Reference_Manual for more details on the API.
Creation and Destruction of Text Variables
- Create a new text variable containing no text.
- Dup a new text variable from a existing one, copying the contents.
- Create a new text variable and initialize it with a given host encoded string.
- Create a new text variable and initialize it with a given PDF string text representation (encoded in either PDF Doc Encoding or UTF16-BE). This function should be used with caution, as an input UTF16-BE PDF string could be converted in more than one pdf_text_t elements if country/language code information is included within the data.
- Create a new text variable from a string of Unicode characters in a given Unicode encoding.
- Create a new text variable containing the textual representation of a given integer.
- Destroy a given text variable and its contents.
Manipulation of Text Properties
- Get the country code associated with a text variable.
- Get the language code associated with a text variable.
- Associate a text variable with a country code.
- Associate a text variable with a language code.
- Determine if a given text variable is empty (contains no text).
Manipulation of Text Contents
- Obtain the host encoding configured in the system or user environment.
- Guess the best available host encoding to encode the contents of a given text variable.
- Get the contents of a text variable encoded in a given host encoding.
- Get the contents of a text variable encoded in PDF Doc Encoding.
- Get the contents of a text variable encoded in a Unicode encoding (with or without BOM, with or without Country/Language Codes).
- Set the value of a text variable to a string encoded with some host encoding.
- Set the value of a text variable to a string encoded with PDF Doc Encoding.
- Set the value of a text variable to a string encoded with a Unicode encoding (with or without BOM).
- Concatenate the contents of two text variables.
- Replace a fixed pattern in the content of a given text variable.
- Replace a fixed ASCII pattern in the content of a given text variable.
- Filter the contents of a text variable with a predefined filter (See Filters section above).
Comparison of Text Variables
- Compare the contents (data and country/language information) of two text variables.
Annex
Unicode Character Database
The text module involves the use of some algorithms (casing, normalization, ...) described in the Unicode standard, so an extensive use of the Unicode Character Database (UCD) is done. The following table shows the specific files from the UCD that are used in the Text Module.
| UCD file | Contents considered | Link to online UCD |
|---|---|---|
| UnicodeData.txt | Simple Lowercase, Simple Uppercase, Simple Titlecase, General Category, Combining Class | UnicodeData.txt |
| SpecialCasing.txt | All contents of file | SpecialCasing.txt |
| WordBreakProperty.txt | All contents of file | WordBreakProperty.txt |
| Proplist.txt | All contents of file, but only those with Soft_Dotted are currently used. | Proplist.txt |
The internal required set of data coming from the UCD is self-generated using the `pdf-text-download-and-generate-ucd.sh' script available in the `build-aux' directory within the sources. For more information on how to regenerate the internal UCD data set (due to a new version of Unicode, for example) see README.regenerateUCD
Note: See Exhibit 1, Unicode License Agreement for terms of use of the UCD.
Casing algorithms
The text module provides a self implementation of casing algorithms based on Unicode standard v5.1. This casing algorithms can be divided in three different types:
- Standard casing
- Special casing
- Special casing with conditions
For each of these casing types, three different low-level operations are available:
- Uppercase a given character
- Lowercase a given character
- Titlecase a given character
The text module will use these internal low-level functions in both filter operations and case-sensitive comparison.
It must be noted that Unicode is still not perfect, and there are quite a lot of locale-dependent rules that could be applied when performing case-conversion algorithms. This library does not cover those special cases (yet).
Standard casing
The standard casing is that where a single unicode code point is converted into another single unicode point. This algorithm covers all the casing needs for most of the languages where casing information is needed. If there is no case information for a given unicode point (e.g the uppercase of an already capital letter), the same unicode point is returned.
Examples:
- UpperCase( g ) = G
- UpperCase( a2 ) = A2
- UpperCase( gnu pdf ) = GNU PDF
- UpperCase( Gnu PDF ) = GNU PDF
- TitleCase( gnu pdf ) = Gnu Pdf
- TitleCase( GNU PDF ) = Gnu Pdf
Special casing
There are special unicode points where a case operation converts a single unicode point into 0 or more unicode points. The text module interface is designed to allow such operations, by specifying not only the output unicode points, but also the number of points returned.
Examples:
- UpperCase(HEIß) = HEISS
Special casing with conditions
The most complicated case algorithm is that involving specific conditions that must be met in order to apply the special case defined in Unicode. The types of conditions are listed below:
a) The case conversion should be done only if the user's locale matches the one specified as condition. In the GNU PDF implementation, the user's locale is not checked: the embedded language/country information in PDF strings is used instead. This type of language context checks allow the text module to provide correct Uppercase and Lowercase turkish variants for dotless 'i', for example.
b) The character can be in a particular Casing Context, as specified below.
- Final_Sigma: When the given character is preceded by a sequence consisting of a cased unicode point (with case information) and a case-ignorable (without case information) sequence of points. The given character must no be followed by a sequence consisting of a case-ignorable sequence and a cased letter.
- After_Soft_Dotted: There is a character with the Soft_Dotted property before the character to convert, with no intervening character of combining class 0 or 230.
- More_Above: The character to convert is followed by a character of combining class 230 (Above) with no intervening character of combining class 0.
- Before_Dot: The character to convert is followed by COMBINING DOT ABOVE (U+0307). Any sequence of characters with a combining class that is neither 0 nor 230 may intervene between the current character and the combining dot above.
- After_I: There is an uppercase I before the character to convert, and there is no intervening combining character class 230 (Above) or 0.
Word boundaries, according to Unicode
The text module provides a self implementation of word boundary detection based on Unicode Standard Annex #29. This algorithm is based on 14 different boundary rules that are applied to the UTF-32HE text stored within the text object, and it's basically used within case conversion algorithms.
It must be noted that Unicode is still not perfect, and there are quite a lot of locale-dependent rules that could be applied when looking for word boundaries. This library does not cover those special cases (yet).
References
1. Default Case Algorithms. Section 3.13, The Unicode Standard 5.0
2. Word boundaries algorithm. Section 4, Unicode Text Segmentation



