Home | History | Annotate | only in /external/hyphenation
Up to higher level directory
NameDateSize
Android.mk15-Nov-2011863
AUTHORS15-Nov-2011772
COPYING15-Nov-2011890
COPYING.MPL15-Nov-201125.2K
csutil.c15-Nov-201176K
csutil.h15-Nov-20111.4K
example.c15-Nov-20115.1K
hnjalloc.c15-Nov-20111.9K
hnjalloc.h15-Nov-20111.5K
hyph_en_US.dic15-Nov-201191.1K
hyphen.c15-Nov-201134.1K
hyphen.h15-Nov-20115.3K
patches/15-Nov-2011
README15-Nov-20113.9K
README.compound15-Nov-20112K
README.hyphen15-Nov-20114.8K
README.nonstandard15-Nov-20114.1K
substrings.c15-Nov-20116.2K
substrings.pl15-Nov-20114.9K

README

      1 Hyphen - hyphenation library to use converted TeX hyphenation patterns
      2  
      3 (C) 1998 Raph Levien
      4 (C) 2001 ALTLinux, Moscow
      5 (C) 2006, 2007, 2008 Lszl Nmeth
      6  
      7 This was part of libHnj library by Raph Levien.
      8  
      9 Peter Novodvorsky from ALTLinux cut hyphenation part from libHnj
     10 to use it in OpenOffice.org.
     11  
     12 Compound word and non-standard hyphenation support by Lszl Nmeth.
     13   
     14 License is the original LibHnj license:
     15 LibHnj is dual licensed under LGPL and MPL (see also README.libhnj).
     16 
     17 Because LGPL allows GPL relicensing, COPYING contains now 
     18 LGPL/GPL/MPL tri-license for explicit Mozilla source compatibility.
     19 
     20 Original Libhnj source with OOo's patches are managed by Rene Engelhard
     21 and Chris Halls at Debian:
     22 
     23 http://packages.debian.org/stable/libdevel/libhnj-dev
     24 and http://packages.debian.org/unstable/source/libhnj
     25 
     26 
     27 OTHER FILES
     28 
     29 This distribution is the source of the en_US hyphenation patterns
     30 "hyph_en_US.dic", too. See README_hyph_en_US.txt.
     31 
     32 Source files of hyph_en_US.dic in the distribution:
     33 
     34 hyphen.tex (en_US hyphenation patterns from plain TeX)
     35 
     36   Source: http://tug.ctan.org/text-archive/macros/plain/base/hyphen.tex
     37 
     38 tbhyphext.tex: hyphenation exception log from TugBoat archive
     39 
     40   Source of the hyphenation exception list: 
     41   http://www.ctan.org/tex-archive/info/digests/tugboat/tb0hyf.tex
     42 
     43   Generated with the hyphenex script
     44   (http://www.ctan.org/tex-archive/info/digests/tugboat/hyphenex.sh)
     45 
     46   sh hyphenex.sh <tb0hyf.tex >tbhyphext.tex
     47 
     48 
     49 INSTALLATION
     50 
     51 ./configure
     52 make
     53 make install
     54 
     55 UNIT TESTS (WITH VALGRIND DEBUGGER)
     56 
     57 make check
     58 VALGRIND=memcheck make check
     59 
     60 USAGE
     61 
     62 ./example hyph_en_US.dic mywords.txt
     63 
     64 or (under Linux)
     65 
     66 echo example | ./example hyph_en_US.dic /dev/stdin
     67 
     68 NOTE: In the case of Unicode encoded input, convert your words
     69 to lowercase before hyphenation (under UTF-8 console environment):
     70 
     71 cat mywords.txt | awk '{print tolower($0)}' >mywordslow.txt
     72 
     73 DEVELOPMENT
     74 
     75 See README.hyphen for hyphenation algorithm, README.nonstandard
     76 and doc/tb87nemeth.pdf for non-standard hyphenation,
     77 README.compound for compound word hyphenation, and tests/*.
     78 
     79 Description of the dictionary format:
     80 
     81 First line contains the character encoding (ISO8859-x, UTF-8).
     82 
     83 Possible options in the following lines:
     84 
     85 LEFTHYPHENMIN num          minimal hyphenation distance from the left word end
     86 RIGHTHYPHENMIN num         minimal hyphation distance from the right word end
     87 COMPOUNDLEFTHYPHENMIN num  min. hyph. dist. from the left compound word boundary
     88 COMPOUNDRIGHTHYPHENMIN num min. hyph. dist. from the right comp. word boundary
     89 
     90 hyphenation patterns       see README.* files
     91 
     92 NEXTWORD                   separate the two compound sets (see README.compound)
     93 
     94 Default values:
     95 Without explicite declarations, hyphenmin fields of dict struct
     96 are zeroes, but in this case the lefthyphenmin and righthyphenmin
     97 will be the default 2 under the hyphenation (for backward compatibility).
     98 
     99 Comments
    100 
    101 Use percent sign at the beginning of the lines to add comments to your
    102 hpyhenation patterns (after the character encoding in the first line):
    103 
    104 % comment
    105 
    106 *****************************************************************************
    107 * Warning! Correct working of Libhnj *needs* prepared hyphenation patterns. *
    108 
    109 For example, generating hyph_en_US.dic from "hyphen.us" TeX patterns:
    110     
    111 perl substrings.pl hyphen.us hyph_en_US.dic ISO8859-1
    112 
    113 or with default LEFTHYPHENMIN and RIGHTHYPHENMIN values:
    114 
    115 perl substrings.pl hyphen.us hyph_en_US.dic ISO8859-1 2 3
    116 perl substrings.pl hyphen.gb hyph_en_GB.dic ISO8859-1 3 3
    117 ****************************************************************************
    118 
    119 OTHERS
    120 
    121 Java hyphenation: Peter B. West (Folio project) implements a hyphenator with
    122 non standard hyphenation facilities based on extended Libhnj. The HyFo module
    123 is released in binary form as jar files and in source form as zip files.
    124 See http://sourceforge.net/project/showfiles.php?group_id=119136
    125 
    126 Lszl Nmeth
    127 <nemeth (at) openoffice (dot) org>
    128 

README.compound

      1 Compound word hyphenation
      2 
      3 Hyphen library supports better compound word hyphenation and special
      4 rules of compound word hyphenation of German languages and other
      5 languages with arbitrary number of compound words. The new options,
      6 COMPOUNDLEFTHYPHENMIN and COMPOUNDRIGHTHYPHENMIN help to set the right
      7 style for the hyphenation of compound words.
      8 
      9 Algorithm
     10 
     11 The algorithm is an extension of the original pattern based hyphenation
     12 algorithm. It uses two hyphenation pattern sets, defined in the same
     13 pattern file and separated by the NEXTLEVEL keyword. First pattern
     14 set is for hyphenation only at compound word boundaries, the second one
     15 is for hyphenation within words or word parts.
     16 
     17 Recursive compound level hyphenation
     18 
     19 The algorithm is recursive: every word parts of a successful 
     20 first (compound) level hyphenation will be rehyphenated
     21 by the same (first) pattern set.
     22 
     23 Finally, when first level hyphenation is not possible, Hyphen uses
     24 the second level hyphenation for the word or the word parts.
     25 
     26 Word endings and word parts
     27 
     28 Patterns for word endings (patterns with ellipses) match the
     29 word parts, too.
     30 
     31 Options
     32 
     33 COMPOUNDLEFTHYPHENMIN: min. hyph. dist. from the left compound word boundary
     34 COMPOUNDRIGHTHYPHENMIN: min. hyph. dist. from the right comp. word boundary
     35 NEXTLEVEL: sign second level hyphenation patterns
     36 
     37 Default hyphenmin values
     38 
     39 Default values of COMPOUNDLEFTHYPHENMIN and COMPOUNDRIGHTHYPHENMIN are 0,
     40 and 0 under the hyphenation, too. ("0" values of
     41 LEFTHYPHENMIN and RIGHTHYPHENMIN mean the default "2" under the hyphenation.)
     42 
     43 Examples
     44 
     45 See tests/compound* test files.
     46 
     47 Preparation of hyphenation patterns
     48 
     49 It hasn't been special pattern generator tool for compound hyphenation
     50 patterns, yet. It is possible to use PATGEN to generate both of
     51 pattern sets, concatenate it manually and set the requested HYPHENMIN values.
     52 (But don't forget the preprocessing steps by substrings.pl before
     53 concatenation.) One of the disadvantage of this method, that PATGEN
     54 doesn't know recursive compound hyphenation of Hyphen.
     55 
     56 Lszl Nmeth
     57 <nemeth (at) openoffice.org>
     58 

README.hyphen

      1 Brief explanation of the hyphenation algorithm herein.[1]
      2 
      3 Raph Levien <raph (a] acm.org>
      4 4 Aug 1998
      5 
      6    The hyphenation algorithm is basically the same as Knuth's TeX
      7 algorithm. However, the implementation is quite a bit faster.
      8 
      9    The hyphenation files from TeX can almost be used directly. There
     10 is a preprocessing step, however. If you don't do the preprocessing
     11 step, you'll get bad hyphenations (i.e. a silent failure).
     12 
     13    Start with a file such as hyphen.us. This is the TeX ushyph1.tex
     14 file, with the exception dictionary encoded using the same rules as
     15 the main portion of the file. Any line beginning with % is a comment.
     16 Each other line should contain exactly one rule.
     17 
     18    Then, do the preprocessing - "perl substrings.pl hyphen.us". The
     19 resulting file is hyphen.mashed. It's in Perl, and it's fairly slow
     20 (it uses brute force algorithms; about 17 seconds on a P100), but it
     21 could probably be redone in C with clever algorithms. This would be
     22 valuable, for example, if it was handle user-supplied exception
     23 dictionaries by integrating them into the rule table.[2]
     24 
     25    Once the rules are preprocessed, loading them is quite quick -
     26 about 200ms on a P100. It then hyphenates at about 40,000 words per
     27 second on a P100. I haven't benchmarked it against other
     28 implementations (both TeX and groff contain essentially the same
     29 algorithm), but expect that it runs quite a bit faster than any of
     30 them.
     31 
     32 Knuth's algorithm
     33 
     34    This section contains a brief explanation of Knuth's algorithm, in
     35 case you missed it from the TeX books. We'll use the semi-word
     36 "example" as our running example.
     37 
     38    Since the beginning and end of a word are special, the algorithm is
     39 actually run over the prepared word (prep_word in the source)
     40 ".example.". Knuths algorithm basically just does pattern matches from
     41 the rule set, then applies the matches. The patterns in this case that
     42 match are "xa", "xam", "mp", and "pl". These are actually stored as
     43 "x1a", "xam3", "4m1p", and "1p2l2". Whenever numbers appear between
     44 the letters, they are added in. If two (or more) patterns have numbers
     45 in the same place, the highest number wins. Here's the example:
     46 
     47  . e x a m p l e .
     48      x1a
     49      x a m3
     50         4m1p
     51           1p2l2
     52  -----------------
     53  . e x1a4m3p2l2e .
     54 
     55    Finally, hyphens are placed wherever odd numbers appear. They are,
     56 however, suppressed after the first letter and before the last letter
     57 of the word (TeX actually suppresses them before the next-to-last, as
     58 well). So, it's "ex-am-ple", which is correct.
     59 
     60    Knuth uses a trie to implement this. I.e. he stores each rule in a
     61 trie structure. For each position in the word, he searches the trie,
     62 searching for a match. Most patterns are short, so efficiency should
     63 be quite good.
     64 
     65 Theory of the algorithm
     66 
     67    The algorithm works as a slightly modified finite state machine.
     68 There are two kinds of transitions: those that consume one letter of
     69 input (which work just like your regular finite state machine), and
     70 "fallback" transitions, which don't consume any input. If no
     71 transition matching the next letter is found, the fallback is used.
     72 One way of looking at this is a form of compression of the transition
     73 tables - i.e. it behaves the same as a completely vanilla state
     74 machine in which the actual transition table of a node is made up of
     75 the union of transition tables of the node itself, plus its fallbacks.
     76 
     77    Each state is represented by a string. Thus, if the current state
     78 is "am" and the next letter is "p", then the next state is "amp".
     79 Fallback transitions go to states which chop off one or (sometimes)
     80 more letters from the beginning. For example, if none of the
     81 transitions from "amp" match the next letter, then it will fall back
     82 to "mp". Similarly, if none of the transitions from "mp" match the
     83 next letter, it will fall back to "m".
     84 
     85    Each state is also associated with a (possibly null) "match"
     86 string. This represents the union of all patterns which are
     87 right-justified substrings of the match string. I.e. the pattern "mp"
     88 is a right-justified substring of the state "amp", so it's numbers get
     89 added in. The actual calculation of this union is done by the
     90 Perl preprocessing script, but could probably be done in C just about
     91 as easily.
     92 
     93    Because each state transition either consumes one input character
     94 or shortens the state string by one character, the total number of
     95 state transitions is linear in the length of the word.
     96 
     97 [1] Documentations:
     98 
     99 Franklin M. Liang: Word Hy-phen-a-tion by Com-put-er.
    100 Stanford University, 1983. http://www.tug.org/docs/liang.
    101 
    102 Lszl Nmeth: Automatic non-standard hyphenation in OpenOffice.org,
    103 TUGboat (27), 2006. No. 2., http://hunspell.sourceforge.net/tb87nemeth.pdf
    104 
    105 [2] There is the C version of pattern converter "substrings.c"
    106 in the distribution written by Nanning Buitenhuis. Unfortunatelly,
    107 this version hasn't handled the non standard extension of the
    108 algorithm, yet.
    109 

README.nonstandard

      1 Non-standard hyphenation
      2 ------------------------
      3 
      4 Some languages use non-standard hyphenation; `discretionary'
      5 character changes at hyphenation points. For example,
      6 Catalan: parallel -> paral-lel,
      7 Dutch: omaatje -> oma-tje,
      8 German (before the new orthography): Schiffahrt -> Schiff-fahrt,
      9 Hungarian: asszonnyal -> asz-szony-nyal (multiple occurance!)
     10 Swedish: tillata -> till-lata.
     11 
     12 Using this extended library, you can define 
     13 non-standard hyphenation patterns. For example:
     14 
     15 l1l/l=l
     16 a1atje./a=t,1,3
     17 .schif1fahrt/ff=f,5,2
     18 .as3szon/sz=sz,2,3
     19 n1nyal./ny=ny,1,3
     20 .til1lata./ll=l,3,2
     21 
     22 or with narrow boundaries:
     23 
     24 l1l/l=,1,2
     25 a1atje./a=,1,1
     26 .schif1fahrt/ff=,5,1
     27 .as3szon/sz=,2,1
     28 n1nyal./ny=,1,1
     29 .til1lata./ll=,3,1
     30 
     31 Note: Libhnj uses modified patterns by preparing substrings.pl.
     32 Unfortunatelly, now the conversion step can generate bad non-standard
     33 patterns (non-standard -> standard pattern conversion), so using
     34 narrow boundaries may be better for recent Libhnj. For example,
     35 substrings.pl generates a few bad patterns for Hungarian hyphenation
     36 patterns resulting bad non-standard hyphenation in a few cases. Using narrow
     37 boundaries solves this problem. Java HyFo module can check this problem.
     38 
     39 Syntax of the non-standard hyphenation patterns
     40 ------------------------------------------------
     41 
     42 pat1tern/change[,start,cut]
     43 
     44 If this pattern matches the word, and this pattern win (see README.hyphen)
     45 in the change region of the pattern, then pattern[start, start + cut - 1]
     46 substring will be replaced with the "change".
     47 
     48 For example, a German ff -> ff-f hyphenation:
     49 
     50 f1f/ff=f 
     51 
     52 or with expansion
     53 
     54 f1f/ff=f,1,2
     55 
     56 will change every "ff" with "ff=f" at hyphenation.
     57 
     58 A more real example:
     59 
     60 % simple ff -> f-f hyphenation
     61 f1f
     62 % Schiffahrt -> Schiff-fahrt hyphenation
     63 % 
     64 schif3fahrt/ff=f,5,2
     65 
     66 Specification
     67 
     68 - Pattern: matching patterns of the original Liang's algorithm
     69   - patterns must contain only one hyphenation point at change region
     70     signed with an one-digit odd number (1, 3, 5, 7 or 9).
     71     These point may be at subregion boundaries: schif3fahrt/ff=,5,1
     72   - only the greater value guarantees the win (don't mix non-standard and
     73     non-standard patterns with the same value, for example
     74     instead of f3f and schif3fahrt/ff=f,5,2 use f3f and schif5fahrt/ff=f,5,2)
     75 
     76 - Change: new characters.
     77   Arbitrary character sequence. Equal sign (=) signs hyphenation points
     78   for OpenOffice.org (like in the example). (In a possible German LaTeX
     79   preprocessor, ff could be replaced with "ff, for a Hungarian one, ssz
     80   with `ssz, according to the German and Hungarian Babel settings.)
     81 
     82 - Start: starting position of the change region.
     83   - begins with 1 (not 0): schif3fahrt/ff=f,5,2
     84   - start dot doesn't matter: .schif3fahrt/ff=f,5,2
     85   - numbers don't matter: .s2c2h2i2f3f2ahrt/ff=f,5,2
     86   - In UTF-8 encoding, use Unicode character positions: ssze/sz=sz,2,3
     87     ("ssze" looks "ssze" in an ISO 8859-1 8-bit editor). 
     88 
     89 - Cut: length of the removed character sequence in the original word.
     90   - In UTF-8 encoding, use Unicode character length: paral1lel/l=l,5,3
     91     ("parallel" looks "paral1lel" in an ISO 8859-1 8-bit editor).
     92 
     93 Dictionary developing
     94 ---------------------
     95 
     96 There hasn't been extended PatGen pattern generator for non-standard
     97 hyphenation patterns, yet.
     98 
     99 Fortunatelly, non-standard hyphenation points are forbidden in the PatGen
    100 generated hyphenation patterns, so with a little patch can be develop
    101 non-standard hyphenation patterns also in this case.
    102 
    103 Warning: If you use UTF-8 Unicode encoding in your patterns, call
    104 substrings.pl with UTF-8 parameter to calculate right
    105 character positions for non-standard hyphenation:
    106 
    107 ./substrings.pl input output UTF-8
    108 
    109 Programming
    110 -----------
    111 
    112 Use hyphenate2() or hyphenate3() to handle non-standard hyphenation.
    113 See hyphen.h for the documentation of the hyphenate*() functions.
    114 See example.c for processing the output of the hyphenate*() functions.
    115 
    116 Warning: change characters are lower cased in the source, so you may need
    117 case conversion of the change characters based on input word case detection.
    118 For example, see OpenOffice.org source
    119 (lingucomponent/source/hyphenator/altlinuxhyph/hyphen/hyphenimp.cxx).
    120 
    121 Lszl Nmeth
    122 <nemeth (at) openoffice.org>
    123