Home | History | Annotate | Download | only in googleurl
      1                        ==============================
      2                        The Google URL Parsing Library
      3                        ==============================
      4 
      5 This is the Google URL Parsing Library which parses and canonicalizes URLs.
      6 Please see the LICENSE.txt file for licensing information.
      7 
      8 Features
      9 ========
     10 
     11    * Easily embeddable: This library was written for a variety of client and
     12      server programs in mind, so unlike most implementations of URL parsing
     13      and canonicalization, it can be easily emdedded.
     14 
     15    * Fast: hundreds of thousands of typical URLs can be parsed and
     16      canonicalized per second on a modern CPU. It is much faster than, for
     17      example, calling WinInet's corresponding functions.
     18 
     19    * Compatible: When possible, this library has strived for IE7 compatability
     20      for both general web compatability, and so IE addons or other applications
     21      that communicate with or embed IE will work properly.
     22 
     23      It supports Unix-style file URLs, as well as the more complex rules for
     24      Window file URLs. Note that total compatability is not possible (for
     25      example, IE6 and IE7 disagree about how to parse certain IP addresses),
     26      and that this is more strict about certain illegal, rarely used, and
     27      potentially dangerous constructs such as escaped control characters in
     28      host names that IE will allow. It is typically a little less strict than
     29      Firefox.
     30 
     31 
     32 Example
     33 =======
     34 
     35 An example implementation of a URL object that uses this library is provided
     36 in src/gurl.*. This implementation uses the "application integration" layer
     37 discussed below to interface with the low-level parsing and canonicalization
     38 functions.
     39 
     40 
     41 Building
     42 ========
     43 
     44 The canonicalization files require ICU for some UTF-8 and UTF-16 conversion
     45 macros. If your project does not use ICU, it should be straightforward to
     46 factor out the macros and functions used in ICU, there are only a few well-
     47 isolated things that are used.
     48 
     49 TODO(brettw) ADD INSTRUCTIONS FOR GETTING ICU HERE!
     50 
     51 logging.h and logging.cc are Windows-only because the corresponding Unix
     52 logging system has many dependencies. This library uses few of the logging
     53 macros, and a dummy header can easily be written that defines the
     54 appropriate things for Unix.
     55 
     56 
     57 Definitions
     58 ===========
     59 
     60 "Standard URL": A URL with an "authority", which is a hostname and optionally
     61    a port, username, and password. Most URLs are standard such as HTTP and FTP.
     62 
     63 "File URL": A URL that references a file on disk. There are special rules for
     64    this type of URL. Note that it may have a hostname! "localhost" is allowed,
     65    for example "file://localhost/foo" is the same as "file:///foo".
     66 
     67 "Path URL": This is everything else. There is no standard on how to treat these
     68    URLs, or even what they are called. This library decomposes them into a
     69    scheme and a path. The path is everything following the scheme. This type of
     70    URL includes "javascript", "data", and even "mailto" (although "mailto"
     71    might look like a standard scheme in some respects, it is not).
     72 
     73 
     74 Design
     75 ======
     76 
     77 The library is divided into four layers. They are listed here from the lowest
     78 to the highest; you can use any portion of the library as long as you embed the
     79 layers below it.
     80 
     81 1. Parsing
     82 ----------
     83 At the lowest level is the parsing code. The files encompasing this are
     84 url_parse.* and the main include file is src/url_parse.h. This code will, given
     85 an input string, parse it into the most likely form of a URL.
     86 
     87 Parsing can not fail and does no validation. The exception is the port number,
     88 which it currently validates, but this is a bug. Given crazy input, the parser
     89 will do its best to find the various URL components according to its rules (see
     90 url_parse_unittest.cc for some examples).
     91 
     92 To use this, an application will typically use ExtractScheme to determine the
     93 type of a given input URL, and then call one of the initialization functions:
     94 "ParseStandardURL", "ParsePathURL", or "ParseFileURL". This will result in
     95 a "Parsed" structure which identifies the substrings of each identified
     96 component.
     97 
     98 2. Canonicalization
     99 -------------------
    100 At the next highest level is canonicalization. The files encompasing this are
    101 url_canon.* and the main include file is src/url_canon.h. This code will
    102 validate an already-parsed URL, and will convert it to a canonical form. For
    103 example, this will convert host names to lowercase, convert IP addresses
    104 into dotted-decimal notation, handle encoding issues, etc.
    105 
    106 This layer will always do its best to produce a reasonable output string, but
    107 it may return that the string is invalid. For example, if there are invalid
    108 characters in the host name, it will escape them or replace them with the
    109 Unicode "invalid character" character, but will fail. This way, the program can
    110 display error messages to the user with the output, log it, etc.  and the
    111 string will have some meaning.
    112 
    113 Canonicalized output is written to a CanonOutput object which is a simple
    114 wrapper around an expanding buffer. An implementation called RawCanonOutput is
    115 proivided that writes to a raw buffer with a fixed amount statically allocated
    116 (for performance). Applications using STL can use StdStringCanonOutput defined
    117 in url_canon_stdstring.h which writes into a std::string.
    118 
    119 A normal application would call one of the three high-level functions
    120 "CanonicalizeStandardURL", "CanonicalizeFileURL", and CanonicalizePathURL"
    121 depending on the type of URL in question. Lower-level functions are also
    122 provided which will canonicalize individual parts of a URL (for example,
    123 "CanonicalizeHost").
    124 
    125 Part of this layer is the integration with the host system for IDN and encoding
    126 conversion. An implementation that provides integration with the ICU
    127 (http://www-306.ibm.com/software/globalization/icu/index.jsp) is provided in
    128 src/url_canon_icu.cc. The embedder may wish to replace this file with
    129 implementations of the functions for their own IDN library if they do not use
    130 ICU.
    131 
    132 3. Application integration
    133 --------------------------
    134 The canonicalization and parsing layers do not know anything about the URI
    135 schemes supported by your application. The parsing and canonicalization
    136 functions are very low-level, and you must call the correct function to do the
    137 work (for example, "CanonicalizeFileURL").
    138 
    139 The application integration in url_util.* provides wrappers around the
    140 low-level parsing and canonicalization to call the correct versions for
    141 different identified schemes.  Embedders will want to modify this file if
    142 necessary to suit the needs of their application.
    143 
    144 4. URL object
    145 -------------
    146 The highest level is the "URL" object that a C++ application would use to
    147 to encapsulate a URL. Embedders will typically want to provide their own URL
    148 object that meets the requirements of their system. A reasonably complete
    149 example implemnetation is provided in src/gurl.*. You may wish to use this
    150 object, extend or modify it, or write your own.
    151 
    152 Whitespace
    153 ----------
    154 Sometimes, you may want to remove linefeeds and tabs from the content of a URL.
    155 Some web pages, for example, expect that a URL spanning two lines should be
    156 treated as one with the newline removed. Depending on the source of the URLs
    157 you are canonicalizing, these newlines may or may not be trimmed off.
    158 
    159 If you want this behavior, call RemoveURLWhitespace before parsing. This will
    160 remove CR, LF and TAB from the input. Note that it preserves spaces. On typical
    161 URLs, this function produces a 10-15% speed reduction, so it is optional and
    162 not done automatically. The example GURL object and the url_util wrapper does
    163 this for you.
    164 
    165 Tests
    166 =====
    167 
    168 There are a number of *_unittest.cc and *_perftest.cc files. These files are
    169 not currently compilable as they rely on a not-included unit testing framework
    170 Tests are declared like this:
    171   TEST(TestCaseName, TestName) {
    172     ASSERT_TRUE(a);
    173     EXPECT_EQ(a, b);
    174   }
    175 If you would like to compile them, it should be straightforward to define
    176 the TEST macro (which would declare a function by combining the two arguments)
    177 and the other macros whose behavior should be self-explanatory (EXPECT is like
    178 an ASSERT, but does not stop the test, if you are doing this, you probably
    179 don't care about this difference). Then you would define a .cc file that
    180 calls all of these functions.
    181