Home | History | Annotate | Download | only in base
      1 //* -*- Mode: C++; tab-width: 2; indent-tabs-mode: nil; c-basic-offset: 2 -*- */
      2 /* ***** BEGIN LICENSE BLOCK *****
      3  * Version: MPL 1.1/GPL 2.0/LGPL 2.1
      4  *
      5  * The contents of this file are subject to the Mozilla Public License Version
      6  * 1.1 (the "License"); you may not use this file except in compliance with
      7  * the License. You may obtain a copy of the License at
      8  * http://www.mozilla.org/MPL/
      9  *
     10  * Software distributed under the License is distributed on an "AS IS" basis,
     11  * WITHOUT WARRANTY OF ANY KIND, either express or implied. See the License
     12  * for the specific language governing rights and limitations under the
     13  * License.
     14  *
     15  * The Original Code is Mozilla TLD Service
     16  *
     17  * The Initial Developer of the Original Code is
     18  * Google Inc.
     19  * Portions created by the Initial Developer are Copyright (C) 2006
     20  * the Initial Developer. All Rights Reserved.
     21  *
     22  * Contributor(s):
     23  *   Pamela Greene <pamg.bugs (at) gmail.com> (original author)
     24  *
     25  * Alternatively, the contents of this file may be used under the terms of
     26  * either the GNU General Public License Version 2 or later (the "GPL"), or
     27  * the GNU Lesser General Public License Version 2.1 or later (the "LGPL"),
     28  * in which case the provisions of the GPL or the LGPL are applicable instead
     29  * of those above. If you wish to allow use of your version of this file only
     30  * under the terms of either the GPL or the LGPL, and not to allow others to
     31  * use your version of this file under the terms of the MPL, indicate your
     32  * decision by deleting the provisions above and replace them with the notice
     33  * and other provisions required by the GPL or the LGPL. If you do not delete
     34  * the provisions above, a recipient may use your version of this file under
     35  * the terms of any one of the MPL, the GPL or the LGPL.
     36  *
     37  * ***** END LICENSE BLOCK ***** */
     38 
     39 // NB: Modelled after Mozilla's code (originally written by Pamela Greene,
     40 // later modified by others), but almost entirely rewritten for Chrome.
     41 
     42 /*
     43   (Documentation based on the Mozilla documentation currently at
     44   http://wiki.mozilla.org/Gecko:Effective_TLD_Service, written by the same
     45   author.)
     46 
     47   The RegistryControlledDomainService examines the hostname of a GURL passed to
     48   it and determines the longest portion that is controlled by a registrar.
     49   Although technically the top-level domain (TLD) for a hostname is the last
     50   dot-portion of the name (such as .com or .org), many domains (such as co.uk)
     51   function as though they were TLDs, allocating any number of more specific,
     52   essentially unrelated names beneath them.  For example, .uk is a TLD, but
     53   nobody is allowed to register a domain directly under .uk; the "effective"
     54   TLDs are ac.uk, co.uk, and so on.  We wouldn't want to allow any site in
     55   *.co.uk to set a cookie for the entire co.uk domain, so it's important to be
     56   able to identify which higher-level domains function as effective TLDs and
     57   which can be registered.
     58 
     59   The service obtains its information about effective TLDs from a text resource
     60   that must be in the following format:
     61 
     62   * It should use plain ASCII.
     63   * It should contain one domain rule per line, terminated with \n, with nothing
     64     else on the line.  (The last rule in the file may omit the ending \n.)
     65   * Rules should have been normalized using the same canonicalization that GURL
     66     applies.  For ASCII, that means they're not case-sensitive, among other
     67     things; other normalizations are applied for other characters.
     68   * Each rule should list the entire TLD-like domain name, with any subdomain
     69     portions separated by dots (.) as usual.
     70   * Rules should neither begin nor end with a dot.
     71   * If a hostname matches more than one rule, the most specific rule (that is,
     72     the one with more dot-levels) will be used.
     73   * Other than in the case of wildcards (see below), rules do not implicitly
     74     include their subcomponents.  For example, "bar.baz.uk" does not imply
     75     "baz.uk", and if "bar.baz.uk" is the only rule in the list, "foo.bar.baz.uk"
     76     will match, but "baz.uk" and "qux.baz.uk" won't.
     77   * The wildcard character '*' will match any valid sequence of characters.
     78   * Wildcards may only appear as the entire most specific level of a rule.  That
     79     is, a wildcard must come at the beginning of a line and must be followed by
     80     a dot.  (You may not use a wildcard as the entire rule.)
     81   * A wildcard rule implies a rule for the entire non-wildcard portion.  For
     82     example, the rule "*.foo.bar" implies the rule "foo.bar" (but not the rule
     83     "bar").  This is typically important in the case of exceptions (see below).
     84   * The exception character '!' before a rule marks an exception to a wildcard
     85     rule.  If your rules are "*.tokyo.jp" and "!pref.tokyo.jp", then
     86     "a.b.tokyo.jp" has an effective TLD of "b.tokyo.jp", but "a.pref.tokyo.jp"
     87     has an effective TLD of "tokyo.jp" (the exception prevents the wildcard
     88     match, and we thus fall through to matching on the implied "tokyo.jp" rule
     89     from the wildcard).
     90   * If you use an exception rule without a corresponding wildcard rule, the
     91     behavior is undefined.
     92 
     93   Firefox has a very similar service, and it's their data file we use to
     94   construct our resource.  However, the data expected by this implementation
     95   differs from the Mozilla file in several important ways:
     96    (1) We require that all single-level TLDs (com, edu, etc.) be explicitly
     97        listed.  As of this writing, Mozilla's file includes the single-level
     98        TLDs too, but that might change.
     99    (2) Our data is expected be in pure ASCII: all UTF-8 or otherwise encoded
    100        items must already have been normalized.
    101    (3) We do not allow comments, rule notes, blank lines, or line endings other
    102        than LF.
    103   Rules are also expected to be syntactically valid.
    104 
    105   The utility application tld_cleanup.exe converts a Mozilla-style file into a
    106   Chrome one, making sure that single-level TLDs are explicitly listed, using
    107   GURL to normalize rules, and validating the rules.
    108 */
    109 
    110 #ifndef NET_BASE_REGISTRY_CONTROLLED_DOMAIN_H_
    111 #define NET_BASE_REGISTRY_CONTROLLED_DOMAIN_H_
    112 #pragma once
    113 
    114 #include <string>
    115 
    116 #include "base/basictypes.h"
    117 
    118 class GURL;
    119 
    120 template <typename T>
    121 struct DefaultSingletonTraits;
    122 struct DomainRule;
    123 
    124 namespace net {
    125 
    126 struct RegistryControlledDomainServiceSingletonTraits;
    127 
    128 // This class is a singleton.
    129 class RegistryControlledDomainService {
    130  public:
    131    ~RegistryControlledDomainService() { }
    132 
    133   // Returns the registered, organization-identifying host and all its registry
    134   // information, but no subdomains, from the given GURL.  Returns an empty
    135   // string if the GURL is invalid, has no host (e.g. a file: URL), has multiple
    136   // trailing dots, is an IP address, has only one subcomponent (i.e. no dots
    137   // other than leading/trailing ones), or is itself a recognized registry
    138   // identifier.  If no matching rule is found in the effective-TLD data (or in
    139   // the default data, if the resource failed to load), the last subcomponent of
    140   // the host is assumed to be the registry.
    141   //
    142   // Examples:
    143   //   http://www.google.com/file.html -> "google.com"  (com)
    144   //   http://..google.com/file.html   -> "google.com"  (com)
    145   //   http://google.com./file.html    -> "google.com." (com)
    146   //   http://a.b.co.uk/file.html      -> "b.co.uk"     (co.uk)
    147   //   file:///C:/bar.html             -> ""            (no host)
    148   //   http://foo.com../file.html      -> ""            (multiple trailing dots)
    149   //   http://192.168.0.1/file.html    -> ""            (IP address)
    150   //   http://bar/file.html            -> ""            (no subcomponents)
    151   //   http://co.uk/file.html          -> ""            (host is a registry)
    152   //   http://foo.bar/file.html        -> "foo.bar"     (no rule; assume bar)
    153   static std::string GetDomainAndRegistry(const GURL& gurl);
    154 
    155   // Like the GURL version, but takes a host (which is canonicalized internally)
    156   // instead of a full GURL.
    157   static std::string GetDomainAndRegistry(const std::string& host);
    158   static std::string GetDomainAndRegistry(const std::wstring& host);
    159 
    160   // This convenience function returns true if the two GURLs both have hosts
    161   // and one of the following is true:
    162   // * They each have a known domain and registry, and it is the same for both
    163   //   URLs.  Note that this means the trailing dot, if any, must match too.
    164   // * They don't have known domains/registries, but the hosts are identical.
    165   // Effectively, callers can use this function to check whether the input URLs
    166   // represent hosts "on the same site".
    167   static bool SameDomainOrHost(const GURL& gurl1, const GURL& gurl2);
    168 
    169   // Finds the length in bytes of the registrar portion of the host in the
    170   // given GURL.  Returns std::string::npos if the GURL is invalid or has no
    171   // host (e.g. a file: URL).  Returns 0 if the GURL has multiple trailing dots,
    172   // is an IP address, has no subcomponents, or is itself a recognized registry
    173   // identifier.  If no matching rule is found in the effective-TLD data (or in
    174   // the default data, if the resource failed to load), returns 0 if
    175   // |allow_unknown_registries| is false, or the length of the last subcomponent
    176   // if |allow_unknown_registries| is true.
    177   //
    178   // Examples:
    179   //   http://www.google.com/file.html -> 3                 (com)
    180   //   http://..google.com/file.html   -> 3                 (com)
    181   //   http://google.com./file.html    -> 4                 (com)
    182   //   http://a.b.co.uk/file.html      -> 5                 (co.uk)
    183   //   file:///C:/bar.html             -> std::string::npos (no host)
    184   //   http://foo.com../file.html      -> 0                 (multiple trailing
    185   //                                                         dots)
    186   //   http://192.168.0.1/file.html    -> 0                 (IP address)
    187   //   http://bar/file.html            -> 0                 (no subcomponents)
    188   //   http://co.uk/file.html          -> 0                 (host is a registry)
    189   //   http://foo.bar/file.html        -> 0 or 3, depending (no rule; assume
    190   //                                                         bar)
    191   static size_t GetRegistryLength(const GURL& gurl,
    192                                   bool allow_unknown_registries);
    193 
    194   // Like the GURL version, but takes a host (which is canonicalized internally)
    195   // instead of a full GURL.
    196   static size_t GetRegistryLength(const std::string& host,
    197                                   bool allow_unknown_registries);
    198   static size_t GetRegistryLength(const std::wstring& host,
    199                                   bool allow_unknown_registries);
    200 
    201   // Returns the singleton instance, after attempting to initialize it.
    202   // NOTE that if the effective-TLD data resource can't be found, the instance
    203   // will be initialized and continue operation with simple default TLD data.
    204   static RegistryControlledDomainService* GetInstance();
    205 
    206  protected:
    207   typedef const struct DomainRule* (*FindDomainPtr)(const char *, unsigned int);
    208 
    209   // The entire protected API is only for unit testing.  I mean it.  Don't make
    210   // me come over there!
    211   RegistryControlledDomainService();
    212 
    213   // Set the RegistryControledDomainService instance to be used internally.
    214   // |instance| will supersede the singleton instance normally used.  If
    215   // |instance| is NULL, normal behavior is restored, and internal operations
    216   // will return to using the singleton.  This function always returns the
    217   // instance set by the most recent call to SetInstance.
    218   static RegistryControlledDomainService* SetInstance(
    219       RegistryControlledDomainService* instance);
    220 
    221   // Used for unit tests, so that a different perfect hash map from the full
    222   // list is used.
    223   static void UseFindDomainFunction(FindDomainPtr function);
    224 
    225  private:
    226   // To allow construction of the internal singleton instance.
    227   friend struct DefaultSingletonTraits<RegistryControlledDomainService>;
    228 
    229   // Internal workings of the static public methods.  See above.
    230   static std::string GetDomainAndRegistryImpl(const std::string& host);
    231   size_t GetRegistryLengthImpl(const std::string& host,
    232                                bool allow_unknown_registries);
    233 
    234   // Function that returns a DomainRule given a domain.
    235   FindDomainPtr find_domain_function_;
    236 
    237   DISALLOW_COPY_AND_ASSIGN(RegistryControlledDomainService);
    238 };
    239 
    240 }  // namespace net
    241 
    242 #endif  // NET_BASE_REGISTRY_CONTROLLED_DOMAIN_H_
    243