Home | History | Annotate | Download | only in registry_controlled_domains
      1 // Copyright (c) 2012 The Chromium Authors. All rights reserved.
      2 // Use of this source code is governed by a BSD-style license that can be
      3 // found in the LICENSE file.
      4 
      5 // NB: Modelled after Mozilla's code (originally written by Pamela Greene,
      6 // later modified by others), but almost entirely rewritten for Chrome.
      7 //   (netwerk/dns/src/nsEffectiveTLDService.h)
      8 /* ***** BEGIN LICENSE BLOCK *****
      9  * Version: MPL 1.1/GPL 2.0/LGPL 2.1
     10  *
     11  * The contents of this file are subject to the Mozilla Public License Version
     12  * 1.1 (the "License"); you may not use this file except in compliance with
     13  * the License. You may obtain a copy of the License at
     14  * http://www.mozilla.org/MPL/
     15  *
     16  * Software distributed under the License is distributed on an "AS IS" basis,
     17  * WITHOUT WARRANTY OF ANY KIND, either express or implied. See the License
     18  * for the specific language governing rights and limitations under the
     19  * License.
     20  *
     21  * The Original Code is Mozilla TLD Service
     22  *
     23  * The Initial Developer of the Original Code is
     24  * Google Inc.
     25  * Portions created by the Initial Developer are Copyright (C) 2006
     26  * the Initial Developer. All Rights Reserved.
     27  *
     28  * Contributor(s):
     29  *   Pamela Greene <pamg.bugs (at) gmail.com> (original author)
     30  *
     31  * Alternatively, the contents of this file may be used under the terms of
     32  * either the GNU General Public License Version 2 or later (the "GPL"), or
     33  * the GNU Lesser General Public License Version 2.1 or later (the "LGPL"),
     34  * in which case the provisions of the GPL or the LGPL are applicable instead
     35  * of those above. If you wish to allow use of your version of this file only
     36  * under the terms of either the GPL or the LGPL, and not to allow others to
     37  * use your version of this file under the terms of the MPL, indicate your
     38  * decision by deleting the provisions above and replace them with the notice
     39  * and other provisions required by the GPL or the LGPL. If you do not delete
     40  * the provisions above, a recipient may use your version of this file under
     41  * the terms of any one of the MPL, the GPL or the LGPL.
     42  *
     43  * ***** END LICENSE BLOCK ***** */
     44 
     45 /*
     46   (Documentation based on the Mozilla documentation currently at
     47   http://wiki.mozilla.org/Gecko:Effective_TLD_Service, written by the same
     48   author.)
     49 
     50   The RegistryControlledDomainService examines the hostname of a GURL passed to
     51   it and determines the longest portion that is controlled by a registrar.
     52   Although technically the top-level domain (TLD) for a hostname is the last
     53   dot-portion of the name (such as .com or .org), many domains (such as co.uk)
     54   function as though they were TLDs, allocating any number of more specific,
     55   essentially unrelated names beneath them.  For example, .uk is a TLD, but
     56   nobody is allowed to register a domain directly under .uk; the "effective"
     57   TLDs are ac.uk, co.uk, and so on.  We wouldn't want to allow any site in
     58   *.co.uk to set a cookie for the entire co.uk domain, so it's important to be
     59   able to identify which higher-level domains function as effective TLDs and
     60   which can be registered.
     61 
     62   The service obtains its information about effective TLDs from a text resource
     63   that must be in the following format:
     64 
     65   * It should use plain ASCII.
     66   * It should contain one domain rule per line, terminated with \n, with nothing
     67     else on the line.  (The last rule in the file may omit the ending \n.)
     68   * Rules should have been normalized using the same canonicalization that GURL
     69     applies.  For ASCII, that means they're not case-sensitive, among other
     70     things; other normalizations are applied for other characters.
     71   * Each rule should list the entire TLD-like domain name, with any subdomain
     72     portions separated by dots (.) as usual.
     73   * Rules should neither begin nor end with a dot.
     74   * If a hostname matches more than one rule, the most specific rule (that is,
     75     the one with more dot-levels) will be used.
     76   * Other than in the case of wildcards (see below), rules do not implicitly
     77     include their subcomponents.  For example, "bar.baz.uk" does not imply
     78     "baz.uk", and if "bar.baz.uk" is the only rule in the list, "foo.bar.baz.uk"
     79     will match, but "baz.uk" and "qux.baz.uk" won't.
     80   * The wildcard character '*' will match any valid sequence of characters.
     81   * Wildcards may only appear as the entire most specific level of a rule.  That
     82     is, a wildcard must come at the beginning of a line and must be followed by
     83     a dot.  (You may not use a wildcard as the entire rule.)
     84   * A wildcard rule implies a rule for the entire non-wildcard portion.  For
     85     example, the rule "*.foo.bar" implies the rule "foo.bar" (but not the rule
     86     "bar").  This is typically important in the case of exceptions (see below).
     87   * The exception character '!' before a rule marks an exception to a wildcard
     88     rule.  If your rules are "*.tokyo.jp" and "!pref.tokyo.jp", then
     89     "a.b.tokyo.jp" has an effective TLD of "b.tokyo.jp", but "a.pref.tokyo.jp"
     90     has an effective TLD of "tokyo.jp" (the exception prevents the wildcard
     91     match, and we thus fall through to matching on the implied "tokyo.jp" rule
     92     from the wildcard).
     93   * If you use an exception rule without a corresponding wildcard rule, the
     94     behavior is undefined.
     95 
     96   Firefox has a very similar service, and it's their data file we use to
     97   construct our resource.  However, the data expected by this implementation
     98   differs from the Mozilla file in several important ways:
     99    (1) We require that all single-level TLDs (com, edu, etc.) be explicitly
    100        listed.  As of this writing, Mozilla's file includes the single-level
    101        TLDs too, but that might change.
    102    (2) Our data is expected be in pure ASCII: all UTF-8 or otherwise encoded
    103        items must already have been normalized.
    104    (3) We do not allow comments, rule notes, blank lines, or line endings other
    105        than LF.
    106   Rules are also expected to be syntactically valid.
    107 
    108   The utility application tld_cleanup.exe converts a Mozilla-style file into a
    109   Chrome one, making sure that single-level TLDs are explicitly listed, using
    110   GURL to normalize rules, and validating the rules.
    111 */
    112 
    113 #ifndef NET_BASE_REGISTRY_CONTROLLED_DOMAINS_REGISTRY_CONTROLLED_DOMAIN_H_
    114 #define NET_BASE_REGISTRY_CONTROLLED_DOMAINS_REGISTRY_CONTROLLED_DOMAIN_H_
    115 
    116 #include <string>
    117 
    118 #include "base/basictypes.h"
    119 #include "net/base/net_export.h"
    120 
    121 class GURL;
    122 
    123 struct DomainRule;
    124 
    125 namespace net {
    126 namespace registry_controlled_domains {
    127 
    128 // This enum is a required parameter to all public methods declared for this
    129 // service. The Public Suffix List (http://publicsuffix.org/) this service
    130 // uses as a data source splits all effective-TLDs into two groups. The main
    131 // group describes registries that are acknowledged by ICANN. The second group
    132 // contains a list of private additions for domains that enable external users
    133 // to create subdomains, such as appspot.com.
    134 // The RegistryFilter enum lets you choose whether you want to include the
    135 // private additions in your lookup.
    136 // See this for example use cases:
    137 // https://wiki.mozilla.org/Public_Suffix_List/Use_Cases
    138 enum NET_EXPORT PrivateRegistryFilter {
    139   EXCLUDE_PRIVATE_REGISTRIES = 0,
    140   INCLUDE_PRIVATE_REGISTRIES
    141 };
    142 
    143 // This enum is a required parameter to the GetRegistryLength functions
    144 // declared for this service. Whenever there is no matching rule in the
    145 // effective-TLD data (or in the default data, if the resource failed to
    146 // load), the result will be dependent on which enum value was passed in.
    147 // If EXCLUDE_UNKNOWN_REGISTRIES was passed in, the resulting registry length
    148 // will be 0. If INCLUDE_UNKNOWN_REGISTRIES was passed in, the resulting
    149 // registry length will be the length of the last subcomponent (eg. 3 for
    150 // foobar.baz).
    151 enum NET_EXPORT UnknownRegistryFilter {
    152   EXCLUDE_UNKNOWN_REGISTRIES = 0,
    153   INCLUDE_UNKNOWN_REGISTRIES
    154 };
    155 
    156 // Returns the registered, organization-identifying host and all its registry
    157 // information, but no subdomains, from the given GURL.  Returns an empty
    158 // string if the GURL is invalid, has no host (e.g. a file: URL), has multiple
    159 // trailing dots, is an IP address, has only one subcomponent (i.e. no dots
    160 // other than leading/trailing ones), or is itself a recognized registry
    161 // identifier.  If no matching rule is found in the effective-TLD data (or in
    162 // the default data, if the resource failed to load), the last subcomponent of
    163 // the host is assumed to be the registry.
    164 //
    165 // Examples:
    166 //   http://www.google.com/file.html -> "google.com"  (com)
    167 //   http://..google.com/file.html   -> "google.com"  (com)
    168 //   http://google.com./file.html    -> "google.com." (com)
    169 //   http://a.b.co.uk/file.html      -> "b.co.uk"     (co.uk)
    170 //   file:///C:/bar.html             -> ""            (no host)
    171 //   http://foo.com../file.html      -> ""            (multiple trailing dots)
    172 //   http://192.168.0.1/file.html    -> ""            (IP address)
    173 //   http://bar/file.html            -> ""            (no subcomponents)
    174 //   http://co.uk/file.html          -> ""            (host is a registry)
    175 //   http://foo.bar/file.html        -> "foo.bar"     (no rule; assume bar)
    176 NET_EXPORT std::string GetDomainAndRegistry(const GURL& gurl,
    177                                             PrivateRegistryFilter filter);
    178 
    179 // Like the GURL version, but takes a host (which is canonicalized internally)
    180 // instead of a full GURL.
    181 NET_EXPORT std::string GetDomainAndRegistry(const std::string& host,
    182                                             PrivateRegistryFilter filter);
    183 
    184 // This convenience function returns true if the two GURLs both have hosts
    185 // and one of the following is true:
    186 // * They each have a known domain and registry, and it is the same for both
    187 //   URLs.  Note that this means the trailing dot, if any, must match too.
    188 // * They don't have known domains/registries, but the hosts are identical.
    189 // Effectively, callers can use this function to check whether the input URLs
    190 // represent hosts "on the same site".
    191 NET_EXPORT bool SameDomainOrHost(const GURL& gurl1, const GURL& gurl2,
    192                                  PrivateRegistryFilter filter);
    193 
    194 // Finds the length in bytes of the registrar portion of the host in the
    195 // given GURL.  Returns std::string::npos if the GURL is invalid or has no
    196 // host (e.g. a file: URL).  Returns 0 if the GURL has multiple trailing dots,
    197 // is an IP address, has no subcomponents, or is itself a recognized registry
    198 // identifier.  The result is also dependent on the UnknownRegistryFilter.
    199 // If no matching rule is found in the effective-TLD data (or in
    200 // the default data, if the resource failed to load), returns 0 if
    201 // |unknown_filter| is EXCLUDE_UNKNOWN_REGISTRIES, or the length of the last
    202 // subcomponent if |unknown_filter| is INCLUDE_UNKNOWN_REGISTRIES.
    203 //
    204 // Examples:
    205 //   http://www.google.com/file.html -> 3                 (com)
    206 //   http://..google.com/file.html   -> 3                 (com)
    207 //   http://google.com./file.html    -> 4                 (com)
    208 //   http://a.b.co.uk/file.html      -> 5                 (co.uk)
    209 //   file:///C:/bar.html             -> std::string::npos (no host)
    210 //   http://foo.com../file.html      -> 0                 (multiple trailing
    211 //                                                         dots)
    212 //   http://192.168.0.1/file.html    -> 0                 (IP address)
    213 //   http://bar/file.html            -> 0                 (no subcomponents)
    214 //   http://co.uk/file.html          -> 0                 (host is a registry)
    215 //   http://foo.bar/file.html        -> 0 or 3, depending (no rule; assume
    216 //                                                         bar)
    217 NET_EXPORT size_t GetRegistryLength(const GURL& gurl,
    218                                     UnknownRegistryFilter unknown_filter,
    219                                     PrivateRegistryFilter private_filter);
    220 
    221 // Like the GURL version, but takes a host (which is canonicalized internally)
    222 // instead of a full GURL.
    223 NET_EXPORT size_t GetRegistryLength(const std::string& host,
    224                                     UnknownRegistryFilter unknown_filter,
    225                                     PrivateRegistryFilter private_filter);
    226 
    227 typedef const struct DomainRule* (*FindDomainPtr)(const char *, unsigned int);
    228 
    229 // Used for unit tests. Use default domains.
    230 NET_EXPORT_PRIVATE void SetFindDomainGraph();
    231 
    232 // Used for unit tests, so that a frozen list of domains is used.
    233 NET_EXPORT_PRIVATE void SetFindDomainGraph(const unsigned char* domains,
    234                                            size_t length);
    235 }  // namespace registry_controlled_domains
    236 }  // namespace net
    237 
    238 #endif  // NET_BASE_REGISTRY_CONTROLLED_DOMAINS_REGISTRY_CONTROLLED_DOMAIN_H_
    239