Home | History | Annotate | Download | only in addlhelp
      1 # -*- coding: utf-8 -*-
      2 # Copyright 2014 Google Inc. All Rights Reserved.
      3 #
      4 # Licensed under the Apache License, Version 2.0 (the "License");
      5 # you may not use this file except in compliance with the License.
      6 # You may obtain a copy of the License at
      7 #
      8 #     http://www.apache.org/licenses/LICENSE-2.0
      9 #
     10 # Unless required by applicable law or agreed to in writing, software
     11 # distributed under the License is distributed on an "AS IS" BASIS,
     12 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
     13 # See the License for the specific language governing permissions and
     14 # limitations under the License.
     15 """Additional help about CRC32C and installing crcmod."""
     16 
     17 from __future__ import absolute_import
     18 
     19 from gslib.help_provider import HelpProvider
     20 
     21 _DETAILED_HELP_TEXT = ("""
     22 <B>OVERVIEW</B>
     23   To minimize the chance for `filename encoding interoperability problems
     24   <https://en.wikipedia.org/wiki/Filename#Encoding_indication_interoperability>`_ 
     25   gsutil requires use of the `UTF-8 <https://en.wikipedia.org/wiki/UTF-8>`_
     26   character encoding when uploading and downloading files. Because UTF-8 is in
     27   widespread (and growing) use, for most users nothing needs to be done to use
     28   UTF-8. Users with files stored in other encodings (such as
     29   `Latin 1 <https://en.wikipedia.org/wiki/ISO/IEC_8859-1>`_) must convert those
     30   filenames to UTF-8 before attempting to upload the files. 
     31 
     32   The most common place where users who have filenames that use some other
     33   encoding encounter a gsutil error is while uploading files using the recursive
     34   (-R) option on the gsutil cp , mv, or rsync commands. When this happens you'll
     35   get an error like this:
     36 
     37       CommandException: Invalid Unicode path encountered
     38       ('dir1/dir2/file_name_with_\\xf6n_bad_chars').
     39       gsutil cannot proceed with such files present.
     40       Please remove or rename this file and try again.
     41 
     42   Note that the invalid Unicode characters have been hex-encoded in this error
     43   message because otherwise trying to print them would result in another
     44   error.
     45 
     46   If you encounter such an error you can either remove the problematic file(s)
     47   or try to rename them and re-run the command. If you have a modest number of
     48   such files the simplest thing to do is to think of a different name for the
     49   file and manually rename the file (using local filesystem tools). If you have
     50   too many files for that to be practical you can use a tool to convert the old
     51   character encoding to UTF-8. One such tool is `native2ascii
     52   <http://docs.oracle.com/javase/7/docs/technotes/tools/solaris/native2ascii.html>`_.
     53 
     54   Note also that there's no restriction on the character encoding used in file
     55   content - it can be UTF-8, a different encoding, or non-character
     56   data (like audio or video content). The gsutil UTF-8 character encoding
     57   requirement applies only to filenames.
     58 
     59 
     60 <B>CROSS-PLATFORM ENCODING PROBLEMS OF WHICH TO BE AWARE</B>
     61   Using UTF-8 for all object names and filenames will ensure that gsutil doesn't
     62   encounter character encoding errors while operating on the files.
     63   Unfortunately, it's still possible that files uploaded / downloaded this way
     64   can have interoperability problems, for a number of reasons unrelated to
     65   gsutil. For example:
     66 
     67     - Windows filenames are case-insensitive, while GCS, Linux and MacOS are
     68       not. Thus, for example, if you have two filenames on Linux differing only
     69       in case and upload both to GCS and then subsequently download them to
     70       Windows, you will end up with just one file whose contents came from the
     71       last of these files to be written to the filesystem. Moreover, case
     72       translation is handled by tables that change across OS versions.
     73     - Mac OS performs character encoding decomposition based on tables stored in
     74       the OS, and the tables change between Unicode versions. Thus the encoding
     75       used by an external library may not match that performed by the the OS.
     76     - Windows console support for Unicode is difficult to use correctly.
     77 
     78   For a more thorough list of such issues see `this presentation
     79   <http://www.i18nguy.com/unicode/filename-issues-iuc33.pdf>`_
     80 
     81   These problems mostly arise when sharing data across platforms (e.g.,
     82   uploading data from a Windows machine to GCS, and then downloading from GCS
     83   to a machine running MacOS). Unfortunately these problems are a consequence
     84   of the lack of a filename encoding standard, and users need to be aware of the
     85   kinds of problems that can arise when copying filenames across platforms.
     86 
     87   There is one precaution users can exercise to prevent some of these problems:
     88   When using the Windows console specify wildcards or folders (using the -R
     89   option) rather than explicitly named individual files.
     90 """)
     91 
     92 
     93 class CommandOptions(HelpProvider):
     94   """Additional help about filename encoding and interoperability problems."""
     95 
     96   # Help specification. See help_provider.py for documentation.
     97   help_spec = HelpProvider.HelpSpec(
     98       help_name='encoding',
     99       help_name_aliases=['encodings', 'utf8', 'utf-8', 'latin1', 'unicode',
    100                          'interoperability'],
    101       help_type='additional_help',
    102       help_one_line_summary='Filename encoding and interoperability problems',
    103       help_text=_DETAILED_HELP_TEXT,
    104       subcommand_help_text={},
    105   )
    106