1 # -*- coding: utf-8 -*- 2 # Copyright 2012 Google Inc. All Rights Reserved. 3 # 4 # Licensed under the Apache License, Version 2.0 (the "License"); 5 # you may not use this file except in compliance with the License. 6 # You may obtain a copy of the License at 7 # 8 # http://www.apache.org/licenses/LICENSE-2.0 9 # 10 # Unless required by applicable law or agreed to in writing, software 11 # distributed under the License is distributed on an "AS IS" BASIS, 12 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 13 # See the License for the specific language governing permissions and 14 # limitations under the License. 15 """Additional help about using gsutil for production tasks.""" 16 17 from __future__ import absolute_import 18 19 from gslib.help_provider import HelpProvider 20 21 _DETAILED_HELP_TEXT = (""" 22 <B>OVERVIEW</B> 23 If you use gsutil in large production tasks (such as uploading or 24 downloading many GiBs of data each night), there are a number of things 25 you can do to help ensure success. Specifically, this section discusses 26 how to script large production tasks around gsutil's resumable transfer 27 mechanism. 28 29 30 <B>BACKGROUND ON RESUMABLE TRANSFERS</B> 31 First, it's helpful to understand gsutil's resumable transfer mechanism, 32 and how your script needs to be implemented around this mechanism to work 33 reliably. gsutil uses resumable transfer support when you attempt to upload 34 or download a file larger than a configurable threshold (by default, this 35 threshold is 2 MiB). When a transfer fails partway through (e.g., because of 36 an intermittent network problem), gsutil uses a truncated randomized binary 37 exponential backoff-and-retry strategy that by default will retry transfers up 38 to 23 times over a 10 minute period of time (see "gsutil help retries" for 39 details). If the transfer fails each of these attempts with no intervening 40 progress, gsutil gives up on the transfer, but keeps a "tracker" file for 41 it in a configurable location (the default location is ~/.gsutil/, in a file 42 named by a combination of the SHA1 hash of the name of the bucket and object 43 being transferred and the last 16 characters of the file name). When transfers 44 fail in this fashion, you can rerun gsutil at some later time (e.g., after 45 the networking problem has been resolved), and the resumable transfer picks 46 up where it left off. 47 48 49 <B>SCRIPTING DATA TRANSFER TASKS</B> 50 To script large production data transfer tasks around this mechanism, 51 you can implement a script that runs periodically, determines which file 52 transfers have not yet succeeded, and runs gsutil to copy them. Below, 53 we offer a number of suggestions about how this type of scripting should 54 be implemented: 55 56 1. When resumable transfers fail without any progress 23 times in a row 57 over the course of up to 10 minutes, it probably won't work to simply 58 retry the transfer immediately. A more successful strategy would be to 59 have a cron job that runs every 30 minutes, determines which transfers 60 need to be run, and runs them. If the network experiences intermittent 61 problems, the script picks up where it left off and will eventually 62 succeed (once the network problem has been resolved). 63 64 2. If your business depends on timely data transfer, you should consider 65 implementing some network monitoring. For example, you can implement 66 a task that attempts a small download every few minutes and raises an 67 alert if the attempt fails for several attempts in a row (or more or less 68 frequently depending on your requirements), so that your IT staff can 69 investigate problems promptly. As usual with monitoring implementations, 70 you should experiment with the alerting thresholds, to avoid false 71 positive alerts that cause your staff to begin ignoring the alerts. 72 73 3. There are a variety of ways you can determine what files remain to be 74 transferred. We recommend that you avoid attempting to get a complete 75 listing of a bucket containing many objects (e.g., tens of thousands 76 or more). One strategy is to structure your object names in a way that 77 represents your transfer process, and use gsutil prefix wildcards to 78 request partial bucket listings. For example, if your periodic process 79 involves downloading the current day's objects, you could name objects 80 using a year-month-day-object-ID format and then find today's objects by 81 using a command like gsutil ls "gs://bucket/2011-09-27-*". Note that it 82 is more efficient to have a non-wildcard prefix like this than to use 83 something like gsutil ls "gs://bucket/*-2011-09-27". The latter command 84 actually requests a complete bucket listing and then filters in gsutil, 85 while the former asks Google Storage to return the subset of objects 86 whose names start with everything up to the "*". 87 88 For data uploads, another technique would be to move local files from a "to 89 be processed" area to a "done" area as your script successfully copies 90 files to the cloud. You can do this in parallel batches by using a command 91 like: 92 93 gsutil -m cp -r to_upload/subdir_$i gs://bucket/subdir_$i 94 95 where i is a shell loop variable. Make sure to check the shell $status 96 variable is 0 after each gsutil cp command, to detect if some of the copies 97 failed, and rerun the affected copies. 98 99 With this strategy, the file system keeps track of all remaining work to 100 be done. 101 102 4. If you have really large numbers of objects in a single bucket 103 (say hundreds of thousands or more), you should consider tracking your 104 objects in a database instead of using bucket listings to enumerate 105 the objects. For example this database could track the state of your 106 downloads, so you can determine what objects need to be downloaded by 107 your periodic download script by querying the database locally instead 108 of performing a bucket listing. 109 110 5. Make sure you don't delete partially downloaded temporary files after a 111 transfer fails: gsutil picks up where it left off (and performs a hash 112 of the final downloaded content to ensure data integrity), so deleting 113 partially transferred files will cause you to lose progress and make 114 more wasteful use of your network. 115 116 6. If you have a fast network connection, you can speed up the transfer of 117 large numbers of files by using the gsutil -m (multi-threading / 118 multi-processing) option. Be aware, however, that gsutil doesn't attempt to 119 keep track of which files were downloaded successfully in cases where some 120 files failed to download. For example, if you use multi-threaded transfers 121 to download 100 files and 3 failed to download, it is up to your scripting 122 process to determine which transfers didn't succeed, and retry them. A 123 periodic check-and-run approach like outlined earlier would handle this 124 case. 125 126 If you use parallel transfers (gsutil -m) you might want to experiment with 127 the number of threads being used (via the parallel_thread_count setting 128 in the .boto config file). By default, gsutil uses 10 threads for Linux 129 and 24 threads for other operating systems. Depending on your network 130 speed, available memory, CPU load, and other conditions, this may or may 131 not be optimal. Try experimenting with higher or lower numbers of threads 132 to find the best number of threads for your environment. 133 """) 134 135 136 class CommandOptions(HelpProvider): 137 """Additional help about using gsutil for production tasks.""" 138 139 # Help specification. See help_provider.py for documentation. 140 help_spec = HelpProvider.HelpSpec( 141 help_name='prod', 142 help_name_aliases=[ 143 'production', 'resumable', 'resumable upload', 'resumable transfer', 144 'resumable download', 'scripts', 'scripting'], 145 help_type='additional_help', 146 help_one_line_summary='Scripting Production Transfers', 147 help_text=_DETAILED_HELP_TEXT, 148 subcommand_help_text={}, 149 ) 150