Home | History | Annotate | Download | only in sheriffing
      1 Infra Trooper Documentation
      2 ===========================
      3 
      4 ### Contents ###
      5 
      6 *   [What does an Infra trooper do?](#what_is_a_trooper)
      7 *   [View current and upcoming troopers](#view_current_upcoming_troopers)
      8 *   [How to swap trooper shifts](#how_to_swap)
      9 *   [Tips for troopers](#tips)
     10 
     11 
     12 <a name="what_is_a_trooper"></a>
     13 What does an Infra trooper do?
     14 ------------------------------
     15 
     16 The trooper has two main jobs:
     17 
     18 1) Keep an eye on Infra alerts emails (sent to infra-alerts (a] skia.org). The alerts are also available [here](https://alerts.skia.org/infra).
     19 
     20 2) Resolve the above alerts as they come in.
     21 
     22 <a name="view_current_upcoming_troopers"></a>
     23 View current and upcoming troopers
     24 ----------------------------------
     25 
     26 The list of troopers is specified in the [skia-tree-status web app](http://skia-tree-status.appspot.com/trooper). The current trooper is highlighted in green.
     27 The banner on the top of the [status page](https://status.skia.org) also displays the current trooper.
     28 
     29 
     30 <a name="how_to_swap"></a>
     31 How to swap trooper shifts
     32 --------------------------
     33 
     34 If you need to swap shifts with someone (because you are out sick or on vacation), please get approval from the person you want to swap with. Then send an email to skiabot (a] google.com and cc rmistry@.
     35 
     36 
     37 <a name="tips"></a>
     38 Tips for troopers
     39 -----------------
     40 
     41 - Make sure you are a member of
     42   [MDB group chrome-skia-ninja](https://ganpati.corp.google.com/#Group_Info?name=chrome-skia-ninja (a] prod.google.com).
     43   Valentine passwords and Chrome Golo access are based on membership in this
     44   group.
     45 
     46 - These alerts generally auto-dismiss once the criteria for the alert is no
     47   longer met:
     48   - Monitoring alerts, including prober, collectd, and others
     49   - Disconnected build slaves
     50 
     51 - These alerts generally do not auto-dismiss ([issue here](https://bug.skia.org/4292)):
     52   - Build slaves that failed a step
     53   - Disconnected devices (these are detected as the "wait for device" step failing)
     54 
     55 - "Failed to execute query" may show a different query than the failing one;
     56   dismiss the alert to get a new alert showing the query that is actually
     57   failing. (All "failed to execute query" alerts are lumped into a single alert,
     58   which is why the failed query which initially triggered the alert may not be
     59   failing any more but the alert is still active because another query is
     60   failing.)
     61 
     62 - Where machines are located:
     63   - Machine name like "skia-vm-NNN" -> GCE
     64   - Machine name ends with "a3", "a4", "m3" -> Chrome Golo
     65   - Machine name starts with "skiabot-" -> Chapel Hill lab
     66   - Machine name starts with "win8" -> Chapel Hill lab (Windows machine
     67     names can't be very long, so the "skiabot-shuttle-" prefix is dropped.)
     68   - slave11-c3 is a Chrome infra GCE machine (not to be confused with the Skia
     69     Buildbots GCE, which we refer to as simply "GCE")
     70 
     71 - The [chrome-infra IRC channel](https://comlink.googleplex.com/chrome-infra) is
     72   useful for questions regarding bots managed by the Chrome Infra team and to
     73   get visibility into upstream failures that cause problems for us.
     74 
     75 - To log in to a Linux buildbot in GCE, use `gcloud compute ssh default@<machine
     76   name>`. Choose the zone listed for the
     77   [GCE VM](https://pantheon.corp.google.com/project/31977622648/compute/instances)
     78   (or specify it using the `--zone` command-line flag).
     79 
     80 - To log in to a Windows buildbot in GCE, use
     81   [Chrome RDP Extension](https://chrome.google.com/webstore/detail/chrome-rdp/cbkkbcmdlboombapidmoeolnmdacpkch?hl=en-US)
     82   with the
     83   [IP address of the GCE VM](https://pantheon.corp.google.com/project/31977622648/compute/instances)
     84   shown on the [host info page](https://status.skia.org/hosts) for that bot. The
     85   username is chrome-bot and the password can be found on
     86   [Valentine](https://valentine.corp.google.com/) as "chrome-bot (Win GCE)".
     87 
     88 - If there is a problem with a bot in the Chrome Golo or Chrome infra GCE, the
     89   best course of action is to
     90   [file a bug](https://code.google.com/p/chromium/issues/entry?template=Build%20Infrastructure)
     91   with the Chrome infra team. But if you know what you're doing:
     92   - To access bots in the Chrome Golo,
     93     [follow these instructions](https://chrome-internal.googlesource.com/infra/infra_internal/+/master/doc/ssh.md).
     94     - Machine name ends with "a3" or "a4" -> ssh command looks like `ssh
     95       build3-a3.chrome`
     96     - Machine name ends with "m3" -> ssh command looks like `ssh build5-m3.golo`
     97     - For MacOS and Windows bots, you will be prompted for a password, which is
     98       stored on [Valentine](https://valentine.corp.google.com/) as "Chrome Golo,
     99       Perf, GPU bots - chrome-bot".
    100   - To access bots in the Chrome infra GCE -> command looks like `gcutil
    101     --project=google.com:chromecompute ssh --ssh_user=default slave11-c3` (or
    102     use the ccompute ssh script from the infra_internal repo).
    103 
    104 - Read over the [SkiaLab documentation](../testing/skialab) for more detail on
    105   dealing with device alerts.
    106 
    107 - To stop a buildslave for a device, log in to the host for that device, `cd
    108   ~/buildbot/<slave name>/build/slave; make stop`. To start it again,
    109   `TESTING_SLAVENAME=<slave name> make start`.
    110 
    111 - Buildslaves can be slow to come up after reboot, but if the buildslave remains
    112   disconnected, you may need to start it manually. On Mac and Linux, check using
    113   `ps aux | grep python` that neither buildbot nor gclient are running, then run
    114   `~/skiabot-slave-start-on-boot.sh`.
    115