Home | History | Annotate | Download | only in doc
      1 gRPC Connectivity Semantics and API
      2 ===================================
      3 
      4 This document describes the connectivity semantics for gRPC channels and the
      5 corresponding impact on RPCs. We then discuss an API.
      6 
      7 States of Connectivity
      8 ----------------------
      9 
     10 gRPC Channels provide the abstraction over which clients can communicate with
     11 servers.The client-side channel object can be constructed using little more
     12 than a DNS name. Channels encapsulate a range of functionality including name
     13 resolution, establishing a TCP connection (with retries and backoff) and TLS
     14 handshakes. Channels can also handle errors on established connections and
     15 reconnect, or in the case of HTTP/2 GO_AWAY, re-resolve the name and reconnect.
     16 
     17 To hide the details of all this activity from the user of the gRPC API (i.e.,
     18 application code) while exposing meaningful information about the state of a
     19 channel, we use a state machine with five states, defined below:
     20 
     21 CONNECTING: The channel is trying to establish a connection and is waiting to
     22 make progress on one of the steps involved in name resolution, TCP connection
     23 establishment or TLS handshake. This may be used as the initial state for channels upon
     24 creation.
     25 
     26 READY: The channel has successfully established a connection all the way
     27 through TLS handshake (or equivalent) and all subsequent attempt to communicate
     28 have succeeded (or are pending without any known failure ).
     29 
     30 TRANSIENT_FAILURE: There has been some transient failure (such as a TCP 3-way
     31 handshake timing out or a socket error). Channels in this state will eventually
     32 switch to the CONNECTING state and try to establish a connection again. Since
     33 retries are done with exponential backoff, channels that fail to connect will
     34 start out spending very little time in this state but as the attempts fail
     35 repeatedly, the channel will spend increasingly large amounts of time in this
     36 state. For many non-fatal failures (e.g., TCP connection attempts timing out
     37 because the server is not yet available), the channel may spend increasingly
     38 large amounts of time in this state.
     39 
     40 IDLE: This is the state where the channel is not even trying to create a
     41 connection because of a lack of new or pending RPCs. New RPCs  MAY be created
     42 in this state. Any attempt to start an RPC on the channel will push the channel
     43 out of this state to connecting. When there has been no RPC activity on a channel
     44 for a specified IDLE_TIMEOUT, i.e., no new or pending (active) RPCs for this
     45 period, channels that are READY or CONNECTING switch to IDLE. Additionaly,
     46 channels that receive a GOAWAY when there are no active or pending RPCs should
     47 also switch to IDLE to avoid connection overload at servers that are attempting
     48 to shed connections. We will use a default IDLE_TIMEOUT of 300 seconds (5 minutes).
     49 
     50 SHUTDOWN: This channel has started shutting down. Any new RPCs should fail
     51 immediately. Pending RPCs may continue running till the application cancels them.
     52 Channels may enter this state either because the application explicitly requested
     53 a shutdown or if a non-recoverable error has happened during attempts to connect
     54 communicate . (As of 6/12/2015, there are no known errors (while connecting or
     55 communicating) that are classified as non-recoverable) 
     56 Channels that enter this state never leave this state. 
     57 
     58 The following table lists the legal transitions from one state to another and
     59 corresponding reasons. Empty cells denote disallowed transitions.
     60 
     61 <table style='border: 1px solid black'>
     62   <tr>
     63     <th>From/To</th>
     64     <th>CONNECTING</th>
     65     <th>READY</th>
     66     <th>TRANSIENT_FAILURE</th>
     67     <th>IDLE</th>
     68     <th>SHUTDOWN</th>
     69   </tr>
     70   <tr>
     71     <th>CONNECTING</th>
     72     <td>Incremental progress during connection establishment</td>
     73     <td>All steps needed to establish a connection succeeded</td>
     74     <td>Any failure in any of the steps needed to establish connection</td>
     75     <td>No RPC activity on channel for IDLE_TIMEOUT</td>
     76     <td>Shutdown triggered by application.</td>
     77   </tr>
     78   <tr>
     79     <th>READY</th>
     80     <td></td>
     81     <td>Incremental successful communication on established channel.</td>
     82     <td>Any failure encountered while expecting successful communication on
     83         established channel.</td>
     84     <td>No RPC activity on channel for IDLE_TIMEOUT <br>OR<br>upon receiving a GOAWAY while there are no pending RPCs.</td>
     85     <td>Shutdown triggered by application.</td>
     86   </tr>
     87   <tr>
     88     <th>TRANSIENT_FAILURE</th>
     89     <td>Wait time required to implement (exponential) backoff is over.</td>
     90     <td></td>
     91     <td></td>
     92     <td></td>
     93     <td>Shutdown triggered by application.</td>
     94   </tr>
     95   <tr>
     96     <th>IDLE</th>
     97     <td>Any new RPC activity on the channel</td>
     98     <td></td>
     99     <td></td>
    100     <td></td>
    101     <td>Shutdown triggered by application.</td>
    102   </tr>
    103   <tr>
    104     <th>SHUTDOWN</th>
    105     <td></td>
    106     <td></td>
    107     <td></td>
    108     <td></td>
    109     <td></td>
    110   </tr>
    111 </table>
    112 
    113 
    114 Channel State API
    115 -----------------
    116 
    117 All gRPC libraries will expose a channel-level API method to poll the current
    118 state of a channel. In C++, this method is called GetState and returns an enum
    119 for one of the five legal states. It also accepts a boolean `try_to_connect` to
    120 transition to CONNECTING if the channel is currently IDLE. The boolean should
    121 act as if an RPC occurred, so it should also reset IDLE_TIMEOUT.
    122 
    123 ```cpp
    124 grpc_connectivity_state GetState(bool try_to_connect);
    125 ```
    126 
    127 All libraries should also expose an API that enables the application (user of
    128 the gRPC API) to be notified when the channel state changes. Since state
    129 changes can be rapid and race with any such notification, the notification
    130 should just inform the user that some state change has happened, leaving it to
    131 the user to poll the channel for the current state.
    132 
    133 The synchronous version of this API is:
    134 
    135 ```cpp
    136 bool WaitForStateChange(grpc_connectivity_state source_state, gpr_timespec deadline);
    137 ```
    138 
    139 which returns `true` when the state is something other than the
    140 `source_state` and `false` if the deadline expires. Asynchronous- and futures-based
    141 APIs should have a corresponding method that allows the application to be
    142 notified when the state of a channel changes.
    143 
    144 Note that a notification is delivered every time there is a transition from any
    145 state to any *other* state. On the other hand the rules for legal state
    146 transition, require a transition from CONNECTING to TRANSIENT_FAILURE and back
    147 to CONNECTING for every recoverable failure, even if the corresponding
    148 exponential backoff requires no wait before retry. The combined effect is that
    149 the application may receive state change notifications that appear spurious.
    150 e.g., an application waiting for state changes on a channel that is CONNECTING
    151 may receive a state change notification but find the channel in the same
    152 CONNECTING state on polling for current state because the channel may have
    153 spent infinitesimally small amount of time in the TRANSIENT_FAILURE state.
    154