Home | History | Annotate | Download | only in VirtioNetDxe
      1 ## @file

      2 #

      3 # Technical notes for the virtio-net driver.

      4 #

      5 # Copyright (C) 2013, Red Hat, Inc.

      6 #

      7 # This program and the accompanying materials are licensed and made available

      8 # under the terms and conditions of the BSD License which accompanies this

      9 # distribution. The full text of the license may be found at

     10 # http://opensource.org/licenses/bsd-license.php

     11 #

     12 # THE PROGRAM IS DISTRIBUTED UNDER THE BSD LICENSE ON AN "AS IS" BASIS, WITHOUT

     13 # WARRANTIES OR REPRESENTATIONS OF ANY KIND, EITHER EXPRESS OR IMPLIED.

     14 #

     15 ##

     16 
     17 Disclaimer
     18 ----------
     19 
     20 All statements concerning standards and specifications are informative and not
     21 normative. They are made in good faith. Corrections are most welcome on the
     22 edk2-devel mailing list.
     23 
     24 The following documents have been perused while writing the driver and this
     25 document:
     26 - Unified Extensible Firmware Interface Specification, Version 2.3.1, Errata C;
     27   June 27, 2012
     28 - Driver Writer's Guide for UEFI 2.3.1, 03/08/2012, Version 1.01;
     29 - Virtio PCI Card Specification, v0.9.5 DRAFT, 2012 May 7.
     30 
     31 
     32 Summary
     33 -------
     34 
     35 The VirtioNetDxe UEFI_DRIVER implements the Simple Network Protocol for
     36 virtio-net devices. Higher level protocols are automatically installed on top
     37 of it by the DXE Core / the ConnectController() boot service, enabling for
     38 virtio-net devices eg. DHCP configuration, TCP transfers with edk2 StdLib
     39 applications, and PXE booting in OVMF.
     40 
     41 
     42 UEFI driver structure
     43 ---------------------
     44 
     45 A driver instance, belonging to a given virtio-net device, can be in one of
     46 four states at any time. The states stack up as follows below. The state
     47 transitions are labeled with the primary function (and its important callees
     48 faithfully indented) that implement the transition.
     49 
     50                                |  ^
     51                                |  |
     52    [DriverBinding.c]           |  | [DriverBinding.c]
     53    VirtioNetDriverBindingStart |  | VirtioNetDriverBindingStop
     54      VirtioNetSnpPopulate      |  |   VirtioNetSnpEvacuate
     55        VirtioNetGetFeatures    |  |
     56                                v  |
     57                    +-------------------------+
     58                    | EfiSimpleNetworkStopped |
     59                    +-------------------------+
     60                                |  ^
     61                 [SnpStart.c]   |  | [SnpStop.c]
     62                 VirtioNetStart |  | VirtioNetStop
     63                                |  |
     64                                v  |
     65                    +-------------------------+
     66                    | EfiSimpleNetworkStarted |
     67                    +-------------------------+
     68                                |  ^
     69   [SnpInitialize.c]            |  | [SnpShutdown.c]
     70   VirtioNetInitialize          |  | VirtioNetShutdown
     71     VirtioNetInitRing {Rx, Tx} |  |   VirtioNetShutdownRx [SnpSharedHelpers.c]
     72       VirtioRingInit           |  |   VirtioNetShutdownTx [SnpSharedHelpers.c]
     73     VirtioNetInitTx            |  |   VirtioRingUninit {Tx, Rx}
     74     VirtioNetInitRx            |  |
     75                                v  |
     76                   +-----------------------------+
     77                   | EfiSimpleNetworkInitialized |
     78                   +-----------------------------+
     79 
     80 The state at the top means "nonexistent" and is hence unnamed on the diagram --
     81 a driver instance actually doesn't exist at that point. The transition
     82 functions out of and into that state implement the Driver Binding Protocol.
     83 
     84 The lower three states characterize an existent driver instance and are all
     85 states defined by the Simple Network Protocol. The transition functions between
     86 them are member functions of the Simple Network Protocol.
     87 
     88 Each transition function validates its expected source state and its
     89 parameters. For example, VirtioNetDriverBindingStop will refuse to disconnect
     90 from the controller unless it's in EfiSimpleNetworkStopped.
     91 
     92 
     93 Driver instance states (Simple Network Protocol)
     94 ------------------------------------------------
     95 
     96 In the EfiSimpleNetworkStopped state, the virtio-net device is (has been)
     97 re-set. No resources are allocated for networking / traffic purposes. The MAC
     98 address and other device attributes have been retrieved from the device (this
     99 is necessary for completing the VirtioNetDriverBindingStart transition).
    100 
    101 The EfiSimpleNetworkStarted is completely identical to the
    102 EfiSimpleNetworkStopped state for virtio-net, in the functional and
    103 resource-usage sense. This state is mandated / provided by the Simple Network
    104 Protocol for flexibility that the virtio-net driver doesn't exploit.
    105 
    106 In particular, the EfiSimpleNetworkStarted state is the target of the Shutdown
    107 SNP member function, and must therefore correspond to a hardware configuration
    108 where "[it] is safe for another driver to initialize". (Clearly another UEFI
    109 driver could not do that due to the exclusivity of the driver binding that
    110 VirtioNetDriverBindingStart() installs, but a later OS driver might qualify.)
    111 
    112 The EfiSimpleNetworkInitialized state is the live state of the virtio NIC / the
    113 driver instance. Virtio and other resources required for network traffic have
    114 been allocated, and the following SNP member functions are available (in
    115 addition to VirtioNetShutdown which leaves the state):
    116 
    117 - VirtioNetReceive [SnpReceive.c]: poll the virtio NIC for an Rx packet that
    118   may have arrived asynchronously;
    119 
    120 - VirtioNetTransmit [SnpTransmit.c]: queue a Tx packet for asynchronous
    121   transmission (meant to be used together with VirtioNetGetStatus);
    122 
    123 - VirtioNetGetStatus [SnpGetStatus.c]: query link status and status of pending
    124   Tx packets;
    125 
    126 - VirtioNetMcastIpToMac [SnpMcastIpToMac.c]: transform a multicast IPv4/IPv6
    127   address into a multicast MAC address;
    128 
    129 - VirtioNetReceiveFilters [SnpReceiveFilters.c]: emulate unicast / multicast /
    130   broadcast filter configuration (not their actual effect -- a more liberal
    131   filter setting than requested is allowed by the UEFI specification).
    132 
    133 The following SNP member functions are not supported [SnpUnsupported.c]:
    134 
    135 - VirtioNetReset: reinitialize the virtio NIC without shutting it down (a loop
    136   from/to EfiSimpleNetworkInitialized);
    137 
    138 - VirtioNetStationAddress: assign a new MAC address to the virtio NIC,
    139 
    140 - VirtioNetStatistics: collect statistics,
    141 
    142 - VirtioNetNvData: access non-volatile data on the virtio NIC.
    143 
    144 Missing support for these functions is allowed by the UEFI specification and
    145 doesn't seem to trip up higher level protocols.
    146 
    147 
    148 Events and task priority levels
    149 -------------------------------
    150 
    151 The UEFI specification defines a sophisticated mechanism for asynchronous
    152 events / callbacks (see "6.1 Event, Timer, and Task Priority Services" for
    153 details). Such callbacks work like software interrupts, and some notion of
    154 locking / masking is important to implement critical sections (atomic or
    155 exclusive access to data or a device). This notion is defined as Task Priority
    156 Levels.
    157 
    158 The virtio-net driver for OVMF must concern itself with events for two reasons:
    159 
    160 - The Simple Network Protocol provides its clients with a (non-optional) WAIT
    161   type event called WaitForPacket: it allows them to check or wait for Rx
    162   packets by polling or blocking on this event. (This functionality overlaps
    163   with the Receive member function.) The event is available to clients starting
    164   with EfiSimpleNetworkStopped (inclusive).
    165 
    166   The virtio-net driver is informed about such client polling or blockage by
    167   receiving an asynchronous callback (a software interrupt). In the callback
    168   function the driver must interrogate the driver instance state, and if it is
    169   EfiSimpleNetworkInitialized, access the Rx queue and see if any packets are
    170   available for consumption. If so, it must signal the WaitForPacket WAIT type
    171   event, waking the client.
    172 
    173   For simplicity and safety, all parts of the virtio-net driver that access any
    174   bit of the driver instance (data or device) run at the TPL_CALLBACK level.
    175   This is the highest level allowed for an SNP implementation, and all code
    176   protected in this manner satisfies even stricter non-blocking requirements
    177   than what's documented for TPL_CALLBACK.
    178 
    179   The task priority level for the WaitForPacket callback too is set by the
    180   driver, the choice is TPL_CALLBACK again. This in effect serializes  the
    181   WaitForPacket callback (VirtioNetIsPacketAvailable [Events.c]) with "normal"
    182   parts of the driver.
    183 
    184 - According to the Driver Writer's Guide, a network driver should install a
    185   callback function for the global EXIT_BOOT_SERVICES event (a special NOTIFY
    186   type event). When the ExitBootServices() boot service has cleaned up internal
    187   firmware state and is about to pass control to the OS, any network driver has
    188   to stop any in-flight DMA transfers, lest it corrupts OS memory. For this
    189   reason EXIT_BOOT_SERVICES is emitted and the network driver must abort
    190   in-flight DMA transfers.
    191 
    192   This callback (VirtioNetExitBoot) is synchronized with the rest of the driver
    193   code just the same as explained for WaitForPacket. In
    194   EfiSimpleNetworkInitialized state it resets the virtio NIC, halting all data
    195   transfer. After the callback returns, no further driver code is expected to
    196   be scheduled.
    197 
    198 
    199 Virtio internals -- Rx
    200 ----------------------
    201 
    202 Requests (Rx and Tx alike) are always submitted by the guest and processed by
    203 the host. For Tx, processing means transmission. For Rx, processing means
    204 filling in the request with an incoming packet. Submitted requests exist on the
    205 "Available Ring", and answered (processed) requests show up on the "Used Ring".
    206 
    207 Packet data includes the media (Ethernet) header: destination MAC, source MAC,
    208 and Ethertype (14 bytes total).
    209 
    210 The following structures implement packet reception. Most of them are defined
    211 in the Virtio specification, the only driver-specific trait here is the static
    212 pre-configuration of the two-part descriptor chains, in VirtioNetInitRx. The
    213 diagram is simplified.
    214 
    215                      Available Index       Available Index
    216                      last processed          incremented
    217                        by the host           by the guest
    218                            v       ------->        v
    219 Available  +-------+-------+-------+-------+-------+
    220 Ring       |DescIdx|DescIdx|DescIdx|DescIdx|DescIdx|
    221            +-------+-------+-------+-------+-------+
    222                               =D6     =D2
    223 
    224        D2         D3          D4         D5          D6         D7
    225 Descr. +----------+----------++----------+----------++----------+----------+
    226 Table  |Adr:Len:Nx|Adr:Len:Nx||Adr:Len:Nx|Adr:Len:Nx||Adr:Len:Nx|Adr:Len:Nx|
    227        +----------+----------++----------+----------++----------+----------+
    228         =A2    =D3 =A3         =A4    =D5 =A5         =A6    =D7 =A7
    229 
    230 
    231             A2        A3     A4       A5     A6       A7
    232 Receive     +---------------+---------------+---------------+
    233 Destination |vnet hdr:packet|vnet hdr:packet|vnet hdr:packet|
    234 Area        +---------------+---------------+---------------+
    235 
    236                 Used Index                               Used Index incremented
    237         last processed by the guest                            by the host
    238                     v                    ------->                   v
    239 Used    +-----------+-----------+-----------+-----------+-----------+
    240 Ring    |DescIdx:Len|DescIdx:Len|DescIdx:Len|DescIdx:Len|DescIdx:Len|
    241         +-----------+-----------+-----------+-----------+-----------+
    242                                      =D4
    243 
    244 In VirtioNetInitRx, the guest allocates the fixed size Receive Destination
    245 Area, which accommodates all packets delivered asynchronously by the host. To
    246 each packet, a slice of this area is dedicated; each slice is further
    247 subdivided into virtio-net request header and network packet data. The
    248 (guest-physical) addresses of these sub-slices are denoted with A2, A3, A4 and
    249 so on. Importantly, an even-subscript "A" always belongs to a virtio-net
    250 request header, while an odd-subscript "A" always belongs to a packet
    251 sub-slice.
    252 
    253 Furthermore, the guest lays out a static pattern in the Descriptor Table. For
    254 each packet that can be in-flight or already arrived from the host,
    255 VirtioNetInitRx sets up a separate, two-part descriptor chain. For packet N,
    256 the Nth descriptor chain is set up as follows:
    257 
    258 - the first (=head) descriptor, with even index, points to the fixed-size
    259   sub-slice receiving the virtio-net request header,
    260 
    261 - the second descriptor (with odd index) points to the fixed (1514 byte) size
    262   sub-slice receiving the packet data,
    263 
    264 - a link from the first (head) descriptor in the chain is established to the
    265   second (tail) descriptor in the chain.
    266 
    267 Finally, the guest populates the Available Ring with the indices of the head
    268 descriptors. All descriptor indices on both the Available Ring and the Used
    269 Ring are even.
    270 
    271 Packet reception occurs as follows:
    272 
    273 - The host consumes a descriptor index off the Available Ring. This index is
    274   even (=2*N), and fingers the head descriptor of the chain belonging to packet
    275   N.
    276 
    277 - The host reads the descriptors D(2*N) and -- following the Next link there
    278   --- D(2*N+1), and stores the virtio-net request header at A(2*N), and the
    279   packet data at A(2*N+1).
    280 
    281 - The host places the index of the head descriptor, 2*N, onto the Used Ring,
    282   and sets the Len field in the same Used Ring Element to the total number of
    283   bytes transferred for the entire descriptor chain. This enables the guest to
    284   identify the length of Rx packets.
    285 
    286 - VirtioNetReceive polls the Used Ring. If a new Used Ring Element shows up, it
    287   copies the data out to the caller, and recycles the index of the head
    288   descriptor (ie. 2*N) to the Available Ring.
    289 
    290 - Because the host can process (answer) Rx requests in any order theoretically,
    291   the order of head descriptor indices on each of the Available Ring and the
    292   Used Ring is virtually random. (Except right after the initial population in
    293   VirtioNetInitRx, when the Available Ring is full and increasing, and the Used
    294   Ring is empty.)
    295 
    296 - If the Available Ring is empty, the host is forced to drop packets. If the
    297   Used Ring is empty, VirtioNetReceive returns EFI_NOT_READY (no packet
    298   available).
    299 
    300 
    301 Virtio internals -- Tx
    302 ----------------------
    303 
    304 The transmission structure erected by VirtioNetInitTx is similar, it differs
    305 in the following:
    306 
    307 - There is no Receive Destination Area.
    308 
    309 - Each head descriptor, D(2*N), points to a read-only virtio-net request header
    310   that is shared by all of the head descriptors. This virtio-net request header
    311   is never modified by the host.
    312 
    313 - Each tail descriptor is re-pointed to the caller-supplied packet buffer
    314   whenever VirtioNetTransmit places the corresponding head descriptor on the
    315   Available Ring. The caller is responsible to hang on to the unmodified buffer
    316   until it is reported transmitted by VirtioNetGetStatus.
    317 
    318 Steps of packet transmission:
    319 
    320 - Client code calls VirtioNetTransmit. VirtioNetTransmit tracks free descriptor
    321   chains by keeping the indices of their head descriptors in a stack that is
    322   private to the driver instance. All elements of the stack are even.
    323 
    324 - If the stack is empty (that is, each descriptor chain, in isolation, is
    325   either pending transmission, or has been processed by the host but not
    326   yet recycled by a VirtioNetGetStatus call), then VirtioNetTransmit returns
    327   EFI_NOT_READY.
    328 
    329 - Otherwise the index of a free chain's head descriptor is popped from the
    330   stack. The linked tail descriptor is re-pointed as discussed above. The head
    331   descriptor's index is pushed on the Available Ring.
    332 
    333 - The host moves the head descriptor index from the Available Ring to the Used
    334   Ring when it transmits the packet.
    335 
    336 - Client code calls VirtioNetGetStatus. In case the Used Ring is empty, the
    337   function reports no Tx completion. Otherwise, a head descriptor's index is
    338   consumed from the Used Ring and recycled to the private stack. The client
    339   code's original packet buffer address is fetched from the tail descriptor
    340   (where it has been stored at VirtioNetTransmit time) and returned to the
    341   caller.
    342 
    343 - The Len field of the Used Ring Element is not checked. The host is assumed to
    344   have transmitted the entire packet -- VirtioNetTransmit had forced it below
    345   1514 bytes (inclusive). The Virtio specification suggests this packet size is
    346   always accepted (and a lower MTU could be encountered on any later hop as
    347   well). Additionally, there's no good way to report a short transmit via
    348   VirtioNetGetStatus; EFI_DEVICE_ERROR seems too serious from the specification
    349   and higher level protocols could interpret it as a fatal condition.
    350 
    351 - The host can theoretically reorder head descriptor indices when moving them
    352   from the Available Ring to the Used Ring (out of order transmission). Because
    353   of this (and the choice of a stack over a list for free descriptor chain
    354   tracking) the order of head descriptor indices on either Ring is
    355   unpredictable.
    356