MAC(9E) | Driver Entry Points | MAC(9E) |
mac
, GLDv3
— MAC networking device driver overview
#include
<sys/mac_provider.h>
#include <sys/mac_ether.h>
illumos DDI specific
The MAC framework provides a means for implementing high-performance networking device drivers. It is the successor to the GLD interfaces and is sometimes referred to as the GLDv3. The remainder of this manual introduces the aspects of writing devices drivers that leverage the MAC framework. While both the GLDv3 and MAC framework refer to the same thing, in this manual page we use the term the MAC framework to refer to the device driver interface.
MAC device drivers are character devices. They define the standard _init(9E), _fini(9E), and _info(9E) entry points to initialize the module, as well as dev_ops(9S) and cb_ops(9S) structures.
The main interface with MAC is through a series of callbacks defined in a mac_callbacks(9S) structure. These callbacks control all the aspects of the device. They range from sending data, getting and setting of properties, controlling mac address filters, and also managing promiscuous mode.
The MAC framework takes care of many aspects of the device driver's management. A device that uses the MAC framework does not have to worry about creating device nodes or implementing open(9E) or close(9E) routines. In addition, all of the work to interact with dlpi(4P) is taken care of automatically and transparently.
At a high-level, a device driver is chiefly concerned with three general operations:
When sending frames, the MAC framework always calls functions registered in the mac_callbacks(9S) structure to have the driver transmit frames on hardware. When receiving frames, the driver will generally receive an interrupt which will cause it to check for incoming data and deliver it to the MAC framework.
Configuration of a device, such as whether auto-negotiation should be enabled, the speeds that the device supports, the MTU (maximum transmission unit), and the generation of pause frames are all driven by properties. The functions to get, set, and obtain information about properties are defined through callback functions specified in the mac_callbacks(9S) structure. The full list of properties and a description of the relevant callbacks can be found in the PROPERTIES section.
The MAC framework is designed to take advantage of various modern features provided by hardware, such as checksumming, segmentation offload, and hardware filtering. The MAC framework assumes none of these advanced features are present and allows device drivers to negotiate them through a capability system. Drivers can declare that they support various capabilities by implementing the optional mc_getcapab(9E) entry point. Each capability has its associated entry points and structures to fill out. The capabilities are detailed in the CAPABILITIES section.
The following sections describe the flow of a basic device driver. For advanced device drivers, the flow is generally the same. The primary distinction is in how frames are sent and received.
For a device to be used by the MAC framework, it must register with the framework and take specific actions during _init(9E), attach(9E), detach(9E), and _fini(9E).
All device drivers have to define a dev_ops(9S) structure which is pointed to by a modldrv(9S) structure and the corresponding NULL-terminated modlinkage(9S) structure. The dev_ops(9S) structure should have a cb_ops(9S) structure defined for it; however, it does not need to implement any of the standard cb_ops(9S) entry points unless it also exposes a custom set of device nodes not otherwise managed by the MAC framework. See the Custom Device Nodes section for more details.
Normally, in a driver's _init(9E) entry point, it passes its modlinkage(9S) structure directly to mod_install(9F). To properly register with MAC, the driver must call mac_init_ops(9F) before it calls mod_install(9F). If for some reason the mod_install(9F) function fails, then the driver must be removed by a call to mac_fini_ops(9F).
Conversely, in the driver's _fini(9E) routine, it should call mac_fini_ops(9F) after it successfully calls mod_remove(9F). For an example of how to use the mac_init_ops(9F) and mac_fini_ops(9F) functions, see the examples section in mac_init_ops(9F).
A device may want to provide its own minor nodes as simple character or block devices backed by the usual cb_ops(9S) routines. The MAC framework allows for this by leaving a portion of the minor number space available for private driver use. mac_private_minor(9F) returns the first minor number a driver may use for its own purposes, e.g., to pass to ddi_create_minor_node(9F).
A driver making use of this ability must provide its own getinfo(9E) implementation that is aware of any such minor nodes. It must also delegate back to the MAC framework as appropriate via either calls to mac_getinfo(9F) or mac_devt_to_instance(9F) for MAC reserved minor nodes. It should also take care to not affect MAC reserved minors, e.g., removing all minor nodes associated with a device:
ddi_remove_minor_node(dip, NULL);
Every instance of a device should register separately with MAC. To register with MAC, a driver must allocate a mac_register(9S) structure, fill it in, and then call mac_register(9F). The mac_register_t structure contains information about the device and all of the required function pointers that will be used as callbacks by the framework.
These steps should all be taken during a device's attach(9E) entry point. It is recommended that the driver perform this sequence of steps after the device has finished its initialization of the chipset and interrupts, though interrupts should not be enabled at that point. After it calls mac_register(9F) it will start receiving callbacks from the MAC framework.
To allocate the registration structure, the driver should call
mac_alloc(9F). Device drivers
should generally always pass the symbol MAC_VERSION
as the argument to mac_alloc(9F).
Upon successful completion, the driver will receive a
mac_register_t structure which it should fill in. The
structure and its members are documented in
mac_register(9S).
The mac_callbacks(9S) structure is not allocated as a part of the mac_register(9S) structure. In general, device drivers declare this statically. See the MAC Callbacks section for more information on how to fill it out.
Once the structure has been filled in, the driver should call mac_register(9F) to register itself with MAC. The handle that it uses to register with should be part of the driver's soft state. It will be used in various other support functions and callbacks.
If the call is successful, then the device driver should enable
interrupts and finish any other initialization required. If the call to
mac_register(9F) failed, then
it should unwind its initialization and should return
DDI_FAILURE
from its
attach(9E) routine.
The driver does not need to hold onto an allocated mac_register(9S) structure after it has called the mac_register(9F) function. Whether the mac_register(9F) function returns successfully or not, the driver may free its mac_register(9S) structure by calling the mac_free(9F) function.
The MAC framework interacts with a device driver through a series of callbacks. These callbacks are described in their individual manual pages and the collection of callbacks is indicated in the mac_callbacks(9S) manual page. This section does not focus on the specific functions, but rather on interactions between them and the rest of the device driver framework.
A device driver should make no assumptions about when the various callbacks will be called and whether or not they will be called simultaneously. For example, a device driver may be asked to transmit data through a call to its mc_tx(9E) entry point while it is being asked to get a device property through a call to its mc_getprop(9E) entry point. As such, while some calls may be serialized to the device, such as setting properties, the device driver should always presume that all of its data needs to be protected with locks. While the device is holding locks, it is safe for it call the following MAC routines:
Any other MAC related routines should not be called with locks held, such as mac_link_update(9F) or mac_rx(9F). Other routines in the DDI may be called while locks are held; however, device driver writers should be careful about calling blocking routines while locks are held or in interrupt context, even when it is legal to do so as this may cause all other callers that need a given lock to back up behind such an operation.
A device driver will often receive data through the means of an interrupt or by being asked to poll for frames. When this occurs, zero or more frames, each with optional metadata, may be ready for the device driver to consume. Often each frame has a corresponding descriptor which has information about whether or not there were errors or whether or not the device successfully checksummed the packet. In addition to the per-packet flow described below, there are certain requirements that drivers must adhere to when programming the hardware to receive data. See the section RECEIVE DESCRIPTOR LAYOUT for more information.
During a single interrupt or poll request, a device driver should process a fixed number of frames. For each frame the device driver should:
Once all the frames have been processed and assembled, the device driver should deliver them to the rest of the operating system by calling mac_rx(9F). The device driver should try to give as many mblk_t structures to the system at once. It should not call mac_rx(9F) once for every assembled mblk_t.
The device driver must not hold any locks across the call to mac_rx(9F). When this function is called, received data will be pushed through the networking stack and some replies may be generated and given to the driver to send out.
It is not the device driver's responsibility to determine whether or not the system can keep up with a driver's delivery rate of frames. The rest of the networking stack will handle issues related to keeping up appropriately and ensure that kernel memory is not exhausted by packets that are not being processed.
If the device driver has negotiated the
MAC_CAPAB_RINGS
capability (discussed in
mac_capab_rings(9E)) then it
should call mac_rx_ring(9F) and
not mac_rx(9F). A given interrupt may
correspond to more than one ring that needs to be checked. The set of rings
is likely to span different groups that were registered with MAC through the
mr_gget(9E) interface. In those
cases, the driver should follow the above procedure independently for each
ring. That means it will call
mac_rx_ring(9F) once for each
ring using the handle that it received from when MAC called the driver's
mr_rget(9E) entry point. When it is
looking at the rings, the driver will need to make sure that the ring has
not had interrupts disabled (due to a pending change to polling mode). This
is discussed in greater detail in the
mac_capab_rings(9E) and
mri_poll(9E) manual pages.
Finally, the device driver should make sure that any other housekeeping activities required for the ring are taken care of such that more data can be received.
A device driver will be asked to transmit a message block chain by having it's mc_tx(9E) entry point called. While the driver is processing the message blocks, it may run out of resources. For example, a transmit descriptor ring may become full. At that point, the device driver should return the remaining unprocessed frames. The act of returning frames indicates that the device has asserted flow control. Once this has been done, no additional calls will be made to the driver's transmit entry point and the back pressure will be propagated throughout the rest of the networking stack.
At some point in the future when resources have become available again, for example after an interrupt indicating that some portion of the transmit ring has been sent, then the device driver must notify the system that it can continue transmission. To do this, the driver should call mac_tx_update(9F). After that point, the driver will receive calls to its mc_tx(9E) entry point again. As mentioned in the section on callbacks, the device driver should avoid holding any particular locks across the call to mac_tx_update(9F).
For devices operating at higher data rates, interrupt coalescing is an important part of a well functioning device and may impact the performance of the device. Not all devices support interrupt coalescing. If interrupt coalescing is supported on the device, it is recommended that device driver writers provide private properties for their device to control the interrupt coalescing rate. This will make it much easier to perform experiments and observe the impact of different interrupt rates on the rest of the system.
Even with interrupt coalescing, when there is a certain incoming packet rate it can make more sense to just actively poll the device, asking for more packets rather than constantly taking an interrupt. When a device driver supports the mac_capab_rings(9E) capability and therefore polling on receive rings, the MAC framework will ask the driver to disable interrupts, with its mi_disable(9E) entry point, and then subsequently call its polling entry point, mri_poll(9E).
As long as a device driver implements the needed entry points, then there is nothing else that it needs to do to take advantage of polling. A driver should not attempt to spin up its own threads, task queues, or creatively use timeouts, to try to simulate polling for received packets.
The MAC framework will attempt to use as many MAC address filters as a device has. To program a multicast address filter, the driver's mc_multicst(9E) entry point will be called. If the device driver runs out of filters, it should not take any special action and just return the appropriate error as documented in the corresponding manual pages for the entry points. The framework will ensure that the device is placed in promiscuous mode if it needs to.
If the hardware supports more than one unicast filter then the
device driver should consider implementing the
MAC_CAPAB_RINGS
capability, which exposes a means
for multiple unicast MAC address filters to be used by the broader system.
It is still useful to implement this on hardware which only has a single
ring. See
mac_capab_rings(9E) for more
information.
Receive side scaling is where a hardware device supports multiple,
independent queues of frames that can be received. Each of these queues is
generally associated with an independent interrupt and the hardware usually
performs some form of hash across the queues. Hardware which supports this
should look at implementing the MAC_CAPAB_RINGS
capability and see
mac_capab_rings(9E) for more
information.
It is the responsibility of the device driver to keep track of the data link's state. Many devices provide a means of receiving an interrupt when the state of the link changes. When such a change happens, the driver should update its internal data structures and then call mac_link_update(9F) to inform the MAC layer that this has occurred. If the device driver does not properly inform the system about link changes, then various features like link aggregations and other mechanisms that leverage the link state will not work correctly.
Many networking devices support more than one possible speed that they can operate at. The selection of a speed is often performed through auto-negotiation, though some devices allow the user to control what speeds are advertised and used.
Logically, there are two different sets of things that the device driver needs to keep track of while it's operating:
By default, when a link first comes up, the device driver should generally configure the link to support the common set of speeds and perform auto-negotiation.
A user can control what speeds a device advertises via auto-negotiation and whether or not it performs auto-negotiation at all by using a series of properties that have _EN_ in the name. These are read/write properties and there is one for each speed supported in the operating system. For a full list of them, see the PROPERTIES section.
In addition to these properties, there is a corresponding set of properties with _ADV_ in the name. These are similar to the _EN_ family of properties, but they are read-only and indicate what the device has actually negotiated. While they are generally similar to the _EN_ family of properties, they may change depending on power settings. See the Ethernet Link Properties section in dladm(8) for more information.
It's worth discussing how these different values get used throughout the different entry points. The first entry point to consider is the mc_propinfo(9E) entry point. For a given speed, the driver should consult whether or not the hardware supports this speed. If it does, it should fill in the default value that the hardware takes and whether or not the property is writable. The properties should also be updated to indicate whether or not it is writable. This holds for both the _EN_ and _ADV_ family of properties.
The next entry point is
mc_getprop(9E). Here, the device
should first consult whether the given speed is supported. If it is not,
then the driver should return ENOTSUP
. If it does,
then it should return the current value of the property.
The last property endpoint is the
mc_setprop(9E) entry point. Here,
the same logic applies. Before the driver considers whether or not the
property is writable, it should first check whether or not it's a supported
property. If it's not, then it should return
ENOTSUP
. Otherwise, it should proceed to check
whether the property is writable, and if it is and a valid value, then it
should update the property and restart the link's negotiation.
Finally, there is the mc_getstat(9E) entry point. Several of the statistics that are queried relate to auto-negotiation and hardware capabilities. When a statistic relates to the hardware supporting a given speed, the _EN_ properties should be ignored. The only thing that should be consulted is what the hardware itself supports. Otherwise, the statistics should look at what is currently being advertised by the device.
During a driver's detach(9E) routine, it should unregister the device instance from MAC by calling mac_unregister(9F) on the handle that it originally called it on. If the call to mac_unregister(9F) failed, then the device is likely still in use and the driver should fail the call to detach(9E).
Administrators always interact with devices through the dladm(8) command line interface. The state of devices such as whether the link is considered up or down, various link properties such as the MTU, auto-negotiation state, and flow control state, are all exposed. It is also the preferred way that these properties are set and configured.
While device tunables may be presented in a driver.conf(5) file, it is recommended instead to expose such things through dladm(8) private properties, whether explicitly documented or not.
Capabilities in the MAC Framework are optional features that a device supports which indicate various hardware features that the device supports. The two current capabilities that the system supports are related to being able to hardware perform large send offloads (LSO), often also known as TCP segmentation and the ability for hardware to calculate and verify the checksums present in IPv4, IPV6, and protocol headers such as TCP and UDP.
The MAC framework will query a device for support of a capability
through the mc_getcapab(9E)
function. Each capability has its own constant and may have corresponding
data that goes along with it and a specific structure that the device is
required to fill in. Note, the set of capabilities changes over time and
there are also private capabilities in the system. Several of the
capabilities are used in the implementation of the MAC framework. Others,
like MAC_CAPAB_RINGS
, represent feature that have
not been stabilized and thus both API and binary compatibility for them is
not guaranteed. It is important that the device driver handles unknown
capabilities correctly. For more information, see
mc_getcapab(9E).
The following capabilities are stable and defined in the system:
MAC_CAPAB_HCKSUM
The MAC_CAPAB_HCKSUM
capability indicates
to the system that the device driver supports some amount of checksumming.
The specific data for this capability is a pointer to a
uint32_t. To indicate no support for any kind of
checksumming, the driver should either set this value to zero or simply
return that it doesn't support the capability.
Note, the values that the driver declares in this capability indicate what it can do when it transmits data. If the driver can only verify checksums when receiving data, then it should not indicate that it supports this capability. The following set of flags may be combined through a bitwise inclusive OR:
HCKSUM_INET_PARTIAL
HCKSUM_IPHDRCKSUM
flag
.HCKSUM_INET_FULL_V4
HCKSUM_IPHDRCKSUM
.HCKSUM_INET_FULL_V6
HCKSUM_IPHDRCKSUM
When in a driver's transmit function, the driver will be processing a single frame. It should call mac_hcksum_get(9F) to see what checksum flags are set on it. Note that the flags that are set on it are different from the ones described above and are documented in its manual page. These flags indicate how the driver is expected to program the hardware and what checksumming is required. Not all frames will require hardware checksumming or will ask the hardware to checksum it.
If a driver supports offloading the receive checksum and verification, it should check to see what the hardware indicated was verified. The driver should then call mac_hcksum_set(9F). The flags used are different from the ones above and are discussed in detail in the mac_hcksum_set(9F) manual page. If there is no checksum information available or the driver does not support checksumming, then it should simply not call mac_hcksum_set(9F).
Note that the checksum flags should be set on the first mblk_t that makes up a given message. In other words, if multiple mblk_t structures are linked together by the b_cont member to describe a single frame, then it should only be called on the first mblk_t of that set. However, each distinct message should have the checksum bits set on it, if applicable. In other words, each mblk_t that is linked together by the b_next pointer may have checksum flags set.
It is recommended that device drivers provide a private property or driver.conf(5) property to control whether or not checksumming is enabled for both rx and tx; however, the default disposition is recommended to be enabled for both. This way if hardware bugs are found in the checksumming implementation, they can be disabled without requiring software updates. The transmit property should be checked when determining how to reply to mc_getcapab(9E) and the receive property should be checked in the context of the receive function.
MAC_CAPAB_LSO
The MAC_CAPAB_LSO
capability indicates
that the driver supports various forms of large send offload (LSO). The
private data is a pointer to a mac_capab_lso_t
structure. The system currently supports offloading TCP packets over both
IPv4 and IPv6. This structure has the following members which are used to
indicate various types of LSO support.
t_uscalar_t lso_flags; lso_basic_tcp_ivr4_t lso_basic_tcp_ipv4; lso_basic_tcp_ipv6_t lso_basic_tcp_ipv6;
The lso_flags member is used to indicate which members are valid and should be considered. Each flag represents a different form of LSO. The member should be set to the bitwise inclusive OR of the following values:
LSO_TX_BASIC_TCP_IPV4
LSO_TX_BASIC_TCP_IPV6
The lso_basic_tcp_ipv4 member is a structure with the following members:
t_uscalar_t lso_max
The lso_basic_tcp_ipv6 member is a structure with the following members:
t_uscalar_t lso_max
Like with checksumming, it is recommended that driver writers provide a means for disabling the support of LSO even if it is enabled by default. This deals with the case where issues that pop up for LSO may be worked around without requiring additional driver work.
The following capabilities are still evolving in the operating system. They are documented such that device driver writers may experiment with them. However, if such drivers are not present inside the core operating system repository, they may be subject to API and ABI breakage.
MAC_CAPAB_RINGS
The MAC_CAPAB_RINGS
capability is very
important for implementing a high-performing device driver. Networking
hardware structures the queues of packets to be sent and received into a
ring. Each entry in this ring has a descriptor, which describes the address
and options for a packet which is going to be transmitted or received. While
simple networking devices only have a single ring, most high-speed
networking devices have support for many rings.
Rings are used for two important purposes. The first is receive side scaling (RSS), which is the ability to have the hardware hash the contents of a packet based on some of the protocol headers, and send it to one of several rings. These different rings may each have their own interrupt associated with them, allowing the card to receive traffic in parallel. Similar logic can be performed when sending traffic, to leverage multiple hardware resources, thus increasing capacity.
The second use of rings is to group them together and apply filtering rules. For example, if a packet matches a specific VLAN or MAC address, then it can be sent to a specific ring or a specific group of rings. This is especially useful when there are multiple different virtual NICs or zones in play as the operating system will be able to use the hardware classification features to already know where a given packet needs to be delivered internally rather than having to determine that for each packet.
From the MAC framework's perspective, a driver can have one or more groups. A group consists of the following:
The details around how a device driver changes when rings are employed, the data structures that a driver must implement, and more are available in mac_capab_rings(9E).
MAC_CAPAB_TRANSCEIVER
Many networking devices leverage external transceivers that adhere
to standards such as SFP, QSFP, QSFP-DD, etc., which often contain
standardized information in a EEPROM on the device. The
MAC_CAPAB_TRANSCEIVER
capability provides a means of
discovering the number of transceivers, their types, and reading the data
from a transceiver. This allows administrators and users to determine if
devices are present, if the hardware can use them, and in many cases,
detailed information about the device ranging from its manufacturer and
serial numbers to specific information about its health. Implementing this
capability will lead to the operating system being able to discover and
display transceivers as part of its fault management topology.
See mac_capab_transceiver(9E) for more details on the capability structure and the various function entry points that come along with it.
MAC_CAPAB_LED
The MAC_CAPAB_LED
capability provides a
means to access and control the LEDs on a network interface card. This is
then made available to the broader operating system and consumed by
facilities such as the Fault Management Architecture. See
mac_capab_led(9E) for more
details on the structure and requirements of the capability.
Properties in the MAC framework represent aspects of a link. These include things like the link's current state and MTU. Many of the properties in the system are focused around auto-negotiation and controlling what link speeds are advertised. Information about properties is covered by three different device entry points. The mc_propinfo(9E) entry point obtains metadata about the property. The mc_getprop(9E) entry point obtains the property. The mc_setprop(9E) entry point updates the property to a new value.
Many of the properties listed below are read-only. Each property indicates whether it's read-only or it's read/write. However, driver writers may not implement the ability to set all writable properties. Many of these depend on the card itself. In particular, all properties that relate to auto-negotiation and are read/write may not be updated if the hardware in question does not support toggling what link speeds are auto-negotiated. While copper Ethernet often does not have this restriction, it often exists with various fiber standards and phys.
The following properties are the subset of MAC framework properties that driver writers should be aware of and handle. While other properties exist in the system, driver writers should always return an error when a property not listed below is encountered. See mc_getprop(9E) and mc_setprop(9E) for more information on how to handle them.
MAC_PROP_DUPLEX
The MAC_PROP_DUPLEX
property is used
to indicate whether or not the link is duplex. A duplex link may have
traffic flowing in both directions at the same time. The
link_duplex_t is an enumeration which may be set
to any of the following values:
LINK_DUPLEX_UNKNOWN
LINK_DUPLEX_HALF
LINK_DUPLEX_FULL
MAC_PROP_SPEED
The MAC_PROP_SPEED
property stores the
current link speed in bits per second. A link that is running at 100
MBit/s would store the value 100000000ULL. A link that is running at 40
Gbit/s would store the value 40000000000ULL.
MAC_PROP_STATUS
The MAC_PROP_STATUS
property is used
to indicate the current state of the link. It indicates whether the link
is up or down. The link_state_t is an enumeration
which may be set to any of the following values:
LINK_STATE_UNKNOWN
LINK_STATE_DOWN
LINK_STATE_UP
MAC_PROP_MEDIA
The MAC_PROP_MEDIA
property indicates
the current type of media on the link. The type of media is
class-specific and determined based on the
m_type_ident field in the
mac_register_t structure used when calling
mac_register(9F). The media
is always read-only. This property is not used to control how
auto-negotiation should be performed, instead the existing speed-based
properties are used instead. This property should be updated after
auto-negotiation has completed. If device hardware and firmware do not
provide a way to accurately determine this, then it is much better to
return that the media is unknown rather than to lie or guess. A common
case where this comes up is when a network card uses an SFP-based
device. If the underlying negotiated type of the link isn't made
available and therefore the driver can't distinguish between say
40GBASE-SR4 and 40GBASE-LR4, then drivers should return that the media
is unknown.
Similarly many types here represent an electrical interface that is often used between a MAC and a PHY, but also for chip-to-chip connectivity or on a backplane. When connecting to a PHY these shouldn't generally be used as the user is concerned with what is actually on the link they plug in, not the internals of the device.
Currently media values are defined for Ethernet-based devices
and use the enumeration mac_ether_media_t. These
are defined in
<sys/mac_ether.h>
and
generally follow the IEEE standardized physical medium dependent (PMD)
layer in 802.3.
ETHER_MEDIA_UNKNOWN
ETHER_MEDIA_NONE
instead.ETHER_MEDIA_NONE
ETHER_MEDIA_10BASE_T
ETHER_MEDIA_10BASE_T1
ETHER_MEDIA_100BASE_TX
ETHER_MEDIA_100BASE_FX
ETHER_MEDIA_100BASE_X
ETHER_MEDIA_100BASE_T4
ETHER_MEDIA_100BASE_T2
ETHER_MEDIA_100BASE_T1
ETHER_MEDIA_100_SGMII
ETHER_MEDIA_1000BASE_X
ETHER_MEDIA_1000BASE_T
ETHER_MEDIA_1000BASE_T1
ETHER_MEDIA_1000BASE_KX
ETHER_MEDIA_1000BASE_CX
ETHER_MEDIA_1000BASE_SX
ETHER_MEDIA_1000BASE_LX
ETHER_MEDIA_1000BASE_BX
ETHER_MEDIA_1000_SGMII
ETHER_MEDIA_2500BASE_T
ETHER_MEDIA_2500BASE_KX
ETHER_MEDIA_2500BASE_X
ETHER_MEDIA_5000BASE_T
ETHER_MEDIA_5000BASE_KR
ETHER_MEDIA_10GBASE_T
ETHER_MEDIA_10GBASE_SR
ETHER_MEDIA_10GBASE_LR
ETHER_MEDIA_10GBASE_ER
ETHER_MEDIA_10GBASE_LRM
ETHER_MEDIA_10GBASE_KR
ETHER_MEDIA_10GBASE_CX4
ETHER_MEDIA_10GBASE_KX4
ETHER_MEDIA_10GBASE_CR
ETHER_MEDIA_10GBASE_AOC
ETHER_MEDIA_10GBASE_ACC
ETHER_MEDIA_10G_XAUI
ETHER_MEDIA_10G_SFI
ETHER_MEDIA_10G_XFI
ETHER_MEDIA_25GBASE_T
ETHER_MEDIA_25GBASE_SR
ETHER_MEDIA_25GBASE_LR
ETHER_MEDIA_25GBASE_ER
ETHER_MEDIA_25GBASE_KR
ETHER_MEDIA_25GBASE_CR
ETHER_MEDIA_25GBASE_AOC
ETHER_MEDIA_25GBASE_ACC
ETHER_MEDIA_25G_AUI
ETHER_MEDIA_40GBASE_T
ETHER_MEDIA_40GBASE_CR4
ETHER_MEDIA_40GBASE_KR4
ETHER_MEDIA_40GBASE_SR4
ETHER_MEDIA_40GBASE_LR4
ETHER_MEDIA_40GBASE_ER4
ETHER_MEDIA_40GBASE_LM4
ETHER_MEDIA_40GBASE_AOC4
ETHER_MEDIA_40GBASE_ACC4
ETHER_MEDIA_40G_XLAUI
ETHER_MEDIA_40G_XLPPI
ETHER_MEDIA_50GBASE_KR2
ETHER_MEDIA_50GBASE_CR2
ETHER_MEDIA_50GBASE_SR2
ETHER_MEDIA_50GBASE_LR2
ETHER_MEDIA_50GBASE_AOC2
ETHER_MEDIA_50GBASE_ACC2
ETHER_MEDIA_50GBASE_KR
ETHER_MEDIA_50GBASE_CR
ETHER_MEDIA_50GBASE_SR
ETHER_MEDIA_50GBASE_LR
ETHER_MEDIA_50GBASE_ER
ETHER_MEDIA_50GBASE_FR
ETHER_MEDIA_50GBASE_AOC
ETHER_MEDIA_50GBASE_ACC
ETHER_MEDIA_100GBASE_CR10
ETHER_MEDIA_100GBASE_SR10
ETHER_MEDIA_100GBASE_SR4
ETHER_MEDIA_100GBASE_LR4
ETHER_MEDIA_100GBASE_ER4
ETHER_MEDIA_100GBASE_KR4
ETHER_MEDIA_100GBASE_CAUI4
ETHER_MEDIA_100GBASE_CR4
ETHER_MEDIA_100GBASE_AOC4
ETHER_MEDIA_100GBASE_ACC4
ETHER_MEDIA_100GBASE_KR2
ETHER_MEDIA_100GBASE_CR2
ETHER_MEDIA_100GBASE_SR2
ETHER_MEDIA_100GBASE_KR
ETHER_MEDIA_100GBASE_CR
ETHER_MEDIA_100GBASE_SR
ETHER_MEDIA_100GBASE_DR
ETHER_MEDIA_100GBASE_LR
ETHER_MEDIA_100GBASE_FR
ETHER_MEDIA_200GBASE_CR4
ETHER_MEDIA_200GBASE_KR4
ETHER_MEDIA_200GBASE_SR4
ETHER_MEDIA_200GBASE_DR4
ETHER_MEDIA_200GBASE_FR4
ETHER_MEDIA_200GBASE_LR4
ETHER_MEDIA_200GBASE_ER4
ETHER_MEDIA_200GAUI_4
ETHER_MEDIA_200GBASE_KR2
ETHER_MEDIA_200GBASE_CR2
ETHER_MEDIA_200GBASE_SR2
ETHER_MEDIA_200GAUI_2
ETHER_MEDIA_400GBASE_KR8
ETHER_MEDIA_400GBASE_FR8
ETHER_MEDIA_400GBASE_LR8
ETHER_MEDIA_400GBASE_ER8
ETHER_MEDIA_400GAUI_8
ETHER_MEDIA_400GBASE_KR4
ETHER_MEDIA_400GBASE_CR4
ETHER_MEDIA_400GBASE_SR4
ETHER_MEDIA_400GBASE_DR4
ETHER_MEDIA_400GBASE_FR4
ETHER_MEDIA_400GAUI_4
MAC_PROP_AUTONEG
The MAC_PROP_AUTONEG
property
indicates whether or not the device is currently configured to perform
auto-negotiation. A value of
0 indicates that
auto-negotiation is disabled. A
non-zero
value indicates that auto-negotiation is enabled. Devices should
generally default to enabling auto-negotiation.
When getting this property, the device driver should return the current state. When setting this property, if the device supports operating in the requested mode, then the device driver should reset the link to negotiate to the new speed after updating any internal registers.
MAC_PROP_MTU
The MAC_PROP_MTU
property determines
the maximum transmission unit (MTU). This indicates the maximum size
packet that the device can transmit, ignoring its own headers. For an
Ethernet device, this would exclude the size of the Ethernet header and
any VLAN headers that would be placed. It is up to the driver to ensure
that any MTU values that it accepts when adding in its margin and header
sizes does not exceed its maximum frame size.
By default, drivers for Ethernet should initialize this value and the MTU to 1500. When getting this property, the driver should return its current recorded MTU. When setting this property, the driver should first validate that it is within the device's valid range and then it must call mac_maxsdu_update(9F). Note that the call may fail. If the call completes successfully, the driver should update the hardware with the new value of the MTU and perform any other work needed to handle it.
If the device does not support changing the MTU after the
device's mc_start(9E) entry
point has been called, then driver writers should return
EBUSY
.
MAC_PROP_FLOWCTRL
The MAC_PROP_FLOWCTRL
property manages
the configuration of pause frames as part of Ethernet flow control.
Note, this only describes what this device will advertise. What is
actually enabled may be different and is subject to the rules of
auto-negotiation. The link_flowctrl_t is an
enumeration that may be set to one of the following values:
LINK_FLOWCTRL_NONE
LINK_FLOWCTRL_RX
LINK_FLOWCTRL_TX
LINK_FLOWCTRL_BI
When getting this property, the device driver should return the way that it has configured the device, not what the device has actually negotiated. When setting the property, it should update the hardware and allow the link to potentially perform auto-negotiation again.
MAC_PROP_EN_FEC_CAP
The MAC_PROP_EN_FEC_CAP
property
indicates which Forward Error Correction (FEC) code is advertised by the
device.
The link_fec_t is an enumeration that may be a combination of the following bit values:
LINK_FEC_NONE
LINK_FEC_AUTO
LINK_FEC_AUTO
cannot be set along with any of
the other values. This is the default setting the device driver should
use.LINK_FEC_RS
LINK_FEC_BASE_R
When setting the property, it should update the hardware with
the requested, or combination of requested codings. If a particular
combination of codings is not supported by the hardware, the device
driver should return EINVAL
. When retrieving
this property, the device driver should return the current value of the
property.
MAC_PROP_ADV_FEC_CAP
The MAC_PROP_ADV_FEC_CAP
has the same
values as MAC_PROP_EN_FEC_CAP
. The property
indicates which Forward Error Correction (FEC) code has been negotiated
over the link.
The remaining properties are all about various auto-negotiation link speeds. They fall into two different buckets: properties with _ADV_ in the name and properties with _EN_ in the name. For any given supported speed, there is one of each. The _EN_ set of properties are read/write properties that control what should be advertised by the device. When these are retrieved, they should return the current value of the property. When they are set, they should change how the hardware advertises the specific speed and trigger any kind of link reset and auto-negotiation, if enabled, to occur.
The _ADV_ set of properties are read-only properties. They are meant to reflect what has actually been negotiated. These may be different from the _EN_ family of properties, especially when different power management settings are at play.
See the Link Speed and Auto-negotiation section for more information.
The properties are ordered in increasing link speed:
MAC_PROP_ADV_10HDX_CAP
The MAC_PROP_ADV_10HDX_CAP
property
describes whether or not 10 Mbit/s half-duplex support is
advertised.
MAC_PROP_EN_10HDX_CAP
The MAC_PROP_EN_10HDX_CAP
property
describes whether or not 10 Mbit/s half-duplex support is enabled.
MAC_PROP_ADV_10FDX_CAP
The MAC_PROP_ADV_10FDX_CAP
property
describes whether or not 10 Mbit/s full-duplex support is
advertised.
MAC_PROP_EN_10FDX_CAP
The MAC_PROP_EN_10FDX_CAP
property
describes whether or not 10 Mbit/s full-duplex support is enabled.
MAC_PROP_ADV_100HDX_CAP
The MAC_PROP_ADV_100HDX_CAP
property
describes whether or not 100 Mbit/s half-duplex support is
advertised.
MAC_PROP_EN_100HDX_CAP
The MAC_PROP_EN_100HDX_CAP
property
describes whether or not 100 Mbit/s half-duplex support is enabled.
MAC_PROP_ADV_100FDX_CAP
The MAC_PROP_ADV_100FDX_CAP
property
describes whether or not 100 Mbit/s full-duplex support is
advertised.
MAC_PROP_EN_100FDX_CAP
The MAC_PROP_EN_100FDX_CAP
property
describes whether or not 100 Mbit/s full-duplex support is enabled.
MAC_PROP_ADV_100T4_CAP
The MAC_PROP_ADV_100T4_CAP
property
describes whether or not 100 Mbit/s Ethernet using the 100BASE-T4
standard is advertised.
MAC_PROP_EN_100T4_CAP
The MAC_PROP_EN_100T4_CAP
property
describes whether or not 100 Mbit/s Ethernet using the 100BASE-T4
standard is enabled.
MAC_PROP_ADV_1000HDX_CAP
The MAC_PROP_ADV_1000HDX_CAP
property
describes whether or not 1 Gbit/s half-duplex support is advertised.
MAC_PROP_EN_1000HDX_CAP
The MAC_PROP_EN_1000HDX_CAP
property
describes whether or not 1 Gbit/s half-duplex support is enabled.
MAC_PROP_ADV_1000FDX_CAP
The MAC_PROP_ADV_1000FDX_CAP
property
describes whether or not 1 Gbit/s full-duplex support is advertised.
MAC_PROP_EN_1000FDX_CAP
The MAC_PROP_EN_1000FDX_CAP
property
describes whether or not 1 Gbit/s full-duplex support is enabled.
MAC_PROP_ADV_2500FDX_CAP
The MAC_PROP_ADV_2500FDX_CAP
property
describes whether or not 2.5 Gbit/s full-duplex support is
advertised.
MAC_PROP_EN_2500FDX_CAP
The MAC_PROP_EN_2500FDX_CAP
property
describes whether or not 2.5 Gbit/s full-duplex support is enabled.
MAC_PROP_ADV_5000FDX_CAP
The MAC_PROP_ADV_5000FDX_CAP
property
describes whether or not 5.0 Gbit/s full-duplex support is
advertised.
MAC_PROP_EN_5000FDX_CAP
The MAC_PROP_EN_5000FDX_CAP
property
describes whether or not 5.0 Gbit/s full-duplex support is enabled.
MAC_PROP_ADV_10GFDX_CAP
The MAC_PROP_ADV_10GFDX_CAP
property
describes whether or not 10 Gbit/s full-duplex support is
advertised.
MAC_PROP_EN_10GFDX_CAP
The MAC_PROP_EN_10GFDX_CAP
property
describes whether or not 10 Gbit/s full-duplex support is enabled.
MAC_PROP_ADV_40GFDX_CAP
The MAC_PROP_ADV_40GFDX_CAP
property
describes whether or not 40 Gbit/s full-duplex support is
advertised.
MAC_PROP_EN_40GFDX_CAP
The MAC_PROP_EN_40GFDX_CAP
property
describes whether or not 40 Gbit/s full-duplex support is enabled.
MAC_PROP_ADV_100GFDX_CAP
The MAC_PROP_ADV_100GFDX_CAP
property
describes whether or not 100 Gbit/s full-duplex support is
advertised.
MAC_PROP_EN_100GFDX_CAP
The MAC_PROP_EN_100GFDX_CAP
property
describes whether or not 100 Gbit/s full-duplex support is enabled.
MAC_PROP_ADV_200GFDX_CAP
The MAC_PROP_ADV_200GFDX_CAP
property
describes whether or not 200 Gbit/s full-duplex support is
advertised.
MAC_PROP_EN_200GFDX_CAP
The MAC_PROP_EN_200GFDX_CAP
property
describes whether or not 200 Gbit/s full-duplex support is enabled.
MAC_PROP_ADV_400GFDX_CAP
The MAC_PROP_ADV_400GFDX_CAP
property
describes whether or not 400 Gbit/s full-duplex support is
advertised.
MAC_PROP_EN_400GFDX_CAP
The MAC_PROP_EN_400GFDX_CAP
property
describes whether or not 400 Gbit/s full-duplex support is enabled.
In addition to the defined properties above, drivers are allowed
to define private properties. These private properties are device-specific
properties. All private properties share the same constant,
MAC_PROP_PRIVATE
. Properties are distinguished by a
name, which is a character string. The list of such private properties is
defined when registering with mac in the m_priv_props
member of the mac_register(9S)
structure.
The driver may define whatever semantics it wants for these private properties. They will not be listed when running dladm(8), unless explicitly requested by name. All such properties should start with a leading underscore character and then consist of alphanumeric ASCII characters and additional underscores or hyphens.
Properties of type MAC_PROP_PRIVATE
may
show up in all three property related entry points:
mc_propinfo(9E),
mc_getprop(9E), and
mc_setprop(9E). Device drivers
should tell the different properties apart by using the
strcmp(9F) function to compare it to
the set of properties that it knows about. When encountering properties that
it doesn't know, it should treat them like all other unknown properties.
The MAC framework defines a couple different sets of statistics which are based on various standards for devices to implement. Statistics are retrieved through the mc_getstat(9E) entry point. There are both statistics that are required for all devices and then there is a separate set of Ethernet specific statistics. Not all devices will support every statistic. In many cases, several device registers will need to be combined to create the proper stat.
In general, if the device is not keeping track of these statistics, then it is recommended that the driver store these values as a uint64_t to ensure that overflow does not occur.
If a device does not support a specific statistic, then it is fine to return that it is not supported. The same should be used for unrecognized statistics. See mc_getstat(9E) for more information on the proper way to handle these.
The following statistics are based on MIB-II statistics from both RFC 1213 and RFC 1573.
MAC_STAT_IFSPEED
MAC_STAT_MULTIRCV
MAC_STAT_BRDCSTRCV
MAC_STAT_MULTIXMT
MAC_STAT_BRDCSTXMT
MAC_STAT_NORCVBUF
MAC_STAT_IERRORS
MAC_STAT_UNKNOWNS
MAC_STAT_NOXMTBUF
MAC_STAT_OERRORS
MAC_STAT_COLLISIONS
MAC_STAT_RBYTES
MAC_STAT_IPACKETS
MAC_STAT_OBYTES
MAC_STAT_OPACKETS
MAC_STAT_UNDERFLOWS
MAC_STAT_OVERFLOWS
The following statistics are specific to Ethernet devices. They refer to values from RFC 1643 and include various MII/GMII specific stats. Many of these are also defined in IEEE 802.3.
ETHER_STAT_ADV_CAP_1000FDX
ETHER_STAT_ADV_CAP_1000HDX
ETHER_STAT_ADV_CAP_100FDX
ETHER_STAT_ADV_CAP_100GFDX
ETHER_STAT_ADV_CAP_100HDX
ETHER_STAT_ADV_CAP_100T4
ETHER_STAT_ADV_CAP_10FDX
ETHER_STAT_ADV_CAP_10GFDX
ETHER_STAT_ADV_CAP_10HDX
ETHER_STAT_ADV_CAP_2500FDX
ETHER_STAT_ADV_CAP_40GFDX
ETHER_STAT_ADV_CAP_5000FDX
ETHER_STAT_ADV_CAP_ASMPAUSE
ETHER_STAT_ADV_CAP_AUTONEG
ETHER_STAT_ADV_CAP_PAUSE
ETHER_STAT_ADV_REMFAULT
ETHER_STAT_ALIGN_ERRORS
ETHER_STAT_CAP_1000FDX
ETHER_STAT_CAP_1000HDX
ETHER_STAT_CAP_100FDX
ETHER_STAT_CAP_100GFDX
ETHER_STAT_CAP_100HDX
ETHER_STAT_CAP_100T4
ETHER_STAT_CAP_10FDX
ETHER_STAT_CAP_10GFDX
ETHER_STAT_CAP_10HDX
ETHER_STAT_CAP_2500FDX
ETHER_STAT_CAP_40GFDX
ETHER_STAT_CAP_5000FDX
ETHER_STAT_CAP_ASMPAUSE
ETHER_STAT_CAP_AUTONEG
ETHER_STAT_CAP_PAUSE
ETHER_STAT_CAP_REMFAULT
ETHER_STAT_CARRIER_ERRORS
ETHER_STAT_DEFER_XMTS
ETHER_STAT_EX_COLLISIONS
ETHER_STAT_FCS_ERRORS
ETHER_STAT_FIRST_COLLISIONS
ETHER_STAT_JABBER_ERRORS
ETHER_STAT_LINK_ASMPAUSE
ETHER_STAT_LINK_AUTONEG
ETHER_STAT_LINK_DUPLEX
MAC_PROP_DUPLEX
.ETHER_STAT_LINK_PAUSE
ETHER_STAT_LP_CAP_1000FDX
ETHER_STAT_LP_CAP_1000HDX
ETHER_STAT_LP_CAP_100FDX
ETHER_STAT_LP_CAP_100GFDX
ETHER_STAT_LP_CAP_100HDX
ETHER_STAT_LP_CAP_100T4
ETHER_STAT_LP_CAP_10FDX
ETHER_STAT_LP_CAP_10GFDX
ETHER_STAT_LP_CAP_10HDX
ETHER_STAT_LP_CAP_2500FDX
ETHER_STAT_LP_CAP_40GFDX
ETHER_STAT_LP_CAP_5000FDX
ETHER_STAT_LP_CAP_ASMPAUSE
ETHER_STAT_LP_CAP_AUTONEG
ETHER_STAT_LP_CAP_PAUSE
ETHER_STAT_LP_CAP_REMFAULT
ETHER_STAT_MACRCV_ERRORS
ETHER_STAT_MACXMT_ERRORS
ETHER_STAT_MULTI_COLLISIONS
ETHER_STAT_SQE_ERRORS
ETHER_STAT_TOOLONG_ERRORS
ETHER_STAT_TOOSHORT_ERRORS
ETHER_STAT_TX_LATE_COLLISIONS
ETHER_STAT_XCVR_ADDR
ETHER_STAT_XCVR_ID
ETHER_STAT_XCVR_INUSE
MAC_PROP_MEDIA
above. These
definitions are compatible with the older subset of XCVR_* macros.In addition to the defined statistics above, if the device driver maintains additional statistics or the device provides additional statistics, it should create its own kstats through the kstat_create(9F) function to allow operators to observe them.
One of the important things that a device driver must do is lay out DMA memory, generally in a ring of descriptors, into which received Ethernet frames will be placed. When performing this, there are a few things that drivers should generally do:
As a solution to this, the driver should program the device to start placing the received Ethernet frame at two bytes off of the start of the DMA buffer. This will make sure that no matter whether or not VLAN tags are present, that the IP header will be 4-byte aligned.
Device drivers are the first line of defense for dealing with broken devices and bugs in their firmware. While most devices will rarely fail, it is important that when designing and implementing the device driver that particular attention is paid in the design with respect to RAS (Reliability, Availability, and Serviceability). While everything described in this section is optional, it is highly recommended that all new device drivers follow these guidelines.
The Fault Management Architecture (FMA) provides facilities for detecting and reporting various classes of defects and faults. Specifically for networking device drivers, issues that should be detected and reported include:
All such errors fall into three primary categories:
Drivers should initialize support for the fault management
framework by calling
ddi_fm_init(9F) from their
attach(9E) routine. By registering
with the fault management framework, a device driver is given the chance to
detect and notice transport errors as well as report other errors that
exist. While a device driver does not need to indicate that it is capable of
all such capabilities described in
ddi_fm_init(9F), we suggest that
device drivers at least register the
DDI_FM_EREPORT_CAPABLE
so as to allow the driver to
report issues that it detects.
If the driver registers with the fault management framework during its attach(9E) entry point, it must call ddi_fm_fini(9F) during its detach(9E) entry point.
Many modern networking devices leverage PCI or PCI Express. As such, there are two primary ways that device drivers access data: they either memory map device registers and use routines like ddi_get8(9F) and ddi_put8(9F) or they use direct memory access (DMA). New device drivers should always enable checking of the transport layer by marking their support in the ddi_device_acc_attr(9S) structure and using routines like ddi_fm_acc_err_get(9F) and ddi_fm_dma_err_get(9F) to detect if errors have occurred.
Many devices have capabilities to announce to a device driver that a fatal correctable error or uncorrectable error has occurred. Other devices have the ability to indicate that various physical issues have occurred such as a fan failing or a temperature sensor having fired.
Drivers should wire themselves to receive notifications when these events occur. The means and capabilities will vary from device to device. For example, some devices will generate information about these notifications through special interrupts. Other devices may have a register that software can poll. In the cases where polling is required, driver writers should try not to poll too frequently and should generally only poll when the device is actively being used, e.g. between calls to the mc_start(9E) and mc_stop(9E) entry points.
One of the primary responsibilities of a hardened device driver is to perform transmit stall detection. The core idea behind tx stall detection is that the driver should record when it's getting activity related to when data has been successfully transmitted. Most devices should be transmitting data on a regular basis as long as the link is up. If it is not, then this may indicate that the device is stuck and needs to be reset. At this time, the MAC framework does not provide any resources for performing these checks; however, polling on each individual transmit ring for the last completion time while something is actively being transmitted through the use of routines such as timeout(9F) may be a reasonable starting point.
Each device is programmed in different ways. Some devices are programmed through asynchronous commands while others are programmed by writing directly to memory mapped registers. If a device receives asynchronous replies to commands, then the device driver should set reasonable timeouts for all such commands and plan on detecting them. If a timeout occurs, the driver should presume that there is an issue with the hardware and proceed to abort the command or reset the device.
Many devices do not have such a communication mechanism. However, whenever there is some activity where the device driver must wait, then it should be prepared for the fact that the device may never get back to it and react appropriately by performing some kind of device reset.
When any of the above categories of errors has been triggered, the behavior that the device driver should take depends on the kind of error. If a fatal error, for example, a transport error, a transmit stall was detected, or the device indicated an uncorrectable error was detected, then it is important that the driver take the following steps:
DDI_SERVICE_LOST
.DDI_SERVICE_RESTORED
.When a non-fatal error occurs, then the device driver should
submit an ereport and should optionally mark the device degraded using
ddi_fm_service_impact(9F)
with the DDI_SERVICE_DEGRADED
value depending on the
nature of the problem that has occurred.
Device drivers should never make the decision to remove a device from service based on errors that have occurred nor should they panic the system. Rather, the device driver should always try to notify the operating system with various ereports and allow its policy decisions to occur. The decision to retire a device lies in the hands of the fault management architecture. It knows more about the operator's intent and the surrounding system's state than the device driver itself does and it will make the call to offline and retire the device if it is required.
When resetting a device, a device driver must exercise caution. If a device driver has not been written to plan for a device reset, then it may not correctly restore the device's state after such a reset. Such state should be stored in the instance's private state data as the MAC framework does not know about device resets and will not inform the device again about the expected, programmed state.
One wrinkle with device resets is that many networking cards show up as multiple PCI functions on a single device, for example, each port may show up as a separate function and thus have a separate instance of the device driver attached. When resetting a function, device driver writers should carefully read the device programming manuals and verify whether or not a reset impacts only the stalled function or if it impacts all function across the device.
If the only way to reset a given function is through the device, then this may require more coordination and work on the part of the device driver to ensure that all the other instances are correctly restored. In cases where this occurs, some devices offer ways of injecting interrupts onto those other functions to notify them that this is occurring.
The networking stack manages framed data through the use of the mblk(9S) structure. The mblk allows for a single message to be made up of individual blocks. Each part is linked together through its b_cont member. However, it also allows for multiple messages to be chained together through the use of the b_next member. While the networking stack works with these structures, device drivers generally work with DMA regions. There are two different strategies that device drivers use for handling these two different cases: copying and binding.
The first way that device drivers handle interfacing between the two is by having two separate regions of memory. One part is memory which has been allocated for DMA through a call to ddi_dma_mem_alloc(9F) and the other is memory associated with the memory block.
In this case, a driver will use bcopy(9F) to copy memory between the two distinct regions. When transmitting a packet, it will copy the memory from the mblk_t to the DMA region. When receiving memory, it will allocate a mblk_t through the allocb(9F) routine, copy the memory across with bcopy(9F), and then increment the mblk_t's b_wptr structure.
If, when receiving, memory is not available for a new message block, then the frame should be skipped and effectively dropped. A kstat should be bumped when such an occasion occurs.
An alternative approach to copying data is to use DMA binding. When using DMA binding, the OS takes care of mapping between DMA memory and normal device memory. The exact process is a bit different between transmit and receive.
When transmitting a device driver has an mblk_t and needs to call the ddi_dma_addr_bind_handle(9F) function to bind it to an already existing DMA handle. At that point, it will receive various DMA cookies that it can use to obtain the addresses to program the device with for transmitting data. Once the transmit is done, the driver must then make sure to call freemsg(9F) to release the data. It must not call freemsg(9F) before it receives an interrupt from the device indicating that the data has been transmitted, otherwise it risks sending arbitrary kernel memory.
When receiving data, the device can perform a similar operation. First, it must bind the DMA memory into the kernel's virtual memory address space through a call to the ddi_dma_addr_bind_handle(9F) function if it has not already. Once it has, it must then call desballoc(9F) to try and create a new mblk_t which leverages the associated memory. It can then pass that mblk_t up to the stack.
When deciding which of these options to use, there are many different considerations that must be made. The answer as to whether to bind memory or to copy data is not always simpler.
The first thing to remember is that DMA resources may be finite on a given platform. Consider the case of receiving data. A device driver that binds one of its receive descriptors may not get it back for quite some time as it may be used by the kernel until an application actually consumes it. Device drivers that try to bind memory for receive, often work with the constraint that they must be able to replace that DMA memory with another DMA descriptor. If they were not replaced, then eventually the device would not be able to receive additional data into the ring.
On the other hand, particularly for larger frames, copying every packet from one buffer to another can be a source of additional latency and memory waste in the system. For larger copies, the cost of copying may dwarf any potential cost of performing DMA binding.
For device driver authors that are unsure of what to do, they should first employ the copying method to simplify the act of writing the device driver. The copying method is simpler and also allows the device driver author not to worry about allocated DMA memory that is still outstanding when it is asked to unload.
If device driver writers are worried about the cost, it is recommended to make the decision as to whether or not to copy or bind DMA data a separate private property for both transmitting and receiving. That private property should indicate the size of the received frame at which to switch from one format to the other. This way, data can be gathered to determine what the impact of each method is on a given platform.
dlpi(4P), driver.conf(5), ieee802.3(7), dladm(8), _fini(9E), _info(9E), _init(9E), attach(9E), close(9E), detach(9E), mac_capab_led(9E), mac_capab_rings(9E), mac_capab_transceiver(9E), mc_close(9E), mc_getcapab(9E), mc_getprop(9E), mc_getstat(9E), mc_multicst(9E), mc_open(9E), mc_propinfo(9E), mc_setpromisc(9E), mc_setprop(9E), mc_start(9E), mc_stop(9E), mc_tx(9E), mc_unicst(9E), open(9E), allocb(9F), bcopy(9F), ddi_dma_addr_bind_handle(9F), ddi_dma_mem_alloc(9F), ddi_fm_acc_err_get(9F), ddi_fm_dma_err_get(9F), ddi_fm_ereport_post(9F), ddi_fm_fini(9F), ddi_fm_init(9F), ddi_fm_service_impact(9F), ddi_get8(9F), ddi_put8(9F), desballoc(9F), freemsg(9F), kstat_create(9F), mac_alloc(9F), mac_devt_to_instance(9F), mac_fini_ops(9F), mac_free(9F), mac_getinfo(9F), mac_hcksum_get(9F), mac_hcksum_set(9F), mac_init_ops(9F), mac_link_update(9F), mac_lso_get(9F), mac_maxsdu_update(9F), mac_private_minor(9F), mac_prop_info_set_default_link_flowctrl(9F), mac_prop_info_set_default_str(9F), mac_prop_info_set_default_uint32(9F), mac_prop_info_set_default_uint64(9F), mac_prop_info_set_default_uint8(9F), mac_prop_info_set_perm(9F), mac_prop_info_set_range_uint32(9F), mac_register(9F), mac_rx(9F), mac_unregister(9F), mod_install(9F), mod_remove(9F), strcmp(9F), timeout(9F), cb_ops(9S), ddi_device_acc_attr(9S), dev_ops(9S), mac_callbacks(9S), mac_register(9S), mblk(9S), modldrv(9S), modlinkage(9S)
McCloghrie, K. and Rose, M., RFC 1213 Management Information Base for Network Management of, TCP/IP-based internets: MIB-II, March 1991.
McCloghrie, K. and Kastenholz, F., RFC 1573 Evolution of the Interfaces Group of MIB-II, January 1994.
Kastenholz, F., RFC 1643 Definitions of Managed Objects for the Ethernet-like, Interface Types.
IEEE Computer Standard, IEEE 802.3, IEEE Standard for Ethernet, 2022.
July 17, 2023 | OmniOS |