|MAC(9E)||Driver Entry Points||MAC(9E)|
The main interface with MAC is through a series of callbacks defined in a mac_callbacks(9S) structure. These callbacks control all the aspects of the device. They range from sending data, getting and setting of properties, controlling mac address filters, and also managing promiscuous mode.
The MAC framework takes care of many aspects of the device driver's management. A device that uses the MAC framework does not have to worry about creating device nodes or implementing open(9E) or close(9E) routines. In addition, all of the work to interact with dlpi(7P) is taken care of automatically and transparently.
All device drivers have to define a dev_ops(9S) structure which is pointed to by a modldrv(9S) structure and the corresponding NULL-terminated modlinkage(9S) structure. The dev_ops(9S) structure should have a cb_ops(9S) structure defined for it; however, it does not need to implement any of the standard cb_ops(9S) entry points.
Normally, in a driver's _init(9E) entry point, it passes its modlinkage structure directly to mod_install(9F). To properly register with MAC, the driver must call mac_init_ops(9F) before it calls mod_install(9F). If for some reason the mod_install(9F) function fails, then the driver must be removed by a call to mac_fini_ops(9F).
Conversely, in the driver's _fini(9E) routine, it should call mac_fini_ops(9F) after it successfully calls mod_remove(9F). For an example of how to use the mac_init_ops(9F) and mac_fini_ops(9F) functions, see the examples section in mac_init_ops(9F).
These steps should all be taken during a device's attach(9E) entry point. It is recommended that the driver perform this sequence of steps after the device has finished its initialization of the chipset and interrupts, though interrupts should not be enabled at that point. After it calls mac_register(9F) it will start receiving callbacks from the MAC framework.
To allocate the registration structure, the driver should call mac_alloc(9F). Device drivers should generally always pass the symbol MAC_VERSION as the argument to mac_alloc(9F). Upon successful completion, the driver will receive a mac_register_t structure which it should fill in. The structure and its members are documented in mac_register(9S).
The mac_callbacks(9S) structure is not allocated as a part of the mac_register(9S) structure. In general, device drivers declare this statically. See the MAC Callbacks section for more information on how to fill it out.
Once the structure has been filled in, the driver should call mac_register(9F) to register itself with MAC. The handle that it uses to register with should be part of the driver's soft state. It will be used in various other support functions and callbacks.
If the call is successful, then the device driver should enable interrupts and finish any other initialization required. If the call to mac_register(9F) failed, then it should unwind its initialization and should return DDI_FAILURE from its attach(9E) routine.
The driver does not need to hold onto an allocated mac_register(9S) structure after it has called the mac_register(9F) function. Whether the mac_register(9F) function returns successfully or not, the driver may free its mac_register(9S) structure by calling the mac_free(9F) function.
A device driver should make no assumptions about when the various callbacks will be called and whether or not they will be called simultaneously. For example, a device driver may be asked to transmit data through a call to its mc_tx(9E) entry point while it is being asked to get a device property through a call to its mc_getprop(9E) entry point. As such, while some calls may be serialized to the device, such as setting properties, the device driver should always presume that all of its data needs to be protected with locks. While the device is holding locks, it is safe for it call the following MAC routines:
Any other MAC related routines should not be called with locks held, such as mac_link_update(9F) or mac_rx(9F). Other routines in the DDI may be called while locks are held; however, device driver writers should be careful about calling blocking routines while locks are held or in interrupt context, though it is generally legal to do so.
During a single interrupt, a device driver should process a fixed number of frames. For each frame the device driver should:
Once all the frames have been processed and assembled, the device driver should deliver them to the rest of the operating system by calling mac_rx(9F). The device driver should try to give as many mblk_t structures to the system at once. It should not call mac_rx(9F) once for every assembled mblk_t.
The device driver must not hold any locks across the call to mac_rx(9F). When this function is called, received data will be pushed through the networking stack and some replies may be generated and given to the driver to send out.
It is not the device driver's responsibility to determine whether or not the system can keep up with a driver's delivery rate of frames. The rest of the networking stack will handle issues related to keeping up appropriately and ensure that kernel memory is not exhausted by packets that are not being processed.
Finally, the device driver should make sure that any other housekeeping activities required for the ring are taken care of such that more data can be received.
At some point in the future when resources have become available again, for example after an interrupt indicating that some portion of the transmit ring has been sent, then the device driver must notify the system that it can continue transmission. To do this, the driver should call mac_tx_update(9F). After that point, the driver will receive calls to its mc_tx(9E) entry point again. As mentioned in the section on callbacks, the device driver should avoid holding any particular locks across the call to mac_tx_update(9F).
Logically, there are two different sets of things that the device driver needs to keep track of while it's operating:
By default, when a link first comes up, the device driver should generally configure the link to support the common set of speeds and perform auto-negotiation.
A user can control what speeds a device advertises via auto-negotiation and whether or not it performs auto-negotiation at all by using a series of properties that have _EN_ in the name. These are read/write properties and there is one for each speed supported in the operating system. For a full list of them, see the PROPERTIES section.
In addition to these properties, there is a corresponding set of properties with _ADV_ in the name. These are similar to the _EN_ family of properties, but they are read-only and indicate what the device has actually negotiated. While they are generally similar to the _EN_ family of properties, they may change depending on power settings. See the Ethernet Link Properties section in dladm(1M) for more information.
It's worth discussing how these different values get used throughout the different entry points. The first entry point to consider is the mc_propinfo(9E) entry point. For a given speed, the driver should consult whether or not the hardware supports this speed. If it does, it should fill in the default value that the hardware takes and whether or not the property is writable. The properties should also be updated to indicate whether or not it is writable. This holds for both the _EN_ and _ADV_ family of properties.
The next entry point is
mc_getprop(9E). Here, the device
should first consult whether the given speed is supported. If it is not,
then the driver should return
ENOTSUP. If it does,
then it should return the current value of the property.
The last property endpoint is the
mc_setprop(9E) entry point. Here,
the same logic applies. Before the driver considers whether or not the
property is writable, it should first check whether or not it's a supported
property. If it's not, then it should return
ENOTSUP. Otherwise, it should proceed to check
whether the property is writable, and if it is and a valid value, then it
should update the property and restart the link's negotiation.
Finally, there is the mc_getstat(9E) entry point. Several of the statistics that are queried relate to auto-negotiation and hardware capabilities. When a statistic relates to the hardware supporting a given speed, the _EN_ properties should be ignored. The only thing that should be consulted is what the hardware itself supports. Otherwise, the statistics should look at what is currently being advertised by the device.
The MAC framework will query a device for support of a capability through the mc_getcapab(9E) function. Each capability has its own constant and may have corresponding data that goes along with it and a specific structure that the device is required to fill in. Note, the set of capabilities changes over time and there are also private capabilities in the system. Several of the capabilities are used in the implementation of the MAC framework. Others, like MAC_CAPAB_RINGS, represent feature that have not been stabilized and thus both API and binary compatibility for them is not guaranteed. It is important that the device driver handles unknown capabilities correctly. For more information, see mc_getcapab(9E).
The following capabilities are stable and defined in the system:
Note, the values that the driver declares in this capability indicate what it can do when it transmits data. If the driver can only verify checksums when receiving data, then it should not indicate that it supports this capability. The following set of flags may be combined through a bitwise inclusive OR:
When in a driver's transmit function, the driver will be processing a single frame. It should call mac_hcksum_get(9F) to see what checksum flags are set on it. Note that the flags that are set on it are different from the ones described above and are documented in its manual page. These flags indicate how the driver is expected to program the hardware and what checksumming is required. Not all frames will require hardware checksumming or will ask the hardware to checksum it.
If a driver supports offloading the receive checksum and verification, it should check to see what the hardware indicated was verified. The driver should then call mac_hcksum_set(9F). The flags used are different from the ones above and are discussed in detail in the mac_hcksum_set(9F) manual page. If there is no checksum information available or the driver does not support checksumming, then it should simply not call mac_hcksum_set(9F).
Note that the checksum flags should be set on the first mblk_t that makes up a given message. In other words, if multiple mblk_t structures are linked together by the b_cont member to describe a single frame, then it should only be called on the first mblk_t of that set. However, each distinct message should have the checksum bits set on it, if applicable. In other words, each mblk_t that is linked together by the b_next pointer may have checksum flags set.
It is recommended that device drivers provide a private property or driver.conf(4) property to control whether or not checksumming is enabled for both rx and tx; however, the default disposition is recommended to be enabled for both. This way if hardware bugs are found in the checksumming implementation, they can be disabled without requiring software updates. The transmit property should be checked when determining how to reply to mc_getcapab(9E) and the receive property should be checked in the context of the receive function.
t_uscalar_t lso_flags; lso_basic_tcp_ivr4_t lso_basic_tcp_ipv4; lso_basic_tcp_ipv6_t lso_basic_tcp_ipv6;
The lso_flags member is used to indicate which members are valid and should be considered. Each flag represents a different form of LSO. The member should be set to the bitwise inclusive OR of the following values:
The lso_basic_tcp_ipv4 member is a structure with the following members:
The lso_basic_tcp_ipv6 member is a structure with the following members:
Like with checksumming, it is recommended that driver writers provide a means for disabling the support of LSO even if it is enabled by default. This deals with the case where issues that pop up for LSO may be worked around without requiring additional driver work.
Many of the properties listed below are read-only. Each property indicates whether it's read-only or it's read/write. However, driver writers may not implement the ability to set all writable properties. Many of these depend on the card itself. In particular, all properties that relate to auto-negotiation and are read/write may not be updated if the hardware in question does not support toggling what link speeds are auto-negotiated. While copper Ethernet often does not have this restriction, it often exists with various fiber standards and phys.
The following properties are the subset of MAC framework properties that driver writers should be aware of and handle. While other properties exist in the system, driver writers should always return an error when a property not listed below is encountered. See mc_getprop(9E) and mc_setprop(9E) for more information on how to handle them.
The MAC_PROP_DUPLEX property is used to indicate whether or not the link is duplex. A duplex link may have traffic flowing in both directions at the same time. The link_duplex_t is an enumeration which may be set to any of the following values:
The MAC_PROP_SPEED property stores the current link speed in bits per second. A link that is running at 100 MBit/s would store the value 100000000ULL. A link that is running at 40 Gbit/s would store the value 40000000000ULL.
The MAC_PROP_STATUS property is used to indicate the current state of the link. It indicates whether the link is up or down. The link_state_t is an enumeration which may be set to any of the following values:
The MAC_PROP_AUTONEG property indicates whether or not the device is currently configured to perform auto-negotiation. A value of 0 indicates that auto-negotiation is disabled. A non-zero value indicates that auto-negotiation is enabled. Devices should generally default to enabling auto-negotiation.
When getting this property, the device driver should return the current state. When setting this property, if the device supports operating in the requested mode, then the device driver should reset the link to negotiate to the new speed after updating any internal registers.
The MAC_PROP_MTU property determines the maximum transmission unit (MTU). This indicates the maximum size packet that the device can transmit, ignoring its own headers. For an Ethernet device, this would exclude the size of the Ethernet header and any VLAN headers that would be placed. It is up to the driver to ensure that any MTU values that it accepts when adding in its margin and header sizes does not exceed its maximum frame size.
By default, drivers for Ethernet should initialize this value and the MTU to 1500. When getting this property, the driver should return its current recorded MTU. When setting this property, the driver should first validate that it is within the device's valid range and then it must call mac_maxsdu_update(9F). Note that the call may fail. If the call completes successfully, the driver should update the hardware with the new value of the MTU and perform any other work needed to handle it.
If the device does not support changing the MTU after the
device's mc_start(9E) entry
point has been called, then driver writers should return
The MAC_PROP_FLOWCTRL property manages the configuration of pause frames as part of Ethernet flow control. Note, this only describes what this device will advertise. What is actually enabled may be different and is subject to the rules of auto-negotiation. The link_flowctrl_t is an enumeration that may be set to one of the following values:
When getting this property, the device driver should return the way that it has configured the device, not what the device has actually negotiated. When setting the property, it should update the hardware and allow the link to potentially perform auto-negotiation again.
The MAC_PROP_EN_FEC_CAP property indicates which Forward Error Correction (FEC) code is advertised by the device.
The link_fec_t is an enumeration that may be a combination of the following bit values:
When setting the property, it should update the hardware with
the requested, or combination of requested codings. If a particular
combination of codings is not supported by the hardware, the device
driver should return
EINVAL. When retrieving
this property, the device driver should return the current value of the
The MAC_PROP_ADV_FEC_CAP has the same values as MAC_PROP_EN_FEC_CAP. The property indicates which Forward Error Correction (FEC) code has been negotiated over the link.
The remaining properties are all about various auto-negotiation link speeds. They fall into two different buckets: properties with _ADV_ in the name and properties with _EN_ in the name. For any given supported speed, there is one of each. The _EN_ set of properties are read/write properties that control what should be advertised by the device. When these are retrieved, they should return the current value of the property. When they are set, they should change how the hardware advertises the specific speed and trigger any kind of link reset and auto-negotiation, if enabled, to occur.
The _ADV_ set of properties are read-only properties. They are meant to reflect what has actually been negotiated. These may be different from the _EN_ family of properties, especially when different power management settings are at play.
See the Link Speed and Auto-negotiation section for more information.
The properties are ordered in increasing link speed:
The MAC_PROP_ADV_10HDX_CAP property describes whether or not 10 Mbit/s half-duplex support is advertised.
The MAC_PROP_EN_10HDX_CAP property describes whether or not 10 Mbit/s half-duplex support is enabled.
The MAC_PROP_ADV_10FDX_CAP property describes whether or not 10 Mbit/s full-duplex support is advertised.
The MAC_PROP_EN_10FDX_CAP property describes whether or not 10 Mbit/s full-duplex support is enabled.
The MAC_PROP_ADV_100HDX_CAP property describes whether or not 100 Mbit/s half-duplex support is advertised.
The MAC_PROP_EN_100HDX_CAP property describes whether or not 100 Mbit/s half-duplex support is enabled.
The MAC_PROP_ADV_100FDX_CAP property describes whether or not 100 Mbit/s full-duplex support is advertised.
The MAC_PROP_EN_100FDX_CAP property describes whether or not 100 Mbit/s full-duplex support is enabled.
The MAC_PROP_ADV_100T4_CAP property describes whether or not 100 Mbit/s Ethernet using the 100BASE-T4 standard is advertised.
The MAC_PROP_ADV_100T4_CAP property describes whether or not 100 Mbit/s Ethernet using the 100BASE-T4 standard is enabled.
The MAC_PROP_ADV_1000HDX_CAP property describes whether or not 1 Gbit/s half-duplex support is advertised.
The MAC_PROP_EN_1000HDX_CAP property describes whether or not 1 Gbit/s half-duplex support is enabled.
The MAC_PROP_ADV_1000FDX_CAP property describes whether or not 1 Gbit/s full-duplex support is advertised.
The MAC_PROP_EN_1000FDX_CAP property describes whether or not 1 Gbit/s full-duplex support is enabled.
The MAC_PROP_ADV_2500FDX_CAP property describes whether or not 2.5 Gbit/s full-duplex support is advertised.
The MAC_PROP_EN_2500FDX_CAP property describes whether or not 2.5 Gbit/s full-duplex support is enabled.
The MAC_PROP_ADV_5000FDX_CAP property describes whether or not 5.0 Gbit/s full-duplex support is advertised.
The MAC_PROP_EN_5000FDX_CAP property describes whether or not 5.0 Gbit/s full-duplex support is enabled.
The MAC_PROP_ADV_10GFDX_CAP property describes whether or not 10 Gbit/s full-duplex support is advertised.
The MAC_PROP_EN_10GFDX_CAP property describes whether or not 10 Gbit/s full-duplex support is enabled.
The MAC_PROP_ADV_40GFDX_CAP property describes whether or not 40 Gbit/s full-duplex support is advertised.
The MAC_PROP_EN_40GFDX_CAP property describes whether or not 40 Gbit/s full-duplex support is enabled.
The MAC_PROP_ADV_100GFDX_CAP property describes whether or not 100 Gbit/s full-duplex support is advertised.
The MAC_PROP_EN_100GFDX_CAP property describes whether or not 100 Gbit/s full-duplex support is enabled.
The driver may define whatever semantics it wants for these private properties. They will not be listed when running dladm(1M), unless explicitly requested by name. All such properties should start with a leading underscore character and then consist of alphanumeric ASCII characters and additional underscores or hyphens.
Properties of type MAC_PROP_PRIVATE may show up in all three property related entry points: mc_propinfo(9E), mc_getprop(9E), and mc_setprop(9E). Device drivers should tell the different properties apart by using the strcmp(9F) function to compare it to the set of properties that it knows about. When encountering properties that it doesn't know, it should treat them like all other unknown properties.
In general, if the device is not keeping track of these statistics, then it is recommended that the driver store these values as a uint64_t to ensure that overflow does not occur.
If a device does not support a specific statistic, then it is fine to return that it is not supported. The same should be used for unrecognized statistics. See mc_getstat(9E) for more information on the proper way to handle these.
As a solution to this, the driver should program the device to start placing the received Ethernet frame at two bytes off of the start of the DMA buffer. This will make sure that no matter whether or not VLAN tags are present, that the IP header will be 4-byte aligned.
The Fault Management Architecture (FMA) provides facilities for detecting and reporting various classes of defects and faults. Specifically for networking device drivers, issues that should be detected and reported include:
All such errors fall into three primary categories:
Drivers should wire themselves to receive notifications when these events occur. The means and capabilities will vary from device to device. For example, some devices will generate information about these notifications through special interrupts. Other devices may have a register that software can poll. In the cases where polling is required, driver writers should try not to poll too frequently and should generally only poll when the device is actively being used, e.g. between calls to the mc_start(9E) and mc_stop(9E) entry points.
Many devices do not have such a communication mechanism. However, whenever there is some activity where the device driver must wait, then it should be prepared for the fact that the device may never get back to it and react appropriately by performing some kind of device reset.
When a non-fatal error occurs, then the device driver should submit an ereport and should optionally mark the device degraded using ddi_fm_service_impact(9F) with the DDI_SERVICE_DEGRADED value depending on the nature of the problem that has occurred.
Device drivers should never make the decision to remove a device from service based on errors that have occurred nor should they panic the system. Rather, the device driver should always try to notify the operating system with various ereports and allow its policy decisions to occur. The decision to retire a device lies in the hands of the fault management architecture. It knows more about the operator's intent and the surrounding system's state than the device driver itself does and it will make the call to offline and retire the device if it is required.
One wrinkle with device resets is that many networking cards show up as multiple PCI functions on a single device, for example, each port may show up as a separate function and thus have a separate instance of the device driver attached. When resetting a function, device driver writers should carefully read the device programming manuals and verify whether or not a reset impacts only the stalled function or if it impacts all function across the device.
If the only way to reset a given function is through the device, then this may require more coordination and work on the part of the device driver to ensure that all the other instances are correctly restored. In cases where this occurs, some devices offer ways of injecting interrupts onto those other functions to notify them that this is occurring.
In this case, a driver will use bcopy(9F) to copy memory between the two distinct regions. When transmitting a packet, it will copy the memory from the mblk_t to the DMA region. When receiving memory, it will allocate a mblk_t through the allocb(9F) routine, copy the memory across with bcopy(9F), and then increment the mblk_t's w_ptr structure.
If, when receiving, memory is not available for a new message block, then the frame should be skipped and effectively dropped. A kstat should be bumped when such an occasion occurs.
When transmitting a device driver has an mblk_t and needs to call the ddi_dma_addr_bind_handle(9F) function to bind it to an already existing DMA handle. At that point, it will receive various DMA cookies that it can use to obtain the addresses to program the device with for transmitting data. Once the transmit is done, the driver must then make sure to call freemsg(9F) to release the data. It must not call freemsg(9F) before it receives an interrupt from the device indicating that the data has been transmitted, otherwise it risks sending arbitrary kernel memory.
When receiving data, the device can perform a similar operation. First, it must bind the DMA memory into the kernel's virtual memory address space through a call to the ddi_dma_addr_bind_handle(9F) function if it has not already. Once it has, it must then call desballoc(9F) to try and create a new mblk_t which leverages the associated memory. It can then pass that mblk_t up to the stack.
The first thing to remember is that DMA resources may be finite on a given platform. Consider the case of receiving data. A device driver that binds one of its receive descriptors may not get it back for quite some time as it may be used by the kernel until an application actually consumes it. Device drivers that try to bind memory for receive, often work with the constraint that they must be able to replace that DMA memory with another DMA descriptor. If they were not replaced, then eventually the device would not be able to receive additional data into the ring.
On the other hand, particularly for larger frames, copying every packet from one buffer to another can be a source of additional latency and memory waste in the system. For larger copies, the cost of copying may dwarf any potential cost of performing DMA binding.
For device driver authors that are unsure of what to do, they should first employ the copying method to simplify the act of writing the device driver. The copying method is simpler and also allows the device driver author not to worry about allocated DMA memory that is still outstanding when it is asked to unload.
If device driver writers are worried about the cost, it is recommended to make the decision as to whether or not to copy or bind DMA data a separate private property for both transmitting and receiving. That private property should indicate the size of the received frame at which to switch from one format to the other. This way, data can be gathered to determine what the impact of each method is on a given platform.
McCloghrie, K. and Rose, M., RFC 1213 Management Information Base for Network Management of, TCP/IP-based internets: MIB-II, March 1991.
McCloghrie, K. and Kastenholz, F., RFC 1573 Evolution of the Interfaces Group of MIB-II, January 1994.
Kastenholz, F., RFC 1643 Definitions of Managed Objects for the Ethernet-like, Interface Types.
|February 13, 2021||OmniOS|