iommu information (and other hypervisor api doc issues)
I have sent mail to hypervisor@sun.com, but have yet to receive a response.
I am working on porting FreeBSD to the sun4v, but the documentation for the IOMMU lacking. There is no description of what an iotte is, what it looks like, what it's bits mean, how many bits are in an iotte. If it's an tte, then just say tte, instead of iotte.
The HyperVisor doc references in 20.1.1:
[1] sun4v Bus Binding to Open Firmware
[2] VPCI Bus Binding to Open Firmware
which aren't available for download anywhere, and in section 21.2 it references doc [4] which isn't stated what the doc is.
[620 byte] By [
therealjmg] at [2007-11-26 6:20:24]

# 1
Well, since Sun doesn't seem to be answering this question, and we wanted to continue w/ the project, a friend did some experiments, and found that the IOMMU documentation is incorrect.
The call iommu_map io_page_list_p is not: The page mapping addresses are described in the io_page_list defined by the argument io_page_list_p, which is a pointer to the io_page_list. The first entry in the io_page_list is the address for the first iotte, the 2nd entry for the 2nd iotte, and so on.
There is no such thing as an iotte, just an raddr, and this isn't a double array. Notice how they say the first entry is the address for the first iotte, instead of the first entry IS the first iotte, or even better, the first entry IS the r_addr the entry will be mapped to. The io_page_list_p is an r_addr pointing to an array of r_addrs that is the r_addr that the respective TSB entry will be mapped to.
io_page_list_p r_addr ->dma dest r_addr 0
dma dest r_addr 1
...
dma dest r_addr n
Also in the docs for dma_sync it says: using the direction(s) defined by the argument io_sync_direction.
and io_sync_direction is defined as:
io_sync_direction "direction" definition for pci_dma_sync
A value specifying the direction for a memory/io sync operation, The
direction value is a flag, one or both directions may be specified by the
caller.
0x01 - For device (device read from memory)
0x02 - For cpu (device write to memory)
which explicately says one or both directions may be specified... Well, sorry, you can't, you can only specify one direction at once.
It's one thing for documentation to be incomplete (no definition of what an iotte is) or incorrect (io_page_list_p), but it's another thing for Sun to completely ignore requests for help. If Sun wants these forums to be successful, they need to have engineers watching and responding to these requests.
# 2
Thanks for your comments; having re-read section 20 of the Hypervisor API spec
I agree it probably could use a narrative section to provide more informative
detail. We'll do that in the forthcoming update. I hope the following will
confirm and explain what you have been seeing;
The Niagara-1 system re-uses a PCI-Express interface chip ("Fire") originally
designed for an earlier family of processors. The J-bus interface on Niagara-1
is specifically to be able to connect to this interface chip.
The Fire chip has two PCI-Express interfaces, each of which is supported by
its own IOMMU to translate both 32bit and 64bit PCI bus addresses into host
system memory addresses for J-bus transactions.
Each of the two IOMMUs in Fire contains a TLB for caching frequently used
PCI-bus to host memory address translations. These translations are loaded
on demand by Fire hardware from a lookup table specified in
Niagara's main memory.
The lookup table itself is very simplistic; the (virtual) address presented
by a PCI device is simply shifted by a specified IOMMU page size, and used as
an index into the translation table. A basic linear table lookup is fine
since it is the operating system that is assigning the PCI bus address space -
making it possible to arrange for a compact linear address map.
However, because Fire was originally architected without virtualization in mind,each in-memory translation table contains IO translation table
entries (IO TTE's) that specify physical bus addresses.
Therefore we could not allow the guest operating system to
own and manage these tables directly - instead the Hypervisor must
check and validate the specified address mapping for each IO TTE to ensure
that the guest has not specified an illegal mapping - in order, for example,
to DMA to/from hypervisor memory, or other guest OS' memory.
Furthermore, there is no hardware mechanism to interrupt the Niagara
processor and allow us to "fault-in" a missing translation, (as we do with
the CPU's virtual memory translations), so all the PCI->host memory
address translations have to be fully specified in the in-memory translation
table before IO DMA can begin. This prevented a "shadow" page-table design.
Consequently, these IOMMU translation tables are defined in hypervisor-private
memory and the hypervisor API is used to cause the Hypervisor to fill
in each IO TTE entry. This in turn means that we keep the IO TTE format
hypervisor private also. The guest OS doesn't need to know the TTE format, whichleaves the hardware designers to change the format and layout of the
translation table in the future and the Hypervisor can maintain compatibility
in software. The Hypervisor API is optimised for the kind of behaviour
Solaris wants to perform for each DMA operation - namely map a list of
pages with the same permissions;
So to setup the IO translations required for a DMA transfer, the Hypervisor
API allows the guest OS to identify which translation table it wishes to use
(the abstraction allows for more than one table, and more than one IO TLB).
The remaining arguments specify the starting index in the IO translation
table at which to start populating the mappings, the permissions specified
apply to each of the mappings, and a pointer identifies the list of pages
to use for each of the mappings.
Section 20.3 describes each of the parameters in more detail; but the tsbid
field really is a tuple of the identifier for the translation table in the
upper 32bits, and the index within the table in the lowest 32 bits.
(by the way TSB = Translation Storage Buffer, another name for translation
table. The distinction is simply that these tables are linear rather than
with a heirarchical walk like x86 virtual-memory page tables).
For the map API the Hypervisor will try to create and insert as many
IOTTEs as possible
using the information given into the IOTSB identified. However to avoid
problems like priority inversions while in the Hypervisor, the API
may return before building all the IO TTE mappings. In this event
the guest OS may simply call the API again for the remaining mappings.
The dma_sync API is intended to allow for possible future hardware that
may not be cache-coherent for DMA transactions. For example, an application
may write data into a software buffer which sits as dirty lines, for example,
in a write-back level 2 cache. With non-coherent DMA a device
read of the buffer memory may return the incorrect contents of main memory
without the L2 cache snooping and returning the correct data for each
memory read. On Niagara this is essentially a no-op since DMA is coherent,
but the API is provided to enable binary compatibility of kernels
from one sun4v platform to another.
With that in mind, the flag argument was originally intended to enable the
sync to be specified for both directions simultaneously. However in practice
the operation typically needs to be invoked at different times. The sync for
a device prior to a DMA read, and the sync for the cpu after a DMA write
completes. So in practice we have had no need to specify both operations
in one sync API call - which is why the issue you raise was not fixed
with the Niargara-1 hypervisor. We'll correct this in the next release
of the HV specification.