扒开DMA映射的内裤

上次我们说过解决cpu和dma访问内存一致性问题有两种方法，一种是一致性映射，一种是流式映射。无论哪种，本质都是避免cache带来的影响，要么一步到位直接把cache关了，要么只在dma传输数据过程中才关cache。不过很明显前者由于关了cache，则会带来性能的影响。

今天我们来详细看下这两种用法的实现本质是什么？

一致性DMA映射

dma_addr_t dma_handle;
cpu_addr = dma_alloc_coherent(dev, size, &dma_handle, gfp);

这个函数返回两个值，其中cpu_addr是虚拟地址，CPU可以通过这个地址来访问这段buffer，另外一个dma_handle物理地址，可以传递给DMA engine。
这里分配的大小以 PAGE_SIZE为单位。
另外这个函数会调用alloc_page()来分配物理页面，所以不要在中断上下文中使用该API

其实现流程如下：dma_alloc_coherent dma_alloc_attrs ops->alloc __dma_alloc

static void *__dma_alloc(struct device *dev, size_t size,
    dma_addr_t *dma_handle, gfp_t flags,
    unsigned long attrs)
{
 struct page *page;
 void *ptr, *coherent_ptr;
 bool coherent = is_device_dma_coherent(dev);
 pgprot_t prot = __get_dma_pgprot(attrs, PAGE_KERNEL, false);

 size = PAGE_ALIGN(size);

 if (!coherent && !gfpflags_allow_blocking(flags)) {//gfpflags_allow_blocking是否允许直接回收页
  struct page *page = NULL;
  void *addr = __alloc_from_pool(size, &page, flags); //coherent_pool

  if (addr)
   *dma_handle = phys_to_dma(dev, page_to_phys(page));

  return addr;
 }

 ptr = __dma_alloc_coherent(dev, size, dma_handle, flags, attrs);
 if (!ptr)
  goto no_mem;

 /* no need for non-cacheable mapping if coherent */
 if (coherent)
  return ptr;

 ......
}

__alloc_from_pool 表示coherent_pool申请方式，可以在cmdline里通过coherent_pool=来设置其大小。

不满足coherent_pool申请方式条件的话会进入__dma_alloc_coherent。

static void *__dma_alloc_coherent(struct device *dev, size_t size,
      dma_addr_t *dma_handle, gfp_t flags,
      unsigned long attrs)
{
 if (IS_ENABLED(CONFIG_ZONE_DMA32) &&
     dev->coherent_dma_mask <= DMA_BIT_MASK(32))
  flags |= GFP_DMA32;
 if (dev_get_cma_area(dev) && gfpflags_allow_blocking(flags)) { //gfpflags_allow_blocking 是否允许直接回收页
  struct page *page;
  void *addr;

  page = dma_alloc_from_contiguous(dev, size >> PAGE_SHIFT, //CMA
       get_order(size), flags);
  if (!page)
   return NULL;

  *dma_handle = phys_to_dma(dev, page_to_phys(page));
  addr = page_address(page);
  memset(addr, 0, size);
  return addr;
 } else {
  return swiotlb_alloc_coherent(dev, size, dma_handle, flags);//buddy or swiotlb
 }
}

dma_alloc_from_contiguous 表示通过CMA申请内存
swiotlb_alloc_coherent 表示通过buddy或者swiotlb申请内存

至此，我们知道dma_alloc_coherent只是申请一致性dma内存的前端api,至于从哪里来，是否连续，带不带cache，完全由后端决定。

如下图所示：

流式DMA映射

我们知道上面的方式明显有个缺点，就是cache一直都是关闭的，所以性能就会很低。比如DMA传输完成之后，CPU去把这个DMA buffer的数据取过来，这时候cache关闭的，CPU去读写就变得很慢。

这里介绍个即可以保证DMA传输的一致性，又能提高性能的方法：流式DMA映射。

「DMA_TO_DEVICE」：从图里看到，CPU需要进行DMA写操作，也就是把内存中的buffer A写入到设备的FIFO A里面，那么有可能cache里面的数据还没有完全写入到内存的buffer A中，那这时候启动DMA的话，最终传递到设备FIFO A的数据其实不是CPU想写的，因为还有一部分数据早潜伏在cache A中没有 sync到内存里。

「DMA_FROM_DEVICE」：我们来看一下DMA读的情况，CPU想把设备的FIFO B的数据读到内存buffer B中。那如果在开启DMA传输的时候没有去把内存buffer B的相应的cache invalid的话，那么DMA把数据从FIFO B到了内存Buffer B之后，CPU去读这个内存Buffer B的数据，那么会把之前的残留在cache line的内容先读到了CPU，那CPU其实是没有读到最新的FIFO B的数据的。

dma_map_single流程如下：dma_map_single dma_map_single_attrs ops->map_page

没有iommu的话会走__swiotlb_map_page。

大题流程如下：