内存管理中关于Movable的理解

时间:2020-04-26
本文章向大家介绍内存管理中关于Movable的理解,主要包括内存管理中关于Movable的理解使用实例、应用技巧、基本知识点总结和需要注意事项,具有一定的参考价值,需要的朋友可以参考一下。

内核中的管理区

内核中定义了如下一些管理区zone:

enum zone_type {
#ifdef CONFIG_ZONE_DMA
    /*
     * ZONE_DMA is used when there are devices that are not able
     * to do DMA to all of addressable memory (ZONE_NORMAL). Then we
     * carve out the portion of memory that is needed for these devices.
     * The range is arch specific.
     *
     * Some examples
     *
     * Architecture     Limit
     * ---------------------------
     * parisc, ia64, sparc  <4G
     * s390         <2G
     * arm          Various
     * alpha        Unlimited or 0-16MB.
     *
     * i386, x86_64 and multiple other arches
     *          <16M.
     */
    ZONE_DMA,
#endif
#ifdef CONFIG_ZONE_DMA32
    /*
     * x86_64 needs two ZONE_DMAs because it supports devices that are
     * only able to do DMA to the lower 16M but also 32 bit devices that
     * can only do DMA areas below 4G.
     */
    ZONE_DMA32,
#endif
    /*
     * Normal addressable memory is in ZONE_NORMAL. DMA operations can be
     * performed on pages in ZONE_NORMAL if the DMA devices support
     * transfers to all addressable memory.
     */
    ZONE_NORMAL,
#ifdef CONFIG_HIGHMEM
    /*
     * A memory area that is only addressable by the kernel through
     * mapping portions into its own address space. This is for example
     * used by i386 to allow the kernel to address the memory beyond
     * 900MB. The kernel will set up special mappings (page
     * table entries on i386) for each page that the kernel needs to
     * access.
     */
    ZONE_HIGHMEM,
#endif
    ZONE_MOVABLE,
    __MAX_NR_ZONES
};

  • ZONE_DMA
    该管理区是一些设备无法使用DMA访问所有地址的范围,因此特意划分出来的一块内存,专门用于特殊DMA访问分配使用的区域。比如x86架构此区域为0-16M
  • ZONE_NORMAL
    NORMAL区域是直接映射区。
  • ZONE_HIGHMEM
    高端内存管理区,申请的内存,需要内核进行map后才能访问。对于64bit Arch架构,我们一般不需要高端内存区,因为地址空间足够映射所有的物理内存。
  • ZONE_MOVABLE
    这个区域是一个特殊的存在,主要是为了支持memory hotplug功能,所以MOVABLE表示可移除,其实它也表示可迁移。

简单来说,可迁移的页面不一定都在ZONE_MOVABLE中,但是ZONE_MOVABLE中的也页面必须都是可迁移的,我们通过查看/proc/pagetypeinfo来看下实例:

xie:/proc # cat pagetypeinfo                                                 
Page block order: 10
Pages per block:  1024

Free pages count per migrate type at order       0      1      2      3      4      5      6      7      8      9     10 
Node    0, zone      DMA, type    Unmovable     76     50     24     20     27     25     19      3      1      2      0 
Node    0, zone      DMA, type      Movable    117     35     28    172    281     93     49     21      7      4      4 
Node    0, zone      DMA, type  Reclaimable      0      3      1      0      0      0      0      1      0      1      0 
Node    0, zone      DMA, type          CMA   3380   1798    856    386    152     55     21      8      4      0      0 
Node    0, zone      DMA, type   HighAtomic      0      0      0      0      0      0      0      0      0      0      0 
Node    0, zone      DMA, type      Isolate      0      0      0      0      0      0      0      0      0      0      0 
Node    0, zone   Normal, type    Unmovable    521    654    531    286    132     52     15      2      1      4      0 
Node    0, zone   Normal, type      Movable      1      8     21     21      1      1      5      3      1      0      0 
Node    0, zone   Normal, type  Reclaimable     18     24      1      1      0      0      1      0      1      0      0 
Node    0, zone   Normal, type          CMA      9      0      1      6      2      0      1      0      0      0      0 
Node    0, zone   Normal, type   HighAtomic      0      0      0      0      0      0      0      0      0      0      0 
Node    0, zone   Normal, type      Isolate      0      0      0      0      0      0      0      0      0      0      0 
Node    0, zone  Movable, type    Unmovable      0      0      0      0      0      0      0      0      0      0      0 
Node    0, zone  Movable, type      Movable    963    649    188     48     24    112     49     21      8      3     50 
Node    0, zone  Movable, type  Reclaimable      0      0      0      0      0      0      0      0      0      0      0 
Node    0, zone  Movable, type          CMA      0      0      0      0      0      0      0      0      0      0      0 
Node    0, zone  Movable, type   HighAtomic      0      0      0      0      0      0      0      0      0      0      0 
Node    0, zone  Movable, type      Isolate      0      0      0      0      0      0      0      0      0      0      0 

Number of blocks type     Unmovable      Movable  Reclaimable          CMA   HighAtomic      Isolate 
Node 0, zone      DMA          123          310           18           61            0            0 
Node 0, zone   Normal          406          310           43            9            0            0 
Node 0, zone  Movable            0          256            0            0            0            0 

Number of mixed blocks    Unmovable      Movable  Reclaimable          CMA   HighAtomic      Isolate 
Node 0, zone      DMA            0           61            0            0            0            0 
Node 0, zone   Normal            0           11            3            0            0            0 
Node 0, zone  Movable            0            0            0            0            0            0 

可以看到在Movable Zone中不存在Unmovable类型的页面,只有Movable类型的页面。

管理区ZONE_MOVABLE

这个管理区,主要是和memory hotplug功能有关,为什么要设计内存热插拔功能,主要是为了如下两点考虑:
1.逻辑内存热插拔,对于虚拟机的支持,对于虚拟机按照需求来分配可用内存
2.物理内存热插拔,对于NUMA服务器的支持,不需要的内存就设置为offline,以降低功耗
3.优化内存碎片问题

这个管理区域存放的page都是可迁移的,只能被带有__GFP_HIGHMEM和__GFP_MOVABLE标志的内存申请所使用,比如:

#define GFP_HIGHUSER_MOVABLE    (GFP_HIGHUSER | __GFP_MOVABLE)

#define GFP_USER    (__GFP_WAIT | __GFP_IO | __GFP_FS | __GFP_HARDWALL)
#define GFP_HIGHUSER    (GFP_USER | __GFP_HIGHMEM)

主要注意的是不要把分配标志__GFP_MOVABLE和管理区ZONE_MOVABLE混淆,两者并不是对应的关系。

  • __GFP_MOVABLE表示的是一种分配页面属性,表示页面可迁移,即使不在ZONE_MOVABLE管理区,有些页面也是可以迁移的,比如cache;
  • ZONE_MOVABLE表示的是管理区,和内存的热插拔有关,当然其中的页面必须要可迁移才能支持热插拔。

分配标志__GFP_MOVABLE

#define __GFP_DMA   ((__force gfp_t)___GFP_DMA)
#define __GFP_HIGHMEM   ((__force gfp_t)___GFP_HIGHMEM)
#define __GFP_DMA32 ((__force gfp_t)___GFP_DMA32)
#define __GFP_MOVABLE   ((__force gfp_t)___GFP_MOVABLE)  /* Page is movable */
#define GFP_ZONEMASK    (__GFP_DMA|__GFP_HIGHMEM|__GFP_DMA32|__GFP_MOVABLE)

这几个分配标志被称为Zone modifiers,他们用来标识优先从哪个zone分配内存。

bit       result
=================
0x0    => NORMAL
0x1    => DMA or NORMAL
0x2    => HIGHMEM or NORMAL
0x3    => BAD (DMA+HIGHMEM)
0x4    => DMA32 or DMA or NORMAL
0x5    => BAD (DMA+DMA32)
0x6    => BAD (HIGHMEM+DMA32)
0x7    => BAD (HIGHMEM+DMA32+DMA)
0x8    => NORMAL (MOVABLE+0)
0x9    => DMA or NORMAL (MOVABLE+DMA)
0xa    => MOVABLE (Movable is valid only if HIGHMEM is set too)
0xb    => BAD (MOVABLE+HIGHMEM+DMA)
0xc    => DMA32 (MOVABLE+DMA32)
0xd    => BAD (MOVABLE+DMA32+DMA)
0xe    => BAD (MOVABLE+DMA32+HIGHMEM)
0xf    => BAD (MOVABLE+DMA32+HIGHMEM+DMA)

一共有4个bit用来表示组合类型,其中低3个bit只能选择一个(__GFP_DMA/__GFP_HIGHMEM/__GFP_DMA32),而__GFP_MOVABLE可以和其他三种的任何一个组合使用,因此一共有16中组合,根据各种类型进行一个偏移存放到一个long类型table中。

GFP_ZONE_TABLE:

|BAD|BAD|BAD|DMA32|BAD|MOVABLE|......|NORMAL|

这些结果会根据上面的bit组合值做一个偏移,存放到ZONE TABLE中,从而可以根据组合快速定位要使用的ZONE管理区。由上可见,__GFP_MOVABLE代表的是一种分配策略,并不是和ZONE_MOVABLE匹配的,上一节也做了介绍,必须是(__GFP_HIGHMEM和__GFP_MOVABLE)同时置位才会从ZONE_MOVABLE管理区去分配内存。

The zone fallback order is MOVABLE=>HIGHMEM=>NORMAL=>DMA32=>DMA

因此我们分配内存时并不一定就会按照传入的FLAG来进行分配,如果对应zone中没有符合要求的内存,那么会依次进行fallback查找符合要求的内存。

如何使能ZONE_MOVABLE

- For all memory hotplug
    Memory model -> Sparse Memory  (CONFIG_SPARSEMEM)
    Allow for memory hot-add       (CONFIG_MEMORY_HOTPLUG)

- To enable memory removal, the followings are also necessary
    Allow for memory hot remove    (CONFIG_MEMORY_HOTREMOVE)
    Page Migration                 (CONFIG_MIGRATION)

- For ACPI memory hotplug, the followings are also necessary
    Memory hotplug (under ACPI Support menu) (CONFIG_ACPI_HOTPLUG_MEMORY)
    This option can be kernel module.

- As a related configuration, if your box has a feature of NUMA-node hotplug
  via ACPI, then this option is necessary too.
    ACPI0004,PNP0A05 and PNP0A06 Container Driver (under ACPI Support menu)
    (CONFIG_ACPI_CONTAINER).
    This option can be kernel module too.

1) When kernelcore=YYYY boot option is used,
   Size of memory not for movable pages (not for offline) is YYYY.
   Size of memory for movable pages (for offline) is TOTAL-YYYY.

2) When movablecore=ZZZZ boot option is used,
   Size of memory not for movable pages (not for offline) is TOTAL - ZZZZ.
   Size of memory for movable pages (for offline) is ZZZZ.

内核中定义了sysfs节点用来控制内存的热插拔:

% echo online > /sys/devices/system/memory/memoryXXX/state

使能内存。

% echo online_movable > /sys/devices/system/memory/memoryXXX/state

切换内存管理区为ZONE_MOVABLE。

% echo online_kernel > /sys/devices/system/memory/memoryXXX/state

切换内存管理区为ZONE_NORMAL。

如何决定MOVABLE_ZONE的大小

我们先来看下在memory zone初始化时的处理:
对于NUMA使能的系统处理是这样的:

zone_sizes_init->free_area_init_nodes->find_zone_movable_pfns_for_nodes:
/*
 * If movable_node is specified, ignore kernelcore and movablecore
 * options.
 */
if (movable_node_is_enabled()) {
    for_each_memblock(memory, r) {
        if (!memblock_is_hotpluggable(r))
            continue;

        nid = r->nid;

        usable_startpfn = PFN_DOWN(r->base);
        zone_movable_pfn[nid] = zone_movable_pfn[nid] ?
            min(usable_startpfn, zone_movable_pfn[nid]) :
            usable_startpfn;
    }

    goto out2;
}

当我们在dts设备树中配置对应的property时就会配置对应的memblock flag:

int __init early_init_dt_scan_memory(unsigned long node, const char *uname,
                     int depth, void *data)
{
   bool hotpluggable;
   hotpluggable = of_get_flat_dt_prop(node, "hotpluggable", NULL);
   while ((endp - reg) >= (dt_root_addr_cells + dt_root_size_cells)) {
     u64 base, size;

     base = dt_mem_next_cell(dt_root_addr_cells, &reg);
     size = dt_mem_next_cell(dt_root_size_cells, &reg);

     if (size == 0)
         continue;
     pr_debug(" - %llx ,  %llx\n", (unsigned long long)base,
         (unsigned long long)size);

     early_init_dt_add_memory_arch(base, size);

     if (!hotpluggable)
         continue;

     if (early_init_dt_mark_hotplug_memory_arch(base, size))
         pr_warn("failed to mark hotplug range 0x%llx - 0x%llx\n",
             base, base + size);
    }

}

int __init __weak early_init_dt_mark_hotplug_memory_arch(u64 base, u64 size)
{
    return memblock_mark_hotplug(base, size);
}

int __init_memblock memblock_mark_hotplug(phys_addr_t base, phys_addr_t size)
{
    return memblock_setclr_flag(base, size, 1, MEMBLOCK_HOTPLUG);
}  

from: https://blog.csdn.net/rikeyone/article/details/86498298

原文地址:https://www.cnblogs.com/aspirs/p/12781693.html