memory reorder检测 - 码农教程

测试说明

测试环境是x86-64 centos7.2 gcc-4.8.5

代码启动了两个thread做store-load操作
thread1:

  a <- 1  # A
  // compiler fence
  r2 <- b # B

thread2:

  b <- 1 # C
  // compiler fence
  r1 <- a # D

在不考虑cpu-OoO的情况下, 可能出现的执行顺序有: ABCD, ACBD, ACDB, CADB, CDAB, CABD. 最终r1r2的值可能为: 10, 01, 11. 但不可能出现00. 因为load操作之前至少有一个store成功. 因此r1和r2中至少有一个1.
但x86可能会将无依赖的store-load重排为load-store. 这就会导致在A和C在globally visible之前, 可能B和D已经执行. 最终r1r2的值可能是: 00.

为了避免编译器优化对测试的影响, 使用compiler-fence限制编译器对上述代码块的顺序调整, 使用volatile限制编译器将值缓存到register.

由于reorder带有一定随机性, 因此检测的核心问题是如何大量的重试以触发reorder. 这段代码使用三个计数器(loop,cnt1,cnt2)来同步两个写线程和观察线程(main). 观察线程每次设置好参数后通过loop计数器通知两个写线程同时开始执行, 等待写线程完毕后, 观察r1r2的值并记录错误次数

代码

#include <stdio.h>
#include <unistd.h>
#include <thread>
#include <stdlib.h>
#include <atomic>

int a=0;
int b=0;
int r1=10;
int r2=10;

volatile int loop=0;
volatile int cnt1=0;
volatile int cnt2=0;

#define COMPILER_FENCE asm volatile("" ::: "memory")

//#define CPU_FENCE asm volatile("mfence" ::: "memory")
#define CPU_FENCE

void f1(){
        while(true){
                cnt1++;
                while(loop!=cnt1);
                a=1;
                COMPILER_FENCE;
                CPU_FENCE;
                r1=b;
        }
}


void f2(){
        while(true){
                cnt2++;
                while(loop!=cnt2);
                b=1;
                COMPILER_FENCE;
                CPU_FENCE;
                r2=a;
        }
}

int main(int argc, char *argv[])
{
        int errcount=0;
        std::thread t1(f1);
        std::thread t2(f2);
        while(loop<1000)
        {
                a=0;
                b=0;
                r1=10;
                r2=10;
                usleep(100); // 目的是确保上述重置同步到其他thread. 之所以不使用mfence是因为我们测试的目的就是mfence的功能.
                loop++;
                while(loop>=cnt1 || loop>=cnt2);

                if(r1==0 && r2==0)
                        errcount++;
        }

        printf("tried %d error %d\n", loop, errcount);
        t1.join();
        t2.join();
        return 0;
}

汇编对照分析

f1:

400b60:       8b 05 fa 15 20 00       mov    eax,DWORD PTR [rip+0x2015fa]        # 602160 <cnt1>
400b66:       83 c0 01                add    eax,0x1
400b69:       89 05 f1 15 20 00       mov    DWORD PTR [rip+0x2015f1],eax        # 602160 <cnt1>
400b6f:       90                      nop
400b70:       8b 15 ee 15 20 00       mov    edx,DWORD PTR [rip+0x2015ee]        # 602164 <loop>
400b76:       8b 05 e4 15 20 00       mov    eax,DWORD PTR [rip+0x2015e4]        # 602160 <cnt1>
400b7c:       39 c2                   cmp    edx,eax
400b7e:       75 f0                   jne    400b70 <_Z2f1v+0x10>
400b80:       c7 05 e2 15 20 00 01    mov    DWORD PTR [rip+0x2015e2],0x1        # 60216c <a>
400b87:       00 00 00 
400b8a:       8b 05 d8 15 20 00       mov    eax,DWORD PTR [rip+0x2015d8]        # 602168 <b>
400b90:       89 05 f2 14 20 00       mov    DWORD PTR [rip+0x2014f2],eax        # 602088 <r1>
400b96:       eb c8                   jmp    400b60 <_Z2f1v>
400b98:       0f 1f 84 00 00 00 00    nop    DWORD PTR [rax+rax*1+0x0]
400b9f:       00

可以看出 a=1; r1=b; 其实对应了三条指令

store 1 to a
load b to eax
store eax to r1

由于x86允许对store-load做reorder. 因此真正执行顺序可能是

thread1                       thread2
store 1 to a 
                              store 1 to b
load b to eax # get 0
                              load a to eax  # get 0
store become globally visible
                              store become globally visible
store eax to r1
                              store eax to r2

加入mfence

切换cpu_fence宏后, f1生成指令为

400b60:       8b 05 fa 15 20 00       mov    eax,DWORD PTR [rip+0x2015fa]        # 602160 <cnt1>
400b66:       83 c0 01                add    eax,0x1
400b69:       89 05 f1 15 20 00       mov    DWORD PTR [rip+0x2015f1],eax        # 602160 <cnt1>
400b6f:       90                      nop
400b70:       8b 15 ee 15 20 00       mov    edx,DWORD PTR [rip+0x2015ee]        # 602164 <loop>
400b76:       8b 05 e4 15 20 00       mov    eax,DWORD PTR [rip+0x2015e4]        # 602160 <cnt1>
400b7c:       39 c2                   cmp    edx,eax
400b7e:       75 f0                   jne    400b70 <_Z2f1v+0x10>
400b80:       c7 05 e2 15 20 00 01    mov    DWORD PTR [rip+0x2015e2],0x1        # 60216c <a>
400b87:       00 00 00 
400b8a:       0f ae f0                mfence 
400b8d:       8b 05 d5 15 20 00       mov    eax,DWORD PTR [rip+0x2015d5]        # 602168 <b>
400b93:       89 05 ef 14 20 00       mov    DWORD PTR [rip+0x2014ef],eax        # 602088 <r1>
400b99:       eb c5                   jmp    400b60 <_Z2f1v>
400b9b:       0f 1f 44 00 00          nop    DWORD PTR [rax+rax*1+0x0]

可以看到, 在store(a)和load(b)之间加入了mfence指令, 这会保证

guarantees that every load and store instruction that precedes the MFENCE instruction in program order becomes globally visible before any load or store instruction that follows the MFENCE instruction

对于这个例子, 就是保证load执行前, store已经全局可见. 也就是说, 两个thread的store操作至少有一个全局可见后才会load. 也就避免了00的情况

thread1                               thread2
store 1 to a 
                                      store 1 to b
wait store become globally visible
                                      wait store become globally visible
load b to eax # get 1
                                      load a to eax  # get 1
store become globally visible
                                      store become globally visible
store eax to r1
                                      store eax to r2

总结

mfence目的是将load/store指令按照program order的先后顺序分为两个集合. mfence之前的所有load/store的全局可见时间必须早于mfence后面的所有load/store.
如果任意store和load指令之间都存在mfence, 则store-load就不允许reorder, x86(只允许sl reorder)就可以变成strongest memory model.
按照我的理解, cpu fence并不会限制cpu out-of-order. cpuOoO也可能导致memory order问题, 但是即使没有cpuOoO, memory order问题依然存在. 这是追求更高性能带来的复杂性.

个人见解, 有问题欢迎指正.

原文地址：https://www.cnblogs.com/shouzhuo/p/12725110.html