1.dmidecode作用把系统BIOS中的硬件信息提取出来使用方法dmidecode | grep Configured Memory Speed这个示例用于查看内存实际频率内存实际频率代表了内存处理数据的速度实际项目中会告诉你测试1DPC还是2DPC这里的DPCDIMM(s) Per Channel代表的是CPU中的IMCIntegrated Memory Controller集成内存控制器延伸出来的总线上插几根内存条2.mlc内存延迟检测器测量内存延迟空闲延迟在系统没有负载时访问内存的最快速度加载延迟在内存带宽被逐渐占满的情况下延迟是如何增加的。这能反映系统在高压力下的稳定性测量内存带宽即单位时间内传输的数据量环境搭建下载mlc工具./mlc mlc_snc_test.log日志输出测试空闲延时Measuring idle latencies for sequential access (in ns)... Numa node Numa node 0 1 0 120.3 147.0 1 137.9 120.4这个测试的是在空闲状态下不同节点的CPU连续读取不同节点内存中数据的延迟时间单位是ns测试峰值带宽Measuring Peak Injection Memory Bandwidths for the system Bandwidths are in MB/sec (1 MB/sec 1,000,000 Bytes/sec) Using all the threads from each core if Hyper-threading is enabled Using traffic with the following read-write ratios ALL Reads : 379143.9 3:1 Reads-Writes : 325338.0 2:1 Reads-Writes : 313722.6 1:1 Reads-Writes : 308136.8 Stream-triad like: 315334.9 All NT writes : 293041.7 1:1 Read-NT write: 301390.1这是在各种读写组合下测试系统能跑出的极限数据量单位MB/sec。这种方式的测试MLC 会命令所有的 CPU 核心同时向自己本地的内存控制器发起疯狂的读写请求。最后输出的结果是全系统的峰值带宽测试极限内存带宽Measuring Memory Bandwidths between nodes within system Bandwidths are in MB/sec (1 MB/sec 1,000,000 Bytes/sec) Using all the threads from each core if Hyper-threading is enabled Using Read-only traffic type Numa node Numa node 0 1 0 189298.1 189711.3 1 189472.2 189588.3这是测试不同节点在极限满负载状态下时CPU读取各个节点内存时的带宽测试负载延迟Measuring Loaded Latencies for the system Using all the threads from each core if Hyper-threading is enabled Using Read-only traffic type Inject Latency Bandwidth Delay (ns) MB/sec 00000 377.93 377594.6 00002 400.58 379020.5 00008 399.81 378989.4 00015 383.97 378821.7 00050 287.89 374683.6 00100 250.89 363595.3 00200 213.68 347540.1 00300 190.41 263701.6 00400 181.01 216367.5 00500 170.79 185032.9 00700 148.27 136411.5 01000 135.84 97726.8 01300 130.50 76133.7 01700 126.76 58842.1 02500 124.18 40543.5 03500 123.06 29268.1 05000 122.44 20727.9 09000 121.85 11798.5 20000 121.16 5449.7模拟了随着吞吐量即带宽的增加延迟是如何增加的测试跨核心的缓存对等输出Measuring cache-to-cache transfer latency (in ns)... Local Socket L2-L2 HIT latency 80.6 # 核心A直接从核心B的L2缓存里拿到了干净的数据 Local Socket L2-L2 HITM latency 81.1 # 核心A拿到了核心B修改过Modified的数据涉及缓存一致性协议测量单个CPU核心之间传递数据所耗的延迟这里命令行没加参数默认是CPU0的3. lmbench作用通过测量系统执行基础操作的延迟和带宽来量化评估操作系统内核与底层硬件协同工作的极限性能与基础开销环境搭建# 解压并进入源码目录 tar -zxvf lmbench-3.0-a9.tgz cd lmbench-3.0-a9/src # 删除所有代码中对不存在的 rpc/rpc.h 的引用 sed -i /#include rpc\/rpc.h/d *.c *.h # 执行编译忽略最后的报错,确保 ../bin/x86_64-linux-gnu/lmbench.a 这个文件已经产生它是所有工具的母体。 make results # 删除 bench.h 中重复的 socklen_t 定义 sed -i /typedef int socklen_t;/d bench.h # 手动组装包含计时、内存、结果处理等模块并链接数学库 gcc -O -DHAVE_socklen_t -o ../bin/x86_64-linux-gnu/lat_mem_rd \ lat_mem_rd.c ../bin/x86_64-linux-gnu/lmbench.a -lmlat_mem_rd作用测量CPU中的缓存L1/L2/L3和主内存的真实容量与响应速度使用方法numactl -m 0 -N 0 ./lat_mem_rd -P 1 -N 3 2048M 1024 # -m 0 : 强制将内存分配锁定在 NUMA 节点 0 上 # -N 0 : 强制将执行程序的进程锁定在 NUMA 节点 0 的 CPU 核心上 # -P 1 : 设置并发数用来模拟高并发压力的场景 # -N 3 : 设置重复次数为3 # 2048M : 测试的数据块规模为2048M # 1024 : 步长为1024Byte日志输出stride1024 0.00098 1.390 0.00195 1.390 0.00293 1.390 0.00391 1.390 0.00586 1.390 0.00781 1.390 0.01172 1.390 0.01562 1.390 0.02344 1.390 0.03125 1.390 0.04688 1.699 0.06250 3.117 0.09375 1.754 0.12500 4.448 0.18750 4.448 0.25000 4.448 0.37500 4.448 0.50000 4.447 0.75000 4.448 1.00000 4.448 1.50000 4.448 2.00000 5.048 3.00000 31.986 4.00000 34.494 6.00000 32.341 8.00000 33.517 12.00000 33.021 16.00000 33.152 24.00000 32.050 32.00000 34.090 48.00000 33.797 64.00000 33.154 96.00000 34.172 128.00000 22.547 192.00000 49.937 256.00000 68.443 384.00000 79.057 512.00000 74.141 768.00000 54.388 1024.00000 43.767 1536.00000 46.355 2048.00000 44.476bw_mem作用测试CPU 核心在遵循标准体系结构规则下通过多级缓存与内存进行交互时的真实吞吐能力带宽与mlc的不同相比bw_memmlc测试则是跳过了CPU的缓存(L1/L2/L3)测试的是主板、内存控制器IMC和物理内存条之间的吞吐能力带宽使用方法numactl -m 0 -N 0 ./bw_mem -P 1 -N 3 1024M rd numactl -m 1 -N 0 ./bw_mem -P 1 -N 3 1024M rd # 1024M : 测试的数据块大小为 1024 MB # rd : 操作类型为Read(纯读),其他常见类型:wr(写)、cp(复制)、rdwr(混合读写)日志输出# 可以看出节点0的CPU读取节点0的内存中数据比跨节点要快 1024.00 22633.34 1024.00 15589.794.sysbench作用可以给 CPU、内存、磁盘 IO 甚至数据库加上沉重的负载观察系统在“高压状态”下的稳定性CPU 运算性能磁盘 IO 性能调度程序性能内存分配及传输速度主要侧重在经过操作系统内存分配、C标准库以及多线程调度的开销后应用层能实际获得的可用带宽POSIX 线程性能–互斥基准测试数据库性能(OLTP 基准测试)环境搭建./autogen.sh ./configure --without-mysql make -j make install sysbench --version使用方法numactl -N 0 -m 0 sysbench memory --memory-block-size16K --memory-scopelocal --memory-total-size100G --memory-access-modernd run # memory : 测试的模块 # --memory-block-size : 测试内存块大小 # --memory-scope : 内存作用域 # --memory-total-size : 测试数据的总大小 # --memory-access-mode : 访问模式 # run : 正式开始执行测试日志输出sysbench 1.1.0 (using bundled LuaJIT 2.1.0-beta3) Running the test with following options: Number of threads: 1 Initializing random number generator from current time # 配置验证 Running memory speed test with the following options: block size: 16KiB total size: 102400MiB operation: write scope: local Initializing worker threads... Threads started! Total operations: 2120337 (212033.17 per second) # 操作吞吐量 33130.27 MiB transferred (3313.02 MiB/sec) # 带宽吞吐量 Throughput: events/s (eps): 212033.1678 time elapsed: 10.0000s total number of events: 2120337 # 延迟统计 Latency (ms): min: 0.00 avg: 0.00 max: 0.01 95th percentile: 0.00 sum: 9794.14 Threads fairness: events (avg/stddev): 2120337.0000/0.00 execution time (avg/stddev): 9.7941/0.005. stream作用专门测量计算机的内存带宽的极限测试和mlc的区别也是绕过CPU的缓存但和mlc不同的地方在于mlc是物理绕过缓存L1/L2/L3通过指令明确告诉硬件直接写到IMC中。而stream则是申请一块至少是L3缓存4倍的数组这样数据就自然存储到主内存中使用方法gcc -O stream.c -DSTREAM_ARRAY_SIZE200000000 -DNTIMES30 -mcmodelmedium -o stream.o numactl -m 0 -N 0 ./stream.o 20输出日志------------------------------------------------------------- STREAM version $Revision: 5.10 $ ------------------------------------------------------------- This system uses 8 bytes per array element. ------------------------------------------------------------- Array size 200000000 (elements), Offset 0 (elements) Memory per array 1525.9 MiB ( 1.5 GiB). Total memory required 4577.6 MiB ( 4.5 GiB). Each kernel will be executed 30 times. The *best* time for each kernel (excluding the first iteration) will be used to compute the reported bandwidth. ------------------------------------------------------------- Your clock granularity/precision appears to be 1 microseconds. Each test below will take on the order of 140319 microseconds. ( 140319 clock ticks) Increase the size of the arrays if this shows that you are not getting at least 20 clock ticks per test. ------------------------------------------------------------- WARNING -- The above is only a rough guideline. For best results, please be sure you know the precision of your system timer. ------------------------------------------------------------- # 不同操作的峰值带宽 Function Best Rate MB/s Avg time Min time Max time Copy: 23597.3 0.135865 0.135609 0.136305 Scale: 21441.3 0.149674 0.149245 0.150994 Add: 25256.7 0.190642 0.190049 0.194377 Triad: 25042.1 0.192444 0.191677 0.193558 ------------------------------------------------------------- Solution Validates: avg error less than 1.000000e-13 on all three arrays -------------------------------------------------------------