工具
nvidia-smi
-L
:显示设备名称,精简信息-q -i 0
:显示0号设备详细状态信息-q -i 0 -d MEMORY
:从详细状态信息中提取某类信息(比如MEMORY、COMPUTE、UTILIZATION等)部分字段含义:
GPU-util
:For a given time period, it reports what percentage of time one or more GPU kernel(s) was active (i.e. running).
Nsight System(nsys)
nsys profile ./hello_world
--trace cuda
--gpu-metrics-device 0
--stats true
如何检测achieved_occupancy活跃线程束比例, gld_throughput内存利用率
https://face2ai.com/CUDA-F-3-3-%E5%B9%B6%E8%A1%8C%E6%80%A7%E8%A1%A8%E7%8E%B0/
nvprof是旧的分析工具,不支持8.0以上的GPU,其功能拆分给nsys和ncu
nsys nvprof
:统计cuda api和cuda kernel的耗时和相关配置ncu --metrics
:统计得到metrics
Nsight Compute(ncu)
- 几种常规用法:很可能需要sudo权限
- 分析某几个指标:
ncu --metrics
ncu --query-metrics
可以列出分析的指标
- 得到profile的全部信息:
ncu --set full --import-source yes --target-processes all -o profile_file
--set full
:profile全部信息ncu --list-sets
可以查看支持的section,每个section是一些metric的集合
--import-source yes
:在服务器端跑出profile,然后copy到本地gui中进行查看-o <output_file_name>
- 分析某几个指标:
metrics
metrics:performance counter,性能统计的指标
- peak rete:每个counter都有两类peak rete
- burst rate:the maximum rate reportable in a single clock cycle
- sustained rate:the maximum rate achievable over an infinitely long measurement period
- metrics entities
- counter:直接从GPU而来的统计量
- 每个counter都有四个sub-metrics,叫做roll-ups
- sum,avg,min,max
- 有一些可以从counter roll-ups计算而来的sub-metrics
- 比如
.peak_sustained
- 比如
- 每个counter都有四个sub-metrics,叫做roll-ups
- ratio:有三个sub-metrics
- pct,ratio,max_rate
- throughputs:标识一个portion接近peak rate的程度,有四个sub-metrics
- 比如
.pct_of_peak_sustained_active
- 比如
- counter:直接从GPU而来的统计量
- ncu的metrics与nvprof的metrics不相同,存在一定的对应关系。常用的对应关系
nvprof ncu 说明 achieved_occupancy sm__warps_active.avg.pct_of_peak_sustained_active gld_throughput l1tex__t_bytes_pipe_lsu_mem_global_op_ld.sum.per_second gst_throughput l1tex__t_bytes_pipe_lsu_mem_global_op_st.sum.per_second gld_efficiency smsp__sass_average_data_bytes_per_sector_mem_global_op_ld.pct gst_efficiency smsp__sass_average_data_bytes_per_sector_mem_global_op_st.pct gld_transactions l1tex__t_sectors_pipe_lsu_mem_global_op_ld.sum gst_transactions l1tex__t_sectors_pipe_lsu_mem_global_op_st.sum gld_transactions_per_request l1tex__average_t_sectors_per_request_pipe_lsu_mem_global_op_ld.ratio gst_transactions_per_request l1tex__average_t_sectors_per_request_pipe_lsu_mem_global_op_st.ratio shared_efficiency smsp__sass_average_data_bytes_per_wavefront_mem_shared.pct shared_load_throughput l1tex__data_pipe_lsu_wavefronts_mem_shared_op_ld.sum.per_second shared_load_transactions l1tex__data_pipe_lsu_wavefronts_mem_shared_op_ld.sum shared_store_throughput l1tex__data_pipe_lsu_wavefronts_mem_shared_op_st.sum.per_second shared_store_transactions l1tex__data_pipe_lsu_wavefronts_mem_shared_op_st.sum.per_second inst_per_warp smsp__average_inst_executed_per_warp.ratio 比如if分支优化之后,分支减少,inst_per_warp会减少很多
- peak rete:每个counter都有两类peak rete
metrics命名规则和对应实体
其他
- 常见报错:
Error: ERR_NVGPUCTRPERM - The user does not have permission to access NVIDIA GPU Performance Counters on the target device.
- 官方解决方法:# NVIDIA Development Tools Solutions - ERR_NVGPUCTRPERM: Permission issue with Performance Counters
- 自己在
/etc/modprobe.d
下touch nvidia-restrict-profiling.conf
并写入options nvidia NVreg_RestrictProfilingToAdminUsers=0
,然后重启
- 常见报错:
参考
cuda-memcheck
CUDA 提供了 CUDA-MEMCHECK 的工具集,包括 memcheck, racecheck, initcheck, synccheck.
cuda-memcheck --tool memcheck [options] app-name [options]
对于 memcheck 工具,可以简化为:
cuda-memcheck [options] app-name [options]
实战
二维矩阵相加进行profile:https://face2ai.com/CUDA-F-3-3-%E5%B9%B6%E8%A1%8C%E6%80%A7%E8%A1%A8%E7%8E%B0/
reduce使用全局内存进行profile逐步优化:https://face2ai.com/CUDA-F-3-5-%E5%B1%95%E5%BC%80%E5%BE%AA%E7%8E%AF/