2-4 设备管理

发表于 2024-12-01 分类于 CUDA ， CUDA C编程权威指南笔记阅读次数：

2.4 设备管理

在本节，你将通过以下两种方法学习查询和管理GPU设备：

CUDA运行时API函数
NVIDIA系统管理界面（nvidia-smi）命令行实用程序

2.4.1 使用运行时API查询GPU信息

cudaError_t cudaSetDevice(int dev) 设置当前GPU设备

当有多个GPU时，选定cuda设备，默认是0即第一个主GPU，多GPU时0,1,2以此类推。

cudaError_t cudaGetDeviceProperties(struct cudaDeviceProp* prop,int dev) 返回设备信息

以*prop形式返回设备dev的属性。struct cudaDevicePro结构体内定义了很多记录GPU设备信息的数据成员，比如multiProcessorCount表示设备上多处理器的数量。详见NVIDIA官网。

cudaError_t cudaGetDeviceCount( int* count )返回具体计算能力的设备数量

以 *count 形式返回可用于执行的计算能力大于等于 1.0 的设备数量。

代码清单2-8提供了一个示例，查询了大家通常感兴趣的一般属性

//chapter02/checkDeviceInfor.cu
#include "../common/common.h"
#include <cuda_runtime.h>
#include <stdio.h>

/*
 * Display a variety of information on the first CUDA device in this system,
 * including driver version, runtime version, compute capability, bytes of
 * global memory, etc.
 */

int main(int argc, char **argv)
{
    printf("%s Starting...\n", argv[0]);

    int deviceCount = 0;
    cudaGetDeviceCount(&deviceCount);

    if (deviceCount == 0)
    {
        printf("There are no available device(s) that support CUDA\n");
    }
    else
    {
        printf("Detected %d CUDA Capable device(s)\n", deviceCount);
    }

    int dev = 0, driverVersion = 0, runtimeVersion = 0;
    CHECK(cudaSetDevice(dev));
    cudaDeviceProp deviceProp;
    CHECK(cudaGetDeviceProperties(&deviceProp, dev));
    printf("Device %d: \"%s\"\n", dev, deviceProp.name);

    cudaDriverGetVersion(&driverVersion);
    cudaRuntimeGetVersion(&runtimeVersion);
    printf("  CUDA Driver Version / Runtime Version          %d.%d / %d.%d\n",
           driverVersion / 1000, (driverVersion % 100) / 10,
           runtimeVersion / 1000, (runtimeVersion % 100) / 10);
    printf("  CUDA Capability Major/Minor version number:    %d.%d\n",
           deviceProp.major, deviceProp.minor);
    printf("  Total amount of global memory:                 %.2f GBytes (%llu "
           "bytes)\n", (float)deviceProp.totalGlobalMem / pow(1024.0, 3),
           (unsigned long long)deviceProp.totalGlobalMem);
    printf("  GPU Clock rate:                                %.0f MHz (%0.2f "
           "GHz)\n", deviceProp.clockRate * 1e-3f,
           deviceProp.clockRate * 1e-6f);
    printf("  Memory Clock rate:                             %.0f Mhz\n",
           deviceProp.memoryClockRate * 1e-3f);
    printf("  Memory Bus Width:                              %d-bit\n",
           deviceProp.memoryBusWidth);

    if (deviceProp.l2CacheSize)
    {
        printf("  L2 Cache Size:                                 %d bytes\n",
               deviceProp.l2CacheSize);
    }

    printf("  Max Texture Dimension Size (x,y,z)             1D=(%d), "
           "2D=(%d,%d), 3D=(%d,%d,%d)\n", deviceProp.maxTexture1D,
           deviceProp.maxTexture2D[0], deviceProp.maxTexture2D[1],
           deviceProp.maxTexture3D[0], deviceProp.maxTexture3D[1],
           deviceProp.maxTexture3D[2]);
    printf("  Max Layered Texture Size (dim) x layers        1D=(%d) x %d, "
           "2D=(%d,%d) x %d\n", deviceProp.maxTexture1DLayered[0],
           deviceProp.maxTexture1DLayered[1], deviceProp.maxTexture2DLayered[0],
           deviceProp.maxTexture2DLayered[1],
           deviceProp.maxTexture2DLayered[2]);
    printf("  Total amount of constant memory:               %lu bytes\n",
           deviceProp.totalConstMem);
    printf("  Total amount of shared memory per block:       %lu bytes\n",
           deviceProp.sharedMemPerBlock);
    printf("  Total number of registers available per block: %d\n",
           deviceProp.regsPerBlock);
    printf("  Warp size:                                     %d\n",
           deviceProp.warpSize);
    printf("  Maximum number of threads per multiprocessor:  %d\n",
           deviceProp.maxThreadsPerMultiProcessor);
    printf("  Maximum number of threads per block:           %d\n",
           deviceProp.maxThreadsPerBlock);
    printf("  Maximum sizes of each dimension of a block:    %d x %d x %d\n",
           deviceProp.maxThreadsDim[0],
           deviceProp.maxThreadsDim[1],
           deviceProp.maxThreadsDim[2]);
    printf("  Maximum sizes of each dimension of a grid:     %d x %d x %d\n",
           deviceProp.maxGridSize[0],
           deviceProp.maxGridSize[1],
           deviceProp.maxGridSize[2]);
    printf("  Maximum memory pitch:                          %lu bytes\n",
           deviceProp.memPitch);

    exit(EXIT_SUCCESS);
}

我的电脑运行如下

./checkDeviceInfor
./checkDeviceInfor Starting...
Detected 1 CUDA Capable device(s)
Device 0: "Quadro P2000"
  CUDA Driver Version / Runtime Version          11.7 / 11.4
  CUDA Capability Major/Minor version number:    6.1
  Total amount of global memory:                 4.93 GBytes (5296029696 bytes)
  GPU Clock rate:                                1481 MHz (1.48 GHz)
  Memory Clock rate:                             3504 Mhz
  Memory Bus Width:                              160-bit
  L2 Cache Size:                                 1310720 bytes
  Max Texture Dimension Size (x,y,z)             1D=(131072), 2D=(131072,65536), 3D=(16384,16384,16384)
  Max Layered Texture Size (dim) x layers        1D=(32768) x 2048, 2D=(32768,32768) x 2048
  Total amount of constant memory:               65536 bytes
  Total amount of shared memory per block:       49152 bytes
  Total number of registers available per block: 65536
  Warp size:                                     32
  Maximum number of threads per multiprocessor:  2048
  Maximum number of threads per block:           1024
  Maximum sizes of each dimension of a block:    1024 x 1024 x 64
  Maximum sizes of each dimension of a grid:     2147483647 x 65535 x 65535
  Maximum memory pitch:                          2147483647 bytes

2.4.2 确定最优GPU

有多GPU设备时，我们可以根据multiProcessorCount设备的多处理器的数量，来选择最优的GPU。

int numDevices = 0 ;
cudaGetDeviceCount(&numDevices);
if (numDevices > 1) {
	int maxMultiprocessors = 0,
	maxDevice = 0 ;
	for (int device=0; device<numDevices; device++){
		cudaDeviceProp props;
		cudaGetDeviceProperties(&props, device);
		if (maxMultiprocessors < props.multiProcessorCount){
			maxMultiprocessors = props.multiProcessorCount;
			maxDevice = device;
		}
	}
	cudasetDevice(maxDevice);
}

2.4.3 使用nvidia-smi查询GPU信息

nvidia-smi是一个命令行工具，用于管理和监控GPU设备，并允许查询和修改设备状态。

2.4.4 在运行时设置设备

支持多GPU的系统是很常见的。对于一个有N个GPU的系统，nvidia-smi从0到N―1标记设备ID。使用环境变量CUDA_VISIBLE_DEVICES，就可以在运行时指定所选的GPU且无须更改应用程序。

设置运行时环境变量CUDA_VISIBLE_DEVICES=2。nvidia驱动程序会屏蔽其他GPU，这时设备2作为设备0出现在应用程序中。

也可以使用CUDA_VISIBLE_DEVICES指定多个设备。例如，如果想测试GPU 2和GPU3，可以设置CUDA_VISIBLE_DEVICES=2，3。然后，在运行时，nvidia驱动程序将只使用ID为2和3的设备，并且会将设备ID分别映射为0和1。