ROCm runtime

November 30, 2023

HSA & Overview

AMD's GPU software stack follows the HSA standard, where CPU, GPU and other devices are treated as HSA agents from the system's view.

ROCt and ROCr

ROCt is a thin interface layer providing APIs from the amdgpu kernel module to userspace. And ROCr is an upper layer managing interactions between language-level userspace runtime and the kernel driver/module.

ROCr

The core component in ROCr is core::Runtime, which maintains the connections to kernel driver. In AMD's implementation, there is a singleton instance of core::Runtime, ensuring interactions to HSA runtime are handled by the only instance thus avoid inconsistency issues.

The single instance is constructed when roc::Runtime::Acquire is called in the hsa_init function (called from roc::Device::init when HIP API firstly called), then roc::Runtime::Load is invoked to initialize the HSA environment (runtime configurations, GPU agents, HSA extensions, API table, etc.).

HSA signal

hsa_signal_t itself contains an uint64_t as the handler at bottom. To handle a signal, ROCr provides a wrapper class hsa_signal_handle:

struct hsa_signal_handle {
  hsa_signal_t signal;
  hsa_signal_handle() {}
  hsa_signal_handle(hsa_signal_t Signal) { signal = Signal; }
  operator hsa_signal_t() { return signal; }
  Signal* operator->() { return core::Signal::Convert(signal); }
};

it overloads operator-> for converting the raw handler into rocr::core::Signal type.

core::Signal is a reference object, which keeps alive whenever there are some object referring it.

HSA agent

There are two primary agent types: CPU agent and GPU agent, both are derived from Agent. The HSA runtime is responsible for maintaining lists for CPU and GPU devices. Each CPU socket is abstracted as a CPU agent while each GPU card is represented as a GPU agent likewise.

HSA signal callback registration

ROCr offers a registration API hsa_amd_signal_async_handler allowing user to register a handler together with a condition to signal. When the condition is met, the handler will be executed asynchronously to process signal-related transactions. Based on the return value (bool type, true or false), the HSA runtime will determine whether this handler procedure will be called again or only once when the same conditions are met.

The ROCr API internally invokes Runtime::SetAsyncSignalHandler to register the handler to the specified signal. There are vectors maintained by HSA runtime managing a series of signals, conditions and values. In the Runtime::SetAsyncSignalHandler function, the HSA runtime will fork a thread (pthread -based software thread) to asynchronously and consistently check the signal value (AsyncEventsLoop is the entry function).

How to declare a new ROCr (HSA) API?

Put your implementation in core/runtime/hsa_ext_amd.cpp, and declare it in both core/inc/hsa_ext_amd_impl.h and inc/hsa_ext_amd.h.

HSA requires its API to be traced, so the newly-added API should be inserted into AMD extension API table suffixed with _fn in inc/hsa_api_trace.h and core/common/hsa_table_interface.cpp and bind them in core/runtime/hsa_api_trace.cpp.

And last, DO NOT forget to expose your API via hsacore.so.def script.

Loading ROC code object

As discussed above, the virtual device data structure is created and bind to a high-level HIP stream. ROCclr will construct the corresponding device::Program and then roc::LightningProgram (ROCr instead). The HSA API calling sequence is:

  1. hsa_executable_create_alt
  2. hsa_code_object_reader_create_from_memory
  3. hsa_executable_load_agent_code_object
  4. hsa_executable_freeze

Block Allocator in ROCr

To avoid frequent memory allocation commands between host and device, the ROCr runtime itself maintains a userspace heap-like data structure to the minimize the allocation requests sent to driver and hardware.

ROCclr

ROC common language runtime (ROCclr) plays as the intermediate layer between specific programming language (OpenCL, HIP, etc.) and (AMD extended) HSA userspace APIs.

The BlockAllocator is implemented in the rocr::AMD::MemoryRegion::BlockAllocator:

class BlockAllocator {
   private:
    MemoryRegion& region_;
    static const size_t block_size_ = 2 * 1024 * 1024;  // 2MB blocks.
   public:
    explicit BlockAllocator(MemoryRegion& region) : region_(region) {}
    void* alloc(size_t request_size, size_t& allocated_size) const;
    void free(void* ptr, size_t length) const { region_.FreeImpl(ptr, length); }
    size_t block_size() const { return block_size_; }
};

The default block size is set as 2MB, and it provides two common interfaces: alloc and free. When users call the hipMalloc routine, it finally gets into the rocr::AMD::MemoryRegion::AllocateImpl function, where the block allocator will be invoked to:

  • align up the requested size based on the block size
  • check whether there are available blocks allocated but not used, if so, return it to user without actually allocating memory in device

The block allocator is wrapped by a SimpleHeap, where the memory allocation and free are managed by a heap-like structure with two level abstractions: block and fragment.

When a memory allocation request from user comes, ROCr will first check whether there is a "suitable" fragment in the heap whose size is just equal or slightly larger than the requested size (using lower_bound of std::multimap). If so, that fragment will be selected for the allocation, and the leftover space in the same block (free_size - request_size) will be re-inserted in the heap. Otherwise, ROCr will try to claim a new free block (either from userspace block caching or driver).

When user calls hipFree in their HIP program, ROCr employs a caching layer to keep the currently unused block in host library (rather than discarding and releasing the block in device immediately). ROCr will firstly check whether there are free neighboring fragments of the releasing one, and tries to merge them into one larger unused fragment. The caching layer is a double-ended queue (std::deque). The entire block will be reported as unused and appended into the caching queue if all its fragments are free. Otherwise, only the merged free fragment will be manipulated into the free fragment heap for later usage. With this recycle mechanism, this heap-based allocator will claim the block from the userspace queue instead of the driver.

To substantially release all cached blocks at userspace, one has to call the trim interface which will request the underlying block allocator to reclaim all unused blocks back to driver (and hardware).

Physical device vs. virtual device

Derivation relationships:

  • physical device: amd::Device ➡️ roc::NullDevice ➡️ roc::Device
  • virtual device: device::VirtualDevice ➡️ device::VirtualGPU

Each HSA agent corresponds to a pdevice, each HIP stream corresponds to a vdevice.

ROCm amd::HostQueue vs. amd::DeviceQueue, and amd::Context

amd::HostQueue is a linked list data structure derived from amd::CommandQueue. It contains a thread together with a virtual device aims to managing host commands.

amd::DeviceQueue is also derived from amd::CommandQueue, and is embedded in Context object.

Kernel launch procedure

hipLaunchKernel receives kernel launch request from host, and then conducts a series of parameter checking and validation processes to ensure kernel parameters (work group size, LDS usage, etc.) meet the hardware requirements. Then the procedure continues in ihipModuleLaunchKernel, where the kernel launch command is generated and pushed to the host queue.

Stream creation & CU masking adjustment procedure

We mainly focus on one problem: how the CU mask configuration is applied to the stream/queue?

Calling sequence of hipStreamCreateWithCUMask: hipExtStreamCreateWithCUMask ➡️ ihipStreamCreate ➡️ hip::Stream::Create ➡️ amd::HostQueue ➡️ hip::Device::SaveQueue

HIP stream synchronize API

hipEvent_t is a wrapper of hip::Event. When AMD_DIRECT_DISPATCH is set, ROCclr will use hip::EventDD derived from hip::Event as the marker to record activities. When hipEventCreate is called, ROCclr is going to create the corresponding HIP event and bind it to the passed-in hipEvent_t and insert it to the global event set.

For hipEventRecord, the runtime will call Event::addMarker to insert a HSA marker (wrapped by hip::EventMarker) packet into the HSA queue. In this procedure, HIP will first call Event::recordCommand to create a marker command then invoke Event::enqueueRecordCommand to insert the newly-created command to the corresponding queue (determined by stream).

Declare and implement a new HIP runtime API

First, declare your API (suppose hipStreamXXX in this example) in HIP/include/hip/hip_runtime_api.h, and provide corresponding implementation in hipamd/src/hip_stream.cpp.

Then insert a new HIP_API_ID_hipStreamXXX at the end of enum hip_api_id_t in hipamd/include/hip/amd_detail/hip_prof_str.h, please also remember to update the value of HIP_API_ID_LAST.

- HIP_API_ID_LAST = 359,
+ HIP_API_ID_hipStreamXXX = 360,
+ HIP_API_ID_LAST = 360,

After hip_prof_str.h updated, run the following command (or rerun cmake --build build to generate a new version of hip_prof_str.h containing both the newly-added HIP_API_ID as well as the callback ID for this self-defined HIP runtime API:

# header, src dir, origin, new
python3 src/hip_prof_gen.py -v -t --priv \
  HIP/include/hip/hip_runtime_api.h \
  hipamd/src hipamd/include/hip/amd_detail/hip_prof_str.h \
  build/include/hip/amd_detail/hip_prof_str.h
# replace origin header with the new one
cp build/include/hip/amd_detail/hip_prof_str.h \
  hipamd/include/hip/amd_detail/hip_prof_str.h

If you finish previous steps, you can successfully build the hipamd project and generate your own libhipamd64.so. However, HIP programs will still fail to link to it, the reported errors say:

undefined reference to `hipStreamXXX'
clang-16: error: linker command failed with exit code 1 (use -v to see invocation)

Now if you inspect the exposed symbols in libamdhip64.so (via the nm -D command), you will find NO hipStreamXXX occurs. It attributes to Linux ELF shared object symbol visibility issues, which in this project, is controlled by the hipamd/src/CMakeLists.txt CMake script to hide all non-authorized symbols. To make the newly-added API as a visible symbol, you need to append it in hipamd/src/hip_hcc.map.in:

    hipGraphMemAllocNodeGetParams;
    hipGraphAddMemFreeNode;
    hipGraphMemFreeNodeGetParams;
+   hipStreamXXX;
local:
    *;
} hip_5.3;

Then you could find hipStreamXXX@hip_5.3 in the exposed symbols in libamdhip64.so.

Tracing and Logging

Environment variables

The full support list of env variables and their effects could be found on the recently-released ROCm new doc site: https://rocm.docs.amd.com/projects/HIP/en/latest/how_to_guides/debugging.html#summary-of-environment-variables-in-hip.

There is an environment variable AMD_LOG_LEVEL (defined in ROCclr:utils/flags.cpp by macro) to indicate which kind of information will be printed during runtime. Currently there are 5 supported logging level:

  • LOG_NONE
  • LOG_ERROR
  • LOG_WARNING
  • LOG_INFO
  • LOG_DEBUG These flags will passed together with logging format and messages to ClPrint function for printing.

There is another knob to control which kind of logging information to print. For example, there are various places where logging functions are called:

  • LOG_API: in the HIP runtime API
  • LOG_INIT: when HIP/ROCm software stack is initializing
  • LOG_AQL: decode and display AQL packet content The default value for AMD_LOG_MASK is LOG_ALWAYS, which displays all logging information everywhere.
enum LogLevel { LOG_NONE = 0, LOG_ERROR = 1, LOG_WARNING = 2, LOG_INFO = 3, LOG_DEBUG = 4 };

enum LogMask {
LOG_API = 0x00000001, //!< API call
LOG_CMD = 0x00000002, //!< Kernel and Copy Commands and Barriers
LOG_WAIT = 0x00000004, //!< Synchronization and waiting for commands to finish
LOG_AQL = 0x00000008, //!< Decode and display AQL packets
LOG_QUEUE = 0x00000010, //!< Queue commands and queue contents
LOG_SIG = 0x00000020, //!< Signal creation, allocation, pool
LOG_LOCK = 0x00000040, //!< Locks and thread-safety code.
LOG_KERN = 0x00000080, //!< kernel creations and arguments, etc.
LOG_COPY = 0x00000100, //!< Copy debug
LOG_COPY2 = 0x00000200, //!< Detailed copy debug
LOG_RESOURCE = 0x00000400, //!< Resource allocation, performance-impacting events.
LOG_INIT = 0x00000800, //!< Initialization and shutdown
LOG_MISC = 0x00001000, //!< misc debug, not yet classified
LOG_AQL2 = 0x00002000, //!< Show raw bytes of AQL packet
LOG_CODE = 0x00004000, //!< Show code creation debug
LOG_CMD2 = 0x00008000, //!< More detailed command info, including barrier commands
LOG_LOCATION = 0x00010000, //!< Log message location
LOG_MEM = 0x00020000, //!< Memory allocation
LOG_ALWAYS = 0xFFFFFFFF, //!< Log always even mask flag is zero
};

To serialize kernel launch and execution manner, there is also a environment variable AMD_SERIALIZE_KERNEL which could be set with the following allowed values:

  • 1: wait for completion before enqueue
  • 2: wait for completion after enqueue
  • 3: both of above

Miscs.

There may be cases where the driver get stuck, one way to fix this is restart but a little troublesome. To reset the AMD driver, read the following file:

cat /sys/kernel/debug/dri/0/amdgpu_gpu_recover

Terms

  • SVM: OpenCL shared virtual memory
  • Virtual Device: an abstraction of physical device in ROCclr, the backend of a HIP stream
  • Command Queue: a linked list like data structure under hipStream, responsible for managing commands (host operations)