Update guided matmul example#2762
Open
chuanyuf wants to merge 2 commits into
Open
Conversation
2025.3 Guided Debugging Sample updates
Updates for oneAPI 2026.0 tools and drivers
There was a problem hiding this comment.
Pull request overview
This PR updates the guided matrix-multiplication debugging samples for Intel® oneAPI 2026.0 by refreshing documentation/tooling requirements, bumping the CMake minimum version, and adding USM deallocation to reduce leaks in the sample code.
Changes:
- Add
sycl::free(...)cleanup for USM allocations across multiple guided samples. - Update sample READMEs for oneAPI 2026.0 tool/driver versions and revise guided-debug instructions/output snippets.
- Bump
cmake_minimum_requiredfrom 3.4 to 3.5 for the affected samples.
Reviewed changes
Copilot reviewed 19 out of 19 changed files in this pull request and generated 10 comments.
Show a summary per file
| File | Description |
|---|---|
| Tools/ApplicationDebugger/guided_matrix_mult_SLMSize/src/2_matrix_mul.cpp | Adds USM frees after q.wait() for the “working” variant. |
| Tools/ApplicationDebugger/guided_matrix_mult_SLMSize/src/1_matrix_mul_SLM_size.cpp | Adds USM frees after q.wait() for the SLM-size failure variant. |
| Tools/ApplicationDebugger/guided_matrix_mult_SLMSize/README.md | Updates prerequisites, error messages, and guided-debug narrative for newer runtimes/tools. |
| Tools/ApplicationDebugger/guided_matrix_mult_SLMSize/CMakeLists.txt | Raises CMake minimum version to 3.5. |
| Tools/ApplicationDebugger/guided_matrix_mult_RaceCondition/README.md | Updates prerequisites and guided-debug instructions/output. |
| Tools/ApplicationDebugger/guided_matrix_mult_RaceCondition/CMakeLists.txt | Raises CMake minimum version to 3.5. |
| Tools/ApplicationDebugger/guided_matrix_mult_InvalidContexts/src/2_matrix_mul.cpp | Adds device enumeration output and USM frees. |
| Tools/ApplicationDebugger/guided_matrix_mult_InvalidContexts/src/1_matrix_mul_invalid_contexts.cpp | Adds device enumeration/output, device-specific queue selection, and conditional free behavior for the tutorial. |
| Tools/ApplicationDebugger/guided_matrix_mult_InvalidContexts/README.md | Significant refresh of tutorial steps, additional scenarios (ASAN/bonus), and updated prerequisites. |
| Tools/ApplicationDebugger/guided_matrix_mult_InvalidContexts/CMakeLists.txt | Raises CMake minimum version to 3.5. |
| Tools/ApplicationDebugger/guided_matrix_mult_Exceptions/src/3_matrix_mul.cpp | Adds USM frees after q.wait(). |
| Tools/ApplicationDebugger/guided_matrix_mult_Exceptions/src/2_matrix_mul_multi_offload.cpp | Adds USM frees after q.wait(). |
| Tools/ApplicationDebugger/guided_matrix_mult_Exceptions/src/1_matrix_mul_null_pointer.cpp | Adds USM frees after q.wait(). |
| Tools/ApplicationDebugger/guided_matrix_mult_Exceptions/README.md | Updates prerequisites and guided-debug narrative for newer runtime behavior. |
| Tools/ApplicationDebugger/guided_matrix_mult_Exceptions/CMakeLists.txt | Raises CMake minimum version to 3.5. |
| Tools/ApplicationDebugger/guided_matrix_mult_BadBuffers/src/b2_matrix_mul_usm.cpp | Adds USM frees after q.wait(). |
| Tools/ApplicationDebugger/guided_matrix_mult_BadBuffers/src/b1_matrix_mul_null_usm.cpp | Adds USM frees after q.wait(). |
| Tools/ApplicationDebugger/guided_matrix_mult_BadBuffers/README.md | Expands tutorial with device-side AddressSanitizer guidance and updates prerequisites. |
| Tools/ApplicationDebugger/guided_matrix_mult_BadBuffers/CMakeLists.txt | Raises CMake minimum version to 3.5. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Comment on lines
+95
to
+99
| #ifdef BAD_FREE | ||
| device selected_device = devices[0]; | ||
| #else | ||
| device selected_device = devices[1]; | ||
| #endif |
Comment on lines
+81
to
83
| // Be very specific about the device to use. | ||
| queue q(devices[0]); | ||
|
|
Comment on lines
66
to
81
| property_list propList = property_list{property::queue::enable_profiling()}; | ||
|
|
||
| std::vector<sycl::device> devices = sycl::device::get_devices(); | ||
| cout << "Devices:" << std::endl; | ||
|
|
||
| for (size_t index = 0; index < devices.size(); index++){ | ||
| std::string device_name = devices[index].get_info<sycl::info::device::name>(); | ||
| std::string device_driver = devices[index].get_info<sycl::info::device::driver_version>(); | ||
| std::string sycl_version = devices[index].get_info<sycl::info::device::version>(); | ||
| std::string vendor = devices[index].get_info<sycl::info::device::vendor>(); | ||
| std::string backend = devices[index].get_info<sycl::info::device::backend_version>(); | ||
| std::cout << " [" << index << "] " << device_name << ", " << sycl_version << " [" << device_driver | ||
| << "] " << backend << ", " << vendor << std::endl; | ||
| } | ||
|
|
||
| queue q(default_selector_v); |
|
|
||
| q.wait(); | ||
|
|
||
| sycl::free(dev_a, q); |
| [opencl:gpu][opencl:2] Intel(R) OpenCL Graphics, Intel(R) Data Center GPU Max 1550 OpenCL 3.0 NEO [25.18.33578] | ||
| [opencl:cpu][opencl:3] Intel(R) OpenCL, Intel(R) Xeon(R) Platinum 8360Y CPU @ 2.40GHz OpenCL 3.0 (Build 0) [2023.16.7.0.21_160000] | ||
| ``` | ||
| > **Note:** If you have only one `[level_zero:gpu]` device listed, or the order is different from the above, the the main example below may not work. Try to follow through anyway, and then try the bonus sample at the end of this document, which should work no matter what system configuration. |
| ### Identify the Problem without Code Inspection | ||
|
|
||
| You must have already built the [Unified Tracing and Profiling Tool](#getting-the-tracing-and-profiling-tool). Once you have built the utility, you can start it before your program (similar to using GBD). | ||
| You need to build the [Unified Tracing and Profiling Tool](#getting-the-tracing-and-profiling-tool) before completing this section. Once you have built the utility, you can start it before your program (similar to using GBD). |
| 101 queue q2(devicecontext, selected_device); | ||
| 102 float * dev_c = sycl::malloc_device<float>(M*P, q2); | ||
| ``` | ||
| As is hopefully obvious from the previous example, the problem is that we are trying to free memory allocated in SYCL queue `q2` that has a different device context fron SYCL queue `q`, even though under the covers they point to the same hardware device. |
| ``` | ||
|
|
||
| Similarly, we specify targeting the CPU, which sometimes can avoid problems in your code that are specific to offloading to the GPU. | ||
| Similarly, we an force the program to run on the CPU, which sometimes can avoid problems in your code that are specific to offloading to the GPU. |
| #### Debugging the Problem | ||
|
|
||
| Why did we try with multiple backends? If one had shown correct or incorrect results, and one had crashed, we might be facing a race condition that only occasionally manifests as something that goes terribly wrong. Or one of the backbends might have a bug. But here all three crash, so it's likely the program is doing something illegal to memory. The host CPU is a particularly good place to test for illegal memory accesses, because the CPU never allows pointers with an address within a few kilobytes of address 0x0, while this may be legally allocated memory on the GPU. | ||
| Why did we try with multiple backends? If one had shown correct or incorrect results, and one had crashed, we might be facing a race condition that only occasionally manifests when something goes terribly wrong. Or one of the backbends might have a bug while the others do not. But here all three crash, so it's likely the program is doing something illegal to memory. The host CPU is a particularly good place to test for illegal memory accesses, because the CPU never allows pointers with an address within a few kilobytes of address `0x0`, while this may be legally allocated memory on the GPU. |
| ``` | ||
|
|
||
| We used the form of `parallel_for` that takes the `nd_range`, which specifies the global iteration range (163850) and the local work-group size (10) like so: `nd_range<1>{{163850}, {10}}`. The first line above shows the workgroup size (`groupSizeX = 10 groupSizeY = 1 groupSizeZ = 1`), and the second shows how many total workgroups will be needed to process the global iteration range (`{16385, 1, 1}`). | ||
| At like 106 we used the form of `parallel_for` that takes the `nd_range`, which specifies the global iteration range (163850) and the local work-group size (10) like so: `nd_range<1>{{163850}, {10}}`. The first line above shows the workgroup size (`groupSizeX = 0xa groupSizeY = 0x1 groupSizeZ = 0x1`), and the second shows how many total workgroups will be needed to process the global iteration range (`{16385, 1, 1}`). |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Existing Sample Changes
Description
Guided Debugging Sample updates, Updates for oneAPI 2026.0 tools and drivers
Update cmake to at least 3.5, update readme document, and ensure memory release after use.
No new functions added for the PRs.
Fixes Issue#
External Dependencies
List any external dependencies created as a result of this change.
Type of change
Please delete options that are not relevant. Add a 'X' to the one that is applicable.
How Has This Been Tested?
Please describe the tests that you ran to verify your changes. Provide instructions so we can reproduce. Please also list any relevant details for your test configuration