-
Notifications
You must be signed in to change notification settings - Fork 1.3k
Description
Is there an existing issue for this?
- I checked and did not find my issue in the already reported ones
Describe the bug
I am reporting performance issues on a brand new AMD Ryzen AI MAX+ 395 (Strix Halo) APU with Radeon 8060S graphics (detected as gfx1151).
Since this is an APU with Unified Memory, I expected near-zero copy overhead. However, Darktable refuses to use Pinned Memory, resulting in excessive copying and tiling.
Key Findings:
Pinned Memory Refused on APU: Even with DT_OPENCL_TRANSFER_USE_PINNED_MEMORY=1, the log reports PINNED MEMORY TRANSFER: NO. On a Unified Memory architecture, this is a critical bottleneck.
Performance Gap: GPU computes the pipe in ~0.8s, but export takes ~4.5s due to memory overhead.
CPU Fallback: The bilat (Local Contrast) module consistently falls back to CPU (processed bilat on CPU), taking ~0.3s wall time (8s CPU time).
Excessive Tiling: denoiseprofile and atrous are tiling, presumably due to Darktable misjudging the APU's unified memory behavior.
System:
Hardware: ASUS ROG Flow Z13 (2025 model)
CPU/APU: AMD Ryzen AI MAX+ 395 w/ Radeon 8060S (32 threads)
RAM: 32 GB (Unified)
OS: Arch Linux (CachyOS), Kernel 6.18.2
Driver: AMD ROCm / OpenCL 2.0 (Driver Version 3581.0)
Darktable Version: 5.4.0
Steps to Reproduce:
Run darktable-cli with DT_OPENCL_TRANSFER_USE_PINNED_MEMORY=1 and -d perf -d opencl on a Strix Halo APU.
Logs:
DT_OPENCL_TRANSFER_USE_PINNED_MEMORY=1 darktable-cli setubal.orf setubal.orf.xmp test_pinned.jpg --core -d perf -d opencl
output file already exists, it will get renamed
darktable 5.4.0
Copyright (C) 2012-2025 Johannes Hanika and other contributors.
Compile options:
Bit depth -> 64 bit
Exiv2 -> 0.28.7
Lensfun -> 0.3.4
Debug -> DISABLED
SSE2 optimizations -> ENABLED
OpenMP -> ENABLED
OpenCL -> ENABLED
Lua -> ENABLED - API version 9.6.0
Colord -> ENABLED
gPhoto2 -> ENABLED
OSMGpsMap -> ENABLED - map view is available
GMIC -> ENABLED - Compressed LUTs are supported
GraphicsMagick -> ENABLED
ImageMagick -> DISABLED
libavif -> ENABLED
libheif -> ENABLED
libjxl -> ENABLED
LibRaw -> ENABLED - Version 0.22.0-PreRC1
OpenJPEG -> ENABLED
OpenEXR -> ENABLED
WebP -> ENABLED
See https://www.darktable.org/resources/ for detailed documentation.
See https://github.com/darktable-org/darktable/issues/new/choose to report bugs.
0,0273 [opencl_init] opencl library 'libOpenCL' found on your system and loaded, preference 'default path'
0,2076 [opencl_init] found 2 platforms
0,2076 [check platform] platform 'rusticl' with key 'clplatform_rusticl' is NOT active
[opencl_init] found 1 device
[dt_opencl_device_init]
DEVICE: 0: 'gfx1151'
CONF KEY: cldevice_v5_amdacceleratedparallelprocessinggfx1151
PLATFORM, VENDOR & ID: AMD Accelerated Parallel Processing, Advanced Micro Devices, Inc., ID=4098
CANONICAL NAME: amdacceleratedparallelprocessinggfx1151
DRIVER VERSION: 3581.0 (HSA1.1,LC)
DEVICE VERSION: OpenCL 2.0
DEVICE_TYPE: GPU, unified mem
GLOBAL MEM SIZE: 15610 MB
MAX MEM ALLOC: 13268 MB
MAX IMAGE SIZE: 16384 x 16384
MAX CONSTANT BUFFER: 13586692 KB
ADDRESS ALIGN: 256
MAX WORK GROUP SIZE: 256
MAX WORK ITEM DIMENSIONS: 3
MAX WORK ITEM SIZES: [ 1024 1024 1024 ]
ASYNC PIXELPIPE: NO
PINNED MEMORY TRANSFER: NO
AVOID ATOMICS: NO
MICRO NAP: 250
ROUNDUP WIDTH & HEIGHT 16x16
CHECK EVENT HANDLES: 128
TILING ADVANTAGE: 0,000
DEFAULT DEVICE: NO
KERNEL BUILD DIRECTORY: /usr/share/darktable/kernels
KERNEL DIRECTORY: /home/cf/.cache/darktable/cached_v5_kernels_for_AMDAcceleratedParallelProcessinggfx1151_35810HSA11LC
CL COMPILER OPTION: -cl-fast-relaxed-math
CL COMPILER COMMAND: -w -cl-fast-relaxed-math -DAMD=1 -I"/usr/share/darktable/kernels"
KERNEL LOADING TIME: 0,0117 sec
[opencl_init] OpenCL successfully initialized. internal numbers and names of available devices:
[opencl_init] 0 'AMD Accelerated Parallel Processing gfx1151'
0,2827 [opencl_init] FINALLY: opencl PREFERENCE=ON is AVAILABLE and ENABLED.
[opencl_init] opencl_scheduling_profile: 'default'
[opencl_init] opencl_device_priority: '*/!0,*/*/*/!0,*'
[opencl_init] opencl_mandatory_timeout: 1000
[opencl_update_priorities] these are your device priorities:
[opencl_update_priorities] image preview export thumbs preview2
[dt_opencl_update_priorities] 0 -1 0 0 -1
[opencl_update_priorities] show if opencl use is mandatory for a given pixelpipe:
[opencl_update_priorities] image preview export thumbs preview2
[opencl_update_priorities] 0 0 0 0 0
[opencl_synchronization_timeout] synchronization timeout set to 200
UNIFIED MEM SIZE: 7805 MB reserved for 'amdacceleratedparallelprocessinggfx1151' id=0[opencl_update_priorities] these are your device priorities:
[opencl_update_priorities] image preview export thumbs preview2
[dt_opencl_update_priorities] 0 -1 0 0 -1
[opencl_update_priorities] show if opencl use is mandatory for a given pixelpipe:
[opencl_update_priorities] image preview export thumbs preview2
[opencl_update_priorities] 0 0 0 0 0
[opencl_synchronization_timeout] synchronization timeout set to 200
1,1512 [dt_dev_load_raw] loading the image. took 0,370 secs (0,495 CPU)
1,1795 [export] creating pixelpipe took 0,025 secs (0,676 CPU)
1,1798 [dev_pixelpipe] took 0,000 secs (0,000 CPU) initing base buffer [export]
1,2041 [dev_pixelpipe] took 0,024 secs (0,142 CPU) [export] processed `rawprepare' on GPU, blended on GPU
1,2159 [dev_pixelpipe] took 0,012 secs (0,000 CPU) [export] processed `temperature' on GPU, blended on GPU
1,2358 [dev_pixelpipe] took 0,020 secs (0,001 CPU) [export] processed `highlights' on GPU, blended on GPU
1,2652 [dev_pixelpipe] took 0,029 secs (0,204 CPU) [export] processed `hotpixels' on CPU, blended on CPU
1,4512 [dev_pixelpipe] took 0,186 secs (0,081 CPU) [export] processed `demosaic' on GPU, blended on GPU
2,1371 [dev_pixelpipe] took 0,686 secs (0,022 CPU) [export] processed `denoiseprofile' on GPU with tiling, blended on CPU
2,4088 [dev_pixelpipe] took 0,272 secs (1,422 CPU) [export] processed `lens' on GPU, blended on GPU
2,4533 [dev_pixelpipe] took 0,045 secs (0,001 CPU) [export] processed `ashift' on GPU, blended on GPU
2,4974 [dev_pixelpipe] took 0,044 secs (0,001 CPU) [export] processed `exposure' on GPU, blended on GPU
2,5415 [dev_pixelpipe] took 0,044 secs (0,002 CPU) [export] processed `colorin' on GPU, blended on GPU
2,6113 [dt_ioppr_transform_image_colorspace_cl] IOP_CS_LAB-->IOP_CS_RGB took 0,036 secs (0,001 GPU) [channelmixerrgb]
2,6345 [dev_pixelpipe] took 0,093 secs (0,003 CPU) [export] processed `channelmixerrgb' on GPU, blended on GPU
2,6609 [dt_ioppr_transform_image_colorspace] IOP_CS_RGB-->IOP_CS_LAB took 0,014 secs (0,391 CPU) [atrous]
3,3564 [dev_pixelpipe] took 0,722 secs (0,552 CPU) [export] processed `atrous' on GPU with tiling, blended on CPU
3,4688 [dt_ioppr_transform_image_colorspace_cl] IOP_CS_LAB-->IOP_CS_RGB took 0,035 secs (0,001 GPU) [colorbalancergb]
3,4938 [dev_pixelpipe] took 0,137 secs (0,004 CPU) [export] processed `colorbalancergb' on GPU, blended on GPU
3,5389 [dev_pixelpipe] took 0,045 secs (0,001 CPU) [export] processed `rgblevels' on GPU, blended on GPU
3,5826 [dev_pixelpipe] took 0,044 secs (0,001 CPU) [export] processed `sigmoid' on GPU, blended on GPU
3,6093 [dt_ioppr_transform_image_colorspace] IOP_CS_RGB-->IOP_CS_LAB took 0,018 secs (0,463 CPU) [bilat]
3,9434 [dev_pixelpipe] took 0,361 secs (8,482 CPU) [export] processed `bilat' on CPU, blended on CPU
4,2175 [dev_pixelpipe] took 0,274 secs (8,222 CPU) [export] processed `colorout' on CPU, blended on CPU
4,2947 [resample_cl] took 0,000 secs (0,000 CPU) 1:1 copy/crop of 8065x6046 pixels
4,3038 [dev_pixelpipe] took 0,086 secs (0,133 CPU) [export] processed `finalscale' on GPU, blended on GPU
4,3127 [opencl_profiling] profiling device 0 ('AMD Accelerated Parallel Processing gfx1151'):
4,3127 [opencl_profiling] spent 0,0502 seconds in [Write Image (from host to device)]
4,3127 [opencl_profiling] spent 0,0020 seconds in rawprepare_1f
4,3127 [opencl_profiling] spent 0,0025 seconds in whitebalance_1f
4,3127 [opencl_profiling] spent 0,0017 seconds in highlights_initmask
4,3127 [opencl_profiling] spent 0,0004 seconds in highlights_dilatemask
4,3127 [opencl_profiling] spent 0,0431 seconds in [Write Buffer (from host to device)]
4,3127 [opencl_profiling] spent 0,0036 seconds in highlights_chroma
4,3127 [opencl_profiling] spent 0,0031 seconds in [Read Buffer (from device to host)]
4,3127 [opencl_profiling] spent 0,0019 seconds in highlights_opposed
4,3127 [opencl_profiling] spent 0,0541 seconds in [Read Image (from device to host)]
4,3127 [opencl_profiling] spent 0,0003 seconds in border_interpolate
4,3127 [opencl_profiling] spent 0,0018 seconds in rcd_border_green
4,3127 [opencl_profiling] spent 0,0039 seconds in rcd_border_redblue
4,3127 [opencl_profiling] spent 0,0064 seconds in rcd_populate
4,3127 [opencl_profiling] spent 0,0030 seconds in rcd_step_1_1
4,3127 [opencl_profiling] spent 0,0027 seconds in rcd_step_1_2
4,3127 [opencl_profiling] spent 0,0013 seconds in rcd_step_2_1
4,3127 [opencl_profiling] spent 0,0040 seconds in rcd_step_3_1
4,3127 [opencl_profiling] spent 0,0029 seconds in rcd_step_4_1
4,3127 [opencl_profiling] spent 0,0014 seconds in rcd_step_4_2
4,3127 [opencl_profiling] spent 0,0040 seconds in rcd_step_5_1
4,3127 [opencl_profiling] spent 0,0058 seconds in rcd_step_5_2
4,3127 [opencl_profiling] spent 0,0069 seconds in rcd_write_output
4,3127 [opencl_profiling] spent 0,0077 seconds in denoiseprofile_precondition_Y0U0V0
4,3127 [opencl_profiling] spent 0,1027 seconds in denoiseprofile_decompose
4,3127 [opencl_profiling] spent 0,0292 seconds in denoiseprofile_reduce_first
4,3127 [opencl_profiling] spent 0,0001 seconds in denoiseprofile_reduce_second
4,3127 [opencl_profiling] spent 0,0787 seconds in denoiseprofile_synthesize
4,3127 [opencl_profiling] spent 0,0385 seconds in [Copy Image (on device)]
4,3127 [opencl_profiling] spent 0,0078 seconds in denoiseprofile_backtransform_Y0U0V0
4,3127 [opencl_profiling] spent 0,0138 seconds in lens_vignette
4,3127 [opencl_profiling] spent 0,0180 seconds in lens_distort_bicubic
4,3127 [opencl_profiling] spent 0,0103 seconds in ashift_bicubic
4,3127 [opencl_profiling] spent 0,0082 seconds in exposure
4,3127 [opencl_profiling] spent 0,0076 seconds in colorin_unbound
4,3127 [opencl_profiling] spent 0,0147 seconds in colorspaces_transform_lab_to_rgb_matrix
4,3127 [opencl_profiling] spent 0,0075 seconds in channelmixerrgb_CAT16
4,3127 [opencl_profiling] spent 0,1407 seconds in eaw_decompose
4,3127 [opencl_profiling] spent 0,0982 seconds in eaw_synthesize
4,3127 [opencl_profiling] spent 0,0081 seconds in colorbalancergb
4,3127 [opencl_profiling] spent 0,0073 seconds in rgblevels
4,3127 [opencl_profiling] spent 0,0075 seconds in sigmoid_loglogistic_per_channel
4,3127 [opencl_profiling] spent 0,8138 seconds totally in command queue (with 0 events missing)
4,3127 [dev_process_export] pixel pipeline processing took 3,133 secs (19,275 CPU)
4,6023 [export_job] exported to `test_pinned_01.jpg'
[opencl_summary_statistics] device 'AMD Accelerated Parallel Processing gfx1151' id=0: 165 out of 165 events were successful and 0 events lost. max event=164
Steps to reproduce
https://math.dartmouth.edu/~sarunas/darktable_bench.html
Expected behavior
No response
Logfile | Screenshot | Screencast
No response
Commit
No response
Where did you obtain darktable from?
darktable.org / GitHub release
darktable version
5.4
What OS are you using?
Linux
What is the version of your OS?
CachyOS - Arch Linux
Describe your system
Hardware: ASUS ROG Flow Z13 (2025 model)
CPU/APU: AMD Ryzen AI MAX+ 395 w/ Radeon 8060S (32 threads)
RAM: 32 GB (Unified)
OS: Arch Linux (CachyOS), Kernel 6.18.2
Driver: AMD ROCm / OpenCL 2.0 (Driver Version 3581.0)
Darktable Version: 5.4.0
Are you using OpenCL GPU in darktable?
Yes
If yes, what is the GPU card and driver?
GPU / Device: AMD Radeon 8060S (Integrated in Ryzen AI MAX+ 395 APU) (OpenCL Device Name: gfx1151) Memory Size: 32 GB Unified System RAM (Note: OpenCL reports 15610 MB Global Mem Size available for the GPU) Driver Version: AMD ROCm / OpenCL Driver Version 3581.0 (HSA1.1,LC)
Please provide additional context if applicable. You can attach files too, but might need to rename to .txt or .zip
No response