Skip to content

[Ryzen AI Max+ 395 / gfx1151] Poor OpenCL performance on APU: Pinned Memory fails & CPU fallbacks #20071

@QorStorm

Description

@QorStorm

Is there an existing issue for this?

  • I checked and did not find my issue in the already reported ones

Describe the bug

I am reporting performance issues on a brand new AMD Ryzen AI MAX+ 395 (Strix Halo) APU with Radeon 8060S graphics (detected as gfx1151).
Since this is an APU with Unified Memory, I expected near-zero copy overhead. However, Darktable refuses to use Pinned Memory, resulting in excessive copying and tiling.

Key Findings:

Pinned Memory Refused on APU: Even with DT_OPENCL_TRANSFER_USE_PINNED_MEMORY=1, the log reports PINNED MEMORY TRANSFER: NO. On a Unified Memory architecture, this is a critical bottleneck.
Performance Gap: GPU computes the pipe in ~0.8s, but export takes ~4.5s due to memory overhead.
CPU Fallback: The bilat (Local Contrast) module consistently falls back to CPU (processed bilat on CPU), taking ~0.3s wall time (8s CPU time).
Excessive Tiling: denoiseprofile and atrous are tiling, presumably due to Darktable misjudging the APU's unified memory behavior.
System:

Hardware: ASUS ROG Flow Z13 (2025 model)
CPU/APU: AMD Ryzen AI MAX+ 395 w/ Radeon 8060S (32 threads)
RAM: 32 GB (Unified)
OS: Arch Linux (CachyOS), Kernel 6.18.2
Driver: AMD ROCm / OpenCL 2.0 (Driver Version 3581.0)
Darktable Version: 5.4.0
Steps to Reproduce:
Run darktable-cli with DT_OPENCL_TRANSFER_USE_PINNED_MEMORY=1 and -d perf -d opencl on a Strix Halo APU.

Logs:

DT_OPENCL_TRANSFER_USE_PINNED_MEMORY=1 darktable-cli setubal.orf setubal.orf.xmp test_pinned.jpg --core -d perf -d opencl
output file already exists, it will get renamed
darktable 5.4.0
Copyright (C) 2012-2025 Johannes Hanika and other contributors.

Compile options:
  Bit depth              -> 64 bit
  Exiv2                  -> 0.28.7
  Lensfun                -> 0.3.4
  Debug                  -> DISABLED
  SSE2 optimizations     -> ENABLED
  OpenMP                 -> ENABLED
  OpenCL                 -> ENABLED
  Lua                    -> ENABLED  - API version 9.6.0
  Colord                 -> ENABLED
  gPhoto2                -> ENABLED
  OSMGpsMap              -> ENABLED  - map view is available
  GMIC                   -> ENABLED  - Compressed LUTs are supported
  GraphicsMagick         -> ENABLED
  ImageMagick            -> DISABLED
  libavif                -> ENABLED
  libheif                -> ENABLED
  libjxl                 -> ENABLED
  LibRaw                 -> ENABLED  - Version 0.22.0-PreRC1
  OpenJPEG               -> ENABLED
  OpenEXR                -> ENABLED
  WebP                   -> ENABLED

See https://www.darktable.org/resources/ for detailed documentation.
See https://github.com/darktable-org/darktable/issues/new/choose to report bugs.

     0,0273 [opencl_init] opencl library 'libOpenCL' found on your system and loaded, preference 'default path'
     0,2076 [opencl_init] found 2 platforms
     0,2076 [check platform] platform 'rusticl' with key 'clplatform_rusticl' is NOT active
[opencl_init] found 1 device

[dt_opencl_device_init]
   DEVICE:                   0: 'gfx1151'
   CONF KEY:                 cldevice_v5_amdacceleratedparallelprocessinggfx1151
   PLATFORM, VENDOR & ID:    AMD Accelerated Parallel Processing, Advanced Micro Devices, Inc., ID=4098
   CANONICAL NAME:           amdacceleratedparallelprocessinggfx1151
   DRIVER VERSION:           3581.0 (HSA1.1,LC)
   DEVICE VERSION:           OpenCL 2.0 
   DEVICE_TYPE:              GPU, unified mem
   GLOBAL MEM SIZE:          15610 MB
   MAX MEM ALLOC:            13268 MB
   MAX IMAGE SIZE:           16384 x 16384
   MAX CONSTANT BUFFER:      13586692 KB
   ADDRESS ALIGN:            256
   MAX WORK GROUP SIZE:      256
   MAX WORK ITEM DIMENSIONS: 3
   MAX WORK ITEM SIZES:      [ 1024 1024 1024 ]
   ASYNC PIXELPIPE:          NO
   PINNED MEMORY TRANSFER:   NO
   AVOID ATOMICS:            NO
   MICRO NAP:                250
   ROUNDUP WIDTH & HEIGHT    16x16
   CHECK EVENT HANDLES:      128
   TILING ADVANTAGE:         0,000
   DEFAULT DEVICE:           NO
   KERNEL BUILD DIRECTORY:   /usr/share/darktable/kernels
   KERNEL DIRECTORY:         /home/cf/.cache/darktable/cached_v5_kernels_for_AMDAcceleratedParallelProcessinggfx1151_35810HSA11LC
   CL COMPILER OPTION:       -cl-fast-relaxed-math
   CL COMPILER COMMAND:      -w -cl-fast-relaxed-math -DAMD=1 -I"/usr/share/darktable/kernels"
   KERNEL LOADING TIME:       0,0117 sec
[opencl_init] OpenCL successfully initialized. internal numbers and names of available devices:
[opencl_init]           0       'AMD Accelerated Parallel Processing gfx1151'
     0,2827 [opencl_init] FINALLY: opencl PREFERENCE=ON is AVAILABLE and ENABLED.
[opencl_init] opencl_scheduling_profile: 'default'
[opencl_init] opencl_device_priority: '*/!0,*/*/*/!0,*'
[opencl_init] opencl_mandatory_timeout: 1000
[opencl_update_priorities] these are your device priorities:
[opencl_update_priorities]              image   preview export  thumbs  preview2
[dt_opencl_update_priorities]           0       -1      0       0       -1
[opencl_update_priorities] show if opencl use is mandatory for a given pixelpipe:
[opencl_update_priorities]              image   preview export  thumbs  preview2
[opencl_update_priorities]              0       0       0       0       0
[opencl_synchronization_timeout] synchronization timeout set to 200
   UNIFIED MEM SIZE:         7805 MB reserved for 'amdacceleratedparallelprocessinggfx1151' id=0[opencl_update_priorities] these are your device priorities:
[opencl_update_priorities]              image   preview export  thumbs  preview2
[dt_opencl_update_priorities]           0       -1      0       0       -1
[opencl_update_priorities] show if opencl use is mandatory for a given pixelpipe:
[opencl_update_priorities]              image   preview export  thumbs  preview2
[opencl_update_priorities]              0       0       0       0       0
[opencl_synchronization_timeout] synchronization timeout set to 200
     1,1512 [dt_dev_load_raw] loading the image. took 0,370 secs (0,495 CPU)
     1,1795 [export] creating pixelpipe took 0,025 secs (0,676 CPU)
     1,1798 [dev_pixelpipe] took 0,000 secs (0,000 CPU) initing base buffer [export]
     1,2041 [dev_pixelpipe] took 0,024 secs (0,142 CPU) [export] processed `rawprepare' on GPU, blended on GPU
     1,2159 [dev_pixelpipe] took 0,012 secs (0,000 CPU) [export] processed `temperature' on GPU, blended on GPU
     1,2358 [dev_pixelpipe] took 0,020 secs (0,001 CPU) [export] processed `highlights' on GPU, blended on GPU
     1,2652 [dev_pixelpipe] took 0,029 secs (0,204 CPU) [export] processed `hotpixels' on CPU, blended on CPU
     1,4512 [dev_pixelpipe] took 0,186 secs (0,081 CPU) [export] processed `demosaic' on GPU, blended on GPU
     2,1371 [dev_pixelpipe] took 0,686 secs (0,022 CPU) [export] processed `denoiseprofile' on GPU with tiling, blended on CPU
     2,4088 [dev_pixelpipe] took 0,272 secs (1,422 CPU) [export] processed `lens' on GPU, blended on GPU
     2,4533 [dev_pixelpipe] took 0,045 secs (0,001 CPU) [export] processed `ashift' on GPU, blended on GPU
     2,4974 [dev_pixelpipe] took 0,044 secs (0,001 CPU) [export] processed `exposure' on GPU, blended on GPU
     2,5415 [dev_pixelpipe] took 0,044 secs (0,002 CPU) [export] processed `colorin' on GPU, blended on GPU
     2,6113 [dt_ioppr_transform_image_colorspace_cl] IOP_CS_LAB-->IOP_CS_RGB took 0,036 secs (0,001 GPU) [channelmixerrgb]
     2,6345 [dev_pixelpipe] took 0,093 secs (0,003 CPU) [export] processed `channelmixerrgb' on GPU, blended on GPU
     2,6609 [dt_ioppr_transform_image_colorspace] IOP_CS_RGB-->IOP_CS_LAB took 0,014 secs (0,391 CPU) [atrous]
     3,3564 [dev_pixelpipe] took 0,722 secs (0,552 CPU) [export] processed `atrous' on GPU with tiling, blended on CPU
     3,4688 [dt_ioppr_transform_image_colorspace_cl] IOP_CS_LAB-->IOP_CS_RGB took 0,035 secs (0,001 GPU) [colorbalancergb]
     3,4938 [dev_pixelpipe] took 0,137 secs (0,004 CPU) [export] processed `colorbalancergb' on GPU, blended on GPU
     3,5389 [dev_pixelpipe] took 0,045 secs (0,001 CPU) [export] processed `rgblevels' on GPU, blended on GPU
     3,5826 [dev_pixelpipe] took 0,044 secs (0,001 CPU) [export] processed `sigmoid' on GPU, blended on GPU
     3,6093 [dt_ioppr_transform_image_colorspace] IOP_CS_RGB-->IOP_CS_LAB took 0,018 secs (0,463 CPU) [bilat]
     3,9434 [dev_pixelpipe] took 0,361 secs (8,482 CPU) [export] processed `bilat' on CPU, blended on CPU
     4,2175 [dev_pixelpipe] took 0,274 secs (8,222 CPU) [export] processed `colorout' on CPU, blended on CPU
     4,2947 [resample_cl] took 0,000 secs (0,000 CPU) 1:1 copy/crop of 8065x6046 pixels
     4,3038 [dev_pixelpipe] took 0,086 secs (0,133 CPU) [export] processed `finalscale' on GPU, blended on GPU
     4,3127 [opencl_profiling] profiling device 0 ('AMD Accelerated Parallel Processing gfx1151'):
     4,3127 [opencl_profiling] spent  0,0502 seconds in [Write Image (from host to device)]
     4,3127 [opencl_profiling] spent  0,0020 seconds in rawprepare_1f
     4,3127 [opencl_profiling] spent  0,0025 seconds in whitebalance_1f
     4,3127 [opencl_profiling] spent  0,0017 seconds in highlights_initmask
     4,3127 [opencl_profiling] spent  0,0004 seconds in highlights_dilatemask
     4,3127 [opencl_profiling] spent  0,0431 seconds in [Write Buffer (from host to device)]
     4,3127 [opencl_profiling] spent  0,0036 seconds in highlights_chroma
     4,3127 [opencl_profiling] spent  0,0031 seconds in [Read Buffer (from device to host)]
     4,3127 [opencl_profiling] spent  0,0019 seconds in highlights_opposed
     4,3127 [opencl_profiling] spent  0,0541 seconds in [Read Image (from device to host)]
     4,3127 [opencl_profiling] spent  0,0003 seconds in border_interpolate
     4,3127 [opencl_profiling] spent  0,0018 seconds in rcd_border_green
     4,3127 [opencl_profiling] spent  0,0039 seconds in rcd_border_redblue
     4,3127 [opencl_profiling] spent  0,0064 seconds in rcd_populate
     4,3127 [opencl_profiling] spent  0,0030 seconds in rcd_step_1_1
     4,3127 [opencl_profiling] spent  0,0027 seconds in rcd_step_1_2
     4,3127 [opencl_profiling] spent  0,0013 seconds in rcd_step_2_1
     4,3127 [opencl_profiling] spent  0,0040 seconds in rcd_step_3_1
     4,3127 [opencl_profiling] spent  0,0029 seconds in rcd_step_4_1
     4,3127 [opencl_profiling] spent  0,0014 seconds in rcd_step_4_2
     4,3127 [opencl_profiling] spent  0,0040 seconds in rcd_step_5_1
     4,3127 [opencl_profiling] spent  0,0058 seconds in rcd_step_5_2
     4,3127 [opencl_profiling] spent  0,0069 seconds in rcd_write_output
     4,3127 [opencl_profiling] spent  0,0077 seconds in denoiseprofile_precondition_Y0U0V0
     4,3127 [opencl_profiling] spent  0,1027 seconds in denoiseprofile_decompose
     4,3127 [opencl_profiling] spent  0,0292 seconds in denoiseprofile_reduce_first
     4,3127 [opencl_profiling] spent  0,0001 seconds in denoiseprofile_reduce_second
     4,3127 [opencl_profiling] spent  0,0787 seconds in denoiseprofile_synthesize
     4,3127 [opencl_profiling] spent  0,0385 seconds in [Copy Image (on device)]
     4,3127 [opencl_profiling] spent  0,0078 seconds in denoiseprofile_backtransform_Y0U0V0
     4,3127 [opencl_profiling] spent  0,0138 seconds in lens_vignette
     4,3127 [opencl_profiling] spent  0,0180 seconds in lens_distort_bicubic
     4,3127 [opencl_profiling] spent  0,0103 seconds in ashift_bicubic
     4,3127 [opencl_profiling] spent  0,0082 seconds in exposure
     4,3127 [opencl_profiling] spent  0,0076 seconds in colorin_unbound
     4,3127 [opencl_profiling] spent  0,0147 seconds in colorspaces_transform_lab_to_rgb_matrix
     4,3127 [opencl_profiling] spent  0,0075 seconds in channelmixerrgb_CAT16
     4,3127 [opencl_profiling] spent  0,1407 seconds in eaw_decompose
     4,3127 [opencl_profiling] spent  0,0982 seconds in eaw_synthesize
     4,3127 [opencl_profiling] spent  0,0081 seconds in colorbalancergb
     4,3127 [opencl_profiling] spent  0,0073 seconds in rgblevels
     4,3127 [opencl_profiling] spent  0,0075 seconds in sigmoid_loglogistic_per_channel
     4,3127 [opencl_profiling] spent  0,8138 seconds totally in command queue (with 0 events missing)
     4,3127 [dev_process_export] pixel pipeline processing took 3,133 secs (19,275 CPU)
     4,6023 [export_job] exported to `test_pinned_01.jpg'
 [opencl_summary_statistics] device 'AMD Accelerated Parallel Processing gfx1151' id=0: 165 out of 165 events were successful and 0 events lost. max event=164

Steps to reproduce

https://math.dartmouth.edu/~sarunas/darktable_bench.html

Expected behavior

No response

Logfile | Screenshot | Screencast

No response

Commit

No response

Where did you obtain darktable from?

darktable.org / GitHub release

darktable version

5.4

What OS are you using?

Linux

What is the version of your OS?

CachyOS - Arch Linux

Describe your system

Hardware: ASUS ROG Flow Z13 (2025 model)
CPU/APU: AMD Ryzen AI MAX+ 395 w/ Radeon 8060S (32 threads)
RAM: 32 GB (Unified)
OS: Arch Linux (CachyOS), Kernel 6.18.2
Driver: AMD ROCm / OpenCL 2.0 (Driver Version 3581.0)
Darktable Version: 5.4.0

Are you using OpenCL GPU in darktable?

Yes

If yes, what is the GPU card and driver?

GPU / Device: AMD Radeon 8060S (Integrated in Ryzen AI MAX+ 395 APU) (OpenCL Device Name: gfx1151) Memory Size: 32 GB Unified System RAM (Note: OpenCL reports 15610 MB Global Mem Size available for the GPU) Driver Version: AMD ROCm / OpenCL Driver Version 3581.0 (HSA1.1,LC)

Please provide additional context if applicable. You can attach files too, but might need to rename to .txt or .zip

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    AMD ROCm OpenCLSpecific to AMD OpenCL hardware or driver

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions