Skip to content

Conversation

@madsmtm
Copy link
Member

@madsmtm madsmtm commented Aug 12, 2025

Transactions are expensive, and the layer should be able to figure out the timing of when to render by itself (by virtue of being installed in a view). The only reason why we did it before was to avoid a fade transition between layer content changes.

Part of #83. I have not benchmarked this, but I have visibly confirmed less stuttering when resizing.

Note that removing this does also remove any form of frame-limiting, that's tracked in #29 for that.

@madsmtm madsmtm added enhancement New feature or request DS - CoreGraphics macOS/iOS/tvOS/watchOS/visionOS backend labels Aug 12, 2025
@madsmtm madsmtm changed the title Avoid the explicit CATransaction commit Avoid the explicit CATransaction Aug 12, 2025
@madsmtm madsmtm force-pushed the cg-avoid-transaction branch from cecb0bc to e8ddecc Compare August 12, 2025 22:40
@nicoburns
Copy link

nicoburns commented Aug 13, 2025

Wow. This is dramatically faster (up to 1000x !!!) for me. I'm seeing present times measured in 10s of microseconds rather than 10s of milliseconds. Specifically: running Blitz (a winit application) on my 14" MacBook Pro (M1), I'm getting the following for the times to call surface_buffer.present().unwrap();:

Test softbuffer 0.4 This PR buffer_mut pixels 0.15
800x600 1x 1.5ms 17us 200us 500us
800x600 2x 6ms 25us 650us 1ms
1512x982 2x 18ms 30us 1.5ms 3ms

To reproduce:

Then:

  • To test softbuffer 0.4: cargo run -rp readme --no-default-features --features comrak,log_frame_times,log_phase_times,cpu-softbuffer .
  • To test this PR add the following the bottom of the root-level (workspace) Cargo.toml:
    [patch.crates-io]
    softbuffer = { git = "https://github.com/rust-windowing/softbuffer", branch = "cg-avoid-transaction" }
    and then run the same command as for softbuffer 0.4.
  • To test the pixels crate: cargo run -rp readme --no-default-features --features comrak,log_frame_times,log_phase_times,cpu-pixels .

You should then see output like:

Resolve: 11ms (style: 76us, construct: 10ms, flush: 32us, layout: 224us)
Frame time: 12ms (cmd: 11ms, flush: 105us, render: 277us, swizel: 336us, present: 19us)

It is the present time that is the call to softbuffer's (or pixels's) present.

@madsmtm madsmtm force-pushed the cg-avoid-transaction branch from e8ddecc to 91c1904 Compare August 13, 2025 15:04
@madsmtm
Copy link
Member Author

madsmtm commented Aug 13, 2025

Well, to be fair, this PR is not actually doing any work in the present call any more.

The actual work now happens in buffer_mut (allocation of a new buffer, which is deallocated once the CGImage is no longer referenced) and internally somewhere in Apple's rendering pipeline (maybe -[CALayer display]). I think if you're going to benchmark this, you'll need to use Instruments.app, flamegraph, or some other whole-program benchmarking.

@madsmtm madsmtm force-pushed the cg-avoid-transaction branch from 91c1904 to e23a8fb Compare August 13, 2025 15:26
@nicoburns
Copy link

Well, to be fair, this PR is not actually doing any work in the present call any more.
The actual work now happens in buffer_mut ... and internally somewhere in Apple's rendering pipeline (maybe -[CALayer display]).

Hmm... I wasn't previously timing buffer_mut, but I just added it and the amount of time spent there doesn't seem to have changed much (~200us to ~1ms depending on buffer size for both 0.4 and this PR). I guess there could be some time being spent elsewhere within Apple frameworks that I'm not capturing. But this PR is visually much smoother for me, so I suspect it is genuinely faster overall.

@madsmtm
Copy link
Member Author

madsmtm commented Aug 13, 2025

Hmm... I wasn't previously timing buffer_mut, but I just added it and the amount of time spent there doesn't seem to have changed much (~200us to ~1ms depending on buffer size for both 0.4 and this PR).

Sorry, that wasn't particularly clear; I meant that buffer_mut was, and still is, doing a large part of the work (and some of this work would be lessened by using IOSurface and/or swapping between buffers instead of reallocating).

But this PR is visually much smoother for me, so I suspect it is genuinely faster overall.

Definitely agree.

@MarijnS95
Copy link
Member

MarijnS95 commented Aug 13, 2025

Reading the definition of CATransaction, doesn't this simply offload/postpone the cost to somewhere else (e.g. the "when the thread’s runloop next iterates.")?

Or were we accidentally waiting for the transaction to have completed, while this new model allows multiple implicit transactions to be created and submitted "asynchronously"?

Just curious to map this to all other platforms' "compositor transaction" abstractions :)

@madsmtm
Copy link
Member Author

madsmtm commented Aug 13, 2025

Reading the definition of CATransaction, doesn't this simply offload/postpone the cost to somewhere else (e.g. the "when the thread’s runloop next iterates.")?

I think that's true, yeah. Disassembling setContents:, I found that it locks the CATransaction and inserts the change into that (such that it will be applied later).

I suspect that the real problem is actually in the way that Winit schedules redraws such that they happen outside display/drawRect: (and thereby outside the transaction) in the first place (good old rust-windowing/winit#2640 strikes again).

@MarijnS95
Copy link
Member

MarijnS95 commented Aug 13, 2025

Thanks for looking into that! Some quick local testing shows that ::commit() on a MacBook Air M4 takes about 3.7ms on average on the animation example.

Could it be that this call is blocking, when applications use it directly? Assigning a completion handler shows that it completes at around the same time. Curious how you "disassembled" so that we can look into ::commit() instead.


Never ::begin()ing a new transaction shows that the completion handler is now called between 5-15ms (with some 21ms outliers) after setContents(), so I'm really curious if we're just trading some predicable/visible "CPU overhead" (or blocking? - need to attach a profiler) for increased latency?
EDIT: Adding a frame index shows that these transactions are completing out of order a bunch of times...

Note that animationDuration is equal to 0.25 by default (and the animationTimingFunction is None) - setting that to 0f64 shows a consistent delay of 250µs in the completion handler...

@madsmtm
Copy link
Member Author

madsmtm commented Aug 13, 2025

Curious how you "disassembled" so that we can look into ::commit() instead.

I did lldb target/debug/examples/rectangle and set a few breakpoints.

Never ::begin()ing a new transaction

Uhh, pretty sure that's invalid use of the API, otherwise you may be committing work done by something higher in your call stack.

Could it be that this call is blocking, when applications use it directly? Assigning a completion handler shows that it completes at around the same time.

I'm really curious if we're just trading some predicable/visible "CPU overhead" (or blocking? - need to attach a profiler) for increased latency?

I don't completely know how CATransaction::commit works when outside of a draw call issued by the OS (such as -[NSView drawRect:]), but I think it submits the result to the compositor immediately. Testing on current master by calling .present() a hundred times per requested redraw in the rectangle example seems to back this theory up.

And yeah, with this PR, you will get a bit of latency here, in that the result is not actually presented immediately, but instead only presented the next time the OS renders.

(I'm pretty sure all of these issues just go away if the Winit issue was fixed, since then the CATransaction would know that it was run inside a draw call by the OS, and the commit wouldn't render immediately).

@MarijnS95
Copy link
Member

Apologies, I meant to also skip ::commit(), i.e. it the layer update and the callback to time this would run later in that thread's runloop that I linked before, and what you do in this PR.

Curious to see those complete out of order, some frames taking very long.

I should've assumed the debugger might have "enough" debug symbols to see what is going on under the hood 😅


And yeah, I've been wanting to write better present-timing abstractions in Winit for years. RedrawRequested is also fundamentally broken on Android. Curious if you get less delay on Mac if it's running at a closer time before vblank (or does it always run right after the previous vblank?).

@madsmtm
Copy link
Member Author

madsmtm commented Aug 13, 2025

Curious if you get less delay on Mac if it's running at a closer time before vblank (or does it always run right after the previous vblank?).

No idea honestly, and unsure of how I'd test it?

@MarijnS95
Copy link
Member

No idea honestly, and unsure of how I'd test it?

This is what I did, perhaps we could add it to that draw(Rect): callback and see how much of a delay it has, respectively?

let s = Instant::now();
unsafe { self.imp.layer.setContents(Some(image.as_ref())) };

static mut FRAME: u32 = 0;
let frame = unsafe { FRAME };
unsafe { FRAME += 1 };

unsafe {
    CATransaction::setCompletionBlock(Some(
        // Does this clone or otherwise reference the block? After the move,
        // the closure lifetime is 'static and could use StackBlock as well?
        block2::RcBlock::new(move || {
            println!("{frame:0>6}: {:?}", s.elapsed());
        })
        .deref(),
    ))
};

@nicoburns
Copy link

nicoburns commented Aug 20, 2025

This PR seems to have a memory problem. I am able to get memory usage to spike as high as 3GB+ with this PR just by scrolling my app (which it causes it to render frames). Interestingly, it does drop if I resize the window, but only down to ~700mb. Rendering this same app with pixels it sits at around 150mb.

@madsmtm madsmtm added this to the Softbuffer v0.5 milestone Jan 14, 2026
@madsmtm
Copy link
Member Author

madsmtm commented Jan 29, 2026

So, I reverse-engineered a bit, and found that CATransaction is actually committed ("flushed") once every iteration of the event loop. This is done by registering them with the event loop as follows:

CFRunLoopObserverCreate(
    NULL,                                     // allocator
    kCFRunLoopBeforeWaiting | kCFRunLoopExit, // activities
    1,                                        // repeats
    2000000,                                  // order
    CA::Transaction::observer_callback,       // callout
    ptr::null(),                              // context
);

This clashes a bit with Winit (as expected), since Winit issues RedrawRequested inside kCFRunLoopBeforeWaiting too, but with a much higher order value. This means that any work that is done inside RedrawRequested is actually queued up for the next iteration of the event loop.

This also matches the documenation for +[CATransaction flush] now that I read it again.


Furthermore, I tried stepping through a debugger, and found that right after issuing +[CATransaction begin] + +[CATransaction commit] (effectively +[CATransaction flush] when not inside an existing transaction / CoreGraphics context IIUC), the image in the window is updated (even though we haven't returned to the run loop or anything).

This leads me to believe that +[CATransaction flush] is what sends data to the compositor (WindowServer).

To put it in Wayland terms:

  • -[CALayer setContents:] is analogous to wl_surface.attach (though completely client-side).
  • +[CATransaction flush] is analogous to wl_surface.commit (except it's global instead of only for the current surface).
  • +[CATransaction flush] is automatically called each iteration of the runloop (roughly each frame).

This can be seen by drawing twice per frame:

// Draw black
let buffer = surface.buffer_mut().unwrap();
buffer.present().unwrap();

let mut buffer = surface.buffer_mut().unwrap();
draw(&mut buffer);
buffer.present().unwrap();

Before this PR, the window flickers between black and the actual contents, after this PR it only shows the final contents.


Now the question becomes: What should Buffer::present do? Submitting to the compositor directly inside present is closest to the other platforms, but arguably, it's the wrong behaviour for Softbuffer to touch global state like CATransaction, because we're submitting all in-flight work (which might interfere with other parts of the application, especially if rendering multiple windows on different threads).

See also #29.

@madsmtm
Copy link
Member Author

madsmtm commented Jan 29, 2026

@nicoburns:

This PR seems to have a memory problem. I am able to get memory usage to spike as high as 3GB+ with this PR just by scrolling my app (which it causes it to render frames). Interestingly, it does drop if I resize the window, but only down to ~700mb. Rendering this same app with pixels it sits at around 150mb.

Hmm, weird. Is there a public example in Blitz that reproduces that? Or is there some other way I would be able to reproduce this?

If not, maybe I could get you to try the following diff in Winit 0.30 + this PR?

diff --git a/src/platform_impl/macos/observer.rs b/src/platform_impl/macos/observer.rs
index 833980308..3b83401fb 100644
--- a/src/platform_impl/macos/observer.rs
+++ b/src/platform_impl/macos/observer.rs
@@ -220,7 +220,7 @@ pub fn setup_control_flow_observers(mtm: MainThreadMarker, panic_info: Weak<Pani
         );
         run_loop.add_observer(
             kCFRunLoopExit | kCFRunLoopBeforeWaiting,
-            CFIndex::MAX,
+            10000, // Less than `2000000`
             control_flow_end_handler,
             &mut context as *mut _,
         );

@nicoburns
Copy link

Is there a public example in Blitz that reproduces that? Or is there some other way I would be able to reproduce this?

cargo run -rp todomvc --no-default-features --features incremental,log_frame_times,cpu-softbuffer

The fast-softbuffer branch patches in this PR.


If not, maybe I could get you to try the following diff in Winit 0.30 + this PR?

Will this also work with Winit 0.31? I upgraded Blitz main already...

@madsmtm
Copy link
Member Author

madsmtm commented Jan 29, 2026

Is there a public example in Blitz that reproduces that? Or is there some other way I would be able to reproduce this?

cargo run -rp todomvc --no-default-features --features incremental,log_frame_times,cpu-softbuffer

The fast-softbuffer branch patches in this PR.

Hmm, the incremental feature doesn't exist? And I tried scrolling around, but could never get it above 100MB memory usage?

If not, maybe I could get you to try the following diff in Winit 0.30 + this PR?

Will this also work with Winit 0.31? I upgraded Blitz main already...

The specific patch there would only cleanly apply on 0.30, but should be the same situation on 0.31, the patch there is just instead:

diff --git a/winit-appkit/src/observer.rs b/winit-appkit/src/observer.rs
index 48c934491..e4d98bcc7 100644
--- a/winit-appkit/src/observer.rs
+++ b/winit-appkit/src/observer.rs
@@ -160,7 +160,7 @@ pub fn setup_control_flow_observers(mtm: MainThreadMarker) {
         );
         run_loop.add_observer(
             CFRunLoopActivity::Exit | CFRunLoopActivity::BeforeWaiting,
-            CFIndex::MAX,
+            2000,
             Some(control_flow_end_handler),
             &mut context as *mut _,
         );

@nicoburns
Copy link

Is there a public example in Blitz that reproduces that? Or is there some other way I would be able to reproduce this?

cargo run -rp todomvc --no-default-features --features incremental,log_frame_times,cpu-softbuffer
The fast-softbuffer branch patches in this PR.

Hmm, the incremental feature doesn't exist? And I tried scrolling around, but could never get it above 100MB memory usage?. You can also run with cpu-pixels rather than cpu-softbuffer for comparison.

That exact command works for me :/ Do you have an old version somehow? Perhaps I hadn't pushed the fast-softbuffer branch?

It's resizing the window a lot where I see high memory usage.


I tried the Winit patch and it didn't seem to make any difference.

@madsmtm
Copy link
Member Author

madsmtm commented Jan 29, 2026

Okay, so I pulled nicoburns/blitz@67ee43d (though I had to do a cargo update -p softbuffer on that to actually make it use the patched softbuffer). Then I ran the command you provided, on both macOS 15.7.3 and in a macOS 26.2 VM.

On macOS 15, Activity Monitor reports a steady <100MB memory usage for todomvc.

On macOS 26 though, if I moved the window offscreen and resized it to be larger than my screen, I could get the memory usage to spike a lot, easily a 1GB and more. But, the thing is, I can reproduce that before this PR too, it's just "only" roughly half as bad 1. And doing the Winit fix brings it down to before this PR levels.

A guess: The allocator changed in macOS 26 (which has caused other problems), perhaps the memory allocations are more "lazily" released there, which means that Activity Monitor can't report it as accurately?

That hypothesis is supported by the following screenshot where I resized the window ~20 times my screen width (i.e. with most of it offscreen). todomvc is reported as using 6GB memory 2, which clearly isn't true, the machine only has ~4GB, and the overall memory pressure isn't even going orange.
keep resizing

Does this match the behaviour you're seeing? Or is there something else going on?

Footnotes

  1. Caveat emptor human eye heuristics, it's kinda hard to precisely resize a window with a cursor.

  2. For comparison, a window on macOS 15 resized to ~20 times screen width only needs ~350MB.

@nicoburns
Copy link

nicoburns commented Jan 30, 2026

The problematic case isn't when the window is resized once to a large size. It's when it's frequently resized repeatedly. That being said, I'm not seeing behaviour that was as bad as I remember seeing (possibly due to OS updates??? Or I suppose Winit 0.31 upgrade?), and I'm also seeing basically the same behaviour with Blitz main (softbuffer version from crates.io).

Testing on macOS 15 btw.

And this PR is still much faster than the released version. The speed is dependent on window size. The current version of softbuffer gets progressively slower the larger the window gets and this PR doesn't.

Video of the testing process I've been using (apologies for poor quality - had to compress for file size reasons):

clip.mp4

@nicoburns
Copy link

Hmm... I've just noticed that my original comment did explicitly say the issue happened during scrolling. Let me retest that.

@nicoburns
Copy link

Ok, so I am able to get memory usage to increase pretty fast by running:

cargo run -rp readme --no-default-features --features comrak,log_frame_times,log_phase_times,cpu-softbuffer .

And then scrolling up and down like a madman for a few seconds.

However, this seems to happen on main too. And when using pixels instead of softbuffer (although potentially a little slower?), although it doesn't happen when using the vello backend (so perhaps it's in issue with vello_cpu?).

With the kind of differences (in speed of memory growth) I'm seeing now I'd say it's entirely possible that memory usage is increasing a little faster with this PR simply because it's able to render frames faster so it's rendering more frames in the same period of time.

@nicoburns
Copy link

My softbuffer integration code is also allocating a new buffer-sized Vec every frame, which probably terrible for memory fragmentation?

Screenshot 2026-01-30 at 00 44 55

@madsmtm
Copy link
Member Author

madsmtm commented Jan 30, 2026

The problematic case isn't when the window is resized once to a large size. It's when it's frequently resized repeatedly.

Yeah, I tested that as well.

video

I can get the "Real Mem" up to about 1GB on macOS 15 too if I do that, but that's (seemingly) true both before and after this PR.

With the kind of differences (in speed of memory growth) I'm seeing now I'd say it's entirely possible that memory usage is increasing a little faster with this PR simply because it's able to render frames faster so it's rendering more frames in the same period of time.

Sounds likely.

I'll add, I'm not that familiar with exactly how memory allocation works, but I'm pretty sure that just because Activity Monitor is reporting a high memory usage doesn't actually mean that there are any leaks - it might again just be the allocator doing funny stuff. I tried running under Instruments' leak checker, the only reported leak is in Winit's AppState::setup_global (which is expected).

@madsmtm
Copy link
Member Author

madsmtm commented Jan 30, 2026

My softbuffer integration code is also allocating a new buffer-sized Vec every frame, which probably terrible for memory fragmentation?

Oh definitely! At the very least, you should allocate once and then re-use the buffer (and on window resize, as long as it's large enough, keep using that buffer it). Softbuffer is currently also allocating on macOS on every frame, which just makes the problem even worse (it should be double-buffering IOSurfaces instead).

I know you're aware, but I'll just note it here too that linebender/vello#1382 is tracking allowing vello_cpu to use Softbuffer's buffer more directly.

@nicoburns
Copy link

I know you're aware, but I'll just note it here too that linebender/vello#1382 is tracking allowing vello_cpu to use Softbuffer's buffer more directly.

I think this is possible today. It just requires swizzling in-place. I'll probably give that another go if this PR lands.

@madsmtm madsmtm force-pushed the cg-avoid-transaction branch from e23a8fb to 6964486 Compare January 30, 2026 02:15
@madsmtm
Copy link
Member Author

madsmtm commented Jan 30, 2026

Thinking about it, I don't believe this PR actually affects performance real-world, sending the data to the compositor still has to happen somewhere.

E.g. doing:

let time = Instant::now();
CATransaction::begin();
CATransaction::commit();
println!("elapsed: {:?}", time.elapsed());

Reveals times of around ~100 nanoseconds, which sounds probable to "just" be bookkeeping and a few checks to see if there's any pending modifications we need to send - it might not actually be sending anything to the compositor here!

I also tried adding a simple FPS counter to the raytracing.rs example, specifically one that counts the number of frames and not just the time it took to process those frames (because again part of that cost is hidden inside the implicit transaction in AppKit). This revealed similar FPS both before and after this PR.


So what does this PR do? My understanding (now) is that it allows resizing to work more smoothly, since the updates to the window frame and decorations (which are fully under the control of the client / AppKit / the app) are sent to the compositor at the same time as the re-rendered content.

Wayland solves this for subsurfaces IIUC with wl_subsurface.set_sync and wl_subsurface.set_desync. To put it in those terms, the current implementation desyncs the subsurface (Softbuffer's CALayer) from the parent surface (the entire window), while this PR makes them in-sync again (with the natural caveat that if no-one commits the parent surface / commits at the end, then the data won't be sent to the compositor).


I'm a bit unsure what the default behaviour should be, but I think this PR is the right choice (especially if Winit got better at this). This also matches Wayland in that subsurfaces are synchronized by default (though it's a bit different in that I think the surface Winit's Wayland impl provides to Softbuffer is the root surface, whereas in the macOS impl the contentView is more akin to a subsurface).

In the future, we might want to provide a way to configure this behaviour, but let's track that in #29 though since it's relevant for the other stuff in there.

@madsmtm madsmtm dismissed notgull’s stale review January 30, 2026 02:19

Can be done in a follow-up

This effectively desyncs the surface from the rest of the window, which
isn't desirable when resizing. Instead, we now commit in the implicit
transaction that happens at the end of the current runloop iteration.
@madsmtm madsmtm force-pushed the cg-avoid-transaction branch from 6964486 to 1c617ea Compare January 30, 2026 02:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

DS - CoreGraphics macOS/iOS/tvOS/watchOS/visionOS backend enhancement New feature or request

Development

Successfully merging this pull request may close these issues.

5 participants