-
Notifications
You must be signed in to change notification settings - Fork 1.5k
Description
In which: I notice a lot of issues with (intermittent?) timeouts that say bw: SpinningDown and wonder if it is a significant pattern.
Client reports timeouts from a slow EVAL command. But there was no specific symptom of timeout or slow EVAL command on the server side.
It occurs to me that there are a lot of issues mentioning "SpinningDown", and maybe they arise from a race condition related to the backlog, around the _backlogProcessorIsRunning and _backlogAutoReset state. So, let's see..
The theoretical analysis
The job of StartBacklogProcessor is to ensure a backlog processing thread is running and handling any backlog work. It handles this either by starting a new backlog processor, or by trying to wake up an existing one by setting the _backlogAutoReset event, if it sees that _backlogProcessorIsRunning.
The event is observed by doing WaitOne(_backlogAutoReset, 5 seconds), but there is a race as to:
a) when anybody sets the event, and whether WaitOne randomly returns after 5 seconds, or the event is set first.
b) if multiple people set the event how many times it is observed - in theory though, being observed once should be enough, so I'm not convinced this is an actual problem
If WaitOne sees the event, great, the backlog processing thread wakes up and loops to handle the work.
If WaitOne doesn't see the event, the event is set, and the backlog processing thread exits, releasing its hold on _backlogProcessorIsRunning, and leaving the _backlogStatus equal to SpinningDown. No new backlog thread got started up, since the event was triggered... just not in time.
So yes, we can get to the following undesirable state:
a) we have an item in the backlog
b) backlog status is spinning down
c) backlog processor is no long running, and released _backlogProcessorIsRunning
d) somebody might have tried to restart the backlog processor, but backlog thread didn't observe that
Example repro error message that I feel matches this scenario and motivated me to look deeper
(from 2025)
StackExchange.Redis.RedisTimeoutException: Timeout awaiting response (outbound=0KiB, inbound=1KiB, 3168ms elapsed, timeout is 3000ms), command=EVAL, next: EVAL, inst: 0, qu: 0, qs: 0, aw: False, bw: SpinningDown, rs: ReadAsync, ws: Idle, in: 0, last-in: 0, cur-in: 0, sync-ops: 0, async-ops: 63738, serverEndpoint: x.x.x.x:15011, conn-sec: 2617.06, aoc: 0, mc: 1/1/0, mgr: 10 of 10 available, clientName: xxx(SE.Redis-v2.10.1.65101), PerfCounterHelperkeyHashSlot: 13549, IOCP: (Busy=0,Free=1000,Min=1,Max=1000), WORKER: (Busy=3,Free=32764,Min=10,Max=32767), POOL: (Threads=13,QueuedItems=0,CompletedItems=8851526,Timers=1526), v: 2.10.1.6510
Now I just wonder if there's an actual decent way to repro this to prove the theory...