-
Notifications
You must be signed in to change notification settings - Fork 437
Description
Describe the bug
OneshotTransport can hang forever due to a race condition between send() and receive(). The issue is that tokio::sync::Notify::notify_waiters() only wakes tasks that are currently waiting. If notify_waiters() is called before receive() starts waiting (via notified().await), the notification is lost and receive() blocks indefinitely.
This happens because send() returns a 'static future that runs independently, while receive() is called through tokio::select! which cancels and recreates futures between iterations. If the timing lines up wrong, you get a deadlock.
In our testing, this only occurred when the tool returned an error (e.g., 404 from downstream API). Successful responses never triggered the hang. This might be due to different code paths or timing characteristics when is_error: true.
To Reproduce
-
Configure stateless HTTP mode:
let session_manager = Arc::new(LocalSessionManager::default());
let config = StreamableHttpServerConfig {
stateful_mode: false,
..Default::default()
}; -
Deploy to Lambda or similar serverless environment
-
Send requests that result in tool errors (e.g., 404 from downstream)
-
Send concurrent requests, especially during cold starts
-
Observe some requests hanging for the full timeout duration (30s in Lambda)
The race window is small but hits reliably under load. In our testing, ~3-5% of cold start requests would hang, always on error responses.
Expected behavior
After send() completes with a Response or Error message, the next call to receive() should return None promptly, allowing the serve loop to exit cleanly.
Logs
Successful request (no race):
Looking up order: TEST
serve finished quit_reason=Closed
REPORT Duration: 841.46 ms
Failed request (race condition hit - note: downstream returned 404):
Looking up order: TEST
Downstream API error (status 404)
Response(JsonRpcResponse { ... is_error: Some(true) ... })
REPORT Duration: 30000.00 ms Status: timeout
Note: no serve finished log - the serve loop never exited because receive() hung waiting for a notification that already fired.
Additional context
To make it work locally, i replaced Notify with Semaphore. Permits persist until acquired, so there's no race:
// In struct
termination: Arc,
// In new()
termination: Arc::new(Semaphore::new(0)),
// In send()
if terminate {
termination.add_permits(1); // Persists even if no one waiting
}
// In receive()
if let Some(msg) = self.message.take() {
return Some(msg);
}
let _ = self.termination.acquire().await; // Gets permit immediately if exists
None