Skip to content

Conversation

@ShahabT
Copy link
Contributor

@ShahabT ShahabT commented Nov 30, 2025

What changed?

Add revision number to Worker Deployment Version Data that is synced to Task Queues.

Why?

This allows making task queue registration also async, but more importantly, it prevent's race conditions between concurrent registrations, setCurrent/Ramping, drainage, and deletion.

How did you test it?

  • built
  • run locally and tested manually
  • covered by existing tests
  • added new unit test(s)
  • added new functional test(s)

Potential risks

None.

@ShahabT ShahabT requested review from a team as code owners November 30, 2025 07:29
@ShahabT ShahabT requested a review from Shivs11 November 30, 2025 07:29
Copy link
Member

@Shivs11 Shivs11 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

did one pass of this PR - i think it's great from an initial scan but i think we should sync up to discuss a few things since if we miss out something, this could be breaking

Comment on lines +2109 to +2110
if cleanupOldDeletedVersions(tqWorkerDeploymentData) {
changed = true
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe we could just have this function placed inside of removeDeploymentVersions and call it something like removeAndCleanUpDeploymentVersions. It shall then have the contained logic of deleting old versions, marking the version in the new format as deleted and also Garbage Collecting any versions if present.

Comment on lines +2130 to +2132
// TODO: improve this logic to remove even more old versions if we're reaching the limit of max versions per task queue.
// max versions per task queue is not implemented yet but we should have it because as of now a task queue can be polled from
// numerous deployment versions (accross different deployments) and that leads to a very large user data size.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thinking about this a bit more - I think we should have a safeguard (dynamic config) for this prior to launching this as a GA. If folks use versioning with their CI/CD systems (as is done by our internal saas team) on the same task-queue, what you have placed in the comment can 100% come to life.

wdyt?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will follow up with a PR to clean up more TQs when reaching a limit

// If the routing config changed after applying this operation. Compared base on revision number.
bool routing_config_changed = 2;
// Deprecated. using this is not totaly safe in case of retries.
bool routing_config_changed = 2 [deprecated = true];
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this has confused me a little....might be worth discussing on call.

Comment on lines 55 to 58
// Tracks the version of the deployment version workflow when a particular run of a workflow starts base on the dynamic config of the
// worker who completes the first task of the workflow. `workflowVersion` remains the same until the workflow CaNs when it
// will get another chance to pick the latest manager version.
workflowVersion DeploymentWorkflowVersion
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit:

// workflowVersion is set at workflow start based on the dynamic config of the worker
// that completes the first task. It remains constant for the lifetime of the run and
// only updates when the workflow performs continue-as-new.

Comment on lines +520 to +521
// Still need to check RoutingConfig not being nil because of edge cases during enabling dynamic config.
// i.e. the deployment workflow might run old version and not send the routing config.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As far as my understanding goes, you have placed this check since you are worried that the deployment workflow and the version workflow may not be in full sync when it comes to the version of the code they are running.

Comment on lines 1181 to 1188
// syncRegisteredTaskQueueAsync syncs the routing config and version data to the new task queue.
// This method does not increment version data revision number.
// Note: task queue registration does not affect WorkerDeploymentInfo.RoutingConfigUpdateState hence we do not signal
// the deployment workflow about propagation completion.
// TODO: should RoutingConfigUpdateState become IN_PROGRESS while registration is propagating? Current thoughts: It
// doesn't seem useful because the following setCurrent/Ramping will set the status to IN_PROGRESS. The only case is if
// user is calling setCurrent/Ramping before registration passing IgnoreMissingTaskQueues, but in that case still the
// registration is likely not in control of the operator but is async all the way from the worker side.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lets discuss this on a call; have some concerns/thoughts.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW - I think we should be a bit conservative here and have the routing config state to be in progress. I think convoluting different API calls further with an in-progress propagation is more error prone

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I thought more about this and I think we should do in the case that version is already ramping or current (but not generally). Updated my comment. I think it's fine to do it later since it's a very small edge case.

Comment on lines +599 to +603
// Ensure CurrentDeploymentVersion is populated from deprecated field if needed for backward compatibility
currentDeploymentVersion := d.State.RoutingConfig.CurrentDeploymentVersion
if currentDeploymentVersion == nil {
currentDeploymentVersion = worker_versioning.ExternalWorkerDeploymentVersionFromStringV31(d.State.RoutingConfig.CurrentVersion)
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should just make our life simpler by always populating both values while the workflow CAN's. It's simpler since then, in all our API handlers, we can just focus on using the new value and not worry about workflow versions

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will follow up in another PR

@ShahabT ShahabT enabled auto-merge (squash) December 9, 2025 21:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants