Skip to content

[CELEBORN-2016] Add cooldown time in worker shutdown #3294

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

s0nskar
Copy link
Contributor

@s0nskar s0nskar commented May 27, 2025

What changes were proposed in this pull request?

  • Adding addition cooldown time in worker shutdown logic, which will allow all shutdown hook to execute completely.
  • Minor improvement in shutdown logic.

Why are the changes needed?

We current shutdown logic we have seen worker getting shutdown abruptly with timeout exception without completely executing the shutdown hook because of which Celeborn is –

  • unable to print unreleased partition info on decommission
  • not able to update sorted file DB

Does this PR introduce any user-facing change?

NA

How was this patch tested?

NA

Comment on lines +4236 to +4243
val WORKER_GRACEFUL_SHUTDOWN_TIMEOUT: ConfigEntry[Long] =
buildConf("celeborn.worker.graceful.shutdown.timeout")
.categories("worker")
.doc(s"The worker's graceful shutdown timeout time. This should include " +
s"${WORKER_CHECK_SLOTS_FINISHED_TIMEOUT.key} and ${WORKER_PARTITION_SORTER_SHUTDOWN_TIMEOUT.key}.")
.version("0.2.0")
.timeConf(TimeUnit.MILLISECONDS)
.createWithDefaultString("600s")
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed ordering to fix forward referencing the variables in description.

@FMX
Copy link
Contributor

FMX commented Jun 4, 2025

Can we just set a larger value for celeborn.worker.graceful.shutdown.timeout?

@s0nskar
Copy link
Contributor Author

s0nskar commented Jun 4, 2025

Can we just set a larger value for celeborn.worker.graceful.shutdown.timeout?

For graceful shutdown, we can do this and let other user know by adding in the config description.

But we still need to handle this for decommission flow, so this config can be general / default way to provide some time buffer without requiring config tuning from users.

@FMX
Copy link
Contributor

FMX commented Jun 5, 2025

Can we just set a larger value for celeborn.worker.graceful.shutdown.timeout?

For graceful shutdown, we can do this and let other user know by adding in the config description.

But we still need to handle this for decommission flow, so this config can be general / default way to provide some time buffer without requiring config tuning from users.

The config celeborn.worker.graceful.shutdown.timeout is Celeborn worker's config, which means that the users should not tune it because Celeborn Cluster should not be exposed to the users. In what scenario will the user tune the configs for the Celeborn cluster?

@s0nskar
Copy link
Contributor Author

s0nskar commented Jun 9, 2025

@FMX By users i meant the Celeborn admins, not the client.

For worker decommission, the force exit timeout is 600s. But in some cases the shutdown hook will not fully execute, lets say sendWorkerDecommissionToMaster() took 2s. Due to which we will miss out on the information on unreleased shuffle which gets print at the end of decommission shutdown hook.

Similarly, for worker graceful shutdown, current default timeout is 600s, which accounts for check slots finished timeout (480s) and partition sorter timeout (120s). So either we can increase the graceful shutdown default value slightly or ask the celeborn admins to increase the timeout slightly grater than (check slots finished timeout + partition sorter timeout).

Both of the cases can be handled by adding small cooldown time.

@FMX
Copy link
Contributor

FMX commented Jun 10, 2025

@FMX By users i meant the Celeborn admins, not the client.

For worker decommission, the force exit timeout is 600s. But in some cases the shutdown hook will not fully execute, lets say sendWorkerDecommissionToMaster() took 2s. Due to which we will miss out on the information on unreleased shuffle which gets print at the end of decommission shutdown hook.

Similarly, for worker graceful shutdown, current default timeout is 600s, which accounts for check slots finished timeout (480s) and partition sorter timeout (120s). So either we can increase the graceful shutdown default value slightly or ask the celeborn admins to increase the timeout slightly grater than (check slots finished timeout + partition sorter timeout).

Both of the cases can be handled by adding small cooldown time.

In this scenario, why not increase the graceful shutdown time slightly? Looks like in both approaches, the cluster admin will need to change the config file.

@s0nskar
Copy link
Contributor Author

s0nskar commented Jun 11, 2025

Increasing the timeout approach will only solve for graceful shutdown, decommission flow will still have this problem.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants