-
Notifications
You must be signed in to change notification settings - Fork 410
[CELEBORN-2016] Add cooldown time in worker shutdown #3294
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
val WORKER_GRACEFUL_SHUTDOWN_TIMEOUT: ConfigEntry[Long] = | ||
buildConf("celeborn.worker.graceful.shutdown.timeout") | ||
.categories("worker") | ||
.doc(s"The worker's graceful shutdown timeout time. This should include " + | ||
s"${WORKER_CHECK_SLOTS_FINISHED_TIMEOUT.key} and ${WORKER_PARTITION_SORTER_SHUTDOWN_TIMEOUT.key}.") | ||
.version("0.2.0") | ||
.timeConf(TimeUnit.MILLISECONDS) | ||
.createWithDefaultString("600s") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed ordering to fix forward referencing the variables in description.
Can we just set a larger value for |
For graceful shutdown, we can do this and let other user know by adding in the config description. But we still need to handle this for decommission flow, so this config can be general / default way to provide some time buffer without requiring config tuning from users. |
The config |
@FMX By users i meant the Celeborn admins, not the client. For worker decommission, the force exit timeout is 600s. But in some cases the shutdown hook will not fully execute, lets say Similarly, for worker graceful shutdown, current default timeout is 600s, which accounts for check slots finished timeout (480s) and partition sorter timeout (120s). So either we can increase the graceful shutdown default value slightly or ask the celeborn admins to increase the timeout slightly grater than (check slots finished timeout + partition sorter timeout). Both of the cases can be handled by adding small cooldown time. |
In this scenario, why not increase the graceful shutdown time slightly? Looks like in both approaches, the cluster admin will need to change the config file. |
Increasing the timeout approach will only solve for graceful shutdown, decommission flow will still have this problem. |
What changes were proposed in this pull request?
Why are the changes needed?
We current shutdown logic we have seen worker getting shutdown abruptly with timeout exception without completely executing the shutdown hook because of which Celeborn is –
Does this PR introduce any user-facing change?
NA
How was this patch tested?
NA