Skip to content

Add retries to local sync download #9191

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 5 commits into
base: master
Choose a base branch
from

Conversation

itaigilo
Copy link
Contributor

Change Description

Background

Adding extra retries to lakectl local sync (which is invoked by either clone or checkout).
These are an attempt to better handle socket hangouts of the storage backend, in extreme cases of very large files.

Testing Details

I haven't been able to reproduce the bug (which was reported by a client), but this addresses this specific reported failure.

Added unit-tests, and verified that a simple local clone and local checkout work as before.

@itaigilo itaigilo requested review from N-o-Z and a team June 16, 2025 21:28
Copy link

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: Download Retry Fails, Progress Bars Leak

When retrying a failed download, the file is not truncated. The code only seeks to the beginning of the file, which can lead to file corruption if a previous attempt wrote partial data and the retry writes less, leaving leftover data. Additionally, a new progress bar is created for each retry attempt, resulting in multiple progress indicators for the same download and potential resource leaks, as previous bars are not cleaned up.

pkg/local/sync.go#L186-L202

lakeFS/pkg/local/sync.go

Lines 186 to 202 in 4ca2f9e

if attempt > 1 {
// If we are retrying, we need to reset the file pointer to the beginning
if _, err := f.Seek(0, io.SeekStart); err != nil {
return fmt.Errorf("could not seek to start of file '%s': %w", destination, err)
}
}
b := s.progressBar.AddReader(fmt.Sprintf("download %s", path), sizeBytes)
defer func() {
if err != nil {
b.Error()
} else {
atomic.AddUint64(&s.tasks.Downloaded, 1)
b.Done()
}
}()

Fix in Cursor


Was this report helpful? Give feedback by reacting with 👍 or 👎

@itaigilo itaigilo added include-changelog PR description should be included in next release changelog minor-change Used for PRs that don't require issue attached labels Jun 16, 2025
Copy link
Member

@N-o-Z N-o-Z left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some things on the higher level which I don't understand:

  1. SyncManager is already receiving a retryable http client, why do we need this mechanism as well?
  2. We already have a retry configuration parameter, why are we not using it?
  3. The PR states the changes should address issues with httpclient hangs and large files, however we catch all errors and retry on them. I'm not sure that's what we want.
  4. I did not understand the use case. There's no issue linked, no reproduction scenario or a sighted bug linked. Need more context

}

// retryIfError attempts to execute the given operation up to maxAttempts times
func retryIfError(operation func(attempt int) error, maxAttempts int) error {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
func retryIfError(operation func(attempt int) error, maxAttempts int) error {
func retryOnError(operation func(attempt int) error, maxAttempts int) error {

Or just simply retry

return nil
}

retriesCount := 3
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is this magic number?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
include-changelog PR description should be included in next release changelog minor-change Used for PRs that don't require issue attached
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants