-
Notifications
You must be signed in to change notification settings - Fork 410
[CELEBORN-1768][SPARK-3][WRITER] Refactoring Shuffle Writer to extract common methods #3306
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
final int serializedRecordSize = 4 + rowSize; | ||
|
||
if (dataSize != null) { | ||
dataSize.add(serializedRecordSize); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is a decision point. Hash-based writer chooses to pass in the actual row size, and Sort-based writer chooses to add a 4-bit overhead.
Which one do we need to choose?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should use the serializedRecordSize
in both the hash-based shuffle writer and the sort-based shuffle writer. After checking the earliest commit logs, I believe that this difference is unintentional.
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## main #3306 +/- ##
==========================================
- Coverage 63.54% 63.43% -0.11%
==========================================
Files 343 343
Lines 20812 20875 +63
Branches 1835 1842 +7
==========================================
+ Hits 13222 13239 +17
- Misses 6630 6670 +40
- Partials 960 966 +6 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
@zwangsheng Please fix the sbt CIs first. |
What changes were proposed in this pull request?
Refactoring shuffleWriters, extract common methods as BasedShuffleWriter.
Why are the changes needed?
Currently, HashBasedShuffleWriter and SortBasedShuffleWriter have a lot of the same methods, which makes it difficult for us to maintain the code. In many scenarios, we need to update multiple classes at the same time.
Therefore, we need to refactor Shuffle Writer, extract common methods without changing the specific method logic, and simplify the corresponding Shuffle Writer.
Does this PR introduce any user-facing change?
No, we do not change the logic code of apis or others.
How was this patch tested?
UT and local env