Skip to content

[CELEBORN-1768][SPARK-3][WRITER] Refactoring Shuffle Writer to extract common methods #3306

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

zwangsheng
Copy link
Contributor

What changes were proposed in this pull request?

Refactoring shuffleWriters, extract common methods as BasedShuffleWriter.

Why are the changes needed?

Currently, HashBasedShuffleWriter and SortBasedShuffleWriter have a lot of the same methods, which makes it difficult for us to maintain the code. In many scenarios, we need to update multiple classes at the same time.

Therefore, we need to refactor Shuffle Writer, extract common methods without changing the specific method logic, and simplify the corresponding Shuffle Writer.

Does this PR introduce any user-facing change?

No, we do not change the logic code of apis or others.

How was this patch tested?

UT and local env

@zwangsheng zwangsheng changed the title [CELEBORN-1768][WRITER] Refactoring Shuffle Writer to extract common methods [CELEBORN-1768][SPARK-3][WRITER] Refactoring Shuffle Writer to extract common methods Jun 3, 2025
@zwangsheng
Copy link
Contributor Author

Ref to #2985, cc @FMX

@zwangsheng zwangsheng marked this pull request as draft June 3, 2025 12:25
@zwangsheng zwangsheng self-assigned this Jun 3, 2025
@zwangsheng zwangsheng marked this pull request as ready for review June 3, 2025 12:42
final int serializedRecordSize = 4 + rowSize;

if (dataSize != null) {
dataSize.add(serializedRecordSize);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a decision point. Hash-based writer chooses to pass in the actual row size, and Sort-based writer chooses to add a 4-bit overhead.

Which one do we need to choose?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should use the serializedRecordSize in both the hash-based shuffle writer and the sort-based shuffle writer. After checking the earliest commit logs, I believe that this difference is unintentional.

Copy link

codecov bot commented Jun 3, 2025

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 63.43%. Comparing base (fff9725) to head (9a3f7db).
Report is 51 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #3306      +/-   ##
==========================================
- Coverage   63.54%   63.43%   -0.11%     
==========================================
  Files         343      343              
  Lines       20812    20875      +63     
  Branches     1835     1842       +7     
==========================================
+ Hits        13222    13239      +17     
- Misses       6630     6670      +40     
- Partials      960      966       +6     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@FMX
Copy link
Contributor

FMX commented Jun 4, 2025

@zwangsheng Please fix the sbt CIs first.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants