Skip to content

UCT/DC: Adjust DC FHS latency #10685

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 4 commits into
base: master
Choose a base branch
from
Open

UCT/DC: Adjust DC FHS latency #10685

wants to merge 4 commits into from

Conversation

shasson5
Copy link
Contributor

What?

Adjust DC FHS latency

Why?

Impove perf by better selection of RC/DC switching point

How?

Distinguish AR enabled and forced FHS scenraios and add extra round trip latency

@@ -205,6 +211,46 @@ uct_dc_mlx5_ep_create_connected(const uct_ep_params_t *params, uct_ep_h* ep_p)
}
}

/* Check whether RDMA write is disabled */
static int uct_dc_mlx5_iface_rdma_wr_disabled(uct_dc_mlx5_iface_t *iface)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

const where possible

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

getter functions used here receive non-const args

{"FULL_HANDSHAKE_ADDED_LATENCY", "140ns",
"Amount of latency added to performance estimation of DC due to full "
"handshake (used when AR is enabled).",
ucs_offsetof(uct_dc_mlx5_iface_config_t, fhs_added_latency),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_added_ is a bit weird. Maybe _extra_ instead? Or simple fhs_latency?
(+ in all similar places)

@@ -140,6 +140,12 @@ ucs_config_field_t uct_dc_mlx5_iface_config_sub_table[] = {
ucs_offsetof(uct_dc_mlx5_iface_config_t, dcis_initial_capacity),
UCS_CONFIG_TYPE_ULUNITS},

{"FULL_HANDSHAKE_ADDED_LATENCY", "140ns",
"Amount of latency added to performance estimation of DC due to full "
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Amount of ->

Comment on lines +143 to +146
{"FULL_HANDSHAKE_LATENCY", "140ns",
"DC Full Handshake extra latency",
ucs_offsetof(uct_dc_mlx5_iface_config_t, fhs_latency),
UCS_CONFIG_TYPE_TIME_UNITS},
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there dependency on iface_attr->latency ? then maybe we could calculate default value based on it.

Copy link
Contributor Author

@shasson5 shasson5 May 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AFAIU default value today doesn't reflect real latency (just a relation in comparison to other transports)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

iface_attr->latency should be real latency

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that's a separate issue (not related to this PR)

@@ -42,6 +42,12 @@ do
new_caps=$(echo "$new_tl_caps" | grep -A 7 "Device: $device" || true)
for cap in bandwidth latency overhead
do
# DC latency was modified to account for FHS since v1.20
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe just skip latency in line 43 since it is not affecting wire compatibility

Comment on lines +143 to +146
{"FULL_HANDSHAKE_LATENCY", "140ns",
"DC Full Handshake extra latency",
ucs_offsetof(uct_dc_mlx5_iface_config_t, fhs_latency),
UCS_CONFIG_TYPE_TIME_UNITS},
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

iface_attr->latency should be real latency

Comment on lines 242 to 249
return (iface->super.config.dp_ordering >=
UCT_IB_MLX5_DP_ORDERING_OOO_RW) &&
(UCS_BIT(ib_iface->config.sl) & ooo_sl_mask) &&
/* If DevX isn't used for DCI, driver sets default value for
* 'rdma_wr_disabled' */
(!(md->flags & UCT_IB_MLX5_MD_FLAG_DEVX_DCI) ||
/* RDMA write should be enabled for FHS to be used */
!uct_dc_mlx5_iface_rdma_wr_disabled(iface));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. simplify and split to several if conditions
  2. what about DDP (in CX8), do we have FHS also in this case?
  3. IMO there is something wrong with the logic. if a device exposes UCT_IB_MLX5_MD_FLAG_NO_RDMA_WR_OPTIMIZED, it will use FHS unless UCT_DC_MLX5_IFACE_FLAG_DISABLE_PUT is set and UCT_IB_MLX5_MD_FLAG_DEVX_DCI is enabled. Otherwise, it will use HHS. It would be easier to review the logic after refactoring (1)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DDP case is covered here

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

seems DDP is making a difference only if have AR because there is && between them

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The first condition includes DDP because it tests >= UCT_IB_MLX5_DP_ORDERING_OOO_RW and DDP enabled when UCT_IB_MLX5_DP_ORDERING_OOO_ALL

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants