Skip to content

Conversation

@hpatro
Copy link
Collaborator

@hpatro hpatro commented Oct 17, 2025

I would prefer us to keep the node count bounded in gossip section rather than unbounded. In a 2,000 nodes cluster, the worst case is 1998 nodes in the gossip section which seems quite expensive on both sender and receiver end.

Also, wanted others to explore other node count and see if we should update the default. In #2291 Viktor suggestion was to try out sqrt(n) rather than 10% of total node count.

Related to #2291

  • Bound the node count in gossip section
  • Prioritize PFAIL nodes in the gossip section
  • Introduce a config to control the percentage of node in gossip section control the overhead
  • Default to 10% of total node count

@hpatro hpatro requested a review from madolson October 17, 2025 18:29
@hpatro hpatro force-pushed the cluster_ping_gossip_count branch from 908e10e to 7b29e59 Compare October 17, 2025 18:35
Signed-off-by: Harkrishn Patro <harkrisp@amazon.com>
@hpatro hpatro force-pushed the cluster_ping_gossip_count branch from 7b29e59 to c5642e8 Compare October 17, 2025 18:40
Comment on lines 4600 to 4602
* Since we have non-voting replicas that lower the probability of an entry
* to feature our node, we set the number of entries per packet as
* 10% of the total nodes we have. */
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to update the comment?

wanted = floor(dictSize(server.cluster->nodes) / 10);
if (wanted < 3) wanted = 3;
if (wanted > freshnodes) wanted = freshnodes;
int overall = server.cluster_ping_message_gossip_max_count;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we have a it as a percentage? cluster_ping_message_gossip_max_perc

Also we are naming it to be max but will the number of nodes ever be less than that?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I was going to suggest the same. The default is a percentage so it seems appropriate to configure it as a percentage.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I like this. Will be easier for folks to deal with scale in/out situations.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about supporting both options?

* information would be broadcasted. */
int pfail_wanted = server.cluster->stats_pfail_nodes;
if (pfail_wanted >= overall) {
pfail_wanted = overall - 1;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we set the pfail_wanted = overall

why are we reserving one spot in overall for wanted?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I suggested that. Will update it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we foresee any regression we don't gossip healthy nodes at all? I am wondering in scenarios where PFAIL nodes are never actually marked as FAIL or healthy.

@zuiderkwast
Copy link
Contributor

What are the theoretical implications of lowering the number?

Sending 10 pings with n/10 gossips achieves the same information-spreading effect as sending 20 pings n/20 gossips? So failure detection and convenrgence of any changes slows down linearly with this config?

I'm fine with a config like this, but I (and others, you included?) have a feeling we can gossip smarter without sacrificing anything.

I really liked the idea of prioritizing gossips about node for which there was a recent change, this idea: #1897 (comment)

Can we add a last-modified timetamp to each node and do a weighted random?

Another idea is to increment a score for each node we gossip about and then prioritize the ones with lower score next time.

@hpatro
Copy link
Collaborator Author

hpatro commented Oct 17, 2025

I really liked the idea of prioritizing gossips about node for which there was a recent change, this idea: #1897 (comment)

I came across this idea in hashicorp's serf. It requires a bit of work to index node by different type and have logic around which ones to prioritise more. Quite achievable to have smarter gossip.

This PR is a guardrail for the current system to avoid CPU/network spikes.

@hpatro
Copy link
Collaborator Author

hpatro commented Oct 18, 2025

What are the theoretical implications of lowering the number?

We need to guarantee a message to be received directly or indirectly from another node within node-timeout/2 period. If that is met we don't send out another message.

So, this might lead to more direct pings which has higher overhead. Gossip node information is 106 bytes. However, the entire payload is around 2200B.

createIntConfig("rdma-port", NULL, MODIFIABLE_CONFIG, 0, 65535, server.rdma_ctx_config.port, 0, INTEGER_CONFIG, NULL, updateRdmaPort),
createIntConfig("rdma-rx-size", NULL, IMMUTABLE_CONFIG, 64 * 1024, 16 * 1024 * 1024, server.rdma_ctx_config.rx_size, 1024 * 1024, INTEGER_CONFIG, NULL, NULL),
createIntConfig("rdma-completion-vector", NULL, IMMUTABLE_CONFIG, -1, 1024, server.rdma_ctx_config.completion_vector, -1, INTEGER_CONFIG, NULL, NULL),
createIntConfig("cluster-ping-message-gossip-max-count", NULL, MODIFIABLE_CONFIG, 0, 2000, server.cluster_ping_message_gossip_max_count, 0, INTEGER_CONFIG, NULL, NULL),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can the max be a function of dictSize(server.cluster->nodes)? I mean it would be good to validate that we don't oversend the number of nodes in gossip.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants