-
Notifications
You must be signed in to change notification settings - Fork 933
Bound the number of nodes in gossip section #2746
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: unstable
Are you sure you want to change the base?
Conversation
908e10e to
7b29e59
Compare
Signed-off-by: Harkrishn Patro <harkrisp@amazon.com>
7b29e59 to
c5642e8
Compare
| * Since we have non-voting replicas that lower the probability of an entry | ||
| * to feature our node, we set the number of entries per packet as | ||
| * 10% of the total nodes we have. */ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to update the comment?
| wanted = floor(dictSize(server.cluster->nodes) / 10); | ||
| if (wanted < 3) wanted = 3; | ||
| if (wanted > freshnodes) wanted = freshnodes; | ||
| int overall = server.cluster_ping_message_gossip_max_count; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we have a it as a percentage? cluster_ping_message_gossip_max_perc
Also we are naming it to be max but will the number of nodes ever be less than that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I was going to suggest the same. The default is a percentage so it seems appropriate to configure it as a percentage.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I like this. Will be easier for folks to deal with scale in/out situations.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about supporting both options?
| * information would be broadcasted. */ | ||
| int pfail_wanted = server.cluster->stats_pfail_nodes; | ||
| if (pfail_wanted >= overall) { | ||
| pfail_wanted = overall - 1; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we set the pfail_wanted = overall
why are we reserving one spot in overall for wanted?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I suggested that. Will update it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we foresee any regression we don't gossip healthy nodes at all? I am wondering in scenarios where PFAIL nodes are never actually marked as FAIL or healthy.
|
What are the theoretical implications of lowering the number? Sending 10 pings with n/10 gossips achieves the same information-spreading effect as sending 20 pings n/20 gossips? So failure detection and convenrgence of any changes slows down linearly with this config? I'm fine with a config like this, but I (and others, you included?) have a feeling we can gossip smarter without sacrificing anything. I really liked the idea of prioritizing gossips about node for which there was a recent change, this idea: #1897 (comment) Can we add a last-modified timetamp to each node and do a weighted random? Another idea is to increment a score for each node we gossip about and then prioritize the ones with lower score next time. |
I came across this idea in hashicorp's serf. It requires a bit of work to index node by different type and have logic around which ones to prioritise more. Quite achievable to have smarter gossip. This PR is a guardrail for the current system to avoid CPU/network spikes. |
We need to guarantee a message to be received directly or indirectly from another node within node-timeout/2 period. If that is met we don't send out another message. So, this might lead to more direct pings which has higher overhead. Gossip node information is 106 bytes. However, the entire payload is around 2200B. |
| createIntConfig("rdma-port", NULL, MODIFIABLE_CONFIG, 0, 65535, server.rdma_ctx_config.port, 0, INTEGER_CONFIG, NULL, updateRdmaPort), | ||
| createIntConfig("rdma-rx-size", NULL, IMMUTABLE_CONFIG, 64 * 1024, 16 * 1024 * 1024, server.rdma_ctx_config.rx_size, 1024 * 1024, INTEGER_CONFIG, NULL, NULL), | ||
| createIntConfig("rdma-completion-vector", NULL, IMMUTABLE_CONFIG, -1, 1024, server.rdma_ctx_config.completion_vector, -1, INTEGER_CONFIG, NULL, NULL), | ||
| createIntConfig("cluster-ping-message-gossip-max-count", NULL, MODIFIABLE_CONFIG, 0, 2000, server.cluster_ping_message_gossip_max_count, 0, INTEGER_CONFIG, NULL, NULL), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can the max be a function of dictSize(server.cluster->nodes)? I mean it would be good to validate that we don't oversend the number of nodes in gossip.
I would prefer us to keep the node count bounded in gossip section rather than unbounded. In a 2,000 nodes cluster, the worst case is 1998 nodes in the gossip section which seems quite expensive on both sender and receiver end.
Also, wanted others to explore other node count and see if we should update the default. In #2291 Viktor suggestion was to try out sqrt(n) rather than 10% of total node count.
Related to #2291