-
Notifications
You must be signed in to change notification settings - Fork 2
refactor: with announce_signed_peer, get_signed_peer #18
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
Pkarr relay server at: http://pkarr.rustonbsd.com:6881 |
|
Thoughts:
|
|
@rustonbsd happy to see you take this seriously, and I will try to contact Libtorrent maintainer to ask for his support, at least after we validate this implementation a bit further.
I want to note that: 1) the fn main() {
tracing_subscriber::fmt()
.with_env_filter(tracing_subscriber::EnvFilter::from_default_env())
.init();
dht::Dht::builder()
.server_mode()
.port(6881)
.build()
.unwrap()
loop {
std::thread::sleep(std::time::Duration::from_secs(10));
let info = dht.info();
tracing::info!(?info);
}
}That being said, I am pleased to see that my laptop is seeing your node (as an extra node plus mine): RUST_LOG=debug cargo run --example get_signed_peers FFBFBF52B8B91B946C688028AE1D45C8D4A3048D
...
Populated the routing table self_id=Id(....) table_size=87 signed_peers_table_size=2 << Horray
...
Done query id=Id(ffbfbf52b8b91b946c688028ae1d45c8d4a3048d) closest=5 visited=6 responders=2 << Extra HorrayEdit: double checked, and yes indeed the extra node is your Edit2: Note; I didn't manually add your node to my bootstrapping list, I used the default, but your node was registered in the |
|
Hi @Nuhvi,
my node was setup as follows: git clone https://github.com/Nuhvi/pkarr
cd pkarr/relay
cargo build --release
# run "../target/release/pkarr-relay" in tmux session for the momentlet me know if i should make any adjustments!
That is really cool 🎉
I have done a quick refactor taking the feature route. I added a new feature You can test if you want to with the following two tests: cargo run --example e2e_test_experimental --features="iroh-gossip experimental"
cargo run --example chat_experimental --features="iroh-gossip experimental"I am fighting github actions atm but we should be able to look at the e2e test step in the actions workflows with and without the experimental feature enabled and we should see the difference in bootstrapping speed. |
|
execution times form gh actions e2e_test vs e2e_test_experimental:
|
Not necessarily, this setup work, but for example if you run a binary with Another thing you might want to do is to explicitly use your node as an |
src/dht_experimental.rs
Outdated
| self.reset()?; | ||
| } | ||
| let mut hasher = sha2::Sha512::new(); | ||
| hasher.update(topic_bytes); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It might be useful here to add a namespace like b'Iroh' just to make sure that anyone using the same topic and sha512 but for another overlay network than Iroh, won't get the same info_hash.. I described this in the BEP, but maybe I should have added it to the function signature? I didn't want to force it on people.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes absolutely! 1. it should be sha512 and 2. added "/iroh/distributed-topic-tracker" as extra bytes to hash for namespacing.
Did some more minimal refactoring to move experimental features into their own submodule, for a cleaner api cut.
|
@Nuhvi my current status is:
What do you think? I will play with the timestamps of SignedAnnounce a bit more. |
|
@rustonbsd I am not fully aware of what you are doing or testing, so I will have to spend some time going through it and let you know if there is any obvious foot guns you can avoid. Will get back to you later tonight or early tomorrow. |
|
Until then... since you are testing live and not with a local testnet, and since you are using but I will also later test myself and see how does my own node that is running without any rate limits is behaving. But also, note that typically in practice, nodes in the same ip address aren't expected to make too many announcements/lookups, no? |
|
@Nuhvi thank you for taking a look!
I agree that we can be much more chill about the timeouts and retry timings, I just want to know where the limits are and figure out why this was more flaky then the more complex distributed-topic-trackers native behaviour that is also not very chill with the timeouts. I forked mainline and added a more_recent_than param to get_signed_peers function and that seems to work perfectly for nodes that are coming and going quickly (regardless of ratelimits). IDK if that is interesting to you but this would make it possible to avoid namespaceing by time e.g. unix minute or something like that. What do you think? |
Well, likely because these requests were sharded over a massive DHT, and any rate limits that are encountered in the bootstrapping nodes are irrelevant, since they are only used to populate the routing table... while now, there is only two nodes, if they have any rate limits, things will fail.
I noticed that, and it is interesting... not sure how does fix the rate limits? is there a way to simulate the situation with or without this option using a local My best guest here is that this indicates that rate limits aren't relevant at all (which makes sense since at least one node has none), but the fact that you are announcing too many nodes on the same topic, and storage nodes return random 20, and these random 20 might be all old and dead by the time you make a lookup. It would really help if you separate (in your tests) the notion of failure to two: failiure to get responses, and failure to get data. The former is a failure of the DHT, the later is just a thing that any Bittorrent-like system needs to deal with... no matter how "recent" the announcement is, a malicious peer (or just a flaky client) can announce 100s of peers, but none of them is actually either listening to incoming requests, or have the data. So please try again and focus on assuming that peers are a hit and miss, so get peers from the DHT, try them, and they fail, ask for another 20 random ones. Apologies if you are already doing that, I am trying to use my intuition without reviewing the code, because this way I can answer earlier than I could otherwise. |
I agree.
I will look deeper into the querying of 20 nodes at a time sequentially and not randomly. For this highly specific scenario (rapidly running tests in succession for example on every pr sync event or locally without a test net). I usually run my test like 10 times on local in quick succession just to see if or where we break with too many stale records/pub keys still on the dht.
I will rewrite the e2e test to reflect more nuanced failure cases but if you run it locally and enable debug tracing you should get more insight.
Yes the scenario you describe isn't helped by time based filtering. I just came across the issue during testing on live. i go into my reasons a bit more below.
Implemented is two loops a publisher that sleeps for a while after success and a bootstrap loop that checks based on num_peers > 0 what timing to use for the next get_signed_perrs call, get 0->20 pubkeys from the topic (infohash) back, check against already tried hashset and try join the unknown peers, add unknown peers to hashset. repeat. Maybe I am just blind or haven't seen a way to get non random lists of 20 peers to try? like pagination? or maybe a "seen before" filter that gets applied before selecting 20 random peers? so far i just use the get_signed_peers function. but better then more_recent would be a general filter ability either by a passed "filter list" or something. I will take another look at the code if i missed something and I could just through sequential calls get the right peers and no overlap between the random samples from call to call (could very well be 🙈). I am also noticing i wasn't really clear about the state of this right now. It's all overly aggressive because i want to find the best params for the shortest gossip network bootstrap that can be run over and over again on the same system without breaking. Whenever I use the distributed-topic-tracker in a project repeated cold bootstraps are just part of the testing and if the lib feels flaky there (even if not realistic) i would not use the Library since i would be afraid it would do that on users machines if they do a couple of rapid restarts. but let me dial it in with max aggression before i add time based name spacing and then benchmark the tradeoffs off slower and fewer retries etc. then we have some numbers and can quantify the tradeoffs. The failure behavior: I can run the above in three terminals about six to ten times before when running again I read 0 peers found in the logs, the first six to ten worked fine and read more and more known peers for the topic. Maybe some longer term block based on ip? But when i change the topic-postfix and rerun it works again for 6-10 iterations. If i wait a few minutes i can reuse an old topic-postfix again. Probably a two tiered rate limit? I will look through mainline in more detail and figure out what this is exactly. Still working on the details before benchmarking. I just want to make the test work reliably first and get the lowest, repeatable, bootstrap times on the live network (our two nodes for now ^^) FIY: my main objectives are 1. fast cold bootstrap times: time till we connect first, live iroh peer and 2. (rapid) repeatability without impacting the speed of bootstraping a completely new network with some stale entries still around. More reasonable request limits and more nodes would probably solve all the rate limit issues in a "balanced" version. I am just not quite at balanced node yet. I will test the "normal" distributed-topic-tracker with two custom nodes only and see how it behaves in that environment. Might be a couple days. If you have something specific i should test, let me know. I am happy to run some experiments. So far I have just been playing with the new functions and working towards benchmarking and optimizing. |
|
The reason BEP_0005 suggests that nodes return random subset of peers (which is what I am doing in the new extension of signed peers) is because that is the only secure way to let honest peers pass through as fast as possible, pagination favors earliest peers (who might be stale) or if reversed they favor spammy attackers who keep announcing on a loop. Please notice that you only need just one honest peer, once you find them you can ask them for more peers they trust. The mental model using DHTs should be:
You should always assume that the DHT works rarely, and design your system so that it leverages the DHT for censorship resistance, that occasionally works, then go super hard on caching and p2p gossip after you find the first peers etc... Whenever you find your tests flaky, please read the first sentence in BEP_0005:
If you need reliable and high quality service, you hav no other option but to use centralized trackers. |
|
If I am designing a content discovery system, that needs to be reliable and censorship resistant, I would:
This isn't a novel idea either, this is how Bittorrent works in the first place, it started with trackers then added the DHT to enhance censorship resistance, not to have reliable peer discovery, and to this day that's how magnet links work, for example here is a random one, notice how many hard coded trackers there are, yes the infohash is enough to find peers on the DHT, but they still hardcode as many trackers as they can... because that is how things work reliably:
|
|
Oh, I forgot to mention another important point; getting 20 random peers from one node is not representative of what a DHT should be like, usually, you should be getting random 20 peers from each node, and you should have so many, that more or less you get 90% if not all the announced peers in one round trip. I think if you are going to judge the performance of the new extension, in public network and not testnet, then you are forced to run 10s of nodes to get a correct empirical results representative of the state after enough adoption. |
|
Yes I agree, I wouldn't build any production system with the distributed-topic-tracker as its primary peer discovery mechanism. Not if any scale is expected. I am sorry I misunderstood. The original distributed-topic-tracker was an exercise in "can I use mainline to get reliable peer discovery by topic working and can I make it reasonably fast" (it was always intended for very small projects (see my github ^^)) and I did the same thing again with the new If you are still interested I will build a balanced version that is intended for long running systems with eventual consistency and no continuous bootstrapping after the first peer has been found. I can refactor and create some benchmark results? It of course makes more sense to build a reasonable PoC and not a very aggressive engineering exercise since this is a protocol proposal. I apologies for the misunderstanding, I got carried away. |
|
No need to apologize, I am just trying to manage expectations. Happy to help in any way you need going forward. |
|
Hi @Nuhvi , sorry for the long delay, So what are we thinking would be the best way to benchmark this and compare it to the current distributed-topic-tracker? Maybe we can go at this backwards. What do you think would be convincing evidence that this repo could show to make a strong case for your rfc to pass? Ideas:
How do you test and compare different versions of mainline and pkarr against each other? What are the metrics you track? Any ideas? Or maybe something completely different that shows the use case for this new bittorrent feature? If adopted it would make this whole project fold nicely into iroh gossip or even generally as a discovery mechanism for more than gossip topics i.e. as a generalized discovery mechanism at the iroh endpoint level. Let me know your thoughts. |
|
@rustonbsd I am biased of course, but I don't think we need to do benchmarking for the following reasons:
Yes running 200 nodes is going to be more resilient than running 2, but if you are going to shut them down later, it doesn't matter much. I think the most promising ways forward are:
For (1) and (2), we need to convince ourselves firs that the implementation is solid and stable, then go ask Iroh devs and users that they should switch from For (3), I think the RFC is enough, the maintainer is perfectly capable of judging the logic himself, the only issue is that I sent him an email and he hasn't responded yet, and from what I see from the commit history, I am assuming he is currently busy, but eventually we may open an issue on https://github.com/arvidn/libtorrent to get his attention, and having multiple people already testing (thank you) would hopefully lend more credibility to my claims about both relevance and efficacy. So, to summarize: let's get as many people as we can run the node for earnest, especially in Iroh community, and maybe let's open an issue in Libtorrent. |
|
That sounds like a solid plan. I will go over this implementation one more time and make it cleaner then publish the new discovery mechanism as an experimental feature and merge this PR. I have a channel on the iroh discord server for the development of the distributed-topic-tracker by the same name. I will write something about this in there after I merge the pr with the experimental feature. Question: can we/should we do the implementation work in libtorrent ourselves, not only the proposal, so it is easier to judge and merge for the maintainer? I haven't done this before so I'm just curious ^^ |
|
@rustonbsd I am not familiar enough with Libtorrent code base nor with writing C++ so I wouldn't dare to be honest, and even if I would, the first step would be to ask the maintainer regardless. If he asked for us to open a PR, maybe we can try then. |
Refactoring distributed-topic-tracker to use new mainline primitives
announce_signed_peerandget_signed_peerdiscussed in @Nuhvi draft proposal:announce_signed_peerandget_signed_peer