Skip to content

Conversation

@tw4l
Copy link
Member

@tw4l tw4l commented Nov 5, 2025

Work in progress

  • After successfully parsing all seeds from a seed file, store those seeds in Redis and do not attempt to re-download the seed file again on subsequent runs (e.g. after a crawl is paused, picked up from serialized state, or otherwise restarted)
  • Refactor parseSeeds into its own module to avoid circular imports

Tested with pausing crawls in Browsertrix and with picking up from interrupted crawls via serialized state YAML files with the crawler, but still needs tests.

tw4l added 4 commits November 4, 2025 10:46
Use hacky any to avoid circular import, will fix properly in later
commit
Also move parseSeeds to separate module to avoid circular import
Allow crawlState to be undefined in parseSeeds for use in scope tests
@tw4l tw4l force-pushed the issue-897-seedfile-expiration branch 2 times, most recently from 1040dac to 6cb6a22 Compare November 6, 2025 16:27
@tw4l tw4l force-pushed the issue-897-seedfile-expiration branch from dee29ca to 7c409f6 Compare November 6, 2025 17:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants