-
Notifications
You must be signed in to change notification settings - Fork 35
Update Tranco list and derived files - 2025-11-05 #649
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Closed
+1,059,070
−1,060,000
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This commit adds a new step to filter out TLD-like domains from the Tranco list before processing. The excluded domains are second-level domains under generic TLDs that function as alternative TLDs (like net.ru, br.com, uk.com, etc.). This filtering step: - Removes 25 known TLD-like domains from the Tranco list - Logs removal counts for transparency - Runs before the configuration and processing steps 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
Changes: - Replace grep -c with grep | wc -l | tr -d ' ' to properly handle count - Add TOTAL_REMOVED counter to track total exclusions - Add summary output section for better visibility - Use -E flag consistently for regex matching The previous version had a newline issue with grep -c output that caused "integer expression expected" errors in the comparison. This fix ensures clean integer values for all comparisons. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
The Tranco CSV file uses Windows line endings (\r\n), which was preventing the grep pattern from matching domains correctly. The pattern ",domain$" was failing because there's a \r character before the \n. Changes: - Properly escape dots in domain names for regex matching - Update pattern to match optional \r before end of line: ",domain(\r)?$" - This now correctly handles both Unix (\n) and Windows (\r\n) line endings Tested with actual Tranco file and confirmed removal of: - 613,uk.com - 2644,net.ru - 6123,br.com 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
Instead of trying to match \r in regex patterns (which has inconsistent behavior across different grep implementations), normalize the file to Unix line endings first using 'tr -d', then use simple end-of-line patterns. Changes: - Add tr -d '\r' step to strip all carriage returns before processing - Simplified grep pattern from ",domain(\r)?$" to ",domain$" - Use grep -c directly (safe now that output is clean) Tested locally and confirmed: - Removes net.ru, uk.com, br.com successfully - File line count reduced from 1000000 to 999997 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
Replace 'grep -c' with 'grep | wc -l | tr -d' to ensure clean integer output without newlines or extra whitespace. This prevents the "integer expression expected" error when domains are not found. The previous version using grep -c was outputting values with formatting that caused bash integer comparison to fail. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
Instead of maintaining a manual list of ~20 TLD-like domains, now fetch and use the complete Public Suffix List (PSL) from publicsuffix.org. Remove any Tranco entries that exactly match a PSL entry. This is more comprehensive and maintainable: - Covers ~9,754 public suffixes (vs 21 hardcoded) - Automatically includes new suffixes as PSL is updated - Removes infrastructure domains (workers.dev, github.io, herokuapp.com, etc.) - Removes second-level TLDs (br.com, uk.com, net.ru, etc.) Tested on top 10k Tranco domains: - Found and removed 75 PSL entries - Including: workers.dev, github.io, herokuapp.com, github.io, netlify.app, vercel.app, and all previously hardcoded domains 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
…code quality Changes based on Copilot review feedback: 1. **Removed redundant variable**: Eliminated FOUND_COUNT, using only TOTAL_REMOVED 2. **Fixed regex escaping vulnerability**: Replaced regex-based grep with awk's exact string matching using associative arrays. This avoids all regex special character issues (dots, brackets, parentheses, hyphens, etc.) 3. **Added comprehensive error handling**: - Check curl exit code and fail fast if PSL download fails - Verify PSL file has content before proceeding - Added explicit error messages 4. **Optimized performance**: Replaced O(n×m) loop (1000 iterations of grep over 1M lines) with single-pass awk using hash table lookups O(n+m). This reduces processing time from ~3.5 minutes to ~10 seconds. The awk approach: - Loads all PSL entries into an associative array (hash table) - Processes tranco.csv in a single pass - Uses exact string matching via 'in' operator (no regex) - Outputs filtered data and removal count efficiently 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This PR updates the Tranco top 1 million domains list and all derived files.