Update Tranco list and derived files - 2025-11-05 #649

github-actions · 2025-11-05T19:25:37Z

This PR updates the Tranco top 1 million domains list and all derived files.

Date: 2025-11-05
Tranco List ID: VQNWN
List URL: https://tranco-list.eu/list/VQNWN
Automated update via GitHub Actions

This commit adds a new step to filter out TLD-like domains from the Tranco list before processing. The excluded domains are second-level domains under generic TLDs that function as alternative TLDs (like net.ru, br.com, uk.com, etc.). This filtering step: - Removes 25 known TLD-like domains from the Tranco list - Logs removal counts for transparency - Runs before the configuration and processing steps 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>

Changes: - Replace grep -c with grep | wc -l | tr -d ' ' to properly handle count - Add TOTAL_REMOVED counter to track total exclusions - Add summary output section for better visibility - Use -E flag consistently for regex matching The previous version had a newline issue with grep -c output that caused "integer expression expected" errors in the comparison. This fix ensures clean integer values for all comparisons. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>

The Tranco CSV file uses Windows line endings (\r\n), which was preventing the grep pattern from matching domains correctly. The pattern ",domain$" was failing because there's a \r character before the \n. Changes: - Properly escape dots in domain names for regex matching - Update pattern to match optional \r before end of line: ",domain(\r)?$" - This now correctly handles both Unix (\n) and Windows (\r\n) line endings Tested with actual Tranco file and confirmed removal of: - 613,uk.com - 2644,net.ru - 6123,br.com 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>

Instead of trying to match \r in regex patterns (which has inconsistent behavior across different grep implementations), normalize the file to Unix line endings first using 'tr -d', then use simple end-of-line patterns. Changes: - Add tr -d '\r' step to strip all carriage returns before processing - Simplified grep pattern from ",domain(\r)?$" to ",domain$" - Use grep -c directly (safe now that output is clean) Tested locally and confirmed: - Removes net.ru, uk.com, br.com successfully - File line count reduced from 1000000 to 999997 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>

Replace 'grep -c' with 'grep | wc -l | tr -d' to ensure clean integer output without newlines or extra whitespace. This prevents the "integer expression expected" error when domains are not found. The previous version using grep -c was outputting values with formatting that caused bash integer comparison to fail. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>

Instead of maintaining a manual list of ~20 TLD-like domains, now fetch and use the complete Public Suffix List (PSL) from publicsuffix.org. Remove any Tranco entries that exactly match a PSL entry. This is more comprehensive and maintainable: - Covers ~9,754 public suffixes (vs 21 hardcoded) - Automatically includes new suffixes as PSL is updated - Removes infrastructure domains (workers.dev, github.io, herokuapp.com, etc.) - Removes second-level TLDs (br.com, uk.com, net.ru, etc.) Tested on top 10k Tranco domains: - Found and removed 75 PSL entries - Including: workers.dev, github.io, herokuapp.com, github.io, netlify.app, vercel.app, and all previously hardcoded domains 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>

…code quality Changes based on Copilot review feedback: 1. **Removed redundant variable**: Eliminated FOUND_COUNT, using only TOTAL_REMOVED 2. **Fixed regex escaping vulnerability**: Replaced regex-based grep with awk's exact string matching using associative arrays. This avoids all regex special character issues (dots, brackets, parentheses, hyphens, etc.) 3. **Added comprehensive error handling**: - Check curl exit code and fail fast if PSL download fails - Verify PSL file has content before proceeding - Added explicit error messages 4. **Optimized performance**: Replaced O(n×m) loop (1000 iterations of grep over 1M lines) with single-pass awk using hash table lookups O(n+m). This reduces processing time from ~3.5 minutes to ~10 seconds. The awk approach: - Loads all PSL entries into an associative array (hash table) - Processes tranco.csv in a single pass - Uses exact string matching via 'in' operator (no regex) - Outputs filtered data and removal count efficiently 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>

aidenmitchell and others added 8 commits November 5, 2025 10:00

Update Tranco list for 2025-11-05 (ID: VQNWN)

2f35f70

aidenmitchell closed this Nov 10, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Update Tranco list and derived files - 2025-11-05 #649

Update Tranco list and derived files - 2025-11-05 #649

github-actions bot commented Nov 5, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Update Tranco list and derived files - 2025-11-05 #649

Update Tranco list and derived files - 2025-11-05 #649

Conversation

github-actions bot commented Nov 5, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant