Skip to content

Conversation

@github-actions
Copy link
Contributor

@github-actions github-actions bot commented Nov 5, 2025

This PR updates the Tranco top 1 million domains list and all derived files.

aidenmitchell and others added 8 commits November 5, 2025 10:00
This commit adds a new step to filter out TLD-like domains from the Tranco list before processing. The excluded domains are second-level domains under generic TLDs that function as alternative TLDs (like net.ru, br.com, uk.com, etc.).

This filtering step:
- Removes 25 known TLD-like domains from the Tranco list
- Logs removal counts for transparency
- Runs before the configuration and processing steps

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
Changes:
- Replace grep -c with grep | wc -l | tr -d ' ' to properly handle count
- Add TOTAL_REMOVED counter to track total exclusions
- Add summary output section for better visibility
- Use -E flag consistently for regex matching

The previous version had a newline issue with grep -c output that caused
"integer expression expected" errors in the comparison. This fix ensures
clean integer values for all comparisons.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
The Tranco CSV file uses Windows line endings (\r\n), which was preventing
the grep pattern from matching domains correctly. The pattern ",domain$"
was failing because there's a \r character before the \n.

Changes:
- Properly escape dots in domain names for regex matching
- Update pattern to match optional \r before end of line: ",domain(\r)?$"
- This now correctly handles both Unix (\n) and Windows (\r\n) line endings

Tested with actual Tranco file and confirmed removal of:
- 613,uk.com
- 2644,net.ru
- 6123,br.com

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
Instead of trying to match \r in regex patterns (which has inconsistent
behavior across different grep implementations), normalize the file to
Unix line endings first using 'tr -d', then use simple end-of-line patterns.

Changes:
- Add tr -d '\r' step to strip all carriage returns before processing
- Simplified grep pattern from ",domain(\r)?$" to ",domain$"
- Use grep -c directly (safe now that output is clean)

Tested locally and confirmed:
- Removes net.ru, uk.com, br.com successfully
- File line count reduced from 1000000 to 999997

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
Replace 'grep -c' with 'grep | wc -l | tr -d' to ensure clean integer
output without newlines or extra whitespace. This prevents the
"integer expression expected" error when domains are not found.

The previous version using grep -c was outputting values with formatting
that caused bash integer comparison to fail.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
Instead of maintaining a manual list of ~20 TLD-like domains, now fetch
and use the complete Public Suffix List (PSL) from publicsuffix.org.
Remove any Tranco entries that exactly match a PSL entry.

This is more comprehensive and maintainable:
- Covers ~9,754 public suffixes (vs 21 hardcoded)
- Automatically includes new suffixes as PSL is updated
- Removes infrastructure domains (workers.dev, github.io, herokuapp.com, etc.)
- Removes second-level TLDs (br.com, uk.com, net.ru, etc.)

Tested on top 10k Tranco domains:
- Found and removed 75 PSL entries
- Including: workers.dev, github.io, herokuapp.com, github.io,
  netlify.app, vercel.app, and all previously hardcoded domains

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
…code quality

Changes based on Copilot review feedback:

1. **Removed redundant variable**: Eliminated FOUND_COUNT, using only TOTAL_REMOVED

2. **Fixed regex escaping vulnerability**: Replaced regex-based grep with awk's
   exact string matching using associative arrays. This avoids all regex
   special character issues (dots, brackets, parentheses, hyphens, etc.)

3. **Added comprehensive error handling**:
   - Check curl exit code and fail fast if PSL download fails
   - Verify PSL file has content before proceeding
   - Added explicit error messages

4. **Optimized performance**: Replaced O(n×m) loop (1000 iterations of grep
   over 1M lines) with single-pass awk using hash table lookups O(n+m).
   This reduces processing time from ~3.5 minutes to ~10 seconds.

The awk approach:
- Loads all PSL entries into an associative array (hash table)
- Processes tranco.csv in a single pass
- Uses exact string matching via 'in' operator (no regex)
- Outputs filtered data and removal count efficiently

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant