A .NET Core console application that crawls websites to discover domains and retrieves WHOIS information for domain registration details.
- Web Crawling: Recursively crawls websites to specified depth levels
- Concurrent Processing: Uses async/await and semaphores for efficient parallel crawling
- Domain Discovery: Extracts unique domains from discovered URLs
- 🌲 Interactive Tree Visualization: Beautiful ASCII tree view of domain relationships
- WHOIS Lookup: Retrieves domain registration information including:
- Domain owner/registrant
- Domain expiration date
- Domain registration date
- Registrar information
- SQLite Database: Stores discovered domains with efficient indexing
- Duplicate Prevention: Only saves new domains (not already in database)
- 📊 Rich Console UI: Powered by Spectre.Console with colors, tables, and charts
- Performance Optimized:
- Bulk database queries
- HashSet for O(1) lookups
- Concurrent HTTP requests with throttling
- Database indexes on frequently queried fields
- .NET 9.0 SDK or later
- Windows, macOS, or Linux
- Modern terminal recommended (Windows Terminal, iTerm2, etc.) for best visual experience
- Clone or download this repository
- Navigate to the SiteMiner directory
- Restore packages:
dotnet restore
Run without arguments for the interactive menu:
dotnet runChoose from:
- 🌐 Start new mining session - Mine a new website
- 🌲 View domain tree - Visualize discovered domains in a tree structure
- 📊 View database statistics - See mining statistics with charts
- ❌ Exit
dotnet run <url> <level>Arguments:
url: The starting URL to mine (e.g., https://example.com)level: The depth level to crawl (0-3 recommended for reasonable runtime)
Examples:
# Mine a website
dotnet run https://example.com 1
dotnet run https://news.ycombinator.com 2
# View domain tree directly
dotnet run --tree
# Interactive mode
dotnet run- Initialization: Creates SQLite database (
siteminer.db) if it doesn't exist - Crawling: Starts from the given URL and crawls to the specified depth
- Level 0: Only the starting URL
- Level 1: Starting URL + all links found on it
- Level 2: Level 1 + all links found on those pages
- And so on...
- Domain Extraction: Extracts unique domain names from all discovered URLs
- New Domain Filtering: Queries database to find which domains are new
- WHOIS Retrieval: Fetches WHOIS information for new domains (rate-limited to avoid blocking)
- Database Storage: Saves all new domains with their WHOIS information
- Statistics & Visualization: View results as statistics or interactive tree view
The application includes a beautiful tree visualization feature that displays discovered domains in a hierarchical structure:
bvd.co.il ○
└── Level 1 (4 domains)
├── develops.co.il ○
├── pinterest.com ✓ | Owner: DNStination Inc. | Registrar: MarkMonitor, Inc.
├── wa.me ✓ | Owner: Whatsapp Inc. | Registrar: RegistrarSafe, LLC
└── youtube.com ✓ | Owner: Google LLC | Registrar: MarkMonitor, Inc.
Features:
- Interactive domain selection from mined roots
- Color-coded information (owner, expiration, registrar)
- WHOIS status icons (✓ = has info, ○ = no info)
- Tree statistics and summaries
- Expiration date warnings with color coding
See TREE_VIEWER_GUIDE.md for detailed documentation.
Domain(Model): Entity representing a domain with WHOIS informationSiteMinerDbContext: Entity Framework Core database context for SQLiteWebCrawler: Concurrent web crawler with throttlingWhoisService: WHOIS information retrieval and parsingDomainRepository: Data access layer with optimized queriesDomainTreeVisualizer: Interactive tree view for domain relationships
- Concurrent Crawling: Uses
SemaphoreSlimto limit concurrent HTTP requests (default: 5) - Bulk Database Operations: Single query to check multiple domains at once
- HashSet Lookups: O(1) time complexity for duplicate checking
- Database Indexing: Indexes on
DomainNameandDiscoveredAtfields - Efficient URL Management:
ConcurrentDictionaryto track visited URLs - Rich Console UI: Powered by Spectre.Console for beautiful visualizations
Domains Table:
Id(Primary Key)DomainName(Indexed, Unique)SourceUrlOwnerExpirationDateRegistrationDateRegistrarDiscoveredAt(Indexed)LastUpdatedDiscoveryLevel
You can modify these constants in the source code:
MaxConcurrentRequestsinWebCrawler.cs: Number of simultaneous HTTP requests (default: 5)RequestTimeoutSecondsinWebCrawler.cs: HTTP request timeout (default: 10 seconds)- WHOIS processing limit in
Program.cs: Number of domains to fetch WHOIS for (default: 20)
- WHOIS queries are rate-limited to 1 per second to avoid server blocking
- First run processes up to 20 domains with WHOIS info; additional domains are saved without WHOIS
- Some WHOIS servers may block automated queries
- Deep crawling (level > 3) can take significant time and discover many domains
The SQLite database file (siteminer.db) is created in the application's running directory.
=== Site Miner - Domain Discovery Tool ===
Initializing database...
Database ready.
Step 1: Crawling website...
------------------------------------------------------------
Starting crawl from: https://example.com
Max depth level: 1
[Level 0] Crawling: https://example.com
[Level 1] Crawling: https://example.com/about
...
Discovered 25 unique domains.
Step 2: Filtering new domains...
------------------------------------------------------------
Found 15 new domains (not in database).
Skipped 10 existing domains.
Step 3: Retrieving WHOIS information...
------------------------------------------------------------
[1/15] Processing example.com... ✓
[2/15] Processing another-domain.com... ✓
...
Step 4: Saving to database...
------------------------------------------------------------
Saved 15 domains to database.
=== Database Statistics ===
------------------------------------------------------------
Total domains in database: 15
Domains with WHOIS info: 15
Domains by discovery level:
Level 1: 15 domains
Mining completed in 45.32 seconds.
This is a demonstration project. Feel free to use and modify as needed.
- This tool is for educational and research purposes
- Always respect robots.txt and website terms of service
- Be mindful of server load and implement appropriate delays
- Some WHOIS servers have strict rate limits