Skip to content

adirel/www-domain-crawler

Repository files navigation

Site Miner - Domain Discovery Tool

A .NET Core console application that crawls websites to discover domains and retrieves WHOIS information for domain registration details.

Features

  • Web Crawling: Recursively crawls websites to specified depth levels
  • Concurrent Processing: Uses async/await and semaphores for efficient parallel crawling
  • Domain Discovery: Extracts unique domains from discovered URLs
  • 🌲 Interactive Tree Visualization: Beautiful ASCII tree view of domain relationships
  • WHOIS Lookup: Retrieves domain registration information including:
    • Domain owner/registrant
    • Domain expiration date
    • Domain registration date
    • Registrar information
  • SQLite Database: Stores discovered domains with efficient indexing
  • Duplicate Prevention: Only saves new domains (not already in database)
  • 📊 Rich Console UI: Powered by Spectre.Console with colors, tables, and charts
  • Performance Optimized:
    • Bulk database queries
    • HashSet for O(1) lookups
    • Concurrent HTTP requests with throttling
    • Database indexes on frequently queried fields

Prerequisites

  • .NET 9.0 SDK or later
  • Windows, macOS, or Linux
  • Modern terminal recommended (Windows Terminal, iTerm2, etc.) for best visual experience

Installation

  1. Clone or download this repository
  2. Navigate to the SiteMiner directory
  3. Restore packages:
    dotnet restore

Usage

Interactive Mode (Recommended)

Run without arguments for the interactive menu:

dotnet run

Choose from:

  • 🌐 Start new mining session - Mine a new website
  • 🌲 View domain tree - Visualize discovered domains in a tree structure
  • 📊 View database statistics - See mining statistics with charts
  • Exit

Command Line

dotnet run <url> <level>

Arguments:

  • url: The starting URL to mine (e.g., https://example.com)
  • level: The depth level to crawl (0-3 recommended for reasonable runtime)

Examples:

# Mine a website
dotnet run https://example.com 1
dotnet run https://news.ycombinator.com 2

# View domain tree directly
dotnet run --tree

# Interactive mode
dotnet run

How It Works

  1. Initialization: Creates SQLite database (siteminer.db) if it doesn't exist
  2. Crawling: Starts from the given URL and crawls to the specified depth
    • Level 0: Only the starting URL
    • Level 1: Starting URL + all links found on it
    • Level 2: Level 1 + all links found on those pages
    • And so on...
  3. Domain Extraction: Extracts unique domain names from all discovered URLs
  4. New Domain Filtering: Queries database to find which domains are new
  5. WHOIS Retrieval: Fetches WHOIS information for new domains (rate-limited to avoid blocking)
  6. Database Storage: Saves all new domains with their WHOIS information
  7. Statistics & Visualization: View results as statistics or interactive tree view

🌲 Domain Tree Viewer

The application includes a beautiful tree visualization feature that displays discovered domains in a hierarchical structure:

bvd.co.il ○
└── Level 1 (4 domains)
    ├── develops.co.il ○
    ├── pinterest.com ✓ | Owner: DNStination Inc. | Registrar: MarkMonitor, Inc.
    ├── wa.me ✓ | Owner: Whatsapp Inc. | Registrar: RegistrarSafe, LLC
    └── youtube.com ✓ | Owner: Google LLC | Registrar: MarkMonitor, Inc.

Features:

  • Interactive domain selection from mined roots
  • Color-coded information (owner, expiration, registrar)
  • WHOIS status icons (✓ = has info, ○ = no info)
  • Tree statistics and summaries
  • Expiration date warnings with color coding

See TREE_VIEWER_GUIDE.md for detailed documentation.

Architecture

Key Classes

  • Domain (Model): Entity representing a domain with WHOIS information
  • SiteMinerDbContext: Entity Framework Core database context for SQLite
  • WebCrawler: Concurrent web crawler with throttling
  • WhoisService: WHOIS information retrieval and parsing
  • DomainRepository: Data access layer with optimized queries
  • DomainTreeVisualizer: Interactive tree view for domain relationships

Performance Features

  1. Concurrent Crawling: Uses SemaphoreSlim to limit concurrent HTTP requests (default: 5)
  2. Bulk Database Operations: Single query to check multiple domains at once
  3. HashSet Lookups: O(1) time complexity for duplicate checking
  4. Database Indexing: Indexes on DomainName and DiscoveredAt fields
  5. Efficient URL Management: ConcurrentDictionary to track visited URLs
  6. Rich Console UI: Powered by Spectre.Console for beautiful visualizations

Database Schema

Domains Table:

  • Id (Primary Key)
  • DomainName (Indexed, Unique)
  • SourceUrl
  • Owner
  • ExpirationDate
  • RegistrationDate
  • Registrar
  • DiscoveredAt (Indexed)
  • LastUpdated
  • DiscoveryLevel

Configuration

You can modify these constants in the source code:

  • MaxConcurrentRequests in WebCrawler.cs: Number of simultaneous HTTP requests (default: 5)
  • RequestTimeoutSeconds in WebCrawler.cs: HTTP request timeout (default: 10 seconds)
  • WHOIS processing limit in Program.cs: Number of domains to fetch WHOIS for (default: 20)

Limitations

  • WHOIS queries are rate-limited to 1 per second to avoid server blocking
  • First run processes up to 20 domains with WHOIS info; additional domains are saved without WHOIS
  • Some WHOIS servers may block automated queries
  • Deep crawling (level > 3) can take significant time and discover many domains

Database Location

The SQLite database file (siteminer.db) is created in the application's running directory.

Example Output

=== Site Miner - Domain Discovery Tool ===

Initializing database...
Database ready.

Step 1: Crawling website...
------------------------------------------------------------
Starting crawl from: https://example.com
Max depth level: 1
[Level 0] Crawling: https://example.com
[Level 1] Crawling: https://example.com/about
...

Discovered 25 unique domains.

Step 2: Filtering new domains...
------------------------------------------------------------
Found 15 new domains (not in database).
Skipped 10 existing domains.

Step 3: Retrieving WHOIS information...
------------------------------------------------------------
[1/15] Processing example.com... ✓
[2/15] Processing another-domain.com... ✓
...

Step 4: Saving to database...
------------------------------------------------------------
Saved 15 domains to database.

=== Database Statistics ===
------------------------------------------------------------
Total domains in database: 15
Domains with WHOIS info: 15

Domains by discovery level:
  Level 1: 15 domains

Mining completed in 45.32 seconds.

License

This is a demonstration project. Feel free to use and modify as needed.

Notes

  • This tool is for educational and research purposes
  • Always respect robots.txt and website terms of service
  • Be mindful of server load and implement appropriate delays
  • Some WHOIS servers have strict rate limits

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages