Site Miner - Domain Discovery Tool

A .NET Core console application that crawls websites to discover domains and retrieves WHOIS information for domain registration details.

Features

Web Crawling: Recursively crawls websites to specified depth levels
Concurrent Processing: Uses async/await and semaphores for efficient parallel crawling
Domain Discovery: Extracts unique domains from discovered URLs
🌲 Interactive Tree Visualization: Beautiful ASCII tree view of domain relationships
WHOIS Lookup: Retrieves domain registration information including:
- Domain owner/registrant
- Domain expiration date
- Domain registration date
- Registrar information
SQLite Database: Stores discovered domains with efficient indexing
Duplicate Prevention: Only saves new domains (not already in database)
📊 Rich Console UI: Powered by Spectre.Console with colors, tables, and charts
Performance Optimized:
- Bulk database queries
- HashSet for O(1) lookups
- Concurrent HTTP requests with throttling
- Database indexes on frequently queried fields

Prerequisites

.NET 9.0 SDK or later
Windows, macOS, or Linux
Modern terminal recommended (Windows Terminal, iTerm2, etc.) for best visual experience

Installation

Clone or download this repository
Navigate to the SiteMiner directory
Restore packages:
```
dotnet restore
```

Usage

Interactive Mode (Recommended)

Run without arguments for the interactive menu:

dotnet run

Choose from:

🌐 Start new mining session - Mine a new website
🌲 View domain tree - Visualize discovered domains in a tree structure
📊 View database statistics - See mining statistics with charts
❌ Exit

Command Line

dotnet run <url> <level>

Arguments:

url: The starting URL to mine (e.g., https://example.com)
level: The depth level to crawl (0-3 recommended for reasonable runtime)

Examples:

# Mine a website
dotnet run https://example.com 1
dotnet run https://news.ycombinator.com 2

# View domain tree directly
dotnet run --tree

# Interactive mode
dotnet run

How It Works

Initialization: Creates SQLite database (siteminer.db) if it doesn't exist
Crawling: Starts from the given URL and crawls to the specified depth
- Level 0: Only the starting URL
- Level 1: Starting URL + all links found on it
- Level 2: Level 1 + all links found on those pages
- And so on...
Domain Extraction: Extracts unique domain names from all discovered URLs
New Domain Filtering: Queries database to find which domains are new
WHOIS Retrieval: Fetches WHOIS information for new domains (rate-limited to avoid blocking)
Database Storage: Saves all new domains with their WHOIS information
Statistics & Visualization: View results as statistics or interactive tree view

🌲 Domain Tree Viewer

The application includes a beautiful tree visualization feature that displays discovered domains in a hierarchical structure:

bvd.co.il ○
└── Level 1 (4 domains)
    ├── develops.co.il ○
    ├── pinterest.com ✓ | Owner: DNStination Inc. | Registrar: MarkMonitor, Inc.
    ├── wa.me ✓ | Owner: Whatsapp Inc. | Registrar: RegistrarSafe, LLC
    └── youtube.com ✓ | Owner: Google LLC | Registrar: MarkMonitor, Inc.

Features:

Interactive domain selection from mined roots
Color-coded information (owner, expiration, registrar)
WHOIS status icons (✓ = has info, ○ = no info)
Tree statistics and summaries
Expiration date warnings with color coding

See TREE_VIEWER_GUIDE.md for detailed documentation.

Architecture

Key Classes

Domain (Model): Entity representing a domain with WHOIS information
SiteMinerDbContext: Entity Framework Core database context for SQLite
WebCrawler: Concurrent web crawler with throttling
WhoisService: WHOIS information retrieval and parsing
DomainRepository: Data access layer with optimized queries
DomainTreeVisualizer: Interactive tree view for domain relationships

Performance Features

Concurrent Crawling: Uses SemaphoreSlim to limit concurrent HTTP requests (default: 5)
Bulk Database Operations: Single query to check multiple domains at once
HashSet Lookups: O(1) time complexity for duplicate checking
Database Indexing: Indexes on DomainName and DiscoveredAt fields
Efficient URL Management: ConcurrentDictionary to track visited URLs
Rich Console UI: Powered by Spectre.Console for beautiful visualizations

Database Schema

Domains Table:

Id (Primary Key)
DomainName (Indexed, Unique)
SourceUrl
Owner
ExpirationDate
RegistrationDate
Registrar
DiscoveredAt (Indexed)
LastUpdated
DiscoveryLevel

Configuration

You can modify these constants in the source code:

MaxConcurrentRequests in WebCrawler.cs: Number of simultaneous HTTP requests (default: 5)
RequestTimeoutSeconds in WebCrawler.cs: HTTP request timeout (default: 10 seconds)
WHOIS processing limit in Program.cs: Number of domains to fetch WHOIS for (default: 20)

Limitations

WHOIS queries are rate-limited to 1 per second to avoid server blocking
First run processes up to 20 domains with WHOIS info; additional domains are saved without WHOIS
Some WHOIS servers may block automated queries
Deep crawling (level > 3) can take significant time and discover many domains

Database Location

The SQLite database file (siteminer.db) is created in the application's running directory.

Example Output

=== Site Miner - Domain Discovery Tool ===

Initializing database...
Database ready.

Step 1: Crawling website...
------------------------------------------------------------
Starting crawl from: https://example.com
Max depth level: 1
[Level 0] Crawling: https://example.com
[Level 1] Crawling: https://example.com/about
...

Discovered 25 unique domains.

Step 2: Filtering new domains...
------------------------------------------------------------
Found 15 new domains (not in database).
Skipped 10 existing domains.

Step 3: Retrieving WHOIS information...
------------------------------------------------------------
[1/15] Processing example.com... ✓
[2/15] Processing another-domain.com... ✓
...

Step 4: Saving to database...
------------------------------------------------------------
Saved 15 domains to database.

=== Database Statistics ===
------------------------------------------------------------
Total domains in database: 15
Domains with WHOIS info: 15

Domains by discovery level:
  Level 1: 15 domains

Mining completed in 45.32 seconds.

License

This is a demonstration project. Feel free to use and modify as needed.

Notes

This tool is for educational and research purposes
Always respect robots.txt and website terms of service
Be mindful of server load and implement appropriate delays
Some WHOIS servers have strict rate limits

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
Configuration		Configuration
Data		Data
Models		Models
Services		Services
.gitignore		.gitignore
ARCHITECTURE.md		ARCHITECTURE.md
DEVELOPER_GUIDE.md		DEVELOPER_GUIDE.md
PROJECT_SUMMARY.md		PROJECT_SUMMARY.md
Program.cs		Program.cs
README.md		README.md
SiteMiner.csproj		SiteMiner.csproj
TREE_FEATURE_SUMMARY.md		TREE_FEATURE_SUMMARY.md
TREE_VIEWER_GUIDE.md		TREE_VIEWER_GUIDE.md
TREE_VIEW_DEMO.md		TREE_VIEW_DEMO.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Site Miner - Domain Discovery Tool

Features

Prerequisites

Installation

Usage

Interactive Mode (Recommended)

Command Line

How It Works

🌲 Domain Tree Viewer

Architecture

Key Classes

Performance Features

Database Schema

Configuration

Limitations

Database Location

Example Output

License

Notes

About

Uh oh!

Releases

Packages

Languages

adirel/www-domain-crawler

Folders and files

Latest commit

History

Repository files navigation

Site Miner - Domain Discovery Tool

Features

Prerequisites

Installation

Usage

Interactive Mode (Recommended)

Command Line

How It Works

🌲 Domain Tree Viewer

Architecture

Key Classes

Performance Features

Database Schema

Configuration

Limitations

Database Location

Example Output

License

Notes

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages