piedomains predicts website content categories using AI analysis of domain names, text content, and homepage screenshots. Classify domains as news, shopping, adult content, education, etc. with high accuracy.
Install and classify domains in 3 lines:
pip install piedomains
from piedomains import DomainClassifier
classifier = DomainClassifier()
# Classify current content
result = classifier.classify(["cnn.com", "amazon.com", "wikipedia.org"])
print(result[['domain', 'pred_label', 'pred_prob']])
# Expected output:
# domain pred_label pred_prob
# 0 cnn.com news 0.876543
# 1 amazon.com shopping 0.923456
# 2 wikipedia.org education 0.891234- High Accuracy: Combines text analysis + visual screenshots for 90%+ accuracy
- Historical Analysis: Classify websites from any point in time using archive.org
- Fast & Scalable: Batch processing with caching for 1000s of domains
- Easy Integration: Modern Python API with pandas output
- 41 Categories: From news/finance to adult/gambling content
Basic Classification
from piedomains import DomainClassifier
classifier = DomainClassifier()
# Combined analysis (most accurate)
result = classifier.classify(["github.com", "reddit.com"])
# Text-only (faster)
result = classifier.classify_by_text(["news.google.com"])
# Images-only (good for visual content)
result = classifier.classify_by_images(["instagram.com"])Historical Analysis
# Analyze how Facebook looked in 2010 vs today
old_facebook = classifier.classify(["facebook.com"], archive_date="20100101")
new_facebook = classifier.classify(["facebook.com"])
print(f"2010: {old_facebook.iloc[0]['pred_label']}")
print(f"2024: {new_facebook.iloc[0]['pred_label']}")Batch Processing
# Process large lists efficiently
domains = ["site1.com", "site2.com", ...] # 1000s of domains
results = classifier.classify_batch(
domains,
method="text", # text|images|combined
batch_size=50, # Process 50 at a time
show_progress=True # Progress bar
)News, Finance, Shopping, Education, Government, Adult Content, Gambling, Social Networks, Search Engines, and 32 more categories based on the Shallalist taxonomy.
- Speed: ~10-50 domains/minute (depends on method and network)
- Accuracy: 85-95% depending on content type and method
- Memory: <500MB for batch processing
- Caching: Automatic content caching for faster re-runs
Requirements: Python 3.9+
# Basic installation
pip install piedomains
# For development
git clone https://github.com/themains/piedomains
cd piedomains
pip install -e .Old API (still supported):
from piedomains import domain
result = domain.pred_shalla_cat_with_text(["example.com"])New API (recommended):
from piedomains import DomainClassifier
classifier = DomainClassifier()
result = classifier.classify_by_text(["example.com"])- API Reference: https://piedomains.readthedocs.io
- Examples: /examples directory
- Notebooks: /piedomains/notebooks (training & analysis)
# Setup development environment
git clone https://github.com/themains/piedomains
cd piedomains
pip install -e ".[dev]"
# Run tests
pytest piedomains/tests/ -v
# Run linting
flake8 piedomains/MIT License - see LICENSE file.
If you use piedomains in research, please cite:
@software{piedomains,
title={piedomains: AI-powered domain content classification},
author={Chintalapati, Rajashekar and Sood, Gaurav},
year={2024},
url={https://github.com/themains/piedomains}
}---
For legacy API documentation, see LEGACY_API.rst