A flexible documentation crawler that can scrape and process documentation from any website.
First install dependencies:
pip install -r requirements.txtThen install the package in editable mode:
pip install -e .The -e flag installs the package in "editable" mode, which means:
- The package is installed in your Python environment
- Python looks for the package in your current directory instead of copying files
- Changes to the source code take effect immediately without reinstalling
- Required for running the package as a module with
python -m
Create a .env file in the project root:
OPENAI_API_KEY=your_api_key_hereRun the scraper with a URL from the src directory:
cd src
python main.py https://docs.example.com-o, --output: Output directory (default: output_docs)-m, --max-pages: Maximum pages to scrape (default: 1000)-c, --concurrent: Number of concurrent pages to scrape (default: 1)
Example with all options:
python main.py https://docs.example.com -o my_docs -m 500 -c 2If you get a "ModuleNotFoundError", make sure you:
- Have run
pip install -e .from the project root - Are running the command from the
srcdirectory
The crawler accepts the following parameters:
base_url: The starting URL to crawloutput_dir: Directory where scraped docs will be savedmax_pages: Maximum number of pages to crawlmax_concurrent_pages: Number of concurrent pages to process
- Python 3.8+
- Chrome/Chromium browser (for Selenium)