This project aims to provide a practical solution for developers to effectively bypass Cloudflare Challenge using the powerful crawling capabilities of Crawl4AI and the advanced CAPTCHA and anti-bot services of CapSolver. When performing web data scraping, anti-bot mechanisms like Cloudflare often become obstacles. This solution utilizes API-level integration to simulate real browser behavior, ensuring the smooth execution of crawling tasks.
As an developer, I deeply understand the various challenges encountered during data scraping. Cloudflare Challenge is particularly complex, combining techniques such as browser fingerprinting, User-Agent validation, and JavaScript execution to identify and block automated traffic. This project is my exploration and practice of an efficient strategy to address this pain point.
- CapSolver
AntiCloudflareTaskIntegration: Leverage CapSolver's specialized anti-Cloudflare task type to obtain challenge solutions (token, cookies, User-Agent). - Crawl4AI Browser Configuration: Precisely configure Crawl4AI's browser environment based on the solution returned by CapSolver, ensuring consistency with the environment where the challenge was solved.
- Seamless Cloudflare Bypass: Enable Crawl4AI to access Cloudflare-protected websites like a real user.
- Python Implementation: Provide clear, executable Python code examples.
- Request CapSolver Solution: Before launching Crawl4AI, call the CapSolver API using the
AntiCloudflareTasktype, providing the target website URL, proxy (if needed), and a User-Agent that matches the one CapSolver uses internally. - Obtain Challenge Credentials: CapSolver processes the challenge and returns a
solutionobject containing atoken,cookies, and the recommendeduserAgent. - Configure Crawl4AI Browser: Use the
token,cookies, anduserAgentobtained from CapSolver to configure Crawl4AI'sBrowserConfig, ensuring Crawl4AI's browser instance perfectly matches the environment in which the challenge was solved. - Execute Crawling Task: Crawl4AI then executes its
arunmethod with this specially configured browser, successfully accessing the target URL without triggering the Cloudflare Challenge again.
Before running the code, please ensure you have installed the following libraries:
pip install capsolver crawl4aiPlease replace api_key with your CapSolver API key. You can obtain it from the CapSolver Dashboard.
# TODO: set your config
api_key = "CAP-XXX" # your api key of capsolverThe following Python code demonstrates how to integrate CapSolver's API with Crawl4AI to solve Cloudflare Challenge. This example targets a news article page protected by Cloudflare.
import asyncio
import time
import capsolver
from crawl4ai import *
# TODO: set your config
api_key = "CAP-XXX" # your api key of capsolver
site_url = "https://www.tempo.co/hukum/polisi-diduga-salah-tangkap-pelajar-di-magelang-yang-dituduh-perusuh-demo-2070572" # page url of your target site
captcha_type = "AntiCloudflareTask" # type of your target captcha
api_proxy = "http://127.0.0.1:13120" # If you need a proxy for CapSolver, configure it here
capsolver.api_key = api_key
user_data_dir = "./crawl4ai_/browser-profile/Default1493" # Persistent browser profile directory
# or
# cdp_url = "ws://localhost:xxxx" # If connecting to an existing CDP session
async def main():
print("solver token start")
start_time = time.time()
# get cloudflare token using capsolver sdk
solution = capsolver.solve({
"type": captcha_type,
"websiteURL": site_url,
"proxy": api_proxy,
"userAgent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/138.0.0.0 Safari/537.36" # Important: Use a User-Agent that CapSolver supports and matches your intended browser
})
token_time = time.time()
print(f"solver token: {token_time - start_time:.2f} s")
# CapSolver may return cookies as a dict or list, normalize to list of dicts for Crawl4AI
cookies = solution.get("cookies", [])
if isinstance(cookies, dict):
cookies_array = []
for name, value in cookies.items():
cookies_array.append({
"name": name,
"value": value,
"url": site_url,
})
cookies = cookies_array
elif not isinstance(cookies, list):
cookies = []
token = solution["token"]
print("challenge token:", token)
# Configure Crawl4AI browser with the solution from CapSolver
browser_config = BrowserConfig(
verbose=True, # Enable verbose logging for debugging
headless=False, # Set to True for production scraping without a visible browser UI
use_persistent_context=True, # Use a persistent browser context
user_data_dir=user_data_dir, # Specify user data directory for profile persistence
# cdp_url=cdp_url, # Uncomment if using an existing CDP session
user_agent=solution["userAgent"], # Use the User-Agent recommended by CapSolver
cookies=cookies, # Inject cookies provided by CapSolver
)
async with AsyncWebCrawler(config=browser_config) as crawler:
result = await crawler.arun(
url=site_url,
cache_mode=CacheMode.BYPASS, # Bypass cache to ensure fresh content
session_id="session_captcha_test" # Unique session ID for this crawl
)
print(result.markdown[:500]) # Print first 500 characters of the scraped markdown content
if __name__ == "__main__":
asyncio.run(main())- CapSolver SDK Call: The
capsolver.solvemethod is central here, using theAntiCloudflareTasktype. It requireswebsiteURL,proxy(optional), and a specificuserAgent. CapSolver processes the challenge and returns asolutionobject containing atoken,cookies, and theuserAgentthat was used to solve the challenge. - Crawl4AI Browser Configuration: The
BrowserConfigfor Crawl4AI is meticulously set up using the information from CapSolver's solution. This includesuser_agentandcookiesto ensure the Crawl4AI browser instance perfectly matches the conditions under which the Cloudflare Challenge was solved. Theuser_data_diris also specified to maintain a consistent browser profile. - Crawler Execution: Crawl4AI then executes its
arunmethod with this carefully configuredbrowser_config, allowing it to successfully access the target URL without triggering the Cloudflare Challenge again.
Contributions of any kind are welcome! If you have better methods or find bugs, please feel free to submit an Issue or Pull Request.
This project is licensed under the MIT License. See the LICENSE file for details.