Solve Cloudflare Turnstile in Scrapy
Get past a Cloudflare Turnstile challenge in a Scrapy spider using the CaptchaSonic Python SDK β solve the token once, reuse it across paginated requests.
A Scrapy spider needs to scrape a site fronted by Cloudflare Turnstile. This recipe solves the token via CaptchaSonic and submits the protected form so the spider can crawl the gated pages.
NOTE
Turnstile tokens are single-use and tied to the IP that requested them. Pass a proxy= so the SDK solves through the same egress IP your Scrapy spider uses β otherwise Cloudflare may reject the token.
What you'll build
A Scrapy spider that posts a Turnstile-protected form, receives the cookies that gate the rest of the site, and crawls the protected pages. ~10 minutes.
Setup
pip install scrapy captchasonic
export CAPTCHASONIC_API_KEY=sonic_xxx
You need the page's Turnstile sitekey (in <div class="cf-turnstile" data-sitekey="β¦">) and the URL of the form it gates.
The recipe
import os
import scrapy
from captchasonic import CaptchaSonic
class GatedSpider(scrapy.Spider):
name = "gated"
start_urls = ["https://example.com/protected-form"]
def __init__(self, *args, **kwargs):
super().__init__(*args, **kwargs)
self.solver = CaptchaSonic(os.environ["CAPTCHASONIC_API_KEY"])
def parse(self, response):
sitekey = response.css(".cf-turnstile::attr(data-sitekey)").get()
if not sitekey:
self.logger.error("no turnstile widget on %s", response.url)
return
# 1. Solve the token (auto-polls; ~6β12 s typical).
result = self.solver.solve_turnstile(
website_url=response.url,
website_key=sitekey,
# Solve through the same proxy your spider uses β token must
# be IP-coherent for Cloudflare to accept it.
proxy=os.environ.get("SPIDER_PROXY"),
)
token = result["token"]
# 2. Submit the form with the token attached.
yield scrapy.FormRequest.from_response(
response,
formdata={"cf-turnstile-response": token},
callback=self.after_pass,
)
def after_pass(self, response):
# Cookies set on this response will be reused for the rest of the crawl.
for href in response.css("a.item::attr(href)").getall():
yield response.follow(href, self.parse_item)
def parse_item(self, response):
yield {
"url": response.url,
"title": response.css("h1::text").get(),
}
Returns a Turnstile token that you submit as the form's
cf-turnstile-response. Once the form accepts it, Cloudflare'scf_clearancecookie is set and Scrapy reuses it on every following request in the same session.
Common pitfalls
cf_clearancecookie not set after form submit. Cloudflare requires the token-bearing request to come from the same IP and User-Agent that solved it. Insettings.py: pinUSER_AGENTand route both the SDK call (proxy=) and Scrapy (HTTPPROXY_ENABLED) through the same proxy.- Hitting the rate limit.
MinuteLimitExceededErrorcarries aretry_after(seconds). Hook it intoDownloaderMiddleware.process_responseto back off cleanly. - Solving for every page. You only need to solve once per session β the form's response sets
cf_clearance, which Scrapy will carry on subsequent requests via its cookie jar.
See also
- Python SDK β Cloudflare Turnstile
- Capability Matrix β token field for
solve_turnstileisresult["token"] - Configuration β polling/timeout tuning for headless crawlers