Skip to content

Solve Cloudflare Turnstile in Scrapy

Get past a Cloudflare Turnstile challenge in a Scrapy spider using the CaptchaSonic Python SDK β€” solve the token once, reuse it across paginated requests.

A Scrapy spider needs to scrape a site fronted by Cloudflare Turnstile. This recipe solves the token via CaptchaSonic and submits the protected form so the spider can crawl the gated pages.

NOTE

Turnstile tokens are single-use and tied to the IP that requested them. Pass a proxy= so the SDK solves through the same egress IP your Scrapy spider uses β€” otherwise Cloudflare may reject the token.


What you'll build

A Scrapy spider that posts a Turnstile-protected form, receives the cookies that gate the rest of the site, and crawls the protected pages. ~10 minutes.


Setup

pip install scrapy captchasonic
export CAPTCHASONIC_API_KEY=sonic_xxx

You need the page's Turnstile sitekey (in <div class="cf-turnstile" data-sitekey="…">) and the URL of the form it gates.


The recipe

import os
import scrapy
from captchasonic import CaptchaSonic

class GatedSpider(scrapy.Spider):
    name = "gated"
    start_urls = ["https://example.com/protected-form"]

    def __init__(self, *args, **kwargs):
        super().__init__(*args, **kwargs)
        self.solver = CaptchaSonic(os.environ["CAPTCHASONIC_API_KEY"])

    def parse(self, response):
        sitekey = response.css(".cf-turnstile::attr(data-sitekey)").get()
        if not sitekey:
            self.logger.error("no turnstile widget on %s", response.url)
            return

        # 1. Solve the token (auto-polls; ~6–12 s typical).
        result = self.solver.solve_turnstile(
            website_url=response.url,
            website_key=sitekey,
            # Solve through the same proxy your spider uses β€” token must
            # be IP-coherent for Cloudflare to accept it.
            proxy=os.environ.get("SPIDER_PROXY"),
        )
        token = result["token"]

        # 2. Submit the form with the token attached.
        yield scrapy.FormRequest.from_response(
            response,
            formdata={"cf-turnstile-response": token},
            callback=self.after_pass,
        )

    def after_pass(self, response):
        # Cookies set on this response will be reused for the rest of the crawl.
        for href in response.css("a.item::attr(href)").getall():
            yield response.follow(href, self.parse_item)

    def parse_item(self, response):
        yield {
            "url": response.url,
            "title": response.css("h1::text").get(),
        }

Returns a Turnstile token that you submit as the form's cf-turnstile-response. Once the form accepts it, Cloudflare's cf_clearance cookie is set and Scrapy reuses it on every following request in the same session.


Common pitfalls

  • cf_clearance cookie not set after form submit. Cloudflare requires the token-bearing request to come from the same IP and User-Agent that solved it. In settings.py: pin USER_AGENT and route both the SDK call (proxy=) and Scrapy (HTTPPROXY_ENABLED) through the same proxy.
  • Hitting the rate limit. MinuteLimitExceededError carries a retry_after (seconds). Hook it into DownloaderMiddleware.process_response to back off cleanly.
  • Solving for every page. You only need to solve once per session β€” the form's response sets cf_clearance, which Scrapy will carry on subsequent requests via its cookie jar.

See also