Published on

反爬虫方案

Authors

Current anti-scraping techniques generally fall into two main categories: Browser Layer (Application Layer) and Network Layer (TLS/TCP Fingerprinting).

Here are the best methods and open-source libraries on GitHub as of 2025:

1. Browser Automation (Solving JS Challenges & Captchas)

Used when sites require full browser rendering (e.g., Cloudflare Turnstile, ReCaptcha, heavily dynamic content).

  • DrissionPage (Python) - Currently Recommended
    • Why: It combines browser automation (like Selenium/Playwright) with direct packet sending. It controls the browser via the CDP protocol directly, avoiding many WebDriver detection vectors.
    • GitHub: g1879/DrissionPage
  • Undetected Chromedriver / Nodriver (Python)
    • Why: Patches the binary of the Chromedriver to remove "leaks" that reveal automation (like the cdc_ variable). nodriver is the newer, successor project.
    • GitHub: ultrafunkamsterdam/undetected-chromedriver or ultrafunkamsterdam/nodriver
  • Puppeteer Extra Plugin Stealth (Node.js)
    • Why: The standard for Node.js scraping. It patches browser properties (like navigator.webdriver) to look human.
    • GitHub: berstend/puppeteer-extra
  • FlareSolverr (Docker/Proxy)
    • Why: A proxy server that automatically launches a browser to solve Cloudflare challenges and then returns the cookies to your script. Great for easy integration.
    • GitHub: FlareSolverr/FlareSolverr

2. High-Performance Requests (TLS Fingerprinting)

Used for fast, large-scale scraping without a browser. Standard libraries like Python's requests or Node's axios are easily blocked because their TLS handshake fingerprint (JA3/JA4) is distinct from a real Chrome/Safari browser.

  • Curl Impersonate & Curl CFFI (C / Python / Node) - Best for Speed
    • Why: It modifies the low-level curl library to mimic the exact TLS handshake of Chrome, Firefox, or Safari. curl_cffi is the Python binding.
    • GitHub: yifeikong/curl_cffi (Python) or lwthiker/curl-impersonate (Core)
  • TLS Client (Go / Python)
    • Why: A library specifically designed to spoof TLS fingerprints (JA3).
    • GitHub: bogdanfinn/tls-client

Summary Recommendation

ScenarioRecommended ToolGitHub Keyword
Complex Logins / CaptchasDrissionPage (Python) or Nodriverg1879/DrissionPage
High Speed / API Scrapingcurl_cffi (Python)yifeikong/curl_cffi
Cloudflare "Under Attack"FlareSolverrFlareSolverr/FlareSolverr
Node.js EnvironmentPuppeteer Stealthberstend/puppeteer-extra

If you need to simulate a login specifically for a Crawler, I recommend using DrissionPage or Puppeteer with Stealth to handle the initial login interactively, save the cookies, and then pass those cookies to a faster HTTP client (like curl_cffi) for the actual data scraping.