反爬虫方案

Current anti-scraping techniques generally fall into two main categories: Browser Layer (Application Layer) and Network Layer (TLS/TCP Fingerprinting).

Here are the best methods and open-source libraries on GitHub as of 2025:

1. Browser Automation (Solving JS Challenges & Captchas)

Used when sites require full browser rendering (e.g., Cloudflare Turnstile, ReCaptcha, heavily dynamic content).

DrissionPage (Python) - Currently Recommended
- Why: It combines browser automation (like Selenium/Playwright) with direct packet sending. It controls the browser via the CDP protocol directly, avoiding many WebDriver detection vectors.
- GitHub: g1879/DrissionPage
Undetected Chromedriver / Nodriver (Python)
- Why: Patches the binary of the Chromedriver to remove "leaks" that reveal automation (like the cdc_ variable). nodriver is the newer, successor project.
- GitHub: ultrafunkamsterdam/undetected-chromedriver or ultrafunkamsterdam/nodriver
Puppeteer Extra Plugin Stealth (Node.js)
- Why: The standard for Node.js scraping. It patches browser properties (like navigator.webdriver) to look human.
- GitHub: berstend/puppeteer-extra
FlareSolverr (Docker/Proxy)
- Why: A proxy server that automatically launches a browser to solve Cloudflare challenges and then returns the cookies to your script. Great for easy integration.
- GitHub: FlareSolverr/FlareSolverr

2. High-Performance Requests (TLS Fingerprinting)

Used for fast, large-scale scraping without a browser. Standard libraries like Python's requests or Node's axios are easily blocked because their TLS handshake fingerprint (JA3/JA4) is distinct from a real Chrome/Safari browser.

Curl Impersonate & Curl CFFI (C / Python / Node) - Best for Speed
- Why: It modifies the low-level curl library to mimic the exact TLS handshake of Chrome, Firefox, or Safari. curl_cffi is the Python binding.
- GitHub: yifeikong/curl_cffi (Python) or lwthiker/curl-impersonate (Core)
TLS Client (Go / Python)
- Why: A library specifically designed to spoof TLS fingerprints (JA3).
- GitHub: bogdanfinn/tls-client

Summary Recommendation

Scenario	Recommended Tool	GitHub Keyword
Complex Logins / Captchas	DrissionPage (Python) or Nodriver	`g1879/DrissionPage`
High Speed / API Scraping	curl_cffi (Python)	`yifeikong/curl_cffi`
Cloudflare "Under Attack"	FlareSolverr	`FlareSolverr/FlareSolverr`
Node.js Environment	Puppeteer Stealth	`berstend/puppeteer-extra`

If you need to simulate a login specifically for a Crawler, I recommend using DrissionPage or Puppeteer with Stealth to handle the initial login interactively, save the cookies, and then pass those cookies to a faster HTTP client (like curl_cffi) for the actual data scraping.