网页爬虫对站点收录和推广有好处,但大量蜘蛛涌入可能会造成网站瘫痪,通过配置 Nginx 可以限制爬虫请求。在 Nginx 的配置 conf.d 目录下增加请求频率限制定义文件 user-agent-rate-limit.conf。
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29
| # define two rules: hard and soft
map $http_user_agent $rate_bot { default ""; "~Googlebot" 1; "~Bingbot" 2; # could add more spiders }
// 429: too many requests limit_req_status 429;
# soft rate limit map $rate_bot $rate_bot_soft { default ""; 1 "ratebot_soft"; }
limit_req_zone $rate_bot_soft zone=ratebot_soft:16m rate=5r/s;
# hard rate limit map $rate_bot $rate_bot_hard { default ""; 2 "ratebot_hard"; }
limit_req_zone $rate_bot_hard zone=ratebot_hard:16m rate=1r/s;
|
应用规则
1 2 3 4 5 6 7 8
| server {
# .... limit_req zone=ratebot_soft nodelay; limit_req zone=ratebot_hard nodelay; # ... }
|
参考: