Nginx 通过客户端标识过滤请求

网页爬虫对站点收录和推广有好处,但大量蜘蛛涌入可能会造成网站瘫痪,通过配置 Nginx 可以限制爬虫请求。在 Nginx 的配置 conf.d 目录下增加请求频率限制定义文件 user-agent-rate-limit.conf

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
# define two rules: hard and soft 

map $http_user_agent $rate_bot {
default "";
"~Googlebot" 1;
"~Bingbot" 2;

# could add more spiders
}

// 429: too many requests
limit_req_status 429;

# soft rate limit
map $rate_bot $rate_bot_soft {
default "";
1 "ratebot_soft";
}

limit_req_zone $rate_bot_soft zone=ratebot_soft:16m rate=5r/s;

# hard rate limit
map $rate_bot $rate_bot_hard {
default "";
2 "ratebot_hard";
}

limit_req_zone $rate_bot_hard zone=ratebot_hard:16m rate=1r/s;

应用规则

1
2
3
4
5
6
7
8
server {

# ....
limit_req zone=ratebot_soft nodelay;
limit_req zone=ratebot_hard nodelay;
# ...

}

参考: