Adding compression support can be very simple -- if your spider is coded in Perl using LWP::UserAgent, then the
addition of a single line of code will enable compression support.
$ua->default_header('Accept-Encoding' => 'gzip');and then you need to make sure that you always refer to 'decoded_content' when dealing with the response object.
For other languages, all
you need to do is to add
Accept-encoding: gzipto the HTTP request that you send, and then be prepared to deal with a 'content-encoding: gzip' in the response.
Happily, some of the large spiders do support compression -- the googlebot and Yahoo Slurp do (to name but two). Since I started prodding crawler implementors, a couple have implemented compression (one within hours), and another reported that it was a bug that it didn't work -- which would be fixed shortly.
Crawlers which do more than 5% of the total (uncompressed) crawling activity are marked in bold below.
Crawler | Last IP used |
---|---|
DF Bot 1.0" "gladstonefamily.net | 192.29.97.49 |
DomainStatsBot/1.0 (https://domainstats.com/pages/our-bot)" "gladstonefamily.net | 136.243.59.237 |
masscan/1.0 (https://github.com/robertdavidgraham/masscan)" "- | 159.65.205.62 |
Mozilla/5.0 (compatible; DotBot/1.1; http://www.opensiteexplorer.org/dotbot, help@moz.com)" "blog.gladstonefamily.net | 216.244.66.194 |
Mozilla/5.0 (compatible; DotBot/1.1; http://www.opensiteexplorer.org/dotbot, help@moz.com)" "blog1.gladstonefamily.net | 216.244.66.194 |
Mozilla/5.0 (compatible; DotBot/1.1; http://www.opensiteexplorer.org/dotbot, help@moz.com)" "charon.gladstonefamily.net | 216.244.66.194 |
Mozilla/5.0 (compatible; DotBot/1.1; http://www.opensiteexplorer.org/dotbot, help@moz.com)" "gladstone.name | 216.244.66.228 |
Mozilla/5.0 (compatible; DotBot/1.1; http://www.opensiteexplorer.org/dotbot, help@moz.com)" "pond.gladstonefamily.net | 216.244.66.194 |
Mozilla/5.0 (compatible; DotBot/1.1; http://www.opensiteexplorer.org/dotbot, help@moz.com)" "pond1.gladstonefamily.net | 216.244.66.194 |
Mozilla/5.0 (compatible; DotBot/1.1; http://www.opensiteexplorer.org/dotbot, help@moz.com)" "www.gladstone.name | 216.244.66.228 |
Mozilla/5.0 (compatible; oBot/2.3.1; http://www.xforce-security.com/crawler/)" "73.253.74.102 | 206.253.224.14 |
Mozilla/5.0 (compatible; oBot/2.3.1; http://www.xforce-security.com/crawler/)" "blog.gladstonefamily.net | 206.253.224.14 |
Mozilla/5.0 (compatible; oBot/2.3.1; http://www.xforce-security.com/crawler/)" "blog1.gladstonefamily.net | 206.253.224.14 |
Mozilla/5.0 (compatible; oBot/2.3.1; http://www.xforce-security.com/crawler/)" "c-73-253-74-102.hsd1.ma.comcast.net | 206.253.224.14 |
Mozilla/5.0 (compatible; oBot/2.3.1; http://www.xforce-security.com/crawler/)" "charon.gladstonefamily.net | 206.253.224.14 |
Mozilla/5.0 (compatible; oBot/2.3.1; http://www.xforce-security.com/crawler/)" "gladstonefamily.net | 206.253.224.14 |
Mozilla/5.0 (compatible; oBot/2.3.1; http://www.xforce-security.com/crawler/)" "pond.gladstonefamily.net | 206.253.224.14 |
Mozilla/5.0 (compatible; oBot/2.3.1; http://www.xforce-security.com/crawler/)" "pond1.gladstonefamily.net | 206.253.224.14 |
Mozilla/5.0 (compatible; oBot/2.3.1; http://www.xforce-security.com/crawler/)" "pond1.gladstonefamily.net:8080 | 194.153.113.223 |
Mozilla/5.0 (compatible; oBot/2.3.1; http://www.xforce-security.com/crawler/)" "www.gladstonefamily.net | 206.253.224.14 |
Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.6) Gecko/20070725 Firefox/2.0.0.6" "pond1.gladstonefamily.net | 135.181.198.164 |
Mozilla/5.0 (X11; Linux x86_64; rv:109)" "pond1.gladstonefamily.net | 23.81.27.114 |