Listing of web crawlers that do not support compression

If you are the author of any of these spiders, then please add support for content compression when you crawl the web. This will save you bandwidth on your crawling system, and it saves bandwidth on the servers that you crawl.

Adding compression support can be very simple -- if your spider is coded in Perl using LWP::UserAgent, then the addition of a single line of code will enable compression support.

$ua->default_header('Accept-Encoding' => 'gzip');
and then you need to make sure that you always refer to 'decoded_content' when dealing with the response object.

For other languages, all you need to do is to add

Accept-encoding: gzip
to the HTTP request that you send, and then be prepared to deal with a 'content-encoding: gzip' in the response.

Happily, some of the large spiders do support compression -- the googlebot and Yahoo Slurp do (to name but two). Since I started prodding crawler implementors, a couple have implemented compression (one within hours), and another reported that it was a bug that it didn't work -- which would be fixed shortly.

Crawlers which do more than 5% of the total (uncompressed) crawling activity are marked in bold below.

CrawlerLast IP used
curl/7.54.0" "
DomainStatsBot/1.0 (" "gladstonefamily.net148.251.121.91
facebookexternalhit/1.1 (+" "blog.gladstonefamily.net173.252.83.30
fasthttp" "
fasthttp" "
Mozilla/5.0 (compatible; DotBot/1.2; +;" "pond.gladstonefamily.net216.244.66.194
Mozilla/5.0 (compatible; SeekportBot; +" "pond.gladstonefamily.net45.32.88.65
Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.5060.134 Safari/537.36138.246.253.24
Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv: Gecko/20070725 Firefox/" "pond1.gladstonefamily.net65.109.140.152

Comments, problems etc to
Philip Gladstone

Last modified Sunday, 19 November 2006