Listing of web crawlers that do not support compression

If you are the author of any of these spiders, then please add support for content compression when you crawl the web. This will save you bandwidth on your crawling system, and it saves bandwidth on the servers that you crawl.

Adding compression support can be very simple -- if your spider is coded in Perl using LWP::UserAgent, then the addition of a single line of code will enable compression support.

$ua->default_header('Accept-Encoding' => 'gzip');
and then you need to make sure that you always refer to 'decoded_content' when dealing with the response object.

For other languages, all you need to do is to add

Accept-encoding: gzip
to the HTTP request that you send, and then be prepared to deal with a 'content-encoding: gzip' in the response.

Happily, some of the large spiders do support compression -- the googlebot and Yahoo Slurp do (to name but two). Since I started prodding crawler implementors, a couple have implemented compression (one within hours), and another reported that it was a bug that it didn't work -- which would be fixed shortly.

Crawlers which do more than 5% of the total (uncompressed) crawling activity are marked in bold below.

CrawlerLast IP used
LinqiaMetadataDownloaderBot/1.0 (eng@linqia.com)" "blog1.gladstonefamily.net107.23.124.93
masscan/1.0 (https://github.com/robertdavidgraham/masscan)" "-195.2.253.2
Mozilla/4.0" "gladstonefamily.net111.20.178.90
Mozilla/5.0 (compatible; DomainAppender /1.0; +http://www.profound.net/domainappender)" "gladstonefamily.net54.169.180.224
Mozilla/5.0 (compatible; DomainAppender /1.0; +http://www.profound.net/domainappender)" "www.gladstonefamily.net54.169.180.224
Mozilla/5.0 (compatible; DotBot/1.1; http://www.opensiteexplorer.org/dotbot, help@moz.com)" "pond1.gladstonefamily.net216.244.66.244
Mozilla/5.0 (compatible; SEOkicks-Robot; +http://www.seokicks.de/robot.html)" "gladstonefamily.net138.201.30.66
Mozilla/5.0 (compatible; SEOkicks-Robot; +http://www.seokicks.de/robot.html)" "pond1.gladstonefamily.net138.201.30.66
Mozilla/5.0 (compatible; Windows; U; Windows NT 6.2; en-US; rv:12.0) Gecko/20120403211507 Firefox/12.0" "pond1.gladstonefamily.net144.76.94.111
Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/49.0.2623.75 Safari/537.36 OPR/36.0.2130.32" "pond1.gladstonefamily.net178.137.83.79
Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/49.0.2623.75 Safari/537.36 OPR/36.0.2130.32" "pond1.gladstonefamily.net:8080178.137.83.166
Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.6) Gecko/20070725 Firefox/2.0.0.6 - James BOT - WebCrawler http://cognitiveseo.com/bot.html" "pond1.gladstonefamily.net136.243.16.102
Mozilla/5.0 (Windows; U; Windows NT 6.1; en-GB; rv:1.9.2.13) Gecko/20101203 Firefox/3.6.13 (.NET CLR 3.5.30729)" "pond1.gladstonefamily.net213.186.1.207
Python-urllib/2.7" "pond1.gladstonefamily.net54.175.195.72

Comments, problems etc to
Philip Gladstone

Last modified Sunday, 19 November 2006