Thus spake scroogle@xxxxxxxxxxx (scroogle@xxxxxxxxxxx): > My efforts to counter abuse occasionally cause some > programmers to consider using Tor to get Scroogle's > results. About a year ago I began requiring any and all > Tor searches at Scroogle to use SSL. Using SSL is always > a good idea, but the main reason I did this is that the > SSL requirement discouraged script writers who didn't > know how to add this to their scripts. This policy > helped immensely in cutting back on the abuse I was > seeing from Tor. > > Now I'm seeing script writers who have solved the SSL > problem. This leaves me with the user-agent, the search > terms, and as a last resort, blocking Tor exit nodes. > If they vary their search terms and user-agents, it can > take hours to analyze patterns and accurately block them > by returning a blank page. That's the way I prefer to do > it, because I don't like to block Tor exit nodes. Those > who are most sympathetic with what Tor is doing are also > sympathetic with what Scroogle is doing. There's a lot of > collateral damage associated with blocking Tor exit nodes, > and I don't want to alienate the Tor community except as > a last resort. Great, now that we know the motivations of the scrapers and a history of the arms race so far, it becomes a bit easier to try to do some things to mitigate their efforts. I particularly like the idea of feeding them random, incorrect search results when you can fingerprint them. If you want my suggestions for next steps in the arms race for this, (having written some benevolent scrapers and web scanners myself), it would actually be to do things that require your adversary to implement and load more and more bits of a proper web browser into their crawlers for them to succeed in properly issuing queries to you. Some examples: 1. A couple layers of crazy CSS. If you use CSS style sheets that fetch other randomly generated and programmatically controlled style elements that are also keyed to the form submit for the search query (via an extra hidden parameter or something that is their hash), then you can verify on your server side that a given query also loaded sufficient CSS to be genuine. The problem with this is it will mess with people who use your search plugin or search keywords, but you could also do it in a brief landing page that is displayed *after* the query, but before a 302 or meta-refresh to actual results, for problem IPs. 2. Storing identifiers in the cache http://crypto.stanford.edu/sameorigin/safecachetest.html has some PoC of this. Torbutton protects against long-term cache identifiers, but for performance reasons the memory cache is enabled by default, so you could use this to differentiate crawlers who do not properly obey all brower caching sematics. Caching is actually pretty darn hard to get right, so there's probably quite a bit more room here than just plain identifiers. 3. Javascript "proof of work" If the client supports javascript, you can have them factor some medium-sized integers and post the factorization with the query string, to prove some level of periodic work. The factors could be stored in cookies and given a lifetime. The obvious downside of this is that I bet a fair share of your users are running NoScript, or prefer to disable js and cookies. Anyways, thanks for your efforts with Scroogle. Hopefully the above ideas are actually easy enough to implement on your infrastructure to make it worth your while to use for all problem IPs, not just Tor. -- Mike Perry Mad Computer Scientist fscked.org evil labs
Attachment:
pgp90yrSXj8bz.pgp
Description: PGP signature