Scraping through Tor
If, for whatever reason, you want to scrape a website through a proxy, this is pretty easy in Perl, using WWW::Mechanize and Vidalia / Tor:
Free domain? Free hosting? Yup!
Awesome:
I have to say thanks for the tip off from some rather unsubtle comment spam sent to this very blog
Firefox gains market share…
According to a French survey company XiTi Monitor, Mozilla’s Firefox browser hit a whopping 28% share of the European internet ‘visit share’ in December 2007, gaining at the expense of Internet Explorer.
One wonders how much, if at all this has to do with scraping. I know that I for one do 99% of my scraping with a Firefox user agent, and only use the IE agent if there is an explicit reason for doing so.
Clearly IE7 has been a big turn off for a lot of people, but is that the only reason for Firefox’s gained share (other than the fact Firefox is clearly the best browser)?
Random loss of sessions in ASP.NET
We recently had a problem with our website where people would call us up and say that their shopping cart kept on randomly emptying itself as they were just about to buy something. We couldn’t replicate this problem on our computers, so assumed it might be their firewall or anti-virus software cycling through and periodically deleting our cookies.
The other day I realised that our site was setting a separate cookie for each product page a user viewed. This seemed odd, so I investigated further and found out that there is a hard-coded limit in Internet Explorer that sets a maximum number of cookies from any one domain at 20. Thus by trying to add 20 distinct products into my shopping basket I finally managed to replicate the problem.
Swedish domain hacking
I wrote a script a while back that took words ending in ‘se’, and then looked up their domain-hacked alternative. As I’m too poor to register any more of these right now I thought I’d share the results. Please bear in mind I wrote this script in January, so some of these may now be taken, etc etc etc.
You can register .se domains at various places on the web, but the cheapest I found was Crystone. The Swedish whois service is here. Domains are sorted by the word’s popularity in the English language – enjoy:
Why Google sucks and Yahoo! rocks
Despite being a champion of open source, Google have discontinued their SOAP search API. This sucks because those who previously wanted to get information from Google’s SERPs programmatically (ie to check up on their website’s ranking for certain keywords, etc) without resorting to screen scraping, now have to do just that – which is incidentally against their t&cs.
Admittedly the Google API was at best patchy, but it was a step in the right direction. Anyway, Yahoo! rocks because I’ve discovered their search API is tons better – solid, reliable, and it allows 5x more queries per day than the Google one did.
Fun with Perl & search logs
A while ago AOL released a large amount of anonymised search data from its users. This consisted of over 4 GB of data.
Now that’s a hell of a lot of search queries to be looking through manually. However, thanks to some Perl & Excel trickery, it’s possible to get some useful info from all this. ‘Perl’ stands for ‘Practical Extraction & Reporting Language’, so is naturally perfect for the task.
I’m far from being a Perl expert, but here’s the script I wrote to parse this data: