Tag Archive for 'mysql'

Adventures with my Technorati ranks "toy"

As I mentioned here before, a couple of days ago I coded a program to take an OPML file and generate a table in which the sites listed on that file appear ordered by Technorati ranks. It also shows the number of incoming links (again, from Technorati), and each site’s PageRank.

(By the way: no, this is not ready for release yet. But it will be. Soon.)

Initially, the data collecting part of my program started by clearing a table in a MySQL database, which would then be filled with the values it would get from Technorati and Google. However, this had two problems:

  1. Technorati allows only a limited number of accesses per day. I discovered it when I was making several tests, and, after about half a dozen or so, it stopped giving me data. The problem, then, was that it had already cleared the table… so I ended up with an empty one.
  2. From time to time, Technorati gives me “wrong” ranks / links for a blog - values much lower (but not absurd / “bogus”, just wrong) than what they should be. It’s weird, and not reproducible, and usually, by asking TR again, the correct value is then returned.

To solve the first problem, obviously, some form of keeping the data from the previous run while getting the new values was in order, so that, if Technorati told me to get stuffed, I would still have the data from the day before.

The second problem was a little more complicated, though, in a way, the solution to the first helped me crack it.

My method was this: when running the script, start by copying the original table to another (let’s call it temp1) and clearing the original table. Then get the new data to yet another table (temp2). Afterwards, regenerate the original table with data from temp1 and temp2, the following way:

  • if an entry (identified by the site’s URL) exists in only one of the tables, use it.
  • if an entry exists in both, use the common values (URL, site’s name), and for the 3 numeric values, choose the best value (from the two tables) for each. “Best” means the highest # of incoming links, the highest PageRank, and the lowest Technorati rank.

This way, if once in a while Technorati gives it a much worse value than it should (I’ve never seen it rate a blog better than the reality), it still has a more correct value to use instead.

Sounds fine, doesn’t it? But there’s a problem with this method… which I solved later, but which I’ll discuss the next post. Until then… any guesses as to what it was? :)




Creative Commons Attribution-NonCommercial-NoDerivs 2.5 Portugal
Creative Commons Attribution-NonCommercial-NoDerivs 2.5 Portugal