“301 Moved Permanently” is what I get while trying to get wordpress.com/tag/X/ on php. However, Python urllib still works.
The tip I gave about getting wordpress /page/2 of tags no longer works…
I could pay some serious hosting but I’m under the impression they would make crawling impossible sooner rather than later. Also, without page2+ crawling there is no way I would be able to do any real search.
After months thinking about how to crawl wordpresses jumping from blog to blog or simply asking for bloggers to register their blogs or even asking random users to index a post I was quite upset. No alternative seemed good enough. Relevant content needs to be updated continually, needs to be fresh. I tried finding a feed for specific tags on wordpress.com, failing. I even thought today about filtering bing’s results for a given tag a selecting only *.wordpress.com websites, then showing to my users. But then I tried getting a wordpress.com/some_tag page using python script(not mozilla). And all posts urls showed up, as well as titles and even descriptions; a crawl-able website. Not only that, but I can also get older pages by adding “/page/x” where x is [2..Inf]
Unbelievable. That’s exactly what I wanted. Now let’s hope PHP can also get those crawler-friendly pages so I don’t need to pay a python-enabled host.