Crawler | siliconvalleylatebloomer

Tag Archives: Crawler

April 29, 2014 · 2:34 am

Hard to believe

When I tried getting posts urls from wordpress reader I first tried finding post’s names on mozilla’s source code viewer, being unable to do that I conclude the website was made with flash/javascript and I would need a way around.

After months thinking about how to crawl wordpresses jumping from blog to blog or simply asking for bloggers to register their blogs or even asking random users to index a post I was quite upset. No alternative seemed good enough. Relevant content needs to be updated continually, needs to be fresh. I tried finding a feed for specific tags on wordpress.com, failing. I even thought today about filtering bing’s results for a given tag a selecting only *.wordpress.com websites, then showing to my users. But then I tried getting a wordpress.com/some_tag page using python script(not mozilla). And all posts urls showed up, as well as titles and even descriptions; a crawl-able website. Not only that, but I can also get older pages by adding “/page/x” where x is [2..Inf]

Unbelievable. That’s exactly what I wanted. Now let’s hope PHP can also get those crawler-friendly pages so I don’t need to pay a python-enabled host.

Python – Silicing with two Delimiters

While making a crawler for blogs I faced the following problem:

——————————-

Info_That_I_Want_Follows: “Important Information”

——————————-

Problem: How to extract “Important Information”?

The ideal way would probably be parsing the html and seeing which tag contains “Important Information”, but I don’t know how to do that so I wrote the following “hack”:

i1= s1.find(“Want_Follows: \””)
i2= s1[i1+len(“Want_Follows: \””):].find(“\””)
FinalString= s1[i1+len(“Want_Follows: \””):i1+len(“Want_Follows: \””)+i2]

But now that I think about it, a simpler way would be:

s2= s1.split(“Want_Follows: \””,1)[1]
FinalString= s2.split(“\””,1)[0]

A lot easier to read. Yet another cool thing would be to make a function like this:

def Slice2Ways(Delimiter1,Delimiter2,StringToSlice):
String2= StringToSlice.split(Delimiter1,1)[1]
return String2.split(Delimiter2,1)[0]

Because this kind of slicing is done frequently if one is to extract data without building a full blown parser.

Tag Archives: Crawler

Hard to believe

Python – Silicing with two Delimiters

Tags

Archives