Tag Archives: Search Engine

May 5, 2014 · 7:01 am

Crawling WordPress

“301 Moved Permanently” is what I get while trying to get wordpress.com/tag/X/ on php. However, Python urllib still works.

The tip I gave about getting wordpress /page/2 of tags no longer works…

I could pay some serious hosting but I’m under the impression they would make crawling impossible sooner rather than later. Also, without page2+ crawling there is no way I would be able to do any real search.

Hard to believe

When I tried getting posts urls from wordpress reader I first tried finding post’s names on mozilla’s source code viewer, being unable to do that I conclude the website was made with flash/javascript and I would need a way around.

After months thinking about how to crawl wordpresses jumping from blog to blog or simply asking for bloggers to register their blogs or even asking random users to index a post I was quite upset. No alternative seemed good enough. Relevant content needs to be updated continually, needs to be fresh. I tried finding a feed for specific tags on wordpress.com, failing. I even thought today about filtering bing’s results for a given tag a selecting only *.wordpress.com websites, then showing to my users. But then I tried getting a wordpress.com/some_tag page using python script(not mozilla). And all posts urls showed up, as well as titles and even descriptions; a crawl-able website. Not only that, but I can also get older pages by adding “/page/x” where x is [2..Inf]

Unbelievable. That’s exactly what I wanted. Now let’s hope PHP can also get those crawler-friendly pages so I don’t need to pay a python-enabled host.

Thoughts about my new website…

I’ll fix the issue with getting files from other webpages to my host by doing it locally(client-side)

This will release my webpage bandwidth. When you think about it, even if doing it from PHP worked, someone someday would request analytics for a huge file that wouldn’t be feasible to get to my host. By requesting the client to download the file I can also request it to send and/or analyze specific parts of a file(as opposed to getting a whole 1GB movie I could just ask for the first hundred bytes and conclude it is the header of a video and should therefore receive the TAG “video” for example)

Another idea that hit me though is to have a search-engine that could filter multiple tags on wordpress posts. It would still require a user(likely blog owner) to run an open-source Python script to extract tags from posts and upload them to my index. But it would probably be easier than going straight to my auto-tagger/analytics project.

The question though is: is there any demand for either of these websites?

I can see myself using the analytics website IF it could give significant insight into videos and/or foreign accents in text. But those are algos I haven’t developed(not even drafts).

A website that allowed me to filter “Short story” AND “Fiction”? Hmmmm
I guess the tag Flash Fiction already does that…(combine those two)… Well… I guess there is no tag unifying “Fantasy” and “Short story”… but I would need lots of posts to get any decent results with a multiple tags search engine…

The promising side is that neither would take lots of my time, so there is no reason to not do both if my internet connection allows.(I need to look at Python’s and PHP’s online references to code)

Tag Archives: Search Engine

Crawling WordPress

Hard to believe

Thoughts about my new website…

Tags

Archives