PHP get_headers is no good either. The “modified” date seems to change everytime I get the headers of a blog url even if there was no change to the content. Anyway, I think the main problem to updating indexed entries regularly wouldn’t be so much bandwidth as it would be processing speed plus possible script-wise limitations like getting a PHP script to fetch several pages in a row. How many would I be able to fetch without running into the max time limit? I could try to modify that limit but still, sounds like my python script would do a better job. 000webhost doesn’t support python, taking that out of the table.
An alternative would be to drop my blog-driven search plan and stick to my initial all-encompassing search engine/analytics plan.
A pure blog indexer would require at the very least a smart updater that would update some blogs sooner. Without that wordpress search, even lacking the ability to search multiple tags, would be preferable.
100gb, that is a lot per month. If every blog rss required 10kb, 100 000mb 100 000 000kb / 10= 10 000 000 blogs updated per month
/30 333 333 blogs updated everyday
Hmmm… But that is actually just for checking if they should be updated.
A hefty 20kb-30kb would be needed for each blog actually updated. And that is going straight to the last post(and assuming there is just one new post) which is different from my previous py script that grabbed the last post’s link from the blog’s front page.
111 111 blogs updated daily.
A lot actually. But there is the time requesting rss, waiting for it to arrive, writing to disk…