Circular Dependency Problem

I was thinking this morning(morning, from my POV) how to solve the problem of having several analytics running, some of which depend on others.

For instance, suppose I have 2 programs: A and preA; and A depends on preA.

That would require me to try to run A, and if failing due to unsatisfied dependencies, put it in a wait-list and running preA first. Then going back to A.

Now, what about postAB,A,B,preA,preB?

I guess a possible(but inefficient) solution would be to try running all of them, putting the ones that do not run in a wait-list, then iterating the wait-list again and again until all of them had ran or had no hope of having their dependencies satisfied.

Like this:

postAB(depends on A and B, wait-list)
A(depends on preA, preA didn’t run yet so put A in wait-list)
B(depends on preB, wait-list)preA(runs)
preB(runs)

2nd iteration:

postAB(wait-list again)
A(runs)
B(runs)

3rd:

postAB(runs)

———————————-

Now, this lead to quite some processing time. N+(N-1)+(N-2)..(N-(N-1)) attempts in a worst case scenario.

One way of mitigating this might be to sort the initial list in a way that programs with fewer dependencies go first:
preB,preA,A,B,postAB

In this case we would go from 3 iterations in the initial case to just 1. :)
The worst case scenario would remain the same though.

Leave a comment

Filed under Uncategorized

Search Engine project(thoughts)

“As of 2011, over 100,000 new WordPresses are created every day”

Well, 111 111 blogs updated daily don’t seem that many anymore. User submitted URLs ftw? Dumb crawling looks like an ill fated idea. Also it would gear my project towards a niche market(which isn’t bad actually, I’m not planning on tackling bing/google head on)

Leave a comment

Filed under Uncategorized

get_headers

PHP get_headers is no good either. The “modified” date seems to change everytime I get the headers of a blog url even if there was no change to the content. Anyway, I think the main problem to updating indexed entries regularly wouldn’t be so much bandwidth as it would be processing speed plus possible script-wise limitations like getting a PHP script to fetch several pages in a row. How many would I be able to fetch without running into the max time limit? I could try to modify that limit but still, sounds like my python script would do a better job. 000webhost doesn’t support python, taking that out of the table.

An alternative would be to drop my blog-driven search plan and stick to my initial all-encompassing search engine/analytics plan.

A pure blog indexer would require at the very least a smart updater that would update some blogs sooner. Without that wordpress search, even lacking the ability to search multiple tags, would be preferable.

100gb, that is a lot per month. If every blog rss required 10kb, 100 000mb 100 000 000kb / 10= 10 000 000 blogs updated per month
/30 333 333 blogs updated everyday

Hmmm… But that is actually just for checking if they should be updated.
A hefty 20kb-30kb would be needed for each blog actually updated. And that is going straight to the last post(and assuming there is just one new post) which is different from my previous py script that grabbed the last post’s link from the blog’s front page.

111 111 blogs updated daily.

A lot actually. But there is the time requesting rss, waiting for it to arrive, writing to disk…

Leave a comment

Filed under Uncategorized

RSS usefulness

I was writing my search engine yesterday and then I realized that just having a bunch of blogs(hundreds or thousands) indexed wouldn’t make for a good website. Because I needed to constantly refresh that information and that would be difficult given my website’s limited bandwidth. So today I decided to research RSS, it does reduce the data amount needed to check for updates somewhat but not as much as one would think. My own blog(front page) has around 18kb, my RSS shows a 8kb page. Another blog that I like to read has the 30kb/10kb ratio. Half as large and 1/3 respectively. Not really that much. They should make a feed that simply shows url and last date modified(or I should discover it, if it already exists). All in all, more useful for its standardization than size saving.

Leave a comment

Filed under Uncategorized

Design First…

…Code Later

Design First, Code Later.

Design First, Code Later.

Design First, Code Later.

Design First, Code Later.

Basic Functionality First, Details Later.

Basic FUnctionality First…

Leave a comment

Filed under Uncategorized

Python – Silicing with two Delimiters

While making a crawler for blogs I faced the following problem:

——————————-

Info_That_I_Want_Follows: “Important Information”

——————————-

Problem: How to extract “Important Information”?

The ideal way would probably be parsing the html and seeing which tag contains “Important Information”, but I don’t know how to do that so I wrote the following “hack”:

i1= s1.find(“Want_Follows: \””)
i2= s1[i1+len("Want_Follows: \""):].find(“\””)
FinalString= s1[i1+len("Want_Follows: \""):i1+len("Want_Follows: \"")+i2]

But now that I think about it, a simpler way would be:

s2= s1.split(“Want_Follows: \””,1)[1]
FinalString= s2.split(“\””,1)[0]

A lot easier to read. Yet another cool thing would be to make a function like this:

def Slice2Ways(Delimiter1,Delimiter2,StringToSlice):
String2= StringToSlice.split(Delimiter1,1)[1]
return String2.split(Delimiter2,1)[0]

Because this kind of slicing is done frequently if one is to extract data without building a full blown parser.

Leave a comment

Filed under Uncategorized

At a Snail’s Pace

Originally posted on chris white writes:

Arriving to Arratha, City of Science – Michal Matczak

The University had stood for an age, harvesting the finest minds of an entire system, spreading slowly across the skies, a sandstone and ivy cancer, blotting out the sun.

For those of us living in its shadow, it was a reminder of what we could never have, and of the privileges of the ivory tower. Our women stolen, as well as our children – and we were supposed to thank them! I always listened to my mother: Never talk to strangers. We ran, and hid in the mountains.

Yet still they came.

The air-bladders on the ships inflated, at a snail’s pace.

Surely they would discover us. Surely.

They didn’t – and we struck, our flotilla silently approaching, the setting sun at our backs and revenge before us.

Stone doesn’t burn.

But books sure do.

View original 7 more words

Leave a comment

Filed under Uncategorized