Monthly Archives: April 2014

April 29, 2014 · 12:38 pm

Going nuclear

I’ve been thinking since yesterday about writing an algo I have developed mentally around one year ago.

It pinpoints an author “characteristic” so to speak just by her/his text.

I’m not into people or analyzing ’em but I had to develop said algo because it was related to a philosophical question regarding human intelligence. And I wanted to create strong AI at the time(still believed it was possible).

I was saving it for later(the algo)… but it has been several months since I gave up the development of strong AI in September for considering it incompatible with my model of how information in our universe works; and still no big hit, no website. This is indeed my last resort.

The wordpress multi-tag search engine probably wouldn’t be as cool as I thought at first, any weird combination of even a small number of tags and the website would take a while to load, plus its results would be out-dated.

Leave a comment

Filed under Uncategorized

April 29, 2014 · 2:34 am

Hard to believe

When I tried getting posts urls from wordpress reader I first tried finding post’s names on mozilla’s source code viewer, being unable to do that I conclude the website was made with flash/javascript and I would need a way around.

After months thinking about how to crawl wordpresses jumping from blog to blog or simply asking for bloggers to register their blogs or even asking random users to index a post I was quite upset. No alternative seemed good enough. Relevant content needs to be updated continually, needs to be fresh. I tried finding a feed for specific tags on wordpress.com, failing. I even thought today about filtering bing’s results for a given tag a selecting only *.wordpress.com websites, then showing to my users. But then I tried getting a wordpress.com/some_tag page using python script(not mozilla). And all posts urls showed up, as well as titles and even descriptions; a crawl-able website. Not only that, but I can also get older pages by adding “/page/x” where x is [2..Inf]

Unbelievable. That’s exactly what I wanted. Now let’s hope PHP can also get those crawler-friendly pages so I don’t need to pay a python-enabled host.

Leave a comment

Filed under Uncategorized

Tagged as Crawler, PHP, Python, Relief, Search Engine, Wordpress

April 27, 2014 · 10:02 am

Apple,Google,Intel,Adobe lawsuit

“Some Silicon Valley companies refused to enter into no-hire agreements. Facebook Chief Operating Officer Sheryl Sandberg, for instance, rebuffed an entreaty from Google in 2008 that they refrain from poaching each other’s employees.”

The interesting part is that if you were to ask a random consumer to rank those companies CEO’s by likeability the order would probably be: Apple>Google>Facebook

So much for appearances.

Leave a comment

Filed under Uncategorized

Tagged as Apple, Facebook, Google, Lawsuit

April 26, 2014 · 11:29 pm

Tech shares fall

http://www.bloomberg.com/news/2014-04-25/amazon-falls-as-increased-spending-limits-profit-growth.html

One aspect that may or may not be reflected on those stocks value is that you don’t get as much brand permanency on internet businesses as you do with banks, retailers, and other more traditional companies.

All they have is data, lots of it, flowing in. And talent to analyze that data and target ads better. Once people realize that the vast majority of those analysis do not require a huge amount of data but only smarter ways of collecting samples we may see a decrease in their value.

Leave a comment

Filed under Uncategorized

Tagged as Big Data, Finance, Internet, Statistics, Stock Market, Tech Companies

April 20, 2014 · 6:57 pm

Circular Dependency Problem

I was thinking this morning(morning, from my POV) how to solve the problem of having several analytics running, some of which depend on others.

For instance, suppose I have 2 programs: A and preA; and A depends on preA.

That would require me to try to run A, and if failing due to unsatisfied dependencies, put it in a wait-list and running preA first. Then going back to A.

Now, what about postAB,A,B,preA,preB?

I guess a possible(but inefficient) solution would be to try running all of them, putting the ones that do not run in a wait-list, then iterating the wait-list again and again until all of them had ran or had no hope of having their dependencies satisfied.

Like this:

postAB(depends on A and B, wait-list)
A(depends on preA, preA didn’t run yet so put A in wait-list)
B(depends on preB, wait-list)preA(runs)
preB(runs)

2nd iteration:

postAB(wait-list again)
A(runs)
B(runs)

3rd:

postAB(runs)

———————————-

Now, this lead to quite some processing time. N+(N-1)+(N-2)..(N-(N-1)) attempts in a worst case scenario.

One way of mitigating this might be to sort the initial list in a way that programs with fewer dependencies go first:
preB,preA,A,B,postAB

In this case we would go from 3 iterations in the initial case to just 1. 🙂
The worst case scenario would remain the same though.

Leave a comment

Filed under Uncategorized

April 20, 2014 · 6:05 am

Search Engine project(thoughts)

“As of 2011, over 100,000 new WordPresses are created every day”

Well, 111 111 blogs updated daily don’t seem that many anymore. User submitted URLs ftw? Dumb crawling looks like an ill fated idea. Also it would gear my project towards a niche market(which isn’t bad actually, I’m not planning on tackling bing/google head on)

Leave a comment

Filed under Uncategorized

April 20, 2014 · 5:53 am

get_headers

PHP get_headers is no good either. The “modified” date seems to change everytime I get the headers of a blog url even if there was no change to the content. Anyway, I think the main problem to updating indexed entries regularly wouldn’t be so much bandwidth as it would be processing speed plus possible script-wise limitations like getting a PHP script to fetch several pages in a row. How many would I be able to fetch without running into the max time limit? I could try to modify that limit but still, sounds like my python script would do a better job. 000webhost doesn’t support python, taking that out of the table.

An alternative would be to drop my blog-driven search plan and stick to my initial all-encompassing search engine/analytics plan.

A pure blog indexer would require at the very least a smart updater that would update some blogs sooner. Without that wordpress search, even lacking the ability to search multiple tags, would be preferable.

100gb, that is a lot per month. If every blog rss required 10kb, 100 000mb 100 000 000kb / 10= 10 000 000 blogs updated per month
/30 333 333 blogs updated everyday

Hmmm… But that is actually just for checking if they should be updated.
A hefty 20kb-30kb would be needed for each blog actually updated. And that is going straight to the last post(and assuming there is just one new post) which is different from my previous py script that grabbed the last post’s link from the blog’s front page.

111 111 blogs updated daily.

A lot actually. But there is the time requesting rss, waiting for it to arrive, writing to disk…

Leave a comment

Filed under Uncategorized

April 19, 2014 · 11:48 pm

RSS usefulness

I was writing my search engine yesterday and then I realized that just having a bunch of blogs(hundreds or thousands) indexed wouldn’t make for a good website. Because I needed to constantly refresh that information and that would be difficult given my website’s limited bandwidth. So today I decided to research RSS, it does reduce the data amount needed to check for updates somewhat but not as much as one would think. My own blog(front page) has around 18kb, my RSS shows a 8kb page. Another blog that I like to read has the 30kb/10kb ratio. Half as large and 1/3 respectively. Not really that much. They should make a feed that simply shows url and last date modified(or I should discover it, if it already exists). All in all, more useful for its standardization than size saving.

Monthly Archives: April 2014

Going nuclear

Hard to believe

Apple,Google,Intel,Adobe lawsuit

Tech shares fall

Circular Dependency Problem

Search Engine project(thoughts)

get_headers

RSS usefulness

Design First…

Tags

Archives