I am using apache manifoldcf open source project for indexing documents from Google Drive into my solr. Often I have seen it is quite inconsistent in indexing the data. Also it takes time to reflect even small number of documents in solr . Do you really think its a good option to index Google Drive using it?
It is currently bit on slow side, due to response time and throttling constraints from google drive itself. But this limit can probably relieved if you buy additional bandwidth from google. With current setup if you are looking to index a large set of documents in google drive it may not be quick as you may expect
Manifold CF is good for crawling through file-system. You can go for Apache Nutch if you are interested in web crawling.
Yes ManifoldCF does take a lot of time to reflect a small number of document. Also it has very less documentation. Although, you can join the mailing list where you can ask questions to the lead developer "Karl". He is very helpful and usually answers withing a few hours.
P.S. :I have worked using ManifoldCF over a project for a span of 10 months.
Related
I am making a little personal project.
Ideally I would like to be able to make programmatically a google search and have the count of results. (My goal is to compare the results count between a lot (100000+) of different phrases).
Is there a free way to make a web search and compare the popularity of different texts, by using Google Bing or whatever (the source is not really important).
I tried Google but seems that freely I can do only 10 requests per day.
Bing is more permissive (5000 free requests per month).
Is there other tools or way to have a count of number of results for a particular sentence freely ?
Thanks in advance.
There are several things you're going to need if you're seeking to create a simple search engine.
First of all you should read and understand where the field of information retrieval started with G. Salton's paper or at least read the wiki page on the vector space model. It will require you learning at least some undergraduate linear algebra. I suggest Gilbert Strang's MIT video lectures for this.
You can then move to the Brin/Page Pagerank paper which outlays the original concept behind the hyperlink matrix and quickly calculating eigenvectors for ranking or read the wiki page.
You may also be interested in looking at the code for Apache Lucene
To get into contemporary search algorithm techniques you need calculus and regression analysis to learn machine learning and deep learning as the current google search has moved away from Pagerank and utilizes these. This is partially due to how link farming enabled people to artificially engineer search results and the huge amount of meta data that modern browsers and web servers allow to be collected.
EDIT:
For the webcrawler only portion I'd recommend WebSPHINX. I used this in my senior research in college in conjunction with Lucene.
I've a Lucene based application and obviously a problem.
When the number of indexed document is low no problems appear. When then number of documents increase, seems that no single word are indexing. What we obtain is that searching with single word (single term) is an empty set.
The version of Lucene is 3.1 on 64 bit machine and the index is 10GB.
Do you have any idea?
Thanks
According to the Lucene documentation, Lucene should be able to handle 274 billion distinct terms. I don't believe it is possible that you have reached that limitation in a 10GB index.
Without more information, it is difficult to help further. However, being that you only see problems with large numbers of documents, I suspect you are running into exceptional conditions of some form, causing the system to fail to read or respond correctly. File Handle leaks or Memory Overflow perhaps, to take a stab in the dark.
Does anyone have some idea as to how come questions posted here on SO are showing up so quickly on Google?.
Sometimes questions submitted are appearing as the first 10 entries or so - on the first page within 30 minutes of submitting a question. Pray tell, what sort of magic is being wielded here?
Anybody have some ideas, suggestions?. My first thought is that they have info in their sitemap that tells google robots to trawl every N minutes or so - is that whats going on?
BTW, I am aware that simply instructing Googlebots to scan your site every N minutes will not work if you dont have quality information (that is constantly being updated on your site).
I'd just like to know if there is something else that SO may be doing right (apart from the marvelous content of course)
To put it simply, more popular websites with more quality content and more frequent changes are ranked higher with Google's algorithm, and are indexed and cached more frequently than sites that are less popular or change less frequently.
Broadly speaking, it's only content that does it. The size and quality of content has reached Google's threshold for "spider as fast as the site will permit". SO has to actively throttle the Googlebot; Jeff has said on Coding Horror that they were getting more then 50,000 requests per day from Google, and that was over a year ago.
If you scan through non-news sites from the Alexa top 500 you will find virtually all of those have results in Google that are just minutes old. (i.e. type site:archive.org into Google and choose "Latest" in the menu on the left)
So there's nothing practical you can do to your own site to speed up spidering, except to increase the amount of traffic to your site...
It is really simple.
SO is a PageRank 6 site that gives the world new information.
Google has a strong bias on new information. It will crawl the site many times a day and it will immediately add the pages to its index. It will favor a page (top 10) to say a specific query for a small period of time (a few days) and then it will stop favoring that page and rank it as normal.
This is standard G procedure and it happens with many many sites.
As you might guess, grayhat/blackhat seo uses that fact in many ways.
Also helped by SO providing an RSS feed, I think google likes feeds from reliable sources.
I am looking for a project idea in distributed processing on Unix based systems. I wish to use only the C programming language. I have to finish the project in 4 months and it's a part of my course work. Can someone help me with an idea?
Cryptography problems
Distributed Ray Tracer
Chess AI (really, AI for any game)
Large Prime Number Search
Web crawler or other search mechanism
Generic Problem Solver (push out problem definition on the fly, followed by problem data).
Note on the last one:
An example would be if you have a gaming website with lots of board games that you were coming out with all the time. You don't want to have to install new clients on all your servers every time you write a new AI for a board game, so you have a program which you can send new AIs to and then after that you can just send the game data and the pushed AI will be used to solve the problem. This is best used for problems which can be broken into smaller chunks.
It is hard to answer without knowing anything about performance, the scale of the project, what you are trying to accomplish, etc. For example, is it one task or multiple tasks? Is the project just totally open?
4 months is pretty short, but maybe some kind of physics problem or math problem. Sorting or some kind of database work might be dull but beneficial.
Check out mapreduce for ideas! I was really motivated by this work, personally.
We used distributed processing here at work, but it's such a broad field..
Yeah.
Why not write a distributed compiler. You may then present an interface for people to compile things on the fly, and it will be passed to your distribute compilenet. Java is probably well-suited, and you'll get to do fun things, like be very mindful of security and so on.
The BOINC project is always looking for help and is very interesting:
http://boinc.berkeley.edu/
If you want to leave your mark and change the way we search the web,
look into B-Trees.
B-Trees and offspring/variants are the working horse of the internet.
Google uses them extensively to index the web.
Database indexes/indices are B-Tree offspring/variants.
Every LAMP system uses a database and indexes/indices.
Also, they are used extensively in distributed VLDB (Very Large DataBases)
Perhaps you can improve existing distributed databases (Cassandra and HBase)
These are lofty goals, but for me, this would leave a lasting mark
in the way Web data is processed, indexed and stored.
Write a distributed, fault tolerant, redundant network B+Tree or B*Tree.
Read Drozdek's book Data Structures and Algorithms in C++.
It's a good survey of B-Trees.
Read about skip trees
http://www.cs.huji.ac.il/~ittaia/papers/AAY-OPODIS05.pdf
Read about Efficient B-tree Based Indexing for Cloud Data Processing
http://www.comp.nus.edu.sg/~ooibc/vldb10-cgindex.pdf
Google search "Network B+Tree"
https://www.google.com/search?rlz=1C1CHKZ_enUS431US431&sourceid=chrome&ie=UTF-8&q=Network+B%2BTree
The arXiv e-print archive has several terabytes of papers from various fields of science. Some users would like to maintain a full copy of this data on their own computers, while others just want to download the most recent papers in a particular category. They are looking to reduce bandwidth load using some kind of distributed download system (e.g. BitTorrent). I'm looking for ideas for a program or set of programs that would cover all of this.
full pdf content is in the amazon cloud.
while there are > 600k papers on arXiv the total size of the pdf is < 1/2 TB
http://arxiv.org/help/bulk_data_s3
T.
arXiv recommends squid in httpd accelerator mode for precisely this purpose. Any particular reason why this is not good enough?
My first idea is that this looks an awful lot like Usenet newsgroups, with infinite persistence for messages on the servers. I don't know how well it works with PDFs, though.