Nutch tips how to test it

Nutch tips how to test it - testing

For my testing purpose I want to make small fetch-generate-parse-index cycle. Now I have about thousands records in my database and when I run index it starts indexing many many records which takes hours.
Do you have any tip how to test this use case? Is there any way to generate limited number of pages generated in one batch?
Well I generally appreciate for any advice on testing Nutch 2.2.1 with Hbase and Solr 4.

It shouldnt take ages for a thousand records in the database, have you enabled multithreading?

Related

auto-optimize data in apache solr 5.4.1

I am indexing over than 1500000 of items from Mysql with Apache Solr 5.4.1, and when I enter to the Solr Admin Page everyday, I found that there is over than 5000 deleted items that they should be optimized, then I click to Optimze and all will be okay.
Is there a simple url to put it in the Crontab to Automate the Optimization of the indexes in Apache Solr 5.4.1 ?
Thank you.

Example from UpdateXMLMessages:
This example will cause the index to be optimized down to at most 10
segments, but won't wait around until it's done (waitFlush=false):
curl
'http://localhost:8983/solr/core/update?optimize=true&maxSegments=10&waitFlush=false'
.. but in general, you don't have to optimize very often. It might not be worth the time spent doing the actual optimize and the extra disk activity. If you're re-indexing the index completely each time as well, indexing to a fresh collection and then swapping the collections afterwards is also a possible solution.

Scaling CakePHP Version 2.3.0

I'm beginning a new project using CakePHP. I like the "auto-magic" features, I think its a good fit for the project. I'm wondering about the potential to scale CakePHP to several million IP hits a day. and hundreds of thousands of database writes and reads a day. Also about 50,000 to 500,000 users, often with 3000 concurrently using the site. I'm making use of heavy stored procedures to offset this, and I'm accessing several servers including a load balancer.
I'm wondering about the computational time of some of the auto-magic and how well Cake is able to assist with session requests making many db hits. Has anyone has had success with cake running from a single server array setup with this level of traffic? I'm not using the cloud or a distributed database (yet). I'm really worried about potential bottlenecks with using this framework. I'm interested in advice from anyone who has worked with Cake in production. I've reseached, but I would love a second opinion. Thank you for your time.

This is not a problem but optimization is up to you.
There are different cache methods available you can implement, memcache, redis, full page caching... All of that is supported by cacke already. What you cache and where is up to you.
For searching you could try elastic search to speedup things
There are before dispatcher filters to by pass controller instantiation (you might want to do that in special cases, check the asset filter for example)
Use nginx not apache
Also I would not start with over optimizing and over-thinking this before any code is written, start well, think about caching but when you start to come across bottleneck analyse and fix them. Otherwise you'll waste a lot of time with over optimization before you even have written anything that works.
Cake itself is very fast. Just to proof the bullshit factor of these fancy benchmarks some frameworks do we did one using a dispatcher filter to "optimize" it and even beat Yii who seems to be pretty eager to show how fast it is, but benchmarks are pointless, specially in a huge project where so many human made fail can be introduced.

How often to call commit on an offline Solr/Lucene index?

I know there have been some semi-similar questions, but in this case, I am building an index which is offline, until build is complete. I am building from scratch two cores, one has about 300k records with alot of citation information and large blocks of full text (this is the document index) and another core which has about 6.6 Million records, with full text (this is the page index).
Given this index is being built offline, the only real performance issue is speed of building. Noone should be querying this data.
The auto-commit would apparently fire if I stop adding items for 50 seconds? Which I don't do. I am adding ten at a time and they are added every couple seconds.
So, should I commit more often? I feel like the longer this runs the slower it gets, at least in my test case of 6k documents to index.
With noone searching this index, how often would anyone suggest I commit?
Should say I am using Solr 3.1 and SolrNet.

Although it's commits that are taking time for you, you might want to consider looking into other tweaking than commit frequency.
Is it the indexing core that also does searching, or is it replicated somewhere else after indexing concludes? If the latter is the case, then turning off caches might have a very noticeable impact on performance (solr rebuilds caches every time you commit).

You could also look into using the autoCommit or commitWith features of Solr.
commitWithin is done as part of the document add command. I believe that this is supported with SolrNet - please see Using the commiWithin attribute thread for more details.
autoCommit is a Solr configuration value added to the update handler section.

How much SQL Query is too much SQL Query?

I was researching for a CMS to use and ran into a review on vBulletin 4.0; using about 200 queries on one page load.
I was then worried.
Further research brought me to other sites to see how much queries they are using and I found that some forum software such as Invision Power Board and PHPBB are using queries as low as 6 or 8.
Currently, my site uses about 25 to 40 queries.
Should I be worried?

Don't be worried about number of queries.
Be worried about:
Pages loading too slowly
The SQL being too complicated to maintain.
Clarification:
SQL being too complicated can come from either too many queries OR a few queries that are very complicated (lots of joins and sub queries, etc).

If you aim for something, aim for 3 reads and 1 writes per HTTP hit.
While these are arbitrary numbers (somehow, they are actually taken from the Advanced PHP Programming), they emphasize the ideas:
the number of SQL roundtrips should be low, under 10 for sure, per HTTP call
there is a difference between reads and writes, and the ratio should favour reads. writes create contention
Also remember that not all reads are equal: the 3 reads should be highly optimized reads, not full table scans with 4-5 outer joins...

It Depends. The more you hit the db, the more load you have. Just some things to look for. If you need to display values from several different tables, you will probably need to run several queries. If you only have a couple of users and you know you're not going to have lots of data, it probably doesn't matter.
Some things to consider:
Are you running the same query multiple times per page load? If you can reuse the result, do it.
Are you running a query-per-result of another query? If so, maybe allow the DB to do the join and only do one pull.
If your page is slow from hitting the db too much, look at memcached.

You might try re-factoring your code over time to decrease the number of round-trips to SQL Server. One way to do this could be to utilize caching. For example, data you need frequently can be loaded when the application is started, then grabbed from the cache when it is needed.
Another approach could be to de-normalize your data into tables that are specifically designed to give you the data your site needs in a fewer number of queries.

Also consider if some of those queries (those you use to populate lookup values for instance) can be cached. That way if the same query is called on multiple pages or each time you move from one group of records to another, the database isn't hit again to run exaclty the same query. I remeber one time we were trying to determine why the site was so slow when the stored proc that was running was very fast and found using profiler that it was being sent over and over and over again when it didn't need to be.

You can cache all those queries with vbulletin. If you look at pbnation.com they have over a million visitors a day and only around 3-4 queries per page load. Everything else is cached in memcached.

How to: Increase Lucene .net Indexing Speed

I am trying to create an lucene of around 2 million records. The indexing time is around 9 hours.
Could you please suggest how to increase performance?

I wrote a terrible post on how to parallelize a Lucene Index. It's truly terribly written, but you'll find it here (there's some sample code you might want to look at).
Anyhow, the main idea is that you chunk up your data into sizable pieces, and then work on each of those pieces on a separate thread. When each of the pieces is done, you merge them all into a single index.
With the approach described above, I'm able to index 4+ million records in approx. 2 hours.
Hope this gives you an idea of where to go from here.

Apart from the writing side (merge factor) and the computation aspect (parallelizing) this is sometimes due to the simplest of reasons: slow input. Many people build a Lucene index from a database of data. Sometimes you find that a particular query for this data is too complicated and slow to actually return all the (2 million?) records quickly. Try just the query and writing to disk, if it's still in the order of 5-9 hours, you've found a place to optimize (SQL).

The following article really helped me when I needed to speed things up:
http://wiki.apache.org/lucene-java/ImproveIndexingSpeed
I found that document construction was our primary bottleneck. After optimizing data access and implementing some of the other recommendations, I was able to substantially increase indexing performance.

The simplest way to improve Lucene's indexing performance is to adjust the value of IndexWriter's mergeFactor instance variable. This value tells Lucene how many documents to store in memory before writing them to the disk, as well as how often to merge multiple segments together.
http://search-lucene.blogspot.com/2008/08/indexing-speed-factors.html

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Nutch tips how to test it - testing

It shouldnt take ages for a thousand records in the database, have you enabled multithreading?

Related

auto-optimize data in apache solr 5.4.1

Scaling CakePHP Version 2.3.0

How often to call commit on an offline Solr/Lucene index?

How much SQL Query is too much SQL Query?

How to: Increase Lucene .net Indexing Speed

Categories

Resources