Elasticsearch much slower when returning all results - indexing

I have about 20,000 documents stored in elastic search, at about 200kb each.
I have a search which has 733 hits total, I'm running that takes about 50ms to complete when returning 10 results.
If I set the size to 1000 so that it returns all results, the search takes 3-5 seconds to return.
Normally I would see that this is because it has to continue searching until it finds all of them, which takes extra time. However when returning 10 results only, the search still says 733 hits in total, so it already knows which documents are to be returned!
Note that I am not returning the _source field here, all I want it the list of _ids back, so I can't imagine that it would have to read any more data from the disk, as all the _ids are surely stored in the indices anyway.
Am I missing something in the way this works?
(My _ids are guids that we use internally).
EDIT: Since posting I've re-indexed with two changes to the mapping:
Set _source to false, so now the actual documents aren't stored.
Changed the index for the field that I was searching on to be not_analyzed.
This solves the problem, now I'm getting all 733 _ids back in ~50ms. Not sure which change solved it though. I'll take one of them back out and re-index.

It will take that Time. Because it need to fetch all data from ES and calculate score for your query.
Try
1)set fields to not analyzed which you Don search in.
2)change the store type of ES from simplfs to mmaps.. ( mention "index.store.type:mmaps" in elasticsearch.yml..)
3)configure less shard as much possible.. Shard more must be equal to move on nodes you gonna use..

Related

How to get a single item from my Amazon Dynamodb Table

My .Net code below is always returning a search.Matches.Count of 0 even though the movie is in the table. I've literally searched the whole internet but have not been able to get an answer, even on Amazon's AWS Developer website.
Please let me know what am I doing wrong? I appreciate your help. I'm totally new to this.
client = New AmazonDynamoDBClient(config)
table = Table.LoadTable(client, "MovieTable")
scanFilter = New ScanFilter
With scanFilter
.AddCondition("KeyCode", ScanOperator.NotEqual, MovieName)
.AddCondition("Status", ScanOperator.Equal, "In")
End With
search = table.Scan(scanFilter)
If search.Matches.Count = 1 then getMovieName
As the documentation explains, "Scan", a function which is supposed to go through the entire database, cannot go through the entire database at one fell swoop. Instead, it goes through it 1MB at a time, and after 1MB of data it returns to the caller, and you're supposed to ask to continue in the next page (again, see the documentation on how).
In your case, you have a very specific filter which matches only one item, but still - Scan will return after having read 1MB of data, even if none of the items in this 1MB match your request. It doesn't wait until 1MB of results have been collected! So in your use case it is not surprising that you're getting an empty result set, with LastEvaluatedKey set signalling that there are more pages to read.
By the way In your use case, where you are looking for just one item, doing a Scan of the entire database is obviously not a great choice (unless you're only doing this for debugging). a GetItem or Query operation will make more sense, if you can, and maybe a secondary index would be useful if you're searching by items not in the key.

Solr Re-indexing taking time

We have indexed data with 143 million rows(docs) into solr.It takes around 3 hours to index.I usde csvUpdateHandler and indexes the csv file by remote streaming.
Now ,while i re-index the same csv data,it is still taking 3+ hours.
Ideally,since there are no changes in _id values,it should have finished quickly Is there any way to speed up re-indexing?
Please help with this.
You're probably almost as efficient as you can be when it comes to actual submission of data - a possible change is to only submit the data that you know has changed due to some external factor.
Solr would have to query the index for each value anyway, then determine which fields has changed before reindexing, which would probably be more expensive that it already is.
For that number of documents, 3 hours is quite good. You should work on reducing the number of rows submitted instead, so that the total amount of work is less than what it used to be. If the CSV is sorted and rows are only appended, keep the last _id available and only submit the CSV rows present after the id before submitting the CSV to Solr.

Storing trillions of document similarities

I wrote a program to compute similarities among a set of 2 million documents. The program works, but I'm having trouble storing the results. I won't need to access the results often, but will occasionally need to query them and pull out subsets for analysis. The output basically looks like this:
1,2,0.35
1,3,0.42
1,4,0.99
1,5,0.04
1,6,0.45
1,7,0.38
1,8,0.22
1,9,0.76
.
.
.
Columns 1 and 2 are document ids, and column 3 is the similarity score. Since the similarity scores are symmetric I don't need to compute them all, but that still leaves me with 2000000*(2000000-1)/2 ≈ 2,000,000,000,000 lines of records.
A text file with 1 million lines of records is already 9MB. Extrapolating, that means I'd need 17 TB to store the results like this (in flat text files).
Are there more efficient ways to store these sorts of data? I could have one row for each document and get rid of the repeated document ids in the first column. But that'd only go so far. What about file formats, or special database systems? This must be a common problem in "big data"; I've seen papers/blogs reporting similar analyses, but none discuss practical dimensions like storage.
DISCLAIMER: I don't have any practical experience with this, but it's a fun exercise and after some thinking this is what I came up with:
Since you have 2.000.000 documents you're kind of stuck with an integer for the document id's; that makes 4 bytes + 4 bytes; the comparison seems to be between 0.00 and 1.00, I guess a byte would do by encoding the 0.00-1.00 as 0..100.
So your table would be : id1, id2, relationship_value
That brings it to exactly 9 bytes per record. Thus (without any overhead) ((2 * 10^6)^2)*9/2bytes are needed, that's about 17Tb.
Off course that's if you have just a basic table. Since you don't plan on querying it very often I guess performance isn't that much of an issue. So you could go 'creative' by storing the values 'horizontally'.
Simplifying things, you would store the values in a 2 million by 2 million square and each 'intersection' would be a byte representing the relationship between their coordinates. This would "only" require about 3.6Tb, but it would be a pain to maintain, and it also doesn't make use of the fact that the relations are symmetrical.
So I'd suggest to use a hybrid approach, a table with 2 columns. First column would hold the 'left' document-id (4 bytes), 2nd column would hold a string of all values of documents starting with an id above the id in the first column using a varbinary. Since a varbinary only takes the space that it needs, this helps us win back some space offered by the symmetry of the relationship.
In other words,
record 1 would have a string of (2.000.000-1) bytes as value for the 2nd column
record 2 would have a string of (2.000.000-2) bytes as value for the 2nd column
record 3 would have a string of (2.000.000-3) bytes as value for the 2nd column
etc
That way you should be able to get away with something like 2Tb (inc overhead) to store the information. Add compression to it and I'm pretty sure you can store it on a modern disk.
Off course the system is far from optimal. In fact, querying the information will require some patience as you can't approach things set-based and you'll pretty much have to scan things byte by byte. A nice 'benefit' of this approach would be that you can easily add new documents by adding a new byte to the string of EACH record + 1 extra record in the end. Operations like that will be costly though as it will result in page-splits; but at least it will be possible without having to completely rewrite the table. But it will cause quite bit of fragmentation over time and you might want to rebuild the table once in a while to make it more 'aligned' again. Ah.. technicalities.
Selecting and Updating will require some creative use of SubString() operations, but nothing too complex..
PS: Strictly speaking, for 0..100 you only need 7 bytes, so if you really want to squeeze the last bit out of it you could actually store 8 values in 7 bytes and save another ca 300Mb, but it would make things quite a bit more complex... then again, it's not like the data is going to be human-readable anyway =)
PS: this line of thinking is completely geared towards reducing the amount of space needed while remaining practical in terms of updating the data. I'm not saying it's going to be fast; in fact, if you'd go searching for all documents that have a relation-value of 0.89 or above the system will have to scan the entire table and even with modern disks that IS going to take a while.
Mind you that all of this is the result of half an hour brainstorming; I'm actually hoping that someone might chime in with a neater approach =)

Youtube API problem - when searching for playlists, start-index does not work past 100

I have been trying to get the full list of playlists matching a certain keyword. I have discovered however that using start-index past 100 brings the same set of results as using start-index=1. It does not matter what the max-results parameter is - still the same results. The total results returned however is way above 100, thus it cannot be that the query returned only 100 results.
What might the problem be? Is it a quota of some sort or any other authentication restriction?
As an example - the queries bring the same result set, whether you use start-index=1, or start-index=101, or start-index = 201 etc:
http://gdata.youtube.com/feeds/api/playlists/snippets?q=%22Jan+Smit+Laura%22&max-results=50&start-index=1&v=2
Any idea will be much appreciated!
Regards
Christo
I made an interface for my site, and the way I avoided this problem is to do a query for a large number, then store the results. Let your web page then break up the results and present them however is needed.
For example, if someone wants to do a search of over 100 videos, do the search and collect the results, but only present them with the first group, say 10. Then when the person wants to see the next ten, you get them from the list you stored, rather than doing a new query.
Not only does this make paging faster, but it cuts down on the constant queries to the YouTube database.
Hope this makes sense and helps.

What index would speed up my XQuery in X-Hive / Documentum xDB?

I have approx 2500 documents in my test database and searching the xpath /path/to/#attribute takes approximately 2.4 seconds. Doing distinct-values(/path/to/#attribute) takes 3.0 seconds.
I've been able to speed up queries on /path/to[#attribute='value'] to hundreds or tens of milliseconds by adding a Path value index on /path/to[#attribute<STRING>] but no index I can think of gets picked up for the more general query.
Anybody know what indexes I should be using?
The index you propose is the correct one (/path/to[#attribute]), but unfortunately the xDB optimizer currently doesn't recognize this specific case since the 'target node' stored in the index is always an element and not an attribute. If /path/to/#attribute has few results then you can optimize this by slightly modifying your query to this: distinct-values(/path/to[#attribute]/#attribute). With this query the optimizer recognizes that there is an index it can use to get to the 'to' element, but then it still has the access the target document to retrieve the attribute for the #attribute step. This is precisely why it will only benefit cases where there are few hits: each hit will likely access a different data page.
What you also can do is access the keys in the index directly through the API: XhiveIndexIf.getKeys(). This will be very fast, but clearly this is not very user friendly (and should be done by the optimizer instead).
Clearly the optimizer could handle this. I will add it to the bug tracker.