We have indexed data with 143 million rows(docs) into solr.It takes around 3 hours to index.I usde csvUpdateHandler and indexes the csv file by remote streaming.
Now ,while i re-index the same csv data,it is still taking 3+ hours.
Ideally,since there are no changes in _id values,it should have finished quickly Is there any way to speed up re-indexing?
Please help with this.
You're probably almost as efficient as you can be when it comes to actual submission of data - a possible change is to only submit the data that you know has changed due to some external factor.
Solr would have to query the index for each value anyway, then determine which fields has changed before reindexing, which would probably be more expensive that it already is.
For that number of documents, 3 hours is quite good. You should work on reducing the number of rows submitted instead, so that the total amount of work is less than what it used to be. If the CSV is sorted and rows are only appended, keep the last _id available and only submit the CSV rows present after the id before submitting the CSV to Solr.
Related
I'm running SQL Server and I have a table of user profiles which contains columns for the user's personal info and a profile picture.
When setting up the project, I was given advice to store the profile image in the database. This seemed OK and worked fine, but now I'm dealing with real data and querying more rows the data is taking a lifetime to return.
To pull just the personal data, the query takes one second. To pull the images I'm looking at upwards of 6 seconds for 5 records.
The column is of type varchar(max) and the size of the data varies. Here's an example of the data lengths:
28171
4925543
144881
140455
25955
630515
439299
1700483
1089659
1412159
6003
4295935
Is there a way to optimize my fetching of this data? My query looks like this:
SELECT *
FROM userProfile
ORDER BY id
Indexing is out of the question due to the data lengths. Should I be looking at compressing the images before storing?
If takes time to return data. Five seconds seems a little long for a few megabytes, but there is overhead.
I would recommend compressing the data, if retrieval time is so important. You may be able to retrieve and uncompress the data faster than reading the uncompressed data.
That said, you should not be using select * unless you specifically want the image column. If you are using this in places where it is not necessary, that can improve performance. If you want to make this save for other users, you can add a view without the image column and encourage them to use the view.
If it is still possible to take one step back.Drop the idea of Storing images in table. Instead save path in DB and image in folder.This is the most efficient .
SELECT *
FROM userProfile
ORDER BY id
Do not use * and why are you using order by ? You can order by AT UI code
Hi i am new to the BigQuery, if i need to fetch a very large set of data, say more than 1 GB, how can i break it into smaller pieces for quicker processing? i will need to process the result and dump it into a file or elasticsearch. i need to find a efficient way to handle it. i tried with the QueryRequest.setPageSize option, but that does't seem to work. I set 100 and it doesn't seem to break on every 100 record i put this line to see how many record i get back before i turn to a new page
result = result.getNextPage();
it displays at random number of records. sometimes at 1000, sometimes at 400, etc.
thanks
Not sure if this helps you but in our project we have something that seems to be similar: we process lots of data in BigQuery and need to use the final result for later usage (it contains roughly 15 Gbs for us when compressed).
What we did was to first save the results to a table with AllowLargeResults set to True and then export the result by compressing it into cloud storage using the Python API.
It automatically breaks the results into several files.
After that we have a Python script that downloads concurrently all files, reads through the whole thing and builds some matrices for us.
I don't quite remember how long it takes to download all the files, I think it's around 10 minutes. I'll try to confirm this one.
I have about 20,000 documents stored in elastic search, at about 200kb each.
I have a search which has 733 hits total, I'm running that takes about 50ms to complete when returning 10 results.
If I set the size to 1000 so that it returns all results, the search takes 3-5 seconds to return.
Normally I would see that this is because it has to continue searching until it finds all of them, which takes extra time. However when returning 10 results only, the search still says 733 hits in total, so it already knows which documents are to be returned!
Note that I am not returning the _source field here, all I want it the list of _ids back, so I can't imagine that it would have to read any more data from the disk, as all the _ids are surely stored in the indices anyway.
Am I missing something in the way this works?
(My _ids are guids that we use internally).
EDIT: Since posting I've re-indexed with two changes to the mapping:
Set _source to false, so now the actual documents aren't stored.
Changed the index for the field that I was searching on to be not_analyzed.
This solves the problem, now I'm getting all 733 _ids back in ~50ms. Not sure which change solved it though. I'll take one of them back out and re-index.
It will take that Time. Because it need to fetch all data from ES and calculate score for your query.
Try
1)set fields to not analyzed which you Don search in.
2)change the store type of ES from simplfs to mmaps.. ( mention "index.store.type:mmaps" in elasticsearch.yml..)
3)configure less shard as much possible.. Shard more must be equal to move on nodes you gonna use..
I wrote a program to compute similarities among a set of 2 million documents. The program works, but I'm having trouble storing the results. I won't need to access the results often, but will occasionally need to query them and pull out subsets for analysis. The output basically looks like this:
1,2,0.35
1,3,0.42
1,4,0.99
1,5,0.04
1,6,0.45
1,7,0.38
1,8,0.22
1,9,0.76
.
.
.
Columns 1 and 2 are document ids, and column 3 is the similarity score. Since the similarity scores are symmetric I don't need to compute them all, but that still leaves me with 2000000*(2000000-1)/2 ≈ 2,000,000,000,000 lines of records.
A text file with 1 million lines of records is already 9MB. Extrapolating, that means I'd need 17 TB to store the results like this (in flat text files).
Are there more efficient ways to store these sorts of data? I could have one row for each document and get rid of the repeated document ids in the first column. But that'd only go so far. What about file formats, or special database systems? This must be a common problem in "big data"; I've seen papers/blogs reporting similar analyses, but none discuss practical dimensions like storage.
DISCLAIMER: I don't have any practical experience with this, but it's a fun exercise and after some thinking this is what I came up with:
Since you have 2.000.000 documents you're kind of stuck with an integer for the document id's; that makes 4 bytes + 4 bytes; the comparison seems to be between 0.00 and 1.00, I guess a byte would do by encoding the 0.00-1.00 as 0..100.
So your table would be : id1, id2, relationship_value
That brings it to exactly 9 bytes per record. Thus (without any overhead) ((2 * 10^6)^2)*9/2bytes are needed, that's about 17Tb.
Off course that's if you have just a basic table. Since you don't plan on querying it very often I guess performance isn't that much of an issue. So you could go 'creative' by storing the values 'horizontally'.
Simplifying things, you would store the values in a 2 million by 2 million square and each 'intersection' would be a byte representing the relationship between their coordinates. This would "only" require about 3.6Tb, but it would be a pain to maintain, and it also doesn't make use of the fact that the relations are symmetrical.
So I'd suggest to use a hybrid approach, a table with 2 columns. First column would hold the 'left' document-id (4 bytes), 2nd column would hold a string of all values of documents starting with an id above the id in the first column using a varbinary. Since a varbinary only takes the space that it needs, this helps us win back some space offered by the symmetry of the relationship.
In other words,
record 1 would have a string of (2.000.000-1) bytes as value for the 2nd column
record 2 would have a string of (2.000.000-2) bytes as value for the 2nd column
record 3 would have a string of (2.000.000-3) bytes as value for the 2nd column
etc
That way you should be able to get away with something like 2Tb (inc overhead) to store the information. Add compression to it and I'm pretty sure you can store it on a modern disk.
Off course the system is far from optimal. In fact, querying the information will require some patience as you can't approach things set-based and you'll pretty much have to scan things byte by byte. A nice 'benefit' of this approach would be that you can easily add new documents by adding a new byte to the string of EACH record + 1 extra record in the end. Operations like that will be costly though as it will result in page-splits; but at least it will be possible without having to completely rewrite the table. But it will cause quite bit of fragmentation over time and you might want to rebuild the table once in a while to make it more 'aligned' again. Ah.. technicalities.
Selecting and Updating will require some creative use of SubString() operations, but nothing too complex..
PS: Strictly speaking, for 0..100 you only need 7 bytes, so if you really want to squeeze the last bit out of it you could actually store 8 values in 7 bytes and save another ca 300Mb, but it would make things quite a bit more complex... then again, it's not like the data is going to be human-readable anyway =)
PS: this line of thinking is completely geared towards reducing the amount of space needed while remaining practical in terms of updating the data. I'm not saying it's going to be fast; in fact, if you'd go searching for all documents that have a relation-value of 0.89 or above the system will have to scan the entire table and even with modern disks that IS going to take a while.
Mind you that all of this is the result of half an hour brainstorming; I'm actually hoping that someone might chime in with a neater approach =)
Please comment and critique the approach.
Scenario: I have a large dataset(200 million entries) in a flat file. Data is of the form - a 10 digit phone number followed by 5-6 binary fields.
Every week I will be getting a Delta files which will only contain changes to the data.
Problem : Given a list of items i need to figure out whether each item(which will be the 10 digit number) is present in the dataset.
The approach I have planned :
Will parse the dataset and put it a DB(To be done at the start of the
week) like MySQL or Postgres. The reason i want to have RDBMS in the
first step is I want to have full time series data.
Then generate some kind of Key Value store out of this database with
the latest valid data which supports operation to find out whether
each item is present in the dataset or not(Thinking some kind of a
NOSQL db, like Redis here optimised for search. Should have
persistence and be distributed). This datastructure will be read-only.
Query this key value store to find out whether each item is present
(if possible match a list of values all at once instead of matching
one item at a time). Want this to be blazing fast. Will be using this functionality as the back-end to a REST API
Sidenote: Language of my preference is Python.
A few considerations for the fast lookup:
If you want to check a set of numbers at a time, you could use the Redis SINTER which performs set intersection.
You might benefit from using a grid structure by distributing number ranges over some hash function such as the first digit of the phone number (there are probably better ones, you have to experiment), this would e.g. reduce the size per node, when using an optimal hash, to near 20 million entries when using 10 nodes.
If you expect duplicate requests, which is quite likely, you could cache the last n requested phone numbers in a smaller set and query that one first.