SOLR indexing performance with large amount of data

SOLR indexing performance with large amount of data - indexing

i have an indexed of 10 millions documents, then as per the new requirement i need to add extra field in my solr schema.
So my question is what would be the ideal approach? i mean simply add field in schema and re-index the whole data? or just some partial index? or something else?.

The data has to be submitted regardless of what you do, so the answer would probably depend on the size of the rest of your fields. If they are small, just add the field and reindex everything.
If there's a very large field present, it might be more effective letting Solr fetch the content internally (i.e. a partial update) instead of submitting it over the network, but that would require that you already have all fields defined as stored or with doc values.
It's impossible to say exactly, so you'll have to perform a small test with a small number of documents to see exactly how performance is with your dataset.

Related

The efficiency about big data into node on AgensGraph

As far as I know, wide-column cannot be applicable.
But is there a difference in efficiency to put big data into the node?
I'd like to put an index to distinguish the value and want to know the efficiency.

"There are few efficiency considerations when putting big data into nodes (accurately property). Search filtering may become slower due to search scope and size increases. Changes may result in overhead such as wal log.
Although it's difficult to determine how much big data you have, but I think you should save it as a file and save its description as property type. Information saved in the property can reduce access expense by the creation of a seperate property index. "

Is it a good idea to create tables dynamically to store user-content?

I'm currently designing an application where users can create/join groups, and then post content within a group. I'm trying to figure out how best to store this content in a RDBMS.
Option 1: Create a single table for all user content. One of the columns in this table will be the groupID, designating which group the content was posted in. Create an index using the groupID, to enable fast searching of content within a specific group. All content reads/writes will hit this single table.
Option 2: Whenever a user creates a new group, we dynamically create a new table. Something like group_content_{groupName}. All content reads/writes will be routed to the group-specific dynamically created table.
Pros for Option 1:
It's easier to search for content across multiple groups, using a single simple query, operating on a single table.
It's easier to construct simple cross-table queries, since the content table is static and well-defined.
It's easier to implement schema changes and changes to indexing/triggers etc, since there's only one table to maintain.
Pros for Option 2:
All reads and writes will be distributed across numerous tables, thus avoiding any bottlenecks that can result from a lot of traffic hitting a single table (though admittedly, all these tables are still in a single DB)
Each table will be much smaller in size, allowing for faster lookups, faster schema-changes, faster indexing, etc
If we want to shard the DB in future, the transition would be easier if all the data is already "sharded" across different tables.
What are the general recommendations between the above 2 options, from performance/development/maintenance perspectives?

One of the cardinal sins in computing is optimizing too early. It is the opinion of this DBA of 20+ years that you're overestimating the IO that's going to happen to these groups.. RDBMS's are very good at querying and writing this type of info within a standard set of tables. Worst case, you can partition them later. You'll have a lot more search capability and management ease with 1 set of tables instead of a set per user.
Imagine if the schema needs to change? do you really want to update hundreds or thousands of tables or write some long script to fix a mundane issue? Stick with a single set of tables and ignore sharding. Instead, think "maybe we'll partition the tables someday, if necessary"

It is a no-brainer. (1) is the way to go.
You list these as optimizations for the second method. All of these are misconceptions. See comments below:
All reads and writes will be distributed across numerous tables, thus
avoiding any bottlenecks that can result from a lot of traffic hitting
a single table (though admittedly, all these tables are still in a
single DB)
Reads and writes can just as easily be distributed within a table. The only issue would be write conflicts within a page. That is probably a pretty minor consideration, unless you are dealing with more than dozens of transactions per second.
Because of the next item (partially filled pages), you are actually much better off with a single table and pages that are mostly filled.
Each table will be much smaller in size, allowing for faster lookups,
faster schema-changes, faster indexing, etc
Smaller tables can be a performance disaster. Tables are stored on data pages. Each table is then a partially filled page. What you end up with is:
A lot of wasted space on disk.
A lot of wasted space in your page cache -- space that could be used to store records.
A lot of wasted I/O reading in partially filled pages.
If we want to shard the DB in future, the transition would be easier
if all the data is already "sharded" across different tables.
Postgres supports table partitioning, so you can store different parts of a table in different places. That should be sufficient for your purpose of spreading the I/O load.

Option 1: Performance=Normal Development=Easy Maintenance=Easy
Option 2: Performance=Fast Development=Complex Maintenance=Hard
I suggest to choose the Oprion1 and for the BIG table you can manage the performance with better indexes or cash indexes (for some DB) and the last thing is nothing help make the second Option 2, because development a maintenance time is fatal factor

Lucene: fast(er) to get docs in bulk?

I try to build some real-time aggregates on Lucene as a part of experiment. Documents have their values stored in the index. This works very nice for up-to 10K documents.
For larger numbers of documents, this gets kinda slow. I assume there is not too much invested in getting bulk-amounts of documents, as this kind of defeats the purpose of a search engine.
However, it would be cool to be able to do this. So, basically my question is: what could I do to get documents faster from Lucene? Or are there smarter approaches?
I already only retrieve fields I need.
[edit]
The index is quite large >50GB. This does not fit in memory. The number of fields differ, I have several types of documents. Aggregation will mostly take place on a fixed document type; but there is no way to tell on beforehand which one.

Have you put the index in memory? If the entire index fits in memory, that is a huge speedup.
Once you get the hits (which comes back super quick even for 10k records), I would open up multiple threads/readers to access them.
Another thing I have done is store only some properties in Lucene (i.e. don't store 50 attributes from a class). You can get things faster sometimes just by getting a list of IDs and getting the other content from a service/database faster.

Database or other method of storing and dynamically accessing HUGE binary objects

I have some large (200 GB is normal) flat files of data that I would like to store in some kind of database so that it can be accessed quickly and in the intuitive way that the data is logically organized. Think of it as large sets of very long audio recordings, where each recording is the same length (samples) and can be thought of as a row. One of these files normally has about 100,000 recordings of 2,000,000 samples each in length.
It would be easy enough to store these recordings as rows of BLOB data in a relational database, but there are many instances where I want to load into memory only certain columns of the entire data set (say, samples 1,000-2,000). What's the most memory- and time-efficient way to do this?
Please don't hesitate to ask if you need more clarification on the particulars of my data in order to make a recommendation.
EDIT: To clarify the data dimensions... One file consists of: 100,000 rows (recordings) by 2,000,000 columns (samples). Most relational databases I've researched will allow a maximum of a few hundred to a couple thousand rows in a table. Then again, I don't know much about object-oriented databases, so I'm kind of wondering if something like that might help here. Of course, any good solution is very welcome. Thanks.
EDIT: To clarify the usage of the data... The data will be accessed only by a custom desktop/distributed-server application, which I will write. There is metadata (collection date, filters, sample rate, owner, etc.) for each data "set" (which I've referred to as a 200 GB file up to now). There is also metadata associated with each recording (which I had hoped would be a row in a table so I could just add columns for each piece of recording metadata). All of the metadata is consistent. I.e. if a particular piece of metadata exists for one recording, it also exists for all recordings in that file. The samples themselves do not have metadata. Each sample is 8 bits of plain-ol' binary data.

DB storage may not be ideal for large files. Yes, it can be done. Yes, it can work. But what about DB backups? The file contents likely will not change often - once they're added, they will remain the same.
My recommendation would be store the file on disk, but create a DB-based index. Most filesystems get cranky or slow when you have > 10k files in a folder/directory/etc. Your application can generate the filename and store metadata in the DB, then organize by the generated name on disk. Downsides are file contents may not be directly apparent from the name. However, you can easily backup changed files without specialized DB backup plugins and a sophisticated partitioning, incremental backup scheme. Also, seeks within the file become much simpler operations (skip ahead, rewind, etc.). There is generally better support for these operations in a file system than in a DB.

I wonder what makes you think that RDBMS would be limited to mere thousands of rows; there's no reason this would be the case.
Also, at least some databases (Oracle as an example) do allow direct access to parts of LOB data, without loading the full LOB, if you just know the offset and length you want to have. So, you could have a table with some searchable metadata and then the LOB column, and if needed, an additional metadata table containing metadata on the LOB contents so that you'd have some kind of keyword->(offset,length) relation available for partal loading of LOBs.
Somewhat echoing another post here, incremental backups (which you might wish to have here) are not quite feasible with databases (ok, can be possible, but at least in my experience tend to have a nasty price tag attached).

How big is each sample, and how big is each recording?
Are you saying each recording is 2,000,000 samples, or each file is? (it can be read either way)
If it is 2 million samples to make up 200 GB, then each sample is ~10 K, and each recording is 200K (to have 100,000 per file, which is 20 samples per recording)?
That seems like a very reasonable size to put in a row in a DB rather than a file on disk.
As for loading into memory only a certain range, if you have indexed the sample ids, then you could very quickly query for only the subset you want, loading only that range into memory from the DB query result.

I think that Microsoft SQL does what you need with the varbinary(MAX) field type WHEN used in conjnction with filestream storage.
Have a read on TechNet for more depth: (http://technet.microsoft.com/en-us/library/bb933993.aspx).
Basically, you can enter any descriptive fields normally into your database, but the actual BLOB is stored in NTFS, governed by the SQL engine and limited in size only by your NTFS file system.
Hope this helps - I know it raises all kinds of possibilities in my mind. ;-)

Store fields in database or in Lucene index file

My domain object has 20 properties(columns, attributes, whatever you call it) and simple relationships. I want to index 5 properties for full-text search and 3 for sorting. There might be 100,000 records.
To keep my application simple, I want to store the fields in a Lucene index file to avoid introducing a database. Will there be a performance problem?

Depending on how you access stored fields, they may all be loaded into memory (basically, if you use a FieldCache everything will be cached into memory after the first use). And if you have a gig of storage which is taking up memory, that's a gig less to use for your actual index.
Depending on how much memory you have, this may be a performance enhancement, or a performance detriment.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas