does solr optimize storing of frequently repeated field values cross documents? - indexing

Our setup is like this. We emulate what is traditionally understood as a relational database (with one to many connections between entities) by having two SOLR indices. One of them (A) stores documents with fields logically attributed to every document in the other (B). Not only we establish a relation, but also (I believe) save indices from growing unnecessary.
As of current moment we are evaluating merging both indices such that every field of document(id=i) in A will be copied to every document(foreign_key=i) of B. After that A is not needed anymore.
My question is: does SOLR optimize storing frequently repeated values in the entire index? Will merging in such a scenario cause B to bloat?
The one to many relation from A to B has in average 10k links.

It turned out that after moving the fields from A to B, B grew by less than one percent, which implies that Solr does some optimization in storing frequently used field values.

Related

Selecting one column from a table that has 100 columns

I have a table with 100 columns (yes, code smell and arguably a potentially less optimized design). The table has an 'id' as PK. No other column is indexed.
So, if I fire a query like:
SELECT first_name from EMP where id = 10
Will SQL Server (or any other RDBMS) have to load the entire row (all columns) in memory and then return only the first_name?
(In other words - the page that contains the row id = 10 if it isn't in the memory already)
I think the answer is yes! unless it has column markers within a row. I understand there might be optimization techniques, but is it a default behavior?
[EDIT]
After reading some of your comments, I realized I asked an XY question unintentionally. Basically, we have tables with 100s of millions of rows with 100 columns each and receive all sorts of SELECT queries on them. The WHERE clause also changes but no incoming request needs all columns. Many of those cell values are also NULL.
So, I was thinking of exploring a column-oriented database to achieve better compression and faster retrieval. My understanding is that column-oriented databases will load only the requested columns. Yes! Compression will help too to save space and hopefully performance as well.
For MySQL: Indexes and data are stored in "blocks" of 16KB. Each level of the B+Tree holding the PRIMARY KEY in your case needs to be accessed. For example a million rows, that is 3 blocks. Within the leaf block, there are probably dozens of rows, with all their columns (unless a column is "too big"; but that is a different discussion).
For MariaDB's Columnstore: The contents of one columns for 64K rows is held in a packed, compressed structure that varies in size and structure. Before getting to that, the clump of 64K rows must be located. After getting it, it must be unpacked.
In both cases, the structure of the data on disk is a compromises between speed and space for both simple and complex queries.
Your simple query is easy and efficient to doing a regular RDBMS, but messier to do in a Columnstore. Columnstore is a niche market in which your query is abnormal.
Be aware that fetching blocks are typically the slowest part of performing the query, especially when I/O is required. There is a cache of blocks in RAM.

Multiple tables vs one table with more columns

My chosen database is MongoDB. But the question should be independent.
So for example, each row of record will have a flag that can take 1 of 2 possible values.
What is the pro and con of:
Having 1 table with a column to hold the value of this flag.
versus:
the pro and con of:
Having 2 tables to hold the two different types of records distinguished by the aforementioned flag?
Would this be cheaper in terms of storage, since you don't have that extra column?
Would this also be faster in queries, since you know exactly which table to look without having to perform a filter?
What is the common practice in industry?
Storage for a single column holding just a flag (e.g. active and archived) should be negligible. The query could be faster with two tables, however your application is more complex, you have to write 2 queries.
When you have only 2 distinct values and these values are more or less evenly distributed, then an index will not be used, thus the performance should be equal - unless you select the entire table.
It might be useful to have 2 tables if the flags are not evenly distributed. For example you have a rather small active data set which is queried frequently, and a big archive data set which is much bigger but hardly queried.
If available, you can also work with partitions which is actually a good combination of both.

SQL Structure, Dynamic Two Columns or Unique Colmuns

I'm not sure which is faster. I have the need to store lists of possible data.
Currently I have an SQL table with the following structure being accessed with php.
boxID
place
name -- (serialNum, itemNum, idlock, etc, etc)
data
--(Note: The Primary Key here would be boxId, place, name, and data, to prevent duplicate data.)
The reason i set it up like this was to prevent creating columns per named data. Its a possibility in the future to have 5-10 different named data or more. Also possible to store 1,000 - 10,000 entries of data in one week for just one named data. It will be searched as well, like when i get place from a specific serialNum, then getting all data related to that place. (A specific serialNum, itemNum, idLock, etc, etc,)
But my concern is that my structure could be slower than just creating a named column for each named data. For example:
boxID
place
serialNum
itemNum
idLock
etc
etc
--(Note: Not even sure how to add keys to this if i would do it this way)
To sum it up: Which is faster and better practice? (keep in mind im still a novice with SQL)
The best practice is to model your data as entities with specific attributes. Typically an entity has at most a few dozen attributes. The entities typically turn into tables, and the attributes typically which turn into columns. That is, the physical model and the logic model are often very similar.
There may be other considerations. For instance, there is a limit on the number of columns a row can have -- and if you have more columns, you need another solution. Similarly, if the data is sparse (that is, most values are NULL), then having lots of unused columns may be a waste of space. That is, it is more efficient to store it in another format. SQL Server offers sparse columns for this reason.
My suggestion is that you design your table in an intuitive way with named columns. A volume of data of 1,000 - 10,000 rows per week is not that much data. That turns into 50,000 - 500,000 rows per year, which SQL Server should be easily able to handle the volume. You don't say how many named entities you have, but table with millions or tens of millions of rows are quite reasonable for modern databases.

Correlation between amount of rows and amount columns in database performance

Is there a correlation between the amount of rows/number of columns used and it's impact within the (MS)SQL database?
A little more background:
We have to store lots of data from measurement devices. These devices ping a string with data back to us around 100 times a day. These strings contains +- 300 fields. Assume we have 100 devices in operation that means we get 10000 records back every day. At our back-end we split these data strings and have to put these into the database. When these data strings are fixed that means we add each days around 10000 new rows into the database. No big deal.
Whatsoever, the contents of these data strings may change during time. There are two options we are considering:
Using vertical tables to store the data dynamically
Using horizontal tables and add a new column now and then when it's needed.
From the perspective of ease we'd like to choose for the first approach. Whatsoever, that means we're adding 100*100*300=3000000 rows each day. Data has to be stored 1 year and a month (395 days) so then we're around 1.2 billion rows. Not calculated the expected growth.
Is it from a performance perspective smarter to use a 'vertical' or a 'horizontal' approach?
When choosing for the 'vertical' solution, how can we actual optimize performance by using PK's/FK's wisely?
When choosing for the 'horizontal' solution, are there recommendations for adding columns to the table?
I have a vertical DB with 275 million rows in the "values" table. We took this approach because we couldn't accurately define the schema at the outset either. Inserts are fantastic. Selects suck. Too be fair we throw in a couple of extra doohickies the typical vertical schema doesn't have to deal with.
Have a search for EAV aka Entity Attribute Value models. You'll find a lot of heat on both sides of the debate. Too good articles on making it work are
What is so bad about EAV, anyway?
dave’s guide to the eav
My guess is these sensors don't just start sending you extra fields. You have to release new sensors or sensor code for this to happen. That's your chance to do change control on your schema and add the extra columns. If external parties can connect sensors without notifying you this argument is null and void and you may be stuck with an EAV.
For the horizontal option you can split tables putting the frequently-used columns in one table and the less-used in a second; both tables have the same primary key values so you can link less-used to more-used columns. Also you can use RDBMS's built-in partitioning functionality to split each day's (or week's or month's) data for the others'.
Generally, you can tune a table more for inserts (or any DML) or for queries. Improving one side comes at the expense of the other. Usually, it's a balancing act.
First of all, 10K inserts a day is not really a large number. Sure, it's not insignificant, but it doesn't even come close to what would be considered "large" nowadays So, while we don't want to make inserts downright sluggish, this gives you some wiggle room.
Creating an index on the device id and/or entry timestamp will do some logical partitioning of the data for you. The exact makeup of your index(es) will depend on your queries. Are you looking for all entries for a given date or date range? Then index the timestamp column. Are you looking for all entries received from a particular device? Then index the device id column. Are you looking for entries from a particular device on a particular date or date range or sorted by the date? Then create an index on both columns.
So if you ask for the entries for device x on date y, then you are going out to the table and looking only at the rows you need. The fact that the table is much larger than the small subset you query is incidental. It's as if the rest of the table doesn't even exist. The total size of the table need not be intimidating.
Another option: As it looks like the data is written to the table and never altered after that, then you may want to create a data warehouse schema for the data. New entries can be moved to the warehouse every day or several times a day. The point is, the warehouse schema can have the data sliced, diced, quartered and cubed to make queries much more efficient. So you can have the existing table tuned for more efficient inserts and the warehouse tuned for more efficient queries. That is, after all, what data warehouses are for.
You also imply that some of each entry is (or can be) duplicated from one entry to the next. See if you can segment the data into three types:
Type 1: Data that never changes (the device id, for example)
Type 2: Data that rarely changes
Type 3: Data that changes often
Now all you have is a normalization problem, something a lot easier to solve. Let's say the row is equally split between the types. So you have one table with 100 rows of 33 columns. That's it. It never changes. Linked to that is a table with at least 100 rows of 33 columns but maybe several new rows are added each day. Finally, linked to the second table a table with rows of 33 columns that possibly grows by the full 10K every day.
This minimizes the grow-space required by the online database. The warehouse could then denormalize back to one huge table for ease of querying.

Efficiently storing 7.300.000.000 rows

How would you tackle the following storage and retrieval problem?
Roughly 2.000.000 rows will be added each day (365 days/year) with the following information per row:
id (unique row identifier)
entity_id (takes on values between 1 and 2.000.000 inclusive)
date_id (incremented with one each day - will take on values between 1 and 3.650 (ten years: 1*365*10))
value_1 (takes on values between 1 and 1.000.000 inclusive)
value_2 (takes on values between 1 and 1.000.000 inclusive)
entity_id combined with date_id is unique. Hence, at most one row per entity and date can be added to the table. The database must be able to hold 10 years worth of daily data (7.300.000.000 rows (3.650*2.000.000)).
What is described above is the write patterns. The read pattern is simple: all queries will be made on a specific entity_id. I.e. retrieve all rows describing entity_id = 12345.
Transactional support is not needed, but the storage solution must be open-sourced. Ideally I'd like to use MySQL, but I'm open for suggestions.
Now - how would you tackle the described problem?
Update: I was asked to elaborate regarding the read and write patterns. Writes to the table will be done in one batch per day where the new 2M entries will be added in one go. Reads will be done continuously with one read every second.
"Now - how would you tackle the described problem?"
With simple flat files.
Here's why
"all queries will be made on a
specific entity_id. I.e. retrieve all
rows describing entity_id = 12345."
You have 2.000.000 entities. Partition based on entity number:
level1= entity/10000
level2= (entity/100)%100
level3= entity%100
The each file of data is level1/level2/level3/batch_of_data
You can then read all of the files in a given part of the directory to return samples for processing.
If someone wants a relational database, then load files for a given entity_id into a database for their use.
Edit On day numbers.
The date_id/entity_id uniqueness rule is not something that has to be handled. It's (a) trivially imposed on the file names and (b) irrelevant for querying.
The date_id "rollover" doesn't mean anything -- there's no query, so there's no need to rename anything. The date_id should simply grow without bound from the epoch date. If you want to purge old data, then delete the old files.
Since no query relies on date_id, nothing ever needs to be done with it. It can be the file name for all that it matters.
To include the date_id in the result set, write it in the file with the other four attributes that are in each row of the file.
Edit on open/close
For writing, you have to leave the file(s) open. You do periodic flushes (or close/reopen) to assure that stuff really is going to disk.
You have two choices for the architecture of your writer.
Have a single "writer" process that consolidates the data from the various source(s). This is helpful if queries are relatively frequent. You pay for merging the data at write time.
Have several files open concurrently for writing. When querying, merge these files into a single result. This is helpful is queries are relatively rare. You pay for merging the data at query time.
Use partitioning. With your read pattern you'd want to partition by entity_id hash.
You might want to look at these questions:
Large primary key: 1+ billion rows MySQL + InnoDB?
Large MySQL tables
Personally, I'd also think about calculating your row width to give you an idea of how big your table will be (as per the partitioning note in the first link).
HTH.,
S
Your application appears to have the same characteristics as mine. I wrote a MySQL custom storage engine to efficiently solve the problem. It is described here
Imagine your data is laid out on disk as an array of 2M fixed length entries (one per entity) each containing 3650 rows (one per day) of 20 bytes (the row for one entity per day).
Your read pattern reads one entity. It is contiguous on disk so it takes 1 seek (about 8mllisecs) and read 3650x20 = about 80K at maybe 100MB/sec ... so it is done in a fraction of a second, easily meeting your 1-query-per-second read pattern.
The update has to write 20 bytes in 2M different places on disk. IN simplest case this would take 2M seeks each of which takes about 8millisecs, so it would take 2M*8ms = 4.5 hours. If you spread the data across 4 “raid0” disks it could take 1.125 hours.
However the places are only 80K apart. In the which means there are 200 such places within a 16MB block (typical disk cache size) so it could operate at anything up to 200 times faster. (1 minute) Reality is somewhere between the two.
My storage engine operates on that kind of philosophy, although it is a little more general purpose than a fixed length array.
You could code exactly what I have described. Putting the code into a MySQL pluggable storage engine means that you can use MySQL to query the data with various report generators etc.
By the way, you could eliminate the date and entity id from the stored row (because they are the array indexes) and may be the unique id – if you don't really need it since (entity id, date) is unique, and store the 2 values as 3-byte int. Then your stored row is 6 bytes, and you have 700 updates per 16M and therefore a faster inserts and a smaller file.
Edit Compare to Flat Files
I notice that comments general favor flat files. Don't forget that directories are just indexes implemented by the file system and they are generally optimized for relatively small numbers of relatively large items. Access to files is generally optimized so that it expects a relatively small number of files to be open, and has a relatively high overhead for open and close, and for each file that is open. All of those "relatively" are relative to the typical use of a database.
Using file system names as an index for a entity-Id which I take to be a non-sparse integer 1 to 2Million is counter-intuitive. In a programming you would use an array, not a hash-table, for example, and you are inevitably going to incur a great deal of overhead for an expensive access path that could simply be an array indeing operation.
Therefore if you use flat files, why not use just one flat file and index it?
Edit on performance
The performance of this application is going to be dominated by disk seek times. The calculations I did above determine the best you can do (although you can make INSERT quicker by slowing down SELECT - you can't make them both better). It doesn't matter whether you use a database, flat-files, or one flat-file, except that you can add more seeks that you don't really need and slow it down further. For example, indexing (whether its the file system index or a database index) causes extra I/Os compared to "an array look up", and these will slow you down.
Edit on benchmark measurements
I have a table that looks very much like yours (or almost exactly like one of your partitions). It was 64K entities not 2M (1/32 of yours), and 2788 'days'. The table was created in the same INSERT order that yours will be, and has the same index (entity_id,day). A SELECT on one entity takes 20.3 seconds to inspect the 2788 days, which is about 130 seeks per second as expected (on 8 millisec average seek time disks). The SELECT time is going to be proportional to the number of days, and not much dependent on the number of entities. (It will be faster on disks with faster seek times. I'm using a pair of SATA2s in RAID0 but that isn't making much difference).
If you re-order the table into entity order
ALTER TABLE x ORDER BY (ENTITY,DAY)
Then the same SELECT takes 198 millisecs (because it is reading the order entity in a single disk access).
However the ALTER TABLE operation took 13.98 DAYS to complete (for 182M rows).
There's a few other things the measurements tell you
1. Your index file is going to be as big as your data file. It is 3GB for this sample table. That means (on my system) all the index at disk speeds not memory speeds.
2.Your INSERT rate will decline logarithmically. The INSERT into the data file is linear but the insert of the key into the index is log. At 180M records I was getting 153 INSERTs per second, which is also very close to the seek rate. It shows that MySQL is updating a leaf index block for almost every INSERT (as you would expect because it is indexed on entity but inserted in day order.). So you are looking at 2M/153 secs= 3.6hrs to do your daily insert of 2M rows. (Divided by whatever effect you can get by partition across systems or disks).
I had similar problem (although with much bigger scale - about your yearly usage every day)
Using one big table got me screeching to a halt - you can pull a few months but I guess you'll eventually partition it.
Don't forget to index the table, or else you'll be messing with tiny trickle of data every query; oh, and if you want to do mass queries, use flat files
Your description of the read patterns is not sufficient. You'll need to describe what amounts of data will be retrieved, how often and how much deviation there will be in the queries.
This will allow you to consider doing compression on some of the columns.
Also consider archiving and partitioning.
If you want to handle huge data with millions of rows it can be considered similar to time series database which logs the time and saves the data to the database. Some of the ways to store the data is using InfluxDB and MongoDB.