I'm building a screener that should be able to search through a table about 50 columns wide and 7000 rows long really fast.
Each row is composed of the following columns.
primary_key, quantity1, quantity2, quantity3...quantity50.
All quantities are essentially floats or integers. Hence a typical screener would look like this.
Get all rows which have quantity1 > x and quantity2 < y and quantity3 >= z.
Indexing all columns should lead to really fast search times however some of the columns will be updating in realtime. Indexing everything obviously leads to very low insert/update times.
A portion of the columns are fairly static though. Hence an idea was to segregate the data into two tables, one containing all columns that are static while the other containing data that is dynamic. Any screener would then be applied to both tables based on the actual query. And the results combined in the end.
I am currently planning on using a MySQL engine, most probably INNoDB. However I'm looking to get much faster response times. An implementation of the same problem on a certain site was very snappy. Regardless of the query size, I was getting the results within 500 ms. Wondering what other options are available out there to implement this function.
Related
According to the limits postgres supports up to 1600 columns per table.
https://www.postgresql.org/docs/current/limits.html
I understand that it's bad practice to have so many columns but what are the consequences of approaching this limit?
For example, will a table with 200 columns perform fine in an application? How can you tell when you're approaching too many columns for a given table?
The hard limit is that a table row has to fit inside a single 8kB block.
The "soft limits" you encounter with many columns are
writing SELECT list becomes more and more annoying (never, ever, use SELECT *)
each UPDATE has to write a large row version, so lots of data churn
extracting the 603th column from a row requires skipping the previous 602 columns, which is a performance hit
it is plain annoying if the output of \d is 50 pages long
Given we have a big table in a relational database we need to query.
We have two options:
query the whole table
query subsets of data inside the table i.e. rows from 1 to 1000, then 1001 to 2000 etc.
Does this separation make some sense?
Does it depend on query structure?
Let's add some math. Given some query execution time is proportional to n^3 where n is the number of rows in the table. This means in first case query execution time is proportional to n^3. As for second option - its different. Total time would be (n/3)^3 + (n/3)^3 + (n/3)^3 = n^3 / 9 which is better.
Real life is more complicated: the query would not be the same in this case, we have to spend some time limiting the rows to the subset.
Also number of connections and concurrency of database can be limited thus we would not be able to query it simultaneously by 10 queries, for example, at least with the same speed.
But does these reasons make sense? May this help to cut time expenses for some big tables?
It depends on a lot of criteria. Some of them being :
How busy the database is ? That is how many parallel query is
running ?
Reason : If there are a large number of query running or any query with a number of parallel session then querying on the big table will be slow while smaller will work faster.
Into how many smaller table the bigger table have been divided into ?
Reason : One point to consider here that if a big table is divided
into several small tables and run the query on each of the smaller tables, then the individual results needed to be aggregated. This may take time depending on the query.
Type of query is being executed
Reason : If you are running a query having filtering condition on a column and you divide the large table based on values of that column, then you can skip some of the tables based on query condition and hence reduce time of output
Overall in such a scenario instead of dividing a big table into smaller ones it is better to partition the table. Range Partition can be used on the bigger table for faster query execution.
i'm currently working on a project where the client has handed me a database that includes a table with over 200 columns and 3 million rows of data lol. This is definitely poorly designed and currently exploring some options. I developed the app on my 2012 mbp with 16gb of ram and an 512 ssd. I had to develop the app using mvc4 so set up the development and test environment using parallels 8 on osx. As part of the design, I developed an interface for the client to create custom queries to this large table with hundreds of rows so I am sending a queryString to the controller which is passed using dynamic linq and the results are sent to the view using JSON (to populate a kendo ui grid). On my mbp, when testing queries using the interface i created it takes max 10 secs (which find too much) to return the results to my kendo ui grid. Similarly, when I test queries directly in sql server, it never takes really long.
However when I deployed this to the client for testing these same queries take in excess of 3 mins +. So long story short, the client will be upgrading the server hardware but in the mean time they still need to test the app.
My question is, despite the fact that the table holds 200 columns, each row is unique. More specifically, the design is:
PK-(GUID) OrganizationID (FK) --- 200 columns (tax fields)
If I redesign this to:
PK (GUID) OrganizationID (FK) FieldID(FK) Input
Field table:
FieldID FieldName
This would turn this 3 million rows of data table into 600 million rows but only 3 columns. Will I see performance enhancements?
Any insight would be appreciated - I understand normalization but most of my experience is in programming.
Thanks in advance!
It is very hard to make any judgements without knowing the queries that you are running on the table.
Here are some considerations:
Be sure that the queries are using indexes if they are returning only a handful of rows.
Check that you have enough memory to store the table in memory.
When doing timings, be sure to ignore the first run, because this is just loading the page cache.
For testing purposes, just reduce the size of the table. That should speed things up.
As for your question about normalization. Your denormalized structure takes up much less disk space than a normalized structure, because you do not need to repeat the keys for each value. If you are looking for one value on one row, normalization will not help you. You will still need to scan the index to find the row and then load the row. And, the row will be on one page, regardless of whether it is normalized or denormalized. In fact, normalization might be worse, because the index will be much larger.
There are some examples of queries where normalizing the data will help. But, in general, you already have a more efficient data structure if you are fetching the data by rows.
You can take a paging approach. There will be 2 queries: initial will return all rows but only a column with unique IDs. This array can be split into pages, say 100 IDs per page. When user selects a specific page - you pass 100 ids to the second query which this time will return all 200 columns but only for requested 100 rows. This way you don't have to return all the columns across all the rows at once, which should yield significant performance boost.
Let's suppose I have a table in my database with 1.000.000 records.
If I execute:
SELECT * FROM [Table] LIMIT 1000
Will this query take the same time as if I have that table with 1000 records and just do:
SELECT * FROM [Table]
?
I'm not looking for if it will take exactly the same time. I just want to know if the first one will take much more time to execute than the second one.
I said 1.000.000 records, but it could be 20.000.000. That was just an example.
Edit:
Of course that when using LIMIT and without using it in the same table, the query built using LIMIT should be executed faster, but I'm not asking that...
To make it generic:
Table1: X records
Table2: Y records
(X << Y)
What I want to compare is:
SELECT * FROM Table1
and
SELECT * FROM Table2 LIMIT X
Edit 2:
Here is why I'm asking this:
I have a database, with 5 tables and relationships between some of them. One of those tables will (I'm 100% sure) contain about 5.000.000 records. I'm using SQL Server CE 3.5, Entity Framework as the ORM and LINQ to SQL to make the queries.
I need to perform basically three kind of non-simple queries, and I was thinking about showing to the user a limit of records (just like lot of websites do). If the user wants to see more records, the option he/she has is to restrict more the search.
So, the question came up because I was thinking about doing this (limiting to X records per query) or if storing in the database only X results (the recent ones), which will require to do some deletions in the database, but I was just thinking...
So, that table could contain 5.000.000 records or more, and what I don't want is to show the user 1000 or so, and even like this, the query still be as slow as if it would be returning the 5.000.000 rows.
TAKE 1000 from a table of 1000000 records - will be 1000000/1000 (= 1000) times faster because it only needs to look at (and return) 1000/1000000 records. Since it does less, it is naturally faster.
The result will be pretty (pseudo-)random, since you haven't specified any order in which to TAKE. However, if you do introduce an order, then one of two below becomes true:
The ORDER BY clause follows an index - the above statement is still true.
The ORDER BY clause cannot use any index - it will be only marginally faster than without the TAKE, because
it has to inspect ALL records, and sort by ORDER BY
deliver only a subset (TAKE count)
so it is not faster in the first step, but the 2nd step involves less IO/network than ALL records
If you TAKE 1000 records from a table of 1000 records, it will be equivalent (with little significant differences) to TAKE 1000 records from 1 billion, as long as you are following the case of (1) no order by, or (2) order by against an index
Assuming both tables are equivalent in terms of index, row-sizing and other structures. Also assuming that you are running that simple SELECT statement. If you have an ORDER BY clause in your SQL statements, then obviously the larger table will be slower. I suppose you're not asking that.
If X = Y, then obviously they should run in similar speed, since the query engine will be going through the records in exactly the same order -- basically a table scan -- for this simple SELECT statement. There will be no difference in query plan.
If Y > X only by a little bit, then also similar speed.
However, if Y >> X (meaning Y has many many more rows than X), then the LIMIT version MAY be slower. Not because of query plan -- again should be the same -- but simply because that the internal structure of data layout may have several more levels. For example, if data is stored as leafs on a tree, there may be more tree levels, so it may take slightly more time to access the same number of pages.
In other words, 1000 rows may be stored in 1 tree level in 10 pages, say. 1000000 rows may be stored in 3-4 tree levels in 10000 pages. Even when taking only 10 pages from those 10000 pages, the storage engine still has to go through 3-4 tree levels, which may take slightly longer.
Now, if the storage engine stores data pages sequentially or as a linked list, say, then there will be no difference in execution speed.
It would be approximately linear, as long as you specify no fields, no ordering, and all the records. But that doesn't buy you much. It falls apart as soon as your query wants to do something useful.
This would be quite a bit more interesting if you intended to draw some useful conclusion and tell us about the way it would be used to make a design choice in some context.
Thanks for the clarification.
In my experience, real applications with real users seldom have interesting or useful queries that return entire million-row tables. Users want to know about their own activity, or a specific forum thread, etc. So unless yours is an unusual case, by the time you've really got their selection criteria in hand, you'll be talking about reasonable result sizes.
In any case, users wouldn't be able to do anything useful with many rows over several hundred, transporting them would take a long time, and they couldn't scroll through it in any reasonable way.
MySQL has the LIMIT and OFFSET (starting record #) modifiers primarlly for the exact purpose of creating chunks of a list for paging as you describe.
It's way counterproductive to start thinking about schema design and record purging until you've used up this and a bunch of other strategies. In this case don't solve problems you don't have yet. Several-million-row tables are not big, practically speaking, as long as they are correctly indexed.
I built an analytics engine that pulls 50-100 rows of raw data from my database (lets call it raw_table), runs a bunch statistical measurements on it in PHP and then comes up with exactly 140 datapoints that I then need to store in another table (lets call it results_table). All of these data points are very small ints ("40","2.23","-1024" are good examples of the types of data).
I know the maximum # of columns for mysql is quite high (4000+) but there appears to be a lot of grey area as far as when performance really starts to degrade.
So a few questions here on best performance practices:
1) The 140 datapoints could be, if it is better, broken up into 20 rows of 7 data points all with the same 'experiment_id' if fewer columns is better. HOWEVER I would always need to pull ALL 20 rows (with 7 columns each, plus id, etc) so I wouldn't think this would be better performance than pulling 1 row of 140 columns. So the question: is it better to store 20 rows of 7-9 columns (that would all need to be pulled at once) or 1 row of 140-143 columns?
2) Given my data examples ("40","2.23","-1024" are good examples of what will be stored) I'm thinking smallint for the structure type. Any feedback there, performance-wise or otherwise?
3) Any other feedback on mysql performance issues or tips is welcome.
Thanks in advance for your input.
I think the advantage to storing as more rows (i.e. normalized) depends on design and maintenance considerations in the face of change.
Also, if the 140 columns have the same meaning or if it differs per experiment - properly modeling the data according to normalization rules - i.e. how is data related to a candidate key.
As far as performance, if all the columns are used it makes very little difference. Sometimes a pivot/unpivot operation can be expensive over a large amount of data, but it makes little difference on a single key access pattern. Sometimes a pivot in the database can make your frontend code a lot simpler and backend code more flexible in the face of change.
If you have a lot of NULLs, it might be possible to eliminate rows in a normalized design and this would save space. I don't know if MySQL has support for a sparse table concept, which could come into play there.
You have a 140 data items to return every time, each of type double.
It makes no practical difference whether this is 1x140 or 20x7 or 7x20 or 4x35 etc. It could be infinitesimally quicker for one shape of course but then have you considered the extra complexity in the PHP code to deal with a different shape.
Do you have a verified bottleneck, or is this just random premature optimisation?
You've made no suggestion that you intend to store big data in the database, but for the purposes of this argument, I will assume that you have 1 billion (10^9) data points.
If you store them in 140 columns, you'll have a mere 7 millon rows, however, if you want to retrieve a single data point from lots of experiments, then it will have to fetch a large number of very wide rows.
These very wide rows will take up more space in your innodb_buffer_pool, hence you won't be able to cache so many; this will potentially slow you down when you access them again.
If you store one datapoint per row, in a table with very few columns (experiment_id, datapoint_id, value) then you'll need to pull out the same number of smaller rows.
However, the size of rows makes little difference to the number of IO operations required. If we assume that your 1 billion datapoints doesn't fit in ram (which is NOT a safe assumption nowadays), maybe the resulting performance will be approximately the same.
It is probably better database design to use few columns; but it will use less disc space and perhaps be faster to populate, if you use lots of columns.