Problem description
In an ETL pipeline, we update a table from an SQL database with a pandas dataframe. The table has about 2 milion rows, and the dataframe updates approximately 1 million of them. We do it with SQLAlchemy in Python, and the database is SQL Server (I think this is not too relevant for the question, but I'm writing it for the sake of completeness).
At the moment, the code is "as found", the update consisting of the following steps:
Split the dataframe in many dataframes (the number appears to be fix and arbitrary, does not depend on the dataframe size).
For each sub-dataframe, do an update query.
As it is, the process takes what (in my admittedly very limited SQL experience) appears to be too much time, about 1-2 hours. The table schema consists of 4 columns:
An id as the primary key
2 columns that are foreign keys (primary keys in their respective tables)
A 4th column
Questions
What can I do to make the code more efficient, and faster? Since the UPDATE is done in blocks, I'm unsure of whether the index is re-calculated every time (since the id value is not changed I don't know why that would be the case). I also don't know how the foreign key values (which could change for a given row) enter the complexity calculation.
At what point does it make sense, if at any, to insert all the rows into a new auxiliary table, re-calculate the index only at the end, truncate the original table and copy the auxiliary table into it? Are there any subtleties with this approach, with the indices, foreign keys, etc.?
Related
i have a question for performance for update on a big table that is around 8 to 10GBs of size.
I have a task where i'm supposed to detect distinct values from a table of mentioned size with about 4.3 million rows and insert them into some table. This part is not really a problem but it's the update that follows afterwards. So i need to update some column based on the id of the created rows in the table i did an import. A example of the query i'm executing is:
UPDATE billinglinesstagingaws as s
SET product_id = p.id
FROM product AS p
WHERE p.key=(s.data->'product'->>'sku')::varchar(75)||'-'||(s.data->'lineitem'->>'productcode')::varchar(75) and cloudplatform_id = 1
So as mentioned, the staging table size is around 4.3 million rows and 8-10Gb and as it can be seen from the query, it has a JSONB field and the product table has around 1500 rows.
This takes about 12 minutes, which i'm not really sure if it is ok, and i'm really wondering, what i can do to speed it up somehow. There aren't foreign key constraints, there is a unique constraint on two columns together. There are no indexes on the staging table.
I attached the query plan of the query, so any advice would be helpful. Thanks in advance.
This is a bit too long for a comment.
Updating 4.3 million rows in a table is going to take some time. Updates take time because the the ACID properties of databases require that something be committed to disk for each update -- typically log records. And that doesn't count the time for reading the records, updating indexes, and other overhead.
So, about 17,000 updates per second isn't so bad.
There might be ways to speed up your query. However, you describe these as new rows. That makes me wonder if you can just insert the correct values when creating the table. Can you lookup the appropriate value during the insert, rather than doing so afterwards in an update?
This question is about how to achieve the best possible performance with SQLite.
If you have a table with one column that you update very often, and another column that you update very rarely, is it better to split up that table into 2 different tables, so that you have 2 tables with 1 column (excluding primary key) each, instead of 1 table with 2 columns (excluding primary key) each?
I have tried to find information about this in the SQLite Documentation, but I have no been able to find an explanation for what exactly happens on updating one column of one row of a table. The closest answer to my question I found was this:
During part of SQLite's INSERT and SELECT processing, the complete content of each row in the database is encoded as a single BLOB. So the SQLITE_MAX_LENGTH parameter also determines the maximum number of bytes in a row.
(Quoted from here: https://www.sqlite.org/limits.html)
To me that sounds as if every row is internally stored as one big blob of all columns, and that would mean that updating a single column of the row will indeed lead to the whole row being internally re-encoded again, including all other columns of that row, even if they have not been modified as part of the UPDATE. But I am not sure if I understand that sentence I quoted correctly.
I am thinking about a case where you have one column that stores some big blob with multiple MB of size, and another column that stores some integer. The column with the big blob might only be updated once a month, while the column with the integer might be updated once per second. As far as I currently understand it based on that quote, it seems that updating the integer once per second will lead to the multiple MB of blob to be re-encoded again every time you update the integer, and that would be very inefficient. Having one table for the blob, and a different table for the integer would be a lot better then.
My data is similar to currency in many aspects so I will use it for demonstration.
I have 10-15 different groups of data, we can say different currencies like Dollar or Euro.
They need to have these columns:
timestamp INT PRIMARY KEY
value INT
Each of them will have more than 1 billion rows and i will append new rows as time passes.
I will just select them in some intervals and create graphs. Probably multiple currency in same graph.
Question is should I add a group column and store all in one table or leave it separately. If they are in same column timestamp will not be unique anymore and probably I should use advanced SQL techniques to make it efficient.
10 - 15 "currencies"? 1 billion rows each? Consider list partitioning in Postgres 11 or later. This way, the timestamp column stays unique per partition. (Although I am not sure why that is a necessity.)
Or simply have 10 - 15 separate tables without storing the "currency" redundantly per row. Size matters with this many rows.
Or, if you typically have multiple values (one for each "currency") for the same timestamp, you might use a single table with 10-15 dedicated "currency" columns. Much smaller overall, as it saves the tuple overhead for each "currency" (28 bytes per row or more). See:
Making sense of Postgres row sizes
The practicality of a single row for multiple "currencies" depends on detailed specs. For example: might not work so well for many updates on individual values.
You added:
I have read clustered indexes which orders data in physical order in disk. I will not insert new rows in middle of table
That seems like a perfect use case for BRIN indexes, which are dramatically smaller than their B-tree relatives. Typically a bit slower, but with your setup maybe even faster. Related:
How do I improve date-based query performance on a large table?
In my application, users can create custom tables with three column types, Text, Numeric and Date. They can have up to 20 columns. I create a SQL table based on their schema using nvarchar(430) for text, decimal(38,6) for numeric and datetime, along with an Identity Id column.
There is the potential for many of these tables to be created by different users, and the data might be updated frequently by users uploading new CSV files. To get the best performance during the upload of the user data, we truncate the table to get rid of existing data, and then do batches of BULK INSERT.
The user can make a selection based on a filter they build up, which can include any number of columns. My issue is that some tables with a lot of rows will have poor performance during this selection. To combat this I thought about adding indexes, but as we don't know what columns will be included in the WHERE condition we would have to index every column.
For example, on a local SQL server one table with just over a million rows and a WHERE condition on 6 of its columns will take around 8 seconds the first time it runs, then under one second for subsequent runs. With indexes on every column it will run in under one second the first time the query is ran. This performance issue is amplified when we test on an SQL Azure database, where the same query will take over a minute the first time its run, and does not improve on subsequent runs, but with the indexes it takes 1 second.
So, would it be a suitable solution to add a index on every column when a user creates a column, or is there a better solution?
Yes, it's a good idea given your model. There will, of course, be more overhead maintaining the indexes on the insert, but if there is no predictable standard set of columns in the queries, you don't have a lot of choices.
Suppose by 'updated frequently,' you mean data is added frequently via uploads rather than existing records being modified. In that case, you might consider one of the various non-SQL databases (like Apache Lucene or variants) which allow efficient querying on any combination of data. For reading massive 'flat' data sets, they are astonishingly fast.
When creating indexes for an SQL table,if i had an index on 2 columns in the table and i changed the index to be on 4 columns in the table, what would be a reasonable increase the time taken to save say 1 million rows to expect?
I know that the answer to this question will vary depending on a lot of factors, such as foreign keys, other indexes, etc, but I thought I'd ask anyway. Not sure if it matters, but I am using MS SQLServer 2005.
EDIT: Ok, so here's some more information that might help get a better answer. I have a table called CostDependency. Inside this table are the following columns:
CostDependancyID as UniqueIdentifier (PK)
ParentPriceID as UniqueIdentifier (FK)
DependantPriceID as UniqueIdentifier (FK)
LocationID as UniqueIdentifier (FK)
DistributionID as UniqueIdentifier (FK)
IsValid as Bit
At the moment there is one Unique index involving ParentPriceID, DependantPriceID, LocationID and DistributionID. The reason for this index is to guarantee that the combination of those four columns is unique. We are not doing any searching on these four columns together. I can however normalise this table and make it into three tables:
CostDependancyID as UniqueIdentifier (PK)
ParentPriceID as UniqueIdentifier (FK)
DependantPriceID as UniqueIdentifier (FK)
Unique Index on ParentPriceID and DependantPriceID
and
ExtensionID as UniqueIdentifier (PK)
CostDependencyID (FK)
DistributionID as UniqueIdentifier (FK)
Unique Index on CostDependencyID and DistributionID
and
ID as UniqueIdentifier (PK)
ExtensionID as UniqueIdentifier (FK)
LocationID as UniqueIdentifier (FK)
IsValid as Bit
Unique Index on ExtensionID and LocationID
I am trying to work out if normalising this table and thus reducing the number of columns in the indexes will mean speed improvements when adding a large number of rows (i.e. 1 million).
Thanks, Dane.
With all the new info available, I'd like to suggest the following:
1) If a few of the GUID (UniqueIdentifier) columns are such that a) there are relatively few different values and b) there are relatively few new values added after the initial load. (For example the LocationID may represents a store, and if we only see a few new stores every day), it would be profitable to spin off these to a separate lookup table(s) GUID ->LocalId (an INT or some small column), and use this LocalId in the main table.
==> Doing so will greatly reduce the overall size of the main table and its associated indexes, at the cost of slightly complicating the update logic (but not not its performance), because of the lookup(s) and the need to maintain the lookup table(s) with new values.
2) Unless a particular important/frequent search case could make [good] use of a clustered index, we could use the clustered index on the main table to be for the 4 columns-based unique composite key. This would avoid replicating that much data in a separate non-clustered index, and as counter intuitive at is seems, it would save time for the initial load and with new inserts. The trick would be to use a relatively low fillfactor so that node splitting and balancing etc. would be infrequent. BTW, if we make the main record narrower with the use of local IDs, we can more readily afford "wasting" space in the fillfactor, and more new record will fit in this space before requiring node balancing.
3) link664 could provide an order of magnitude for the total number of records in "main" table and the number expected daily/weekly/whenever updates are scheduled. And these two parameter could confirm the validity of the approach suggested above, as well as provide hints as to the possibility of maybe dropping the indexes (or some of them) prior to big batch inserts, as suggested by Philip Kelley. Doing so however would be contingent to operational considerations such as the need to continue search service while new data is inserted.
4) other considerations such as SQL partitioning, storage architecture etc. can also be put to work to improve load and/or retrieve performance.
It depends pretty much on whether the wider index forms a covering index for your queries (and to a lesser extent the ratio of read to writes on that table). Suggest you post your execution plan(s) for the query workload you are trying to improve.
I'm a bit confused over your goals. The (post-edit) question reads that you're trying to optimize data (row) insertion, comparing one table of six columns and a four-column compound primary key against a "normalized" set of three tables of three or four columns each, and each of the three with a two-column compound key. Is this your issue?
My first question is, what are the effects of the "normalization" from one table to three? If you had 1M rows in the single table, how many rows are you likely to have in the three normalized ones? Normalization usually removes redundant data, does it do so here?
Inserting 1M rows into a four-column-PK table will take more time than into a two-column-PK table--perhaps a little, perhaps a lot (see next paragraph). However, if all else is equal, I believe that inserting 1M rows into three two-column-PK tables will be slower than the four column one. Testing is called for.
One thing that is certain is that if the data to be inserted is not loaded in the same order as it will be stored in, it will be a LOT slower than if the data being inserted were already sorted. Multiply that by three, and you'll have a long wait indeed. The most common work-around to this problem is to drop the index, load the data, and then recreate the index (sounds like a time-waster, but for large data sets it can be faster than inserting into an indexed table). A more abstract work-around is to load the table into an unindexed partition, (re)build the index, then switch the partition into the "live" table. Is this an option you can consider?
By and large, people are not overly concerned with performance when data is being inserted--generally they sweat over data retrieval performance. If it's not a warehouse situation, I'd be interested in knowing why insert performance is your apparent bottleneck.
The query optimizer will look at the index and determine if it can use the leading column. If the first column is not in the query, then it won't be used period. If the index can be used, then it will check to see if the second column can be used. If your query contains 'where A=? and C=?' and you index is on A,B,C,D then only the 'A' column will be used in the query plan.
Adding columns to an index can be useful sometimes to avoid the database having to go from the index page to the data page. If you query is 'select D from table where a=? and b=? and c=?', then column 'D' will be returned from the index, and save you a bit of IO in having to go to the data page.