SQL - Optimizing performance of bulk inserts and large joins? - sql

I am doing ETL for log files into a PostgreSQL database, and want to learn more about the various approaches used to optimize performance of loading data into a simple star schema.
To put the question in context, here's an overview of what I do currently:
Drop all foreign key and unique
constraints
Import the data (~100 million records)
Re-create the constraints and run analyze on the fact table.
Importing the data is done by loading from files. For each file:
1) Load the data from into a temporary table using COPY (the PostgreSQL bulk upload tool)
2) Update each of the 9 dimension tables with any new data using an insert for each such as:
INSERT INTO host (name)
SELECT DISTINCT host_name FROM temp_table
EXCEPT
SELECT name FROM host;
ANALYZE host;
The analyze is run at the end of the INSERT with the idea of keeping the statistics up to date over the course of tens of millions of updates (Is this advisable or necessary? At minimum it does not seem to significantly reduce performance).
3) The fact table is then updated with an unholy 9-way join:
INSERT INTO event (time, status, fk_host, fk_etype, ... )
SELECT t.time, t.status, host.id, etype.id ...
FROM temp_table as t
JOIN host ON t.host_name = host.name
JOIN url ON t.etype = etype.name
... and 7 more joins, one for each dimension table
Are there better approaches I'm overlooking?

I've tried several different approaches to trying to normalize the data incoming from a source as such and generally I've found the approach you're using now to be my choice. Its easy to follow and minor changes stay minor. Trying to return the generated id from one of the dimension tables during stage 2 only complicated things and usually generates far too many small queries to be efficient for large data sets. Postgres should be very efficient with your "unholy join" in modern versions and using "select distinct except select" works well for me. Other folks may know better, but I've found your current method to be my perferred method.

During stage 2 you know the primary key of each dimension you're inserting data into (after you've inserted it), but you're throwing this information away and rediscovering it in stage 3 with your "unholy" 9-way join.
Instead I'd recommend creating one sproc to insert into your fact table; e.g. insertXXXFact(...), which calls a number of other sprocs (one per dimension) following the naming convention getOrInsertXXXDim, where XXX is the dimension in question. Each of these sprocs will either look-up or insert a new row for the given dimension (thus ensuring referential integrity), and should return the primary key of the dimension your fact table should reference. This will significantly reduce the work you need to do in stage 3, which is now reduced to a call of the form insert into XXXFact values (DimPKey1, DimPKey2, ... etc.)
The approach we've adopted in our getOrInsertXXX sprocs is to insert a dummy value if one is not available and have a separate cleanse process to identify and enrich these values later on.

Related

Oracle select statement caching during large insert has erratic performance

In a mid-tier server that implements an API with granular store methods, we map each store call to either an insert or update statement, based on the existence of the entity, and to determine that, we issue a select across 3 tables joining on natural keys. (our 2nd implementation uses a zero-caching architecture in the mid-tier server, for easy parallelism, so we always go back to the DB).
For a given import, we load for example 180,000 entities into one of our leaf tables, each entity has a parent foreign-key into a parent table (roughly 9,000 entities), and each parent row is owned by a 3rd top-level table with only 200 rows. All 3 tables use a GUID surrogate key (as the primary key), and they have a (project, parent, name) composite natural key (top-level table has no parent column/key).
The parent columns are proper indexed foreign-keys. And the natural keys all have UNIQUE indexes. We are using Oracle 12c, using OCI in C/C++.
The ideal plan for that join top-parent-leaf select is to use the NK indexes, since they always yield a single row, but in the past we've seen bad performance when the query planner would choose the less selective FK indexes.
The scalar select normally takes 0.25 milliseconds on average (1/4 ms). All those select statements are cached with bind variables, and we just update the bound values and execute() to get the result. This takes for example 500s to import 240,000 rows across 6 tables when working normally (in schema A). But in a different schema (B), it's taking forever, and under the debugger (Release mode) I can see that 1/4 ms query take around 80ms, way more than normal. (180,000 * 80ms is over 4 hours... just for the selects to map NKs to the SK/PK and know if it's an insert or update we need).
So obviously the query planner is using a wrong plan, not the normal highly selective one using the NK indexes.
My question is how do we avoid the terrible plan?
In particular, those select statements (6 of them, all cached) are prepared at the beginning of the import, and at that time the schema could be empty, but have had plenty of rows before, and in any case, the schema will be filled with rows in all cases, sometimes in the millions. So a plan decided on an empty schema may be bad for when the schema starts to be quite full, no?
The import is transactional, and all part of a single transaction.
Are we supposed to re-prepare the cached select statements as more rows are added? Would the plan be any different, i.e. would the stats be any different, given that we are inside a large transaction anyway?
Should use force the use of the NK-indexes (always the best ones) as query-hints, even though Tom Kyte says never to use them?
In general, when doing a mix of inserts/updates with selects in a large transaction that can significantly affects schema statistics, how to we keep the select plans performance to be optimal?
I guess updating stats is a DDL and thus cannot be performed inside that transaction too. Is there a known strategy to not fall into the performance chasm we sometimes fall into? (from 500s / 9min to several hours...)
Thanks for any insight about the above.

SQL insert caching

I have a qustion regarding an sql insert. The problem is that if you have a big table with a lot of indexes and a lot of inserts then the inserting of the data is slow. Is it a good approach if I have table A without any indexes and table B with indexes (A and B have the exact same scheme - they are equal) and if I insert everything in table A first and a separate service will work on background and will move all the data to table B where it will be indexed?
Using a staging table like this is not inconceivable. In fact, operational data stores essentially do this.
However, such database structures are usually used for storing incoming data close to the "transactional" format. The final load into history involves more than just building indexes.
Before going down that path, you need to investigate why inserts are so slow and what is the volume of inserts. Can you replace some of the inserts with bulk inserts? Reducing the number of transactions can improve performance. Are all the indexes needed? Do you have the optimal data model for your problem?

Redshift performance difference between CTAS and select count

I have query A, which mostly left joins several different tables.
When I do:
select count(1) from (
A
);
the query returns the count in approximately 40 seconds. The count is not big, at around 2.8M rows.
However, when I do:
create table tbl as A;
where A is the same query, it takes approximately 2 hours to complete. Query A returns 14 columns (not many) and all the tables used on the query are:
Vacuumed;
Analyzed;
Distributed across all nodes (DISTSTYLE ALL);
Encoded/Compressed (except on their sortkeys).
Any ideas on what should I look at?
When using CREATE TABLE AS (CTAS), a new table is created. This involves copying all 2.8 million rows of data. You didn't state the size of your table, but this could conceivable involve a lot of data movement.
CTAS does not copy the DISTKEY or SORTKEY. The CREATE TABLE AS documentation says that the default DISTKEY is EVEN. Therefore, the CTAS operation would also have involved redistributing the data amongst nodes. Since the source table was DISTKEY ALL, at least the data was available on each node for distribution, so this shouldn't have been too bad.
If your original table DDL included compression, then these settings would probably have been copied across. If the DDL did not specify compression, then the copy to the new table might have triggered the automatic compression analysis, which involves loading 100,000 rows, choosing a compression type for each column, dropping that data and then starting the load again. This could consume some time.
Finally, it comes down to the complexity of Query A. It is possible that Redshift was able to optimize the query by reading very little data from disk because it realized that very few columns of data (or perhaps no columns) were required to read from disk to display the count. This really depends upon the contents of that Query.
It could simply be that you've got a very complex query that takes a long time to process (that wasn't processed as part of the Count). If the query involves many JOIN and WHERE statements, it could be optimized by wise use of DISTKEY and SORTKEY values.
CREATE TABLE writes all data that is returned by the query to disk, count query does not, that explains the difference. Writing all rows is more expensive operation compared to reading row count.

UPDATE large number of columns in SQL

In a database on the cloud, I have a table with about ten thousand columns. I'm going to update it every few minutes with some local data which is the output of a local code (below my_col_val[]). My questions are:
1- What is the best and fastest way to update each row? (For Loop?)
2- Using a char pointer to save the SQL query (szSQL[]) is the best way when it contains a SQL quesry of size of order 1MB?
My code (in C) now roughly looks like:
char * szSQL[?];// (What is the best size?)
char * my_col [?];
char * my_col_val[?];
SQLHSTMT hStmt = NULL;
sprintf(szSQL, "UPDATE my_table SET %s='%s',...,%s='%s'\ // there should be 8000 %s='%s' statements
WHERE ID = my_ID FROM my_table", my_col[0], my_col_val[0], ..., my_col[n], my_col_val[n]); //wher n=8000
SQLExecDirect(hstm, szSQL, SQL_NTS);
I like #Takarii 's solution using three tables. The best strategy involves 1) how to insert the new rows of measurements and 2) what will it be used for. The latter is of particular interest as that may need additional indexes, and these must be maintained by the db when executing the insert statements. The least indexes are required, the faster the inserts will be. For example, although there is a relation between the three tables, the measurement table could not declare its foreign key with other tables, reducing this index' overhead.
As the table will grow and grow, the db will get slower and slower. Then, it can be beneficial to create a new table for each day of measurements.
As the sensor data is of diferent types, the data could be inserted as string data and only be parsed by the retriever program.
Another help could be that, if the recorded data is only retrieved periodically, the measurements could be written to a flat file and inserted in batch periodically, let's say every hour.
Maybe these ideas can be of help.
Based on your comments, and your request above, Here are my suggestions:
1) As you suggested, an individual table for each machine (not ideal, but will work)
Working on that assumption, you will want an individual row for each sensor, but the problem comes when you need to add additional machines - generally table create privileges are restricted by sysadmins
2) Multiple tables to identify sensor information and assignment, along with a unified results table.
Table 1 - Machine
Table 2 - Sensor
Table 3 - Results
Table 1 would contain the information about the machine with which your sensors are assigned (machine_id, **insert extra columns as needed**)
Table 2 contains the sensor information - this is where your potential 10,000 columns would go, however they are now rows with ID's (sensor_id, sensor_name)
Table 3 contains the results of the sensor readings, with an assignment to a sensor and then to a machine (result_id, machine_id(fk), sensor_id(fk), result)
Then using joins, you can pull out the data for each machine as needed. This will be far more efficient than your current 10k column design

How to merge 500 million table with another 500 million table

I have to merge two 500M+ row tables.
What is the best method to merge them?
I just need to display the records from these two SQL-Server tables if somebody searches on my webpage.
These are fixed tables, no one will ever change data in these tables once they are live.
create a view myview as select * from table1 union select * from table2
Is there any harm using the above method?
If I start merging 500M rows it will run for days and if machine reboots it will make the database go into recovery mode, and then I have to start from the beginning again.
Why Am I merging these table?
I have a website which provides a search on the person table.
This table have columns like Name, Address, Age etc
We got 500 million similar .txt files which we loaded into some other
table.
Now we want the website search page to query both tables to see if
a person exists in the table.
We get similar .txt files of 100 million or 20 million, which we load
to this huge table.
How we are currently doing it?
We import the .txt files into separate tables ( some columns are different
in .txt)
Then we arrange the columns and do the data type conversions
Then insert this staging table into the liveCopy huge table ( in
test environment)
We have SQL server 2008 R2
Can we use table partitioning for performance benefits?
Is it ok to create monthly small tables and create a view on top of
them?
How can indexing be done in this case?
We only load new data once in a month and do the select
Does replication help?
Biggest issue I am facing is managing huge tables.
I hope I explained the situation .
Thanks & Regards
1) Usually developers, to achieve more performance, are splitting large tables into smaller ones and call this as partitioning (horizontal to be more precise, because there is also vertical one). Your view is a sample of such partitions joined. Of course, it is mostly used to split a large amount of data into range of values (for example, table1 contains records with column [col1] < 0, while table2 with [col1] >= 0). But even for unsorted data it is ok too, because you get more room for speed improvements. For example - parallel reads if put tables to different storages. So this is a good choice.
2) Another way is to use MERGE statement supported in SQL Server 2008 and higher - http://msdn.microsoft.com/en-us/library/bb510625(v=sql.100).aspx.
3) Of course you can copy using INSERT+DELETE, but in this case or in case of MERGE command used do this in a small batches. Smth like:
SET ROWCOUNT 10000
DECLARE #Count [int] = 1
WHILE #Count > 0 BEGIN
... INSERT+DELETE/MERGE transcation...
SET #Count = ##ROWCOUNT
END
If your purpose is truly just to move the data from the two tables into one table, you will want to do it in batches - 100K records at a time, or something like that. I'd guess you crashed before because your T-Log got full, although that's just speculation. Make sure to throw in a checkpoint after each batch if you are in Full recovery mode.
That said, I agree with all the comments that you should provide why you are doing this - it may not be necessary at all.
You may want to have a look at an Indexed View.
In this way, you can set up indexes on your view and get the best performance out of it. The expensive part of using Indexed Views is in the CRUD operations - but for read performance it would be your best solution.
http://www.brentozar.com/archive/2013/11/what-you-can-and-cant-do-with-indexed-views/
https://www.simple-talk.com/sql/learn-sql-server/sql-server-indexed-views-the-basics/
If the two tables are linked one to one, then you are wasting the cpu time a lot for each read. Especially that you mentioned that the tables don't change at all. You should have only one table in this case.
Try creating a new table including (at least) the two columns from the two tables.
You can do this by:
Select into newTable
from A left join B on A.x=B.y
or (if some people don't have the information of the text file)
Select into newTable
from A inner join B on A.x=B.y
And note that you have to have made index on the join fields at least (to speed up the process).
More details about the fields may help giving more precise answer as well.