I need to create a table based on another table in MYsql including the constraints and indices.
I have following scenario:
Table A- exists probably with millions of rows.
I want to create table B with exactly same as table A (including constraints and indices).
Process data from A and some other source and insert to B.
At the end of processing drop table A(drop indices associated with table A) and Rename table B to A including indices.
What is the best way to do this? Performance is my real concern.
Thanks
In cases like this, we assume you know the structure of the table. In other words, you are not asking "how do I find out what all of these columns, indexes and constraints are".
Second we tend to assume that all data in table A is valid, so you do not need to enforce constraints while copying from A to B.
Your "some other source" is a wildcard. I'm assuming you do not know if this other source contains valid data, and would suggest:
1) Create B w/o indeces or constraints
2) Copy/bulk insert from "other source" to B
3) Execute constraints by issuing SELECTS to find invalid rows. Skip this step if you know the data is valid. Once it is ok to proceed:
4) Copy A to B in "chunks". The issue here is that a straight SELECT...INTO... of all X millions of rows will take forever (because of explosion of resources required to do it in a single implied transaction), but a row-by-row will also take forever (because its just plain slow to do one row at a time). So you process chunks of 1000 or 10000 rows at a time.
5) When all data is copied over, add the indeces
6) Add the constraints
7) Drop A
8) Rename B
Related
Problem description
In an ETL pipeline, we update a table from an SQL database with a pandas dataframe. The table has about 2 milion rows, and the dataframe updates approximately 1 million of them. We do it with SQLAlchemy in Python, and the database is SQL Server (I think this is not too relevant for the question, but I'm writing it for the sake of completeness).
At the moment, the code is "as found", the update consisting of the following steps:
Split the dataframe in many dataframes (the number appears to be fix and arbitrary, does not depend on the dataframe size).
For each sub-dataframe, do an update query.
As it is, the process takes what (in my admittedly very limited SQL experience) appears to be too much time, about 1-2 hours. The table schema consists of 4 columns:
An id as the primary key
2 columns that are foreign keys (primary keys in their respective tables)
A 4th column
Questions
What can I do to make the code more efficient, and faster? Since the UPDATE is done in blocks, I'm unsure of whether the index is re-calculated every time (since the id value is not changed I don't know why that would be the case). I also don't know how the foreign key values (which could change for a given row) enter the complexity calculation.
At what point does it make sense, if at any, to insert all the rows into a new auxiliary table, re-calculate the index only at the end, truncate the original table and copy the auxiliary table into it? Are there any subtleties with this approach, with the indices, foreign keys, etc.?
Background
I have a huge table (Table A) in my database. From which, I apply some filters based on business rules. I then apply these filters into another table (Table B).
Therefore, Table B will always contain data from A and will always be much smaller. Table A contains 500,000 entries and Table B contains 3000 entries.
Both tables have the same structure.
Issue
The huge table (Table A) can be updated at any moment without notice. Therefore, to ensure that Table B contains the most up-to-date business data, it needs to be refreshed regularly. In this instance, I do this once a week.
Solution
How I go about this is by setting up a simple stored procedure that does the following:
TRUNCATE Table B
Apply filters on Table A and INSERT data into Table B
Issue with this is the fact that I have to truncate the table every week. I was wondering if there is a more efficient way of doing this, rather than deleting all the data and inserting the data into table all over again?
Is there a way to check which values are in A that are missing in B and add them accordingly?
Thanks
Instead of doing that in a stored procedure, create a View. View keep in sync with the table, as the data changes in the Table A the View will automatically update.
Here are some details you may want to read about Views and how they work.
How to create Views
In a database on the cloud, I have a table with about ten thousand columns. I'm going to update it every few minutes with some local data which is the output of a local code (below my_col_val[]). My questions are:
1- What is the best and fastest way to update each row? (For Loop?)
2- Using a char pointer to save the SQL query (szSQL[]) is the best way when it contains a SQL quesry of size of order 1MB?
My code (in C) now roughly looks like:
char * szSQL[?];// (What is the best size?)
char * my_col [?];
char * my_col_val[?];
SQLHSTMT hStmt = NULL;
sprintf(szSQL, "UPDATE my_table SET %s='%s',...,%s='%s'\ // there should be 8000 %s='%s' statements
WHERE ID = my_ID FROM my_table", my_col[0], my_col_val[0], ..., my_col[n], my_col_val[n]); //wher n=8000
SQLExecDirect(hstm, szSQL, SQL_NTS);
I like #Takarii 's solution using three tables. The best strategy involves 1) how to insert the new rows of measurements and 2) what will it be used for. The latter is of particular interest as that may need additional indexes, and these must be maintained by the db when executing the insert statements. The least indexes are required, the faster the inserts will be. For example, although there is a relation between the three tables, the measurement table could not declare its foreign key with other tables, reducing this index' overhead.
As the table will grow and grow, the db will get slower and slower. Then, it can be beneficial to create a new table for each day of measurements.
As the sensor data is of diferent types, the data could be inserted as string data and only be parsed by the retriever program.
Another help could be that, if the recorded data is only retrieved periodically, the measurements could be written to a flat file and inserted in batch periodically, let's say every hour.
Maybe these ideas can be of help.
Based on your comments, and your request above, Here are my suggestions:
1) As you suggested, an individual table for each machine (not ideal, but will work)
Working on that assumption, you will want an individual row for each sensor, but the problem comes when you need to add additional machines - generally table create privileges are restricted by sysadmins
2) Multiple tables to identify sensor information and assignment, along with a unified results table.
Table 1 - Machine
Table 2 - Sensor
Table 3 - Results
Table 1 would contain the information about the machine with which your sensors are assigned (machine_id, **insert extra columns as needed**)
Table 2 contains the sensor information - this is where your potential 10,000 columns would go, however they are now rows with ID's (sensor_id, sensor_name)
Table 3 contains the results of the sensor readings, with an assignment to a sensor and then to a machine (result_id, machine_id(fk), sensor_id(fk), result)
Then using joins, you can pull out the data for each machine as needed. This will be far more efficient than your current 10k column design
First off let me say I am running on SQL Server 2005 so I don't have access to MERGE.
I have a table with ~150k rows that I am updating daily from a text file. As rows fall out of the text file I need to delete them from the database and if they change or are new I need to update/insert accordingly.
After some testing I've found that performance wise it is exponentially faster to do a full delete and then bulk insert from the text file rather than read through the file line by line doing an update/insert. However I recently came across some posts discussing mimicking the MERGE functionality of SQL Server 2008 using a temp table and the output of the UPDATE statement.
I was interested in this because I am looking into how I can eliminate the time in my Delete/Bulk Insert method when the table has no rows. I still think that this method will be the fastest so I am looking for the best way to solve the empty table problem.
Thanks
I think your fastest method would be to:
Drop all foreign keys and indexes
from your table.
Truncate your
table.
Bulk insert your data.
Recreate your foreign keys and
indexes.
Is the problem that Joe's solution is not fast enough, or that you can not have any activity against the target table while your process runs? If you just need to prevent users from running queries against your target table, you should contain your process within a transaction block. This way, when your TRUNCATE TABLE executes, it will create a table lock that will be held for the duration of the transaction, like so:
begin tran;
truncate table stage_table
bulk insert stage_table
from N'C:\datafile.txt'
commit tran;
An alternative solution which would satsify your requirement for not having "down time" for the table you are updating.
It sounds like originally you were reading the file and doing an INSERT/UPDATE/DELETE 1 row at a time. A more performant approach than that, that does not involve clearing down the table is as follows:
1) bulk load the file into a new, separate table (no indexes)
2) then create the PK on it
3) Run 3 statements to update the original table from this new (temporary) table:
DELETE rows in the main table that don't exist in the new table
UPDATE rows in the main table where there is a matching row in the new table
INSERT rows into main table from the new table where they don't already exist
This will perform better than row-by-row operations and should hopefully satisfy your overall requirements
There is a way to update the table with zero downtime: keep two day's data in the table, and delete the old rows after loading the new ones!
Add a DataDate column representing the date for which your ~150K rows are valid.
Create a one-row, one-column table with "today's" DataDate.
Create a view of the two tables that selects only rows matching the row in the DataDate table. Index it if you like. Readers will now refer to this view, not the table.
Bulk insert the rows. (You'll obviously need to add the DataDate to each row.)
Update the DataDate table. View updates Instantly!
Delete yesterday's rows at your leisure.
SELECT performance won't suffer; joining one row to 150,000 rows along the primary key should present no problem to any server less than 15 years old.
I have used this technique often, and have also struggled with processes that relied on sp_rename. Production processes that modify the schema are a headache. Don't.
For raw speed, I think with ~150K rows in the table, I'd just drop the table, recreate it from scratch (without indexes) and then bulk load afresh. Once the bulk load has been done, then create the indexes.
This assumes of course that having a period of time when the table is empty/doesn't exist is acceptable which it does sound like could be the case.
I am doing ETL for log files into a PostgreSQL database, and want to learn more about the various approaches used to optimize performance of loading data into a simple star schema.
To put the question in context, here's an overview of what I do currently:
Drop all foreign key and unique
constraints
Import the data (~100 million records)
Re-create the constraints and run analyze on the fact table.
Importing the data is done by loading from files. For each file:
1) Load the data from into a temporary table using COPY (the PostgreSQL bulk upload tool)
2) Update each of the 9 dimension tables with any new data using an insert for each such as:
INSERT INTO host (name)
SELECT DISTINCT host_name FROM temp_table
EXCEPT
SELECT name FROM host;
ANALYZE host;
The analyze is run at the end of the INSERT with the idea of keeping the statistics up to date over the course of tens of millions of updates (Is this advisable or necessary? At minimum it does not seem to significantly reduce performance).
3) The fact table is then updated with an unholy 9-way join:
INSERT INTO event (time, status, fk_host, fk_etype, ... )
SELECT t.time, t.status, host.id, etype.id ...
FROM temp_table as t
JOIN host ON t.host_name = host.name
JOIN url ON t.etype = etype.name
... and 7 more joins, one for each dimension table
Are there better approaches I'm overlooking?
I've tried several different approaches to trying to normalize the data incoming from a source as such and generally I've found the approach you're using now to be my choice. Its easy to follow and minor changes stay minor. Trying to return the generated id from one of the dimension tables during stage 2 only complicated things and usually generates far too many small queries to be efficient for large data sets. Postgres should be very efficient with your "unholy join" in modern versions and using "select distinct except select" works well for me. Other folks may know better, but I've found your current method to be my perferred method.
During stage 2 you know the primary key of each dimension you're inserting data into (after you've inserted it), but you're throwing this information away and rediscovering it in stage 3 with your "unholy" 9-way join.
Instead I'd recommend creating one sproc to insert into your fact table; e.g. insertXXXFact(...), which calls a number of other sprocs (one per dimension) following the naming convention getOrInsertXXXDim, where XXX is the dimension in question. Each of these sprocs will either look-up or insert a new row for the given dimension (thus ensuring referential integrity), and should return the primary key of the dimension your fact table should reference. This will significantly reduce the work you need to do in stage 3, which is now reduced to a call of the form insert into XXXFact values (DimPKey1, DimPKey2, ... etc.)
The approach we've adopted in our getOrInsertXXX sprocs is to insert a dummy value if one is not available and have a separate cleanse process to identify and enrich these values later on.