How to load csv file to multiple tables in postgres (mainly concerned about best practice) - sql

I'm new to DB/postgres SQL.
Scenario:
Need to load an csv file into postgres DB. This CSV data needs to loaded into multiple tables according DB schema. I'm looking for better design using python script.
My thought:
1. Load CSV file to intermediate table in postgres
2. Write a trigger on intermediate table to insert data into multiple tables on event of insert
3. Trigger includes truncate data at end
Any suggestions for better design/other ways without any ETL tools, and also any info on modules in Python 3.
Thanks.

Rather than using a trigger, use an explicit INSERT or UPDATE statement. That is probably faster, since it is not invoked per row.
Apart from that, your procedure is fine.

Related

Updating/Inserting Oracle table from csv data file

I am still learning Oracle SQL, and I've been trying to find the best way to update/insert records in an OracleSQL table with the data from a CSV file.
So far, I've figured out how to load the csv into a temporary table using External Tables in Oracle, but I'm having difficulty finding a detailed guide on how to update/insert (UPSERT) the loaded data into an existing table.
What is the best way to do this, when I have 30+ fields in the table? For example, is it best to read the csv line by line with something like pandas and update each record one by one, or is it best to do it with a sql script using something like a merge statement? Not all records in the csv have a value for the primary key, in which case I need to insert rather than update. Thanks for the help!
That looks like a MERGE, indeed.
Data from external table would then be used to
update values in existing rows
create new rows in the target table
Pandas and row-by-row processing? I wouldn't do that. If you already have a powerful database, then use its capabilities. Row-by-row is usually slow-by-slow and there's rarely some benefit in doing it that way.

Do while loop with GPDB using talend

I have a very large data set in GPDB from which I need to extract close to 3.5 million records. I use this for a flatfile which is then used to load to different tables. I use Talend, and do a select * from table using the tgreenpluminput component and feed that to a tfileoutputdelimited. However due to the very large volume of the file, I run out of memory while executing it on the Talend server.
I lack the permissions of a super user and unable to do a \copy to output it to a csv file. I think something like a do while or a tloop with more limited number of rows might work for me. But my table doesnt have any row_id or uid to distinguish the rows.
Please help me with suggestions how to solve this. Appreciate any ideas. Thanks!
If your requirement is to load data into different tables from one table, then you do not need to go for load into file and then from file to table.
There is a component named tGreenplumRow which allows you to write direct sql queries (DDL and DML queries) in it.
Below is a sample job,
If you notice, there are three insert statements inside this component. It will be executed one by one separated by semicolon.

Bteq Scripts to copy data between two Teradata servers

How do I copy data from multiple tables within one database to another database residing on a different server?
Is this possible through a BTEQ Script in Teradata?
If so, provide a sample.
If not, are there other options to do this other than using a flat-file?
This is not possible using BTEQ since you have mentioned both the databases are residing in different servers.
There are two solutions for this.
Arcmain - You need to use Arcmain Backup first, which creates files containing data from your tables. Then you need to use Arcmain restore which restores the data from the files
TPT - Teradata Parallel Transporter. This is a very advanced tool. This does not create any files like Arcmain. It directly moves the data between two teradata servers.(Wikipedia)
If I am understanding your question, you want to move a set of tables from one DB to another.
You can use the following syntax in a BTEQ Script to copy the tables and data:
CREATE TABLE <NewDB>.<NewTable> AS <OldDB>.<OldTable> WITH DATA AND STATS;
Or just the table structures:
CREATE TABLE <NewDB>.<NewTable> AS <OldDB>.<OldTable> WITH NO DATA AND NO STATS;
If you get real savvy you can create a BTEQ script that dynamically builds the above statement in a SELECT statement, exports the results, then in turn runs the newly exported file all within a single BTEQ script.
There are a bunch of other options that you can do with CREATE TABLE <...> AS <...>;. You would be best served reviewing the Teradata Manuals for more details.
There are a few more options which will allow you to copy from one table to another.
Possibly the simplest way would be to write a smallish program which uses one of their communication layers (ODBC, .NET Data Provider, JDBC, cli, etc.) and use that to take a select statement and an insert statement. This would require some work, but it would have less overhead than trying to learn how to write TPT scripts. You would not need any 'DBA' permissions to write your own.
Teradata also sells other applications which hides the complexity of some of the tools. Teradata Data Mover handles provides an abstraction layer between tools like arcmain and tpt. Access to this tool is most likely restricted to DBA types.
If you want to move data from one server to another server then
We can do this with the flat file.
First we have fetch data from source table to flat file through any utility such as bteq or fastexport.
then we can load this data into target table with the help of mload,fastload or bteq scripts.

SSIS storing logging variables in a derived column

I am developing SSIS packages that consist of 2 main steps:
Step 1: Grab all sorts of data from existing legacy systems and dump them into a series of staging tables in my database.
Step 2: Move the data from my staging tables into a more relational set of tables that I'm using specifically for my project.
In step 1 I'm just doing a bulk SELECT and a bulk INSERT; however, in step 2 I'm doing row-by-row inserts into my tables using OLEDB Command tasks so that I can log very specific row-level activity of everything that's happening. Here is my general layout for step 2 processes.
alt text http://dl.dropbox.com/u/2468578/screenshots/step_1.png
You'll notice 3 OLEDB tasks: 1 for the actual INSERT, and 2 for success/fail INSERTs into our logging table.
The main thing I'm logging is source table/id and destination table/id for each row that passes through this flow. I'm storing this stuff in variables and adding them to the data flow using a Derived Column so that I can easily map them to the query parameters of the stored procedures.
alt text http://dl.dropbox.com/u/2468578/screenshots/step_3.png
I've decided to store these logging values in variables instead of hard-coding the values in the SqlCommand field on the task, because I'm pretty sure you CAN'T put variable expressions in that field (i.e. exec storedproc #[User::VariableName],... ,... ,...). So, this is the best solution I've found.
alt text http://dl.dropbox.com/u/2468578/screenshots/step_2.png
Is this the best solution? Probably not.
Is it good performance wise to add 4 logging columns to a data flow that consists of 500,000 records? Probably not.
Can you think of a better way?
I really don't think calling an OLEDBCommand 500,000 times is going to be performant.
If you are already going to staging tables - load it all to a staging table and take it from there in T-SQL or even another dataflow (or to a raw file and then something else depending on your complete operation). A Bulk insert is going to be hugely more efficient.
to add to Cade's answer if you truly need the logging info on a row by row basis, your best best is to leverage the oledb destination and use one or both of the following transformations to add columns to the dataflow:
Derived Column Transformation
Audit Transformation
This should be your best bet and should't add much overhead

How do I handle large SQL SERVER batch inserts?

I'm looking to execute a series of queries as part of a migration project. The scripts to be generated are produced from a tool which analyses the legacy database then produces a script to map each of the old entities to an appropriate new record. THe scripts run well for small entities but some have records in the hundreds of thousands which produce script files of around 80 MB.
What is the best way to run these scripts?
Is there some SQLCMD from the prompt which deals with larger scripts?
I could also break the scripts down into further smaller scripts but I don't want to have to execute hundreds of scripts to perform the migration.
If possible have the export tool modified to export a BULK INSERT compatible file.
Barring that, you can write a program that will parse the insert statements into something that BULK INSERT will accept.
BULK INSERT uses BCP format files which come in traditional (non-XML) or XML. Does it have to get a new identity and use it in a child and you can't get away with using SET IDENTITY INSERT ON because the database design has changed so much? If so, I think you might be better off using SSIS or similar and doing a Merge Join once the identities are assigned. You could also load the data into staging tables in SQL using SSIS or BCP and then use regular SQL (potentially within SSIS in a SQL task) with the OUTPUT INTO feature to capture the identities and use them in the children.
Just execute the script. We regularly run backup / restore scripts that are 100's Mb in size. It only takes 30 seconds or so.
If it is critical not to block your server for this amount to time, you'll have to really split it up a bit.
Also look into the -tab option of mysqldump with outputs the data using TO OUTFILE, which is more efficient and faster to load.
It sounds like this is generating a single INSERT for each row, which is really going to be pretty slow. If they are all wrapped in a transaction, too, that can be kind of slow (although the number of rows doesn't sound that big that it would cause a transaction to be nearly impossible - like if you were holding a multi-million row insert in a transaction).
You might be better off looking at ETL (DTS, SSIS, BCP or BULK INSERT FROM, or some other tool) to migrate the data instead of scripting each insert.
You could break up the script and execute it in parts (especially if currently it makes it all one big transaction), just automate the execution of the individual scripts using PowerShell or similar.
I've been looking into the "BULK INSERT" from file option but cannot see any examples of the file format. Can the file mix the row formats or does it have to always be consistent in a CSV fashion? The reason I ask is that I've got identities involved across various parent / child tables which is why inserts per row are currently being used.