Import Oracle data dump and overwrite existing data - sql

I have an oracle dmp file and I need to import data into a table.
The data in the dump contains new rows and few updated rows.
I am using import command and IGNORE=Y, so it imports all the new rows well. But it doesn't import/overwrite the existing rows (it shows a warning of unique key constraint violated).
Is there some option to make the import UPDATE the existing rows with new data?

No. If you were using data pump then you could use the TABLE_EXISTS_ACTION=TRUNCATE option to remove all existing rows and import everything from the dump file, but as you want to update existing rows and leave any rows not in the new file alone - i.e. not delete them (I think, since you only mention updating, though that isn't clear) - that might not be appropriate. And as your dump file is from the old exp tool rather than expdp that's moot anyway, unless you can re-export the data.
If you do want to delete existing rows that are not in the dump then you could truncate all the affected tables before importing. But that would be a separate step that you'd have to perform yourself, its not something imp will do for you; and the tables would be empty for a while, so you'd have to have downtime to do it.
Alternatively you could import into new staging tables - in a different schema sinceimp doesn't support renaming either - and then use those to merge the new data into the real tables. That may be the least disruptive approach. You'd still have to design and write all the merge statements though. There's no built-in way to do this automatically.

You can import into temp table and then do record recon by joining with it.
Use impdp option REMAP_TABLE to load existing file into temp table.
impdp .... REMAP_TABLE=TMP_TABLE_NAME
when load is done run MERGE statement on existing table from temp table.

Related

How to append new and overwrite existing to SQL from PySpark?

So I have a table in an SQL database and I want to use Synapse (PySpark) to add new records and overwrite existing records. However in PySpark I can either use overwrite mode (which will delete old records that I am not pushing in the iteration) or append mode (which will not overwrite existing records).
So now I wonder what the best approach would be. I think these my options;
Option A: Load the old records first, combine in PySpark and then overwite everything. Downside is I have to load the whole table first.
Option B: Delete the records I will overwrite and then use append mode.
Downside is it requires extra steps that might fail.
Option C: A better way, I did not think of.
Thanks in advance.
Spark drivers don't really support that. But you can load the data into a staging table and then perform a MERGE or INSERT/UPDATE with TSQL through pyodbc (python) or jdbc (Scala).

Should I use SSIS or the SQL Server Import Export tool for a large bulk insert operation?

I will soon need to import millions of records into into a single SQL Server Database table which we use in production. The data to import will be available in the form of about 40 csv files, each having hundreds of thousands of records.
For each row, some of the column values are supplied by the csv files, whereas other rows will require values that I must specify.
I am trying to determine which tool to use. I noticed that SQL Server Management Studio comes with the Import Export Wizard. Is that tool advisable for this type of job? Or should I use SSIS instead?
Some other questions I have:
Should I "lock" the table during the operation?
Should I perform the insert into a copy of the production table and
then once the operation is validated, should I make the copy the
official version of the production table?
As you are having some logic to handle for the rows from CSV (some rows, you will insert and some rows require you to supply some values), you cannot have these kinds of logic in the Import Export Wizard. It is straightforward load. So, you have to go for SSIS only.
You need to have conditional branching to split the rows and supply values to the target table.
For the second question, If possible, I would suggest you to load to separate table and then rename them later. That way, production system users are not impacted by this loading.

Using SSIS Package, How to validate the source records for duplicate before inserting?

SQL Server 2012: using a SSIS package, how to validate the source records for duplicate before inserting?
Our source file is a .csv. We are facing duplicate records loaded in the staging table.
At present , we are following manual process of loading data.
How to validate the source file data against the destination table before loading and load only the valid records? Possibility of loading duplicate records not only because of the source file having duplicate records in it but also reloading the same file to the staging table.
We are not Truncate the staging table. We are keeping records as is.
Second question : How to pick the name of the source file and pass it in the loading ? Possibly having a derived column as "FileName" which will get loaded along with raw data to the staging table.
The typical load pattern I use in this case is:
Prepare a staging table that matches the source file
In SSIS run a SQL Task with TRUNCATE StagingTable; (which clears it out)
Then, run a data flow task that loads the entire data file into the staging table
Lastly, merge the staging table into the final table.
I prefer to do this last step in a SQL Task also:
INSERT INTO FinalTable
(PrimaryKey,Column1,Column2,Column3)
SELECT
PrimaryKey,Column1,Column2,Column3
FROM StagingTable SRC
WHERE NOT EXISTS (
SELECT * FROM FinalTable TGT WHERE TGT.PrimaryKey=SRC.PrimaryKey
);
If you prefer a graphical UI, and you don't mind the extra network traffic, and slower processing time, you can do the same type of merge operation using lookups. You can even use the SCD component but I strongly discourage it's use.
Whether you do it in T-SQL or the UI, you need a key that can be used to uniquely identify the records (referred to as PrimaryKey in my example). If you don't have this key, there is no way to 'deduplicate'
Note in this example you have a 'real' staging table whose only purpose is to get the data file into the database. Then you have a final table that contains the final consistent result
Also note that this pattern only adds new rows - it will not update existing rows if they change in the data file.
Given your exact scenario (of loading the same file again), I would first check if the data is even loaded to the staging table. If you do that, you don't have to worry about checking the duplicates at record level.
How are you setting the connection to the file? Most of the data loads I have dealt with, I designed for-each-loop-container where the file name/path would be populated in a user variable. As you said, you could just use a derived column transform to add a new column which gets the value from a variable. If you don't have the file name in a user variable, you could use expression task in the control flow to populate it.
To cover your exact requirement, I would use the above step to populate the file name in the table. You could even normalize to a different table instead of storing long file name for every data record. Once you have all the file names in the database, you could just have an "Execute SQL" at the beginning to see if that file name is already in the database.
Two years back I have faced the same problem with importing TSV files.
I tried many other solutions but best I could design is C# code script for such validation at its best.
What I did as a solution
Create one C# DataTable object in memory with Primary Key constraints,
like:-
DataColumn[] keyColumn = new DataColumn[30];
keyColumn[intJ] = dtFilterdPK.Columns["Column name"];
Then try to add one by one row from your CSV to this DataTables.
Whenever your data will get Duplication based on Primary Key will have an error
Handle this error code in (TRY)..CATCH block and make this duplication error as per your logging requirement.
Avoid those error records importing in DataTable object.
Atlast import your CSV file into your table as BulkImport
Like:
using (SqlBulkCopy bulkCopy = new SqlBulkCopy(myConnection))
{
bulkCopy.DestinationTableName = "Your DB Table Name"; //Assign table name
bulkCopy.WriteToServer(dtToBeImport); //Write into Actual table.
}
Hope this will help you.

Extract and recreate DDL of database schema elsewhere

I guess I just cannot formulate the search query appropriately, but I cannot find an answer to the following simple question: how to use extracted DDL pieces to recreate tables, views etc. in a different database or a different schema?
For example, when I extract table DDL with
SELECT dbms_metadata.get_dependent_ddl ('TABLE', TABLE-NAME, SCHEMA) FROM dual
I get output with FOREIGN KEY there. If I now naively issue the resulting CREATE TABLE statements on a different database in e.g. alphabetical order of table names, I get "table or view doesn't exist" error, because constraints reference to non-yet-created tables.
What is the normal procedure of using DDL? Is it (easily) possible to recreate full scheme structure (short of full database dump) without using external tools?
You can use datapump export CONTENT option to only export the metadata for a schema:
CONTENT=[ALL | DATA_ONLY | METADATA_ONLY]
ALL unloads both data and metadata. This is the default.
DATA_ONLY unloads only table row data; no database object definitions are unloaded.
METADATA_ONLY unloads only database object definitions; no table row data is unloaded. Be aware that if you specify CONTENT=METADATA_ONLY, then when the dump file is subsequently imported, any index or table statistics imported from the dump file will be locked after the import.
The import process will create the objects and constraints, taking the dependencies into account.
If you want to see the DDL, and optionally run it manually, you can use the datapump import SQLFILE option to put the DDL into a file instead of executing it:
Specifies a file into which all of the SQL DDL that Import would have executed, based on other parameters, is written.
You can do similar things through SQL Developer and other clients, but those are 'external tools', whereas datapump might not fall into that category, even if you have to run it from the command line. There is a datapump API so you can even avoid the command line if you want to, though in some ways it's more complicated than using the expdp and impdp utilities.

mysql query to import data from dump to existing data and overwrite new entry

I want to import new DUMP file to my old database which has same schema. I want to overwrite if there is anychange in record with dump file record I am importing and want to add new records. I dont want to delete the existing records in the DB I am gonna import.
If you create the dump file using mysqldump, you can give it the option --replace,which generates a dump file containing replace statements instead of insert statements.
When this dump file is loaded into MySQL, records that matches primary or unique keys in the database will replace the old ones, while records not matching existing keys will be inserted.