How to append new and overwrite existing to SQL from PySpark? - sql

So I have a table in an SQL database and I want to use Synapse (PySpark) to add new records and overwrite existing records. However in PySpark I can either use overwrite mode (which will delete old records that I am not pushing in the iteration) or append mode (which will not overwrite existing records).
So now I wonder what the best approach would be. I think these my options;
Option A: Load the old records first, combine in PySpark and then overwite everything. Downside is I have to load the whole table first.
Option B: Delete the records I will overwrite and then use append mode.
Downside is it requires extra steps that might fail.
Option C: A better way, I did not think of.
Thanks in advance.

Spark drivers don't really support that. But you can load the data into a staging table and then perform a MERGE or INSERT/UPDATE with TSQL through pyodbc (python) or jdbc (Scala).

Related

Updating/Inserting Oracle table from csv data file

I am still learning Oracle SQL, and I've been trying to find the best way to update/insert records in an OracleSQL table with the data from a CSV file.
So far, I've figured out how to load the csv into a temporary table using External Tables in Oracle, but I'm having difficulty finding a detailed guide on how to update/insert (UPSERT) the loaded data into an existing table.
What is the best way to do this, when I have 30+ fields in the table? For example, is it best to read the csv line by line with something like pandas and update each record one by one, or is it best to do it with a sql script using something like a merge statement? Not all records in the csv have a value for the primary key, in which case I need to insert rather than update. Thanks for the help!
That looks like a MERGE, indeed.
Data from external table would then be used to
update values in existing rows
create new rows in the target table
Pandas and row-by-row processing? I wouldn't do that. If you already have a powerful database, then use its capabilities. Row-by-row is usually slow-by-slow and there's rarely some benefit in doing it that way.

How to load csv file to multiple tables in postgres (mainly concerned about best practice)

I'm new to DB/postgres SQL.
Scenario:
Need to load an csv file into postgres DB. This CSV data needs to loaded into multiple tables according DB schema. I'm looking for better design using python script.
My thought:
1. Load CSV file to intermediate table in postgres
2. Write a trigger on intermediate table to insert data into multiple tables on event of insert
3. Trigger includes truncate data at end
Any suggestions for better design/other ways without any ETL tools, and also any info on modules in Python 3.
Thanks.
Rather than using a trigger, use an explicit INSERT or UPDATE statement. That is probably faster, since it is not invoked per row.
Apart from that, your procedure is fine.

Is there a way to merge ORC files in HDFS without using ALTER TABLE CONCATENATE command?

This is my first week with Hive and HDFS, so please bear with me.
Almost all the ways I saw so far to merge multiple ORC files suggest using ALTER TABLE with CONCATENATE command.
But I need to merge multiple ORC files of the same table without having to ALTER the table. Another option is to create a copy of the existing table and then use ALTER TABLE on that so that my original table remains unchanged. But I can't do that as well because space and data redundancy reasons.
The thing I'm trying to achieve (ideally) is: I need to transport these ORCs as one file per table into a cloud environment. So, is there a way that I can merge the ORCs on-the-go during the transfer process into cloud? Can this be achieved with/without Hive, maybe directly in HDFS?
Two possible methods other than ALTER TABLE CONCATENATE:
Try to configure merge task, see details here: https://stackoverflow.com/a/45266244/2700344
Alternatively you can force single reducer. This method is quite applicable for not too big files. You can overwrite the same table with ORDER BY, this will force single reducer on the last ORDER BY stage. This will work slow or even fail with big files because all the data will be passed through single reducer:
INSERT OVERWRITE TABLE
SELECT * FROM TABLE
ORDER BY some_col; --this will force single reducer
As a side effect you will get better packed ORC file with efficient index on columns listed in order by.

Import Oracle data dump and overwrite existing data

I have an oracle dmp file and I need to import data into a table.
The data in the dump contains new rows and few updated rows.
I am using import command and IGNORE=Y, so it imports all the new rows well. But it doesn't import/overwrite the existing rows (it shows a warning of unique key constraint violated).
Is there some option to make the import UPDATE the existing rows with new data?
No. If you were using data pump then you could use the TABLE_EXISTS_ACTION=TRUNCATE option to remove all existing rows and import everything from the dump file, but as you want to update existing rows and leave any rows not in the new file alone - i.e. not delete them (I think, since you only mention updating, though that isn't clear) - that might not be appropriate. And as your dump file is from the old exp tool rather than expdp that's moot anyway, unless you can re-export the data.
If you do want to delete existing rows that are not in the dump then you could truncate all the affected tables before importing. But that would be a separate step that you'd have to perform yourself, its not something imp will do for you; and the tables would be empty for a while, so you'd have to have downtime to do it.
Alternatively you could import into new staging tables - in a different schema sinceimp doesn't support renaming either - and then use those to merge the new data into the real tables. That may be the least disruptive approach. You'd still have to design and write all the merge statements though. There's no built-in way to do this automatically.
You can import into temp table and then do record recon by joining with it.
Use impdp option REMAP_TABLE to load existing file into temp table.
impdp .... REMAP_TABLE=TMP_TABLE_NAME
when load is done run MERGE statement on existing table from temp table.

remove source file from Hive table

When I load a (csv)-file to a hive table I can load without overwriting, thus adding the new file to the table.
Internally the file is just copied to the correct folder in HDFS
(e.g. user/warehouse/dbname/tablName/datafile1.csv). And probably some metadata is updated.
After a few loads I want to remove the contents of a specific file from the table.
I am sure I cannot simply delete the file because of the metadata that needs to be adjusted as well. There must be some kind of build-in function for this.
How do I do that?
Why do you need that?I mean Hive was developed to serve as a warehouse where you put lots n lots n lots of data and not to delete data every now and then. Such a need seems to be a poorly thought out schema or a poor use of Hive, at least to me.
And if you really have these kind of needs why don't you create partitioned tables? If you need to delete some specific data just delete that particular partition using either TRUNCATE or ALTER.
TRUNCATE TABLE table_name [PARTITION partition_spec];
ALTER TABLE table_name DROP [IF EXISTS] PARTITION partition_spec, PARTITION partition_spec,...
if this feature is needed more than just once in a while you can use MapR's distribution while allows this kind of operations with no problem (even via NFS). otherwise, if you don't have partition I think you'll have to create and new table using CTAS filterring the data in the bad file or just copy the good files back to os with "hadoop fs -copyToLocal" and move them back to hdfs into new table