I have an external application which creates CSV files. I would like to write these files into SQL automatically but as incremental.
I was looking into Bulk Insert, but i do not think this is incremental. The CSV files can get huge so incremental will be the way to go.
Thank you.
The usual way to handle this is to bulk insert the entire CSV into a staging table, and then do the incremental merge of the data in the staging table into the final destination table with a stored procedure.
If you are still concerned that the CSV files are too big for this, the next step is to write a program that reads the CSV, and produces a truncated file with only the new/changed data that you want to import, and then bulk insert that smaller CSV instead of the original one.
Create a text or csv file with names of all the csv files which you want to load in the table. You can include file path if not repeated. You can do this using shell scripting.
Then make a temporary table which loads all the csv file names which need to be inserted. Using a procedure.
Using above temporary table, loop by the number of rows and load it to the target table(without truncating in the loop). If truncate is required then do it before the loop. You can load data to staging if any transformation is required(use procedure for transformation)
We also had the same problem and we were using this method. Recently, we switched to using Python which does all the task and loads data into a staging table. After transformations, it is finally loaded into the target table.
Related
I'm loading data from a named external stage (S3) by using COPY INTO, and this S3 bucket keep all old files.
Here's what I want:
When a new file comes in, truncate the table and load the new file only, if there's no new file coming in, just keep the old data without truncation.
I understand that I can put option like FORCE = False to avoid loading old files again, but how do I only truncate the table when there's new file coming in?
I would likely do this a bit differently, since there isn't a way to truncate/delete records in the target table from the COPY command. This will be a multi-step process, but can be automated via Snowflake:
Create a transient table. For sake of description, I'll just call this STG_TABLE. You will also maintain your existing target table called TABLE.
Modify your COPY command to load to STG_TABLE.
Create a STREAM called STR_STG_TABLE over STG_TABLE.
Create a TASK called TSK_TABLE with the following statement
This statement will execute only if your COPY command actually loaded any new data.
CREATE OR REPLACE TASK TSG_TABLE
WAREHOUSE = warehouse_name
WHEN SYSTEM$STREAM_HAS_DATA('STR_STG_TABLE')
AS
INSERT OVERWRITE INTO TABLE (fields)
SELECT fields FROM STR_STG_TABLE;
The other benefit of using this method is that your transient table will have the full history of your files, which can be nice for debugging issues.
I'm new to DB/postgres SQL.
Scenario:
Need to load an csv file into postgres DB. This CSV data needs to loaded into multiple tables according DB schema. I'm looking for better design using python script.
My thought:
1. Load CSV file to intermediate table in postgres
2. Write a trigger on intermediate table to insert data into multiple tables on event of insert
3. Trigger includes truncate data at end
Any suggestions for better design/other ways without any ETL tools, and also any info on modules in Python 3.
Thanks.
Rather than using a trigger, use an explicit INSERT or UPDATE statement. That is probably faster, since it is not invoked per row.
Apart from that, your procedure is fine.
I know that by doing:
COPY test FROM '/path/to/csv/example.txt' DELIMITER ',' CSV;
I can import csv data to postgresql.
However, I do not have a static csv file. My csv file gets downloaded several times a day and it includes data which has previously been imported to the database. So, to get a consistent database I would have to leave out this old data.
My bestcase idea to realize this would be something like above. However, worstcase would be a java program manually checks each entry of the db with the csv file. Any recommendations for the implementation?
I really appreciate your answer!
You can dump latest data to the temp table using COPY command and MERGE temp table with the live table.
If you are using JAVA program for execute COPY command, then try CopyManager API.
I am making regular backups of my MySQL database with mysqldump. This gives me a .sql file with CREATE TABLE and INSERT statements, allowing me to restore my database on demand. However, I have yet to find a good way to extract specific data from this backup, e.g. extract all rows from a certain table matching certain conditions.
Thus, my current approach is to restore the entire file into a new temporary database, extract the data I actually want with a new mysqldump call, delete the temporary database and then import the extracted lines into my real database.
Is this really the best way to do this? Is there some sort of script that can directly parse the .sql file and extract the relevant lines? I don't think there is an easy solution with grep and friends unfortunately, as mysqldump generates INSERT statements that insert many values per line.
The solution to this just ended up being to import the whole file, extract the data I needed and drop the database again.
Is it possible for me to write an SQL query from within PhpMyAdmin that will search for matching records from a .csv file and match them to a table in MySQL?
Basically I want to do a WHERE IN query, but I want the WHERE IN to check records in a .csv file on my local machine, not a column in the database.
Can I do this?
I'd load the .csv content into a new table, do the comparison/merge and drop the table again.
Loading .csv files into mysql tables is easy:
LOAD DATA INFILE 'path/to/industries.csv'
INTO TABLE `industries`
FIELDS TERMINATED BY ';'
IGNORE 1 LINES (`nogaCode`, `title`);
There are a lot more things you can tell the LOAD command, like what char wraps the entries, etc.
I would do the following:
Create a temporary or MEMORY table on the server
Copy the CSV file to the server
Use the LOAD DATA INFILE command
Run your comparison
There is no way to have the CSV file on the client and the table on the server and be able to compare the contents of both using only SQL.
Short answer: no, you can't.
Long answer: you'll need to build a query locally, maybe with a script (Python/PHP) or just uploading the CSV in a table and doing a JOIN query (or just the WHERE x IN(SELECT y FROM mytmmpTABLE...))
For anyone new asking, there is this new tool that i used : Write SQL on CSV file