Writing data flow to postgresql - sql

I know that by doing:
COPY test FROM '/path/to/csv/example.txt' DELIMITER ',' CSV;
I can import csv data to postgresql.
However, I do not have a static csv file. My csv file gets downloaded several times a day and it includes data which has previously been imported to the database. So, to get a consistent database I would have to leave out this old data.
My bestcase idea to realize this would be something like above. However, worstcase would be a java program manually checks each entry of the db with the csv file. Any recommendations for the implementation?
I really appreciate your answer!

You can dump latest data to the temp table using COPY command and MERGE temp table with the live table.
If you are using JAVA program for execute COPY command, then try CopyManager API.

Related

Trying to create a table and load data into same table using Databricks and SQL

I Googled for a solution to create a table, using Databticks and Azure SQL Server, and load data into this same table. I found some sample code online, which seems pretty straightforward, but apparently there is an issue somewhere. Here is my code.
CREATE TABLE MyTable
USING org.apache.spark.sql.jdbc
OPTIONS (
url "jdbc:sqlserver://server_name_here.database.windows.net:1433;database = db_name_here",
user "u_name",
password "p_wd",
dbtable "MyTable"
);
Now, here is my error.
Error in SQL statement: SQLServerException: Invalid object name 'MyTable'.
My password, unfortunately, has spaces in it. That could be the problem, perhaps, but I don't think so.
Basically, I would like to get this to recursively loop through files in a folder and sub-folders, and load data from files with a string pattern, like 'ABC*', and load recursively all these files into a table. The blocker, here, is that I need the file name loaded into a field as well. So, I want to load data from MANY files, into 4 fields of actual data, and 1 field that captures the file name. The only way I can distinguish the different data sets is with the file name. Is this possible? Or, is this an exercise in futility?
my suggestion is to use the Azure SQL Spark library, as also mentioned in documentation:
https://docs.databricks.com/spark/latest/data-sources/sql-databases-azure.html#connect-to-spark-using-this-library
The 'Bulk Copy' is what you want to use to have good performances. Just load your file into a DataFrame and bulk copy it to Azure SQL
https://docs.databricks.com/data/data-sources/sql-databases-azure.html#bulk-copy-to-azure-sql-database-or-sql-server
To read files from subfolders, answer is here:
How to import multiple csv files in a single load?
I finally, finally, finally got this working.
val myDFCsv = spark.read.format("csv")
.option("sep","|")
.option("inferSchema","true")
.option("header","false")
.load("mnt/rawdata/2019/01/01/client/ABC*.gz")
myDFCsv.show()
myDFCsv.count()
Thanks for a point in the right direction mauridb!!

CSV file to SQL Table incremental

I have an external application which creates CSV files. I would like to write these files into SQL automatically but as incremental.
I was looking into Bulk Insert, but i do not think this is incremental. The CSV files can get huge so incremental will be the way to go.
Thank you.
The usual way to handle this is to bulk insert the entire CSV into a staging table, and then do the incremental merge of the data in the staging table into the final destination table with a stored procedure.
If you are still concerned that the CSV files are too big for this, the next step is to write a program that reads the CSV, and produces a truncated file with only the new/changed data that you want to import, and then bulk insert that smaller CSV instead of the original one.
Create a text or csv file with names of all the csv files which you want to load in the table. You can include file path if not repeated. You can do this using shell scripting.
Then make a temporary table which loads all the csv file names which need to be inserted. Using a procedure.
Using above temporary table, loop by the number of rows and load it to the target table(without truncating in the loop). If truncate is required then do it before the loop. You can load data to staging if any transformation is required(use procedure for transformation)
We also had the same problem and we were using this method. Recently, we switched to using Python which does all the task and loads data into a staging table. After transformations, it is finally loaded into the target table.

SQL Server - Copying data between tables where the Servers cannot be connected

We want some of our customers to be able to export some data into a file and then we have a job that imports that into a blank copy of a database at our location. Note: a DBA would not be involved. This would be a function within our application.
We can ignore table schema differences - they will match. We have different tables to deal with.
So on the customer side the function would ran somethiug like:
insert into myspecialstoragetable select * from source_table
insert into myspecialstoragetable select * from source_table_2
insert into myspecialstoragetable select * from source_table_3
I then run a select * from myspecialstoragetable and get a .sql file they can then ship to me which we can then use some job/sql script to import into our copy of the db.
I'm thinking we can use XML somehow, but I'm a little lost.
Thanks
Have you looked at the bulk copy utility bcp? You can wrap it with your own program to make it easier for less sophisticated users.
Since it is a function within your application, in what language is the application front-end written ? If it is .NET, you can use Data Transformation Services in SQL Server to do a sample export. In the last step, you could save the steps into a VB/.NET module. If necessary, modify this file to change table names etc. Integrate this DTS module into your application. While doing the sample export, export it to a suitable format such as .CSV, .Excel etc, whichever format from which you will be able to import into a blank database.
Every time the user wants do an export, he will have to click on a button that would invoke the DTS module integrated into your application, that will dump the data to the desired format. He can mail such file to you.
If your application is not written in .NET, in whichever language it is written, it will have options to read data from SQL Server and dump them to a .CSV or text file with delimiters. If it is a primitive language, you may have to do it by concatenating the fields of every record, by looping through the records and writing to a file.
XML would be too far-fetched for this, though it's not impossible. At your end, you should have the ability to parse the XML file and import it into your location. Also, XML is not really suited if the no. of records are too large.
You probably think of a .sql file, as in MySql. In SQL Server, .sql files, that are generated by the 'Generate Scripts' function of SQL Server's interface, are used for table structures/DDL rather than the generation of the insert statements for each of the record's hard values.

Import Excel Data into PostgreSQL 9.3

I've developed a huge table in excel and now facing problem in transferring it into the postgresql database. I've downloaded the odbc software and I'm able to open table created in postgresql with excel. However, I'm not able to do it in a reverse manner which is creating a table in excel and open it in the postgresql. So I would like to know it is can be done in this way or is there any alternative ways that can create a large table with pgAdmin III cause inserting the data raw by raw is quite tedious.
The typical answer is this:
In Excel, File/Save As, select CSV, save your current sheet.
transfer to a holding directory on the Pg server the postgres user can access
in PostgreSQL:
COPY mytable FROM '/path/to/csv/file' WITH CSV HEADER; -- must be superuser
But there are other ways to do this too. PostgreSQL is an amazingly programmable database. These include:
Write a module in pl/javaU, pl/perlU, or other untrusted language to access file, parse it, and manage the structure.
Use CSV and the fdw_file to access it as a pseudo-table
Use DBILink and DBD::Excel
Write your own foreign data wrapper for reading Excel files.
The possibilities are literally endless....
For python you could use openpyxl for all 2010 and newer file formats (xlsx).
Al Sweigart has a full tutorial from automate the boring parts on working with excel spreadsheets its very indepth and the whole book and accompanying Udemy course are great resources.
From his example
>>> import openpyxl
>>> wb = openpyxl.load_workbook('example.xlsx')
>>> wb.get_sheet_names()
['Sheet1', 'Sheet2', 'Sheet3']
>>> sheet = wb.get_sheet_by_name('Sheet3')
>>> sheet
<Worksheet "Sheet3">
Understandably once you have this access you can now use psycopg to parse the data to postgres as you normally would do.
This is a link to a list of python resources at python-excel also xlwings provides a large array of features for using python in place of vba in excel.
You can also use psql console to execute \copy without need to send file to Postgresql server machine. The command is the same:
\copy mytable [ ( column_list ) ] FROM '/path/to/csv/file' WITH CSV HEADER
A method that I use is to load the table into R as a data.frame, then use dbWriteTable to push it to PostgreSQL. These two steps are shown below.
Load Excel data into R
R's data.frame objects are database-like, where named columns have explicit types, such as text or numbers. There are several ways to get a spreadsheet into R, such as XLConnect. However, a really simple method is to select the range of the Excel table (including the header), copy it (i.e. CTRL+C), then in R use this command to get it from the clipboard:
d <- read.table("clipboard", header=TRUE, sep="\t", quote="\"", na.strings="", as.is=TRUE)
If you have RStudio, you can easily view the d object to make sure it is as expected.
Push it to PostgreSQL
Ensure you have RPostgreSQL installed from CRAN, then make a connection and send the data.frame to the database:
library(RPostgreSQL)
conn <- dbConnect(PostgreSQL(), dbname="mydb")
dbWriteTable(conn, "some_table_name", d)
Now some_table_name should appear in the database.
Some common clean-up steps can be done from pgAdmin or psql:
ALTER TABLE some_table_name RENAME "row.names" TO id;
ALTER TABLE some_table_name ALTER COLUMN id TYPE integer USING id::integer;
ALTER TABLE some_table_name ADD PRIMARY KEY (id);
As explained here http://www.postgresonline.com/journal/categories/journal/archives/339-OGR-foreign-data-wrapper-on-Windows-first-taste.html
With ogr_fdw module, its possible to open the excel sheet as foreign table in pgsql and query it directly like any other regular tables in pgsql.
This is useful for reading data from the same regularly updated table
To do this, the table header in your spreadsheet must be clean, the current ogr_fdw driver can't deal with wide-width character or new lines etc. with these characters, you will probably not be able to reference the column in pgsql due to encoding issue. (Major reason I can't use this wonderful extension.)
The ogr_fdw pre-build binaries for windows are located here http://winnie.postgis.net/download/windows/pg96/buildbot/extras/
change the version number in link to download corresponding builds.
extract the file to pgsql folder to overwrite the same name sub-folders.
restart pgsql. Before the test drive, the module needs to be installed by executing:
CREATE EXTENSION ogr_fdw;
Usage in brief:
use ogr_fdw_info.exe to prob the excel file for sheet name list
ogr_fdw_info -s "C:/excel.xlsx"
use "ogr_fdw_info.exe -l" to prob a individual sheet and generate a table definition code.
ogr_fdw_info -s "C:/excel.xlsx" -l "sheetname"
Execute the generated definition code in pgsql, a foreign table is created and mapped to your excel file. it can be queried like regular tables.
This is especially useful, if you have many small files with the same table structure. Just change the path and name in definition, and update the definition will be enough.
This plugin supports both XLSX and XLS file.
According to the document it also possible to write data back into the spreadsheet file, but all the fancy formatting in your excel will be lost, the file is recreated on write.
If the excel file is huge. This will not work. which is another reason I didn't use this extension. It load data in one time.
But this extension also support ODBC interface, it should be possible to use windows' ODBC excel file driver to create a ODBC source for the excel file and use ogr_fdw or any other pgsql's ODBC foreign data wrapper to query this intermediate ODBC source. This should be fairly stable.
The downside is that you can't change file location or name easily within pgsql like in the previous approach.
A friendly reminder. The permission issue applies to this fdw extensions. since its loaded into pgsql service. pgsql must have access privileged to the excel files.
It is possible using ogr2ogr:
C:\Program Files\PostgreSQL\12\bin\ogr2ogr.exe -f "PostgreSQL" PG:"host=someip user=someuser dbname=somedb password=somepw" C:/folder/excelfile.xlsx -nln newtablenameinpostgres -oo AUTODETECT_TYPE=YES
(Not sure if ogr2ogr is included in postgres installation or if I got it with postgis extension.)
I have used Excel/PowerPivot to create the postgreSQL insert statement. Seems like overkill, except when you need to do it over and over again. Once the data is in the PowerPivot window, I add successive columns with concatenate statements to 'build' the insert statement. I create a flattened pivot table with that last and final column. Copy and paste the resulting insert statement into my EXISTING postgreSQL table with pgAdmin.
Example two column table (my table has 30 columns from which I import successive contents over and over with the same Excel/PowerPivot.)
Column1 {a,b,...} Column2 {1,2,...}
In PowerPivot I add calculated columns with the following commands:
Calculated Column 1 has "insert into table_name values ('"
Calculated Column 2 has CONCATENATE([Calculated Column 1],CONCATENATE([Column1],"','"))
...until you get to the last column and you need to terminate the insert statement:
Calculated Column 3 has CONCATENATE([Calculated Column 2],CONCATENATE([Column2],"');"
Then in PowerPivot I add a flattened pivot table and have all of the insert statement that I just copy and paste to pgAgent.
Resulting insert statements:
insert into table_name values ('a','1');
insert into table_name values ('b','2');
insert into table_name values ('c','3');
NOTE: If you are familiar with the power pivot CONCATENATE statement, you know that it can only handle 2 arguments (nuts). Would be nice if it allowed more.
You can handle loading the excel file content by writing Java code using Apache POI library (https://poi.apache.org/). The library is developed for working with MS office application data including Excel.
I have recently created the application based on the technology that will help you to load Excel files to the Postgres database. The application is available under http://www.abespalov.com/. The application is tested only for Windows, but should work for Linux as well.
The application automatically creates necessary tables with the same columns as in the Excel files and populate the tables with content. You can export several files in parallel. You can skip the step to convert the files into the CSV format. The application handles the xls and xlsx formats.
Overall application stages are :
Load the excel file content. Here is the code depending on file extension:
{
fileExtension = FilenameUtils.getExtension(inputSheetFile.getName());
if (fileExtension.equalsIgnoreCase("xlsx")) {
workbook = createWorkbook(openOPCPackage(inputSheetFile));
} else {
workbook =
createWorkbook(openNPOIFSFileSystemPackage(inputSheetFile));
}
sheet = workbook.getSheetAt(0);
}
Establish Postgres JDBC connection
Create a Postgres table
Iterate over the sheet and inset rows into the table. Here is a piece of Java code :
{
Iterator<Row> rowIterator = InitInputFilesImpl.sheet.rowIterator();
//skip a header
if (rowIterator.hasNext()) {
rowIterator.next();
}
while (rowIterator.hasNext()) {
Row row = (Row) rowIterator.next();
// inserting rows
}
}
Here you can find all Java code for the application created for exporting excel to Postgres (https://github.com/palych-piter/Excel2DB).
the simplest answer is to use the psql command:
it's free and is include////
psql -U postgres -p 5432 -f sql-command-file.sql
I recently discovered https://sqlizer.io, it creates insert statements from an Excel file, supports MySQL and PostgreSQL. Not sure about if it supports large files though.
You can do that easily by DataGrip .
First save your excel file as csv formate . Open the excel file then SaveAs as csv format
Go to datagrip then create the table structure according to the csv file . Suggested create the column name as the column name as Excel column
right click on the table name from the list of table name of your database then click of the import data from file . Then select the converted csv file .
.

Extract specific data from full mysqldump backup

I am making regular backups of my MySQL database with mysqldump. This gives me a .sql file with CREATE TABLE and INSERT statements, allowing me to restore my database on demand. However, I have yet to find a good way to extract specific data from this backup, e.g. extract all rows from a certain table matching certain conditions.
Thus, my current approach is to restore the entire file into a new temporary database, extract the data I actually want with a new mysqldump call, delete the temporary database and then import the extracted lines into my real database.
Is this really the best way to do this? Is there some sort of script that can directly parse the .sql file and extract the relevant lines? I don't think there is an easy solution with grep and friends unfortunately, as mysqldump generates INSERT statements that insert many values per line.
The solution to this just ended up being to import the whole file, extract the data I needed and drop the database again.