Match multiline SQL statement in pgdump - sql

I have PostgreSQL database dump by pg_dump version 9.5.2, which contains DDLs and also INSERT INTO statements for each table in given database. Dump looks like this:
SET statement_timeout = 0;
SET lock_timeout = 0;
SET client_encoding = 'UTF8';
CREATE TABLE unimportant_table (
id integer NOT NULL,
col1 character varying
);
CREATE TABLE important_table (
id integer NOT NULL,
col2 character varying NOT NULL,
unimportant_col character varying NOT NULL
);
INSERT INTO unimportant_table VALUES (123456, 'some data split into
- multiple
- lines
just for fun');
INSERT INTO important_table VALUES (987654321, 'some important data', 'another crap split into
- lines');
...
-- thousands of inserts into both tables
The dump file is really large and it is produced by another company, so I am not able to influence the export process. I need create 2 files from this dump:
All DDL statements (all statements that doesn't start with INSERT INTO)
All INSERT INTO important_table statements (I want restore only some tables from dump)
If all statements would be on single line without new line character in the data, it will be very easy to create 2 SQL script by grep, for example:
grep -v '^INSERT INTO .*;$' my_dump.sql > ddl.sql
grep -o '^INSERT INTO important_table .*;$' my_dump.sql > important_table.sql
# Create empty structures
psql < ddl.sql
# Import only one table for now
psql < important_table.sql
Firstly I was thinking about using grep but I did not find, how to process multiple lines at once, then I tried sed but it is returning only single line inserts. I also used https://regex101.com/ to find out right regular expression but I don't know how to combine it with grep or sed:
^(?!(INSERT INTO)).*$ -- for ddl
^INSERT INTO important_table(\s|[[:alnum:]])*;$ -- for inserts
I found similar question pcregrep multiline SQL match but there is no answer. Also, I don't mind if the solution will work with grep, sed or whatever you suggest, but it should work on Ubuntu 18.04.4 TLS.

Here is a bash based solution that uses perl one-liners to prepare your SQL dump data for the subsequent grep statements.
In my approach, the goal is to get one SQL statement on one line through a script that I called prepare.sh. It got a little more complicated because I wanted to accomodate for semicolons and quotes within your insert data strings (these, along with the line breaks, are represented by their hex codes in the intermediate output):
EDIT: In response to #32cupo's comment, below is a modified set of scripts that avoids xargs with large data sets (although I don't have huge dump files to test it with):
#!/bin/bash
perl -pne 's/;(?=\s*$)/__ENDOFSTATEMENT__/g' \
| perl -pne 's/\\/\\\\x5c/g' \
| perl -pne 's/\n/\\\\x0a/g' \
| perl -pne 's/"/\\\\x22/g' \
| perl -pne 's/'\''/\\\\x27/g' \
| perl -pne 's/__ENDOFSTATEMENT__/;\n/g' \
Then, a separate script (called ddl.sh) includes your grep statement for the DDL (and, with the help of the loop, only feeds smaller chunks (lines) into xargs):
#!/bin/bash
while read -r line; do
<<<"$line" xargs -I{} echo -e "{}"
done < <(grep -viE '^(\\\\x0a)*insert into')
Another separate script (called important_table.sh) includes your grep statement for the inserts into important-table:
#!/bin/bash
while read -r line; do
<<<"$line" xargs -I{} echo -e "{}"
done < <(grep -iE '^(\\\\x0a)*insert into important_table')
Here is the set of scripts in action (please also note that I spiced up your insert data with some semicolons and quotes):
~/$ cat dump.sql
SET statement_timeout = 0;
SET lock_timeout = 0;
SET client_encoding = 'UTF8';
CREATE TABLE unimportant_table (
id integer NOT NULL,
col1 character varying
);
CREATE TABLE important_table (
id integer NOT NULL,
col2 character varying NOT NULL,
unimportant_col character varying NOT NULL
);
INSERT INTO unimportant_table VALUES (123456, 'some data split into
- multiple
- lines
;just for fun');
INSERT INTO important_table VALUES (987654321, 'some important ";data"', 'another crap split into
- lines;');
...
-- thousands of inserts into both tables
~/$ cat dump.sql | ./prepare.sh | ./ddl.sh >ddl.sql
~/$ cat ddl.sql
SET statement_timeout = 0;
SET lock_timeout = 0;
SET client_encoding = 'UTF8';
CREATE TABLE unimportant_table (
id integer NOT NULL,
col1 character varying
);
CREATE TABLE important_table (
id integer NOT NULL,
col2 character varying NOT NULL,
unimportant_col character varying NOT NULL
);
...
-- thousands of inserts into both tables
~/$ cat dump.sql | ./prepare.sh | ./important_table.sh > important_table.sql
~/$ cat important_table.sql
INSERT INTO important_table VALUES (987654321, 'some important ";data"', 'another crap split into
- lines;');

Related

integer expression expected shell scripting - BASH

I am trying to get the LAG between Primary & Standby database using the below shell script. The query works fine returning the values "DATABASE IS OUTOFSYNC" or "DATABASE IS INSYNC" for an instance that has 1 Node which returns a single value, but I get an error "[: 0 1: integer expression expected" for an instance that has two Nodes which returns two values for the LAG on the first Node and the Second Node.
So here is the code:
#!/bin/bash
get_status=$(sqlplus -s "/as sysdba" <<EOF
set pagesize 0 feedback off verify off heading off echo off;
SELECT prim.seq - tgt.seq seq_gap
FROM
(
SELECT thread#, MAX(sequence#) seq, MAX(completion_time) tm
FROM
v\$archived_log
GROUP BY
thread#
)
prim,
(
SELECT thread#, MAX(sequence#) seq, MAX(completion_time) tm
FROM
v\$archived_log
WHERE
dest_id IN
(
SELECT
dest_id
FROM
v\$archive_dest
WHERE
target = 'STANDBY'
)
AND
applied = 'YES'
GROUP BY
thread#
)
tgt
WHERE
prim.thread# = tgt.thread#;
exit;
EOF
)
if [ "$get_status" -ge 5 ]; then
echo "DATABASE IS OUTOFSYNC"
else
echo "DATABASE IS INSYNC"
fi
Is there a better way to write this script?
After adding typeset -p get_status after the query and before the if I get the below results:
declare -- get_status=" 1
0"
./dgtest2.sh: line 41: [: 1
0: integer expression expected
DATABASE IS INSYNC
The query is returning more than one value/string (for 2 nodes or threads) as shown in picture/screenshot and it seems like my script is only coded to address a single value/string generated by the query.
enter image description here
Is there away to modify the script to address multiple values/strings generated by the query
The logic should be if all values returned are -ge 5 it should report "DATABASE IS OUTOFSYNC" else "DATABASE IS INSYNC" for all values returned are -lt 5.
The logic for one value -lt 5 and one value -ge 5 would not suffice as the values constantly change on the database.
Any values from 0 - 4 that the database returns whether from both Nodes should report as "DATABASE IS INSYNC" and any value from 5 upwards that the database returns whether from both Nodes should report as "DATABASE IS OUTOFSYNC".
One idea would be to capture the status values (returned by the sqlplus script) into an array and then loop through the array testing said status values.
Instead of:
variable=$(sqlplus ...)
We want:
variable=( $(sqlplus ...) )
For OP's current scripting, with a name change for the variable, we will replace this:
get_status=$(sqlplus -s "/as sysdba" <<EOF
set pagesize 0 feedback off verify off heading off echo off;
SELECT prim.seq - tgt.seq seq_gap
...
exit;
EOF
)
With this:
status_array=( $(sqlplus -s "/as sysdba" <<EOF
set pagesize 0 feedback off verify off heading off echo off;
SELECT prim.seq - tgt.seq seq_gap
...
exit;
EOF
) )
One idea for the follow-on logic testing:
default database status is INSYNC
if any status values are -ge 5 then set database status to OUTOFSYNC
The code for this looks like:
db_status='INSYNC'
for status in "${status_array[#]}"
do
[[ "${status}" -ge 5 ]] && db_status='OUTOFSYNC' && break
done
echo "DATABASE IS ${db_status}"
I'm not setup to run the sqlplus script but I should be able to simulate the results with the following array assignments:
status_array=(1)
status_array=(7)
status_array=(0 1)
status_array=(5 7)
status_array=(5 3)
Running our code for each of these array assignments gives us:
##################### status_array=(1)
DATABASE is INSYNC
##################### status_array=(7)
DATABASE is OUTOFSYNC
##################### status_array=(0 1)
DATABASE is INSYNC
##################### status_array=(5 7)
DATABASE is OUTOFSYNC
##################### status_array=(5 3)
DATABASE is OUTOFSYNC

Why is my script not printing output on one line?

This is an image of what I'm asking for
I am using the following -echo- in a script and after I execute, the output format is as shown below:
`echo -e "UPDATE table1 SET table1_f1='$Fname' ,table1_f2='$Lname' where table1_f3='$id';\ncommit;" >> $OutputFile`
output: UPDATE table1 SET table1_f1='Fname' ,table1_f2='Lname' where table1_f3='id ';
the '; is appearing on a new line, why is that happening?
The variable $id in your shell script actually contains that newline (\n or \r\n) at the end; so there isn't really anything wrong in the part of the script you've shown here.
This effect is pretty common if the variable is created based on external commands (update:) or by reading external files as you are here.
For simple values, one way to strip the newline off the end of the value, prior to using it in your echo is:
id=$( echo "${id}" | tr -s '\r' '' | tr -s '\n' '' );
or for scripts that already rely on a particular bash IFS value:
OLDIFS="${IFS}";
IFS=$'\n\t ';
id=$( echo "${id}" | tr -s '\r' '' | tr -s '\n' '' );
IFS="${OLDIFS}";

Postgres copy to TSV file with header

I have a function like so -
CREATE
OR REPLACE FUNCTION ind (bucket text) RETURNS table (
middle character varying (100),
last character varying (100)
) AS $body$ BEGIN return query
select
fname as first,
lname as last
from all_records
; END;
$body$ LANGUAGE PLPGSQL;
How do I output the results of select ind ('Mob') into a tsv file?
I want the output to look like this -
first last
MARY KATHERINE
You can use the COPY command
example:
COPY (select * from ind('Mob')) TO '/tmp/ind.tsv' CSV HEADER DELIMITER E'\t';
the file '/tmp/ind.tsv' will contain you data
Postgres doesn't allow copy with header for tsv for some reason.
If you're using a linux based system you can do it with a script like this:
#create file with tab delimited column list (use \t between each column name)
echo -e "user_id\temail" > user_output.tsv
#now you can append the results of your query to that file by copying to STDOUT
psql -h your_host_name -d your_database_name -c "\copy (SELECT user_id, email FROM my_user_table) to STDOUT;" >> user_output.tsv
Alternatively, if your script is long and you don't want to pass it in with -c command you can use the same approach from a .sql file, use "--quiet" to avoid notices being passed into your file
psql --quiet -h your_host_name -d your_database_name -f your_sql_file.sql >> user_output.tsv

string substitution from text file to another string

I have a text file with three columns of text (strings) per line. I want to create an SQL insert command by substituting each of the three strings into a skeleton SQL command. I have put place markers in the skeleton script and used SED s/placemarker1/first string/ but with no success. Is there an easier way to accomplish this task. I used pipes to repeat the process for 'second string' etc. I actually used awk to get the fields but could not convert to the actual values.
enter code here
for i in [ *x100* ]; do
if [ -f "$i" ]; then {
grep -e "You received a payment" -e "Transaction ID:" -e "Receipt No: " $i >> ../temp
cat ../temp | awk 'NR == 1 {printf("%s\t",$9)} NR == 2 {printf("%s\t",$9)} NR == 3 {printf("%s\n",$3)}' | awk '{print $2,$1,$3}' | sed 's/(/ /' | sed 's/)./ /' >> ../temp1
cat temp1 | awk 'email="$1"; transaction="$2"; ccreceipt="$3";'
cat /home/linux014/opt/skeleton.sql | sed 's/EMAIL/"$email"/' | sed 's/TRANSACTION/"$transaction"/' | sed 's/CCRECEIPT/"$ccreceipt"/' > /home/linux014/opt/new-member.sql
rm -f ../temp
} fi
done
I cannot figure out how to get the values instead of the names of the variables inserted into my string.
Sample input (one line only):
catdog#gmail.com 2w4e5r6t7y8u9i8u7 1111-2222-3333-4444
Sample actual output:
INSERT INTO users (email,paypal_tran,CCReceipt) VALUES ('"$email"','"$transaction"','"$ccreceipt"');
Preferred output:
INSERT INTO users (email,paypal_tran,CCReceipt) VALUES ('catdog#gmail.com','2w4e5r6t7y8u9i8u7','1111-2222-3333-4444');
awk '{print "INSERT INTO users (email,paypal_tran,CCReceipt) VALUES"; print "(\x27"$1"\x27,\x27"$2"\x27,\x27"$3"\x27);"}' input.txt
Converts your sample input to preferred output. It should work for multi line input.
EDIT
The variables you are using in this line:
cat temp1 | awk 'email="$1"; transaction="$2"; ccreceipt="$3";'
are only visible to awk and in this command. They are not shell variables.
Also in your sed commands remove those single quotes then you can get the values:
sed "s/EMAIL/$email/"
You can try this bash,
while read email transaction ccreceipt; do echo "INSERT INTO users (email,paypal_tran,CCReceipt) VALUES ('$email','$transaction','$ccreceipt');"; done<inputfile
inputfile:
catdog#gmail.com 2w4e5r6t7y8u9i8u7 1111-2222-3333-4444
dog#gmail.com 2dsdsda53563u9i8u7 3333-4444-5555-6666
Test:
sat:~$ while read email transaction ccreceipt; do echo "INSERT INTO users (email,paypal_tran,CCReceipt) VALUES ('$email','$transaction','$ccreceipt')"; done<inputfile
INSERT INTO users (email,paypal_tran,CCReceipt) VALUES ('catdog#gmail.com','2w4e5r6t7y8u9i8u7','1111-2222-3333-4444')
INSERT INTO users (email,paypal_tran,CCReceipt) VALUES ('dog#gmail.com','2dsdsda53563u9i8u7','3333-4444-5555-6666')
You can write a small procedure for this
CREATE PROCEDURE [dbo].[appInsert]--
#string VARCHAR(500)
AS
BEGIN
DECLARE #I INT
DECLARE #SubString VARCHAR(500)
SET #String = 'catdog#gmail.com 2w4e5r6t7y8u9i8u7 1111-2222-3333-4444'
SET #I = 1
SET #String = REPLACE(#String, ' ', '`~`')
WHILE #I > 0
BEGIN
SET #SubString = SUBSTRING (REVERSE(#String), 1, ( CHARINDEX( '`~`', REVERSE(#String)) - 1))
SET #String = SUBSTRING(#String, 1, LEN(#String) - CHARINDEX( '`~`', REVERSE(#String)) - 2 )
print REVERSE(#SubString) + ' === ' + #String
SET #i = CHARINDEX( '`~`', #String)
END
END

BCP/ Bulk Insert Fails (tab delimited file)

I have been trying to import data (tab delimited) into SQL server. The source data is exported from IBM Cognos. Data can be downloaded from: sample data
I have tried BCP / Bulk Insert, but it did not help. The original datafile contains a header row (which needs to be skipped).
==================================
Schema:
CREATE TABLE [dbo].[DIM_Assessment](
[QueryType] [nvarchar](4000) NULL,
[QueryDate] [nvarchar](4000) NULL,
[APUID] [nvarchar](4000) NULL,
[AssessmentID] [nvarchar](4000) NULL,
[ICDCode] [nvarchar](4000) NULL,
[ICDName] [nvarchar](4000) NULL,
[LoadDate] [nvarchar](4000) NULL
) ON [PRIMARY]
GO
=============================
Format File generated using the following command
bcp [dbname].dbo.dim_assessment format nul -c -f C:\config\dim_assessment.Fmt -S <IP> -U sa -P Pwd
Content of the format file:
11.0
7
1 SQLCHAR 0 8000 "\t" 1 QueryType SQL_Latin1_General_CP1_CI_AS
2 SQLCHAR 0 8000 "\t" 2 QueryDate SQL_Latin1_General_CP1_CI_AS
3 SQLCHAR 0 8000 "\t" 3 APUID SQL_Latin1_General_CP1_CI_AS
4 SQLCHAR 0 8000 "\t" 4 AssessmentID SQL_Latin1_General_CP1_CI_AS
5 SQLCHAR 0 8000 "\t" 5 ICDCode SQL_Latin1_General_CP1_CI_AS
6 SQLCHAR 0 8000 "\t" 6 ICDName SQL_Latin1_General_CP1_CI_AS
7 SQLCHAR 0 8000 "\r\n" 7 LoadDate SQL_Latin1_General_CP1_CI_AS
=============================
I tried importing data using BCP / Bulk Insert, however, non of them worked.
bcp [dbname].dbo.dim_assessment IN C:\dim_assessment.dat -f C:\config\dim_assessment.Fmt -S <IP> -U sa -P Pwd
BULK INSERT dim_assessment FROM '\\dbserver\DIM_Assessment.dat'
WITH (
DATAFILETYPE = 'char',
FIELDTERMINATOR = '\t',
ROWTERMINATOR = '\r\n'
);
GO
Thank you in advance for your help#
Your input file is in a terrible format.
Your format file and your BULK INSERT command both state that the end of a row should be a carriage return and line feed combination, and that there are seven columns of data. However if you open your CSV file in Notepad you will quickly see that the carriage returns and line feeds are not observed correctly in Windows (meaning they must be something other than precisely \r\n). You can also see that there aren't actually seven columns of data, but five:
QueryType QueryDate APUID AssessmentID ICDCode ICDName LoadDate
PPIC 2013-11-20 10:23:14 11431 10963 Tremors
PPIC 2013-11-20 10:23:14 11431 11299 THUMB PAIN
PPIC 2013-11-20 10:23:14 11431 11348 Environmental allergies
...
Just looking at it visually you can tell it isn't right, and you need to get a better source file before throwing it over the wall at SQL Server and expecting it to handle it smoothly:
Just Saved your file as .CSV and bulk inserted with the following statement.
BULK INSERT dim_assessment FROM 'C:\Blabla\TestFile.csv'
WITH (
FIRSTROW = 2,
FIELDTERMINATOR = ',',
ROWTERMINATOR = '\n'
);
GO
Returned Message
(22587 row(s) affected)
Loaded Data
Just notice that some data from ICD name has overflown into LoadDate Column, Just use the | pipe character to deliminate and use the same bulk insert statement with FIELDTERMINATOR = '|' and happy days .
Opening the file via Excel shows the following:
There are indeed 7 row headers
Only the first six of them are populated
Columns 1, 2 and 3 hold identical values
There is some confusing data, where the fifth column can be either empty, or filled with numbers, or filled with text.
I guess that, in these conditions, bulk insert might not work properly. As Excel seems to manage your file in quite a clean way, you should think about an extra step, from CSV to Excel and then to your database.
Ok, so, this was a seemingly simple task to push delimited data from flat-file to SQL server. I thought BCP was the way to go (i used it earlier and was successful).
A quick rundown of what was suggested:
a. fix the source file
b. saving source data in native excel format
c. saving source data as pipe-delimited data
I tried all the options, but, it was adding multiple steps to my process, but was do-able.
I stumbled upon invoke-sqlcmd & import-csv commandlets from powershell. Turns out, I can import the data using powershell directly. it is a bit slow at this time, but, i can live with that for now.
$DATA=IMPORT-CSV dim_assessment.CSV -Delimiter "`t"
FOREACH ($LINE in $DATA)
{
$QueryType="`'"+$Line.QueryType+"`'"
$QueryDate="`'"+$Line.QueryDate+"`'"
$APUID="`'"+$Line.APUID+"`'"
$AssessmentID="`'"+$Line.AssessmentID+"`'"
$ICDCode="`'"+$Line.ICDCode+"`'"
$ICDName=$Line.ICDName
$ICDName = $ICDName.replace("'","''")
$ICDName="`'"+$ICDName+"`'"
$LoadDate="`'"+$Line.LoadDate+"`'"
$SQLHEADER="INSERT INTO [dim_assessment] ([QueryType],[QueryDate],[APUID],[AssessmentID],[ICDCode],[ICDName],[LoadDate])"
$SQLVALUES="VALUES ($QueryType,$QueryDate,$APUID,$AssessmentID,$ICDCode,$ICDName,$LoadDate)"
$SQLQUERY=$SQLHEADER+$SQLVALUES
Invoke-Sqlcmd –Query $SQLQuery -ServerInstance HA -U sa -P Pwd
}
Thanks for all your help!