BCP/ Bulk Insert Fails (tab delimited file) - sql

I have been trying to import data (tab delimited) into SQL server. The source data is exported from IBM Cognos. Data can be downloaded from: sample data
I have tried BCP / Bulk Insert, but it did not help. The original datafile contains a header row (which needs to be skipped).
==================================
Schema:
CREATE TABLE [dbo].[DIM_Assessment](
[QueryType] [nvarchar](4000) NULL,
[QueryDate] [nvarchar](4000) NULL,
[APUID] [nvarchar](4000) NULL,
[AssessmentID] [nvarchar](4000) NULL,
[ICDCode] [nvarchar](4000) NULL,
[ICDName] [nvarchar](4000) NULL,
[LoadDate] [nvarchar](4000) NULL
) ON [PRIMARY]
GO
=============================
Format File generated using the following command
bcp [dbname].dbo.dim_assessment format nul -c -f C:\config\dim_assessment.Fmt -S <IP> -U sa -P Pwd
Content of the format file:
11.0
7
1 SQLCHAR 0 8000 "\t" 1 QueryType SQL_Latin1_General_CP1_CI_AS
2 SQLCHAR 0 8000 "\t" 2 QueryDate SQL_Latin1_General_CP1_CI_AS
3 SQLCHAR 0 8000 "\t" 3 APUID SQL_Latin1_General_CP1_CI_AS
4 SQLCHAR 0 8000 "\t" 4 AssessmentID SQL_Latin1_General_CP1_CI_AS
5 SQLCHAR 0 8000 "\t" 5 ICDCode SQL_Latin1_General_CP1_CI_AS
6 SQLCHAR 0 8000 "\t" 6 ICDName SQL_Latin1_General_CP1_CI_AS
7 SQLCHAR 0 8000 "\r\n" 7 LoadDate SQL_Latin1_General_CP1_CI_AS
=============================
I tried importing data using BCP / Bulk Insert, however, non of them worked.
bcp [dbname].dbo.dim_assessment IN C:\dim_assessment.dat -f C:\config\dim_assessment.Fmt -S <IP> -U sa -P Pwd
BULK INSERT dim_assessment FROM '\\dbserver\DIM_Assessment.dat'
WITH (
DATAFILETYPE = 'char',
FIELDTERMINATOR = '\t',
ROWTERMINATOR = '\r\n'
);
GO
Thank you in advance for your help#

Your input file is in a terrible format.
Your format file and your BULK INSERT command both state that the end of a row should be a carriage return and line feed combination, and that there are seven columns of data. However if you open your CSV file in Notepad you will quickly see that the carriage returns and line feeds are not observed correctly in Windows (meaning they must be something other than precisely \r\n). You can also see that there aren't actually seven columns of data, but five:
QueryType QueryDate APUID AssessmentID ICDCode ICDName LoadDate
PPIC 2013-11-20 10:23:14 11431 10963 Tremors
PPIC 2013-11-20 10:23:14 11431 11299 THUMB PAIN
PPIC 2013-11-20 10:23:14 11431 11348 Environmental allergies
...
Just looking at it visually you can tell it isn't right, and you need to get a better source file before throwing it over the wall at SQL Server and expecting it to handle it smoothly:

Just Saved your file as .CSV and bulk inserted with the following statement.
BULK INSERT dim_assessment FROM 'C:\Blabla\TestFile.csv'
WITH (
FIRSTROW = 2,
FIELDTERMINATOR = ',',
ROWTERMINATOR = '\n'
);
GO
Returned Message
(22587 row(s) affected)
Loaded Data
Just notice that some data from ICD name has overflown into LoadDate Column, Just use the | pipe character to deliminate and use the same bulk insert statement with FIELDTERMINATOR = '|' and happy days .

Opening the file via Excel shows the following:
There are indeed 7 row headers
Only the first six of them are populated
Columns 1, 2 and 3 hold identical values
There is some confusing data, where the fifth column can be either empty, or filled with numbers, or filled with text.
I guess that, in these conditions, bulk insert might not work properly. As Excel seems to manage your file in quite a clean way, you should think about an extra step, from CSV to Excel and then to your database.

Ok, so, this was a seemingly simple task to push delimited data from flat-file to SQL server. I thought BCP was the way to go (i used it earlier and was successful).
A quick rundown of what was suggested:
a. fix the source file
b. saving source data in native excel format
c. saving source data as pipe-delimited data
I tried all the options, but, it was adding multiple steps to my process, but was do-able.
I stumbled upon invoke-sqlcmd & import-csv commandlets from powershell. Turns out, I can import the data using powershell directly. it is a bit slow at this time, but, i can live with that for now.
$DATA=IMPORT-CSV dim_assessment.CSV -Delimiter "`t"
FOREACH ($LINE in $DATA)
{
$QueryType="`'"+$Line.QueryType+"`'"
$QueryDate="`'"+$Line.QueryDate+"`'"
$APUID="`'"+$Line.APUID+"`'"
$AssessmentID="`'"+$Line.AssessmentID+"`'"
$ICDCode="`'"+$Line.ICDCode+"`'"
$ICDName=$Line.ICDName
$ICDName = $ICDName.replace("'","''")
$ICDName="`'"+$ICDName+"`'"
$LoadDate="`'"+$Line.LoadDate+"`'"
$SQLHEADER="INSERT INTO [dim_assessment] ([QueryType],[QueryDate],[APUID],[AssessmentID],[ICDCode],[ICDName],[LoadDate])"
$SQLVALUES="VALUES ($QueryType,$QueryDate,$APUID,$AssessmentID,$ICDCode,$ICDName,$LoadDate)"
$SQLQUERY=$SQLHEADER+$SQLVALUES
Invoke-Sqlcmd –Query $SQLQuery -ServerInstance HA -U sa -P Pwd
}
Thanks for all your help!

Related

Match multiline SQL statement in pgdump

I have PostgreSQL database dump by pg_dump version 9.5.2, which contains DDLs and also INSERT INTO statements for each table in given database. Dump looks like this:
SET statement_timeout = 0;
SET lock_timeout = 0;
SET client_encoding = 'UTF8';
CREATE TABLE unimportant_table (
id integer NOT NULL,
col1 character varying
);
CREATE TABLE important_table (
id integer NOT NULL,
col2 character varying NOT NULL,
unimportant_col character varying NOT NULL
);
INSERT INTO unimportant_table VALUES (123456, 'some data split into
- multiple
- lines
just for fun');
INSERT INTO important_table VALUES (987654321, 'some important data', 'another crap split into
- lines');
...
-- thousands of inserts into both tables
The dump file is really large and it is produced by another company, so I am not able to influence the export process. I need create 2 files from this dump:
All DDL statements (all statements that doesn't start with INSERT INTO)
All INSERT INTO important_table statements (I want restore only some tables from dump)
If all statements would be on single line without new line character in the data, it will be very easy to create 2 SQL script by grep, for example:
grep -v '^INSERT INTO .*;$' my_dump.sql > ddl.sql
grep -o '^INSERT INTO important_table .*;$' my_dump.sql > important_table.sql
# Create empty structures
psql < ddl.sql
# Import only one table for now
psql < important_table.sql
Firstly I was thinking about using grep but I did not find, how to process multiple lines at once, then I tried sed but it is returning only single line inserts. I also used https://regex101.com/ to find out right regular expression but I don't know how to combine it with grep or sed:
^(?!(INSERT INTO)).*$ -- for ddl
^INSERT INTO important_table(\s|[[:alnum:]])*;$ -- for inserts
I found similar question pcregrep multiline SQL match but there is no answer. Also, I don't mind if the solution will work with grep, sed or whatever you suggest, but it should work on Ubuntu 18.04.4 TLS.
Here is a bash based solution that uses perl one-liners to prepare your SQL dump data for the subsequent grep statements.
In my approach, the goal is to get one SQL statement on one line through a script that I called prepare.sh. It got a little more complicated because I wanted to accomodate for semicolons and quotes within your insert data strings (these, along with the line breaks, are represented by their hex codes in the intermediate output):
EDIT: In response to #32cupo's comment, below is a modified set of scripts that avoids xargs with large data sets (although I don't have huge dump files to test it with):
#!/bin/bash
perl -pne 's/;(?=\s*$)/__ENDOFSTATEMENT__/g' \
| perl -pne 's/\\/\\\\x5c/g' \
| perl -pne 's/\n/\\\\x0a/g' \
| perl -pne 's/"/\\\\x22/g' \
| perl -pne 's/'\''/\\\\x27/g' \
| perl -pne 's/__ENDOFSTATEMENT__/;\n/g' \
Then, a separate script (called ddl.sh) includes your grep statement for the DDL (and, with the help of the loop, only feeds smaller chunks (lines) into xargs):
#!/bin/bash
while read -r line; do
<<<"$line" xargs -I{} echo -e "{}"
done < <(grep -viE '^(\\\\x0a)*insert into')
Another separate script (called important_table.sh) includes your grep statement for the inserts into important-table:
#!/bin/bash
while read -r line; do
<<<"$line" xargs -I{} echo -e "{}"
done < <(grep -iE '^(\\\\x0a)*insert into important_table')
Here is the set of scripts in action (please also note that I spiced up your insert data with some semicolons and quotes):
~/$ cat dump.sql
SET statement_timeout = 0;
SET lock_timeout = 0;
SET client_encoding = 'UTF8';
CREATE TABLE unimportant_table (
id integer NOT NULL,
col1 character varying
);
CREATE TABLE important_table (
id integer NOT NULL,
col2 character varying NOT NULL,
unimportant_col character varying NOT NULL
);
INSERT INTO unimportant_table VALUES (123456, 'some data split into
- multiple
- lines
;just for fun');
INSERT INTO important_table VALUES (987654321, 'some important ";data"', 'another crap split into
- lines;');
...
-- thousands of inserts into both tables
~/$ cat dump.sql | ./prepare.sh | ./ddl.sh >ddl.sql
~/$ cat ddl.sql
SET statement_timeout = 0;
SET lock_timeout = 0;
SET client_encoding = 'UTF8';
CREATE TABLE unimportant_table (
id integer NOT NULL,
col1 character varying
);
CREATE TABLE important_table (
id integer NOT NULL,
col2 character varying NOT NULL,
unimportant_col character varying NOT NULL
);
...
-- thousands of inserts into both tables
~/$ cat dump.sql | ./prepare.sh | ./important_table.sh > important_table.sql
~/$ cat important_table.sql
INSERT INTO important_table VALUES (987654321, 'some important ";data"', 'another crap split into
- lines;');

Postgres copy to TSV file with header

I have a function like so -
CREATE
OR REPLACE FUNCTION ind (bucket text) RETURNS table (
middle character varying (100),
last character varying (100)
) AS $body$ BEGIN return query
select
fname as first,
lname as last
from all_records
; END;
$body$ LANGUAGE PLPGSQL;
How do I output the results of select ind ('Mob') into a tsv file?
I want the output to look like this -
first last
MARY KATHERINE
You can use the COPY command
example:
COPY (select * from ind('Mob')) TO '/tmp/ind.tsv' CSV HEADER DELIMITER E'\t';
the file '/tmp/ind.tsv' will contain you data
Postgres doesn't allow copy with header for tsv for some reason.
If you're using a linux based system you can do it with a script like this:
#create file with tab delimited column list (use \t between each column name)
echo -e "user_id\temail" > user_output.tsv
#now you can append the results of your query to that file by copying to STDOUT
psql -h your_host_name -d your_database_name -c "\copy (SELECT user_id, email FROM my_user_table) to STDOUT;" >> user_output.tsv
Alternatively, if your script is long and you don't want to pass it in with -c command you can use the same approach from a .sql file, use "--quiet" to avoid notices being passed into your file
psql --quiet -h your_host_name -d your_database_name -f your_sql_file.sql >> user_output.tsv

How to identify the rows with missing data in the column due to hidden # in the .txt file

I have a below .txt files exported from the source system. Due to the # in one field in source system while exporting the .txt file some of the data after # fields do not have any data in the .txt file when exported.
For example below..
LINE|PANO| INOW|DEL|EASLN|EBSAP|LIM1IT|NOMIT|VALUE|KTE1|
1|7870|1000000||40500369|10|25624.0||0.00|SERVI TORNG|33277|
2|294|1000000||500324|10|590.84 ||0.00|REFUDIAL GATNGWAM|30448|
3|9410|1000000||200500325|10|5905.61||0.00|SUPLIVER EXTRACNS|37478|
4|573|1000000||600004075|10||||||||
5|739|1000000||700500290|10|40917.37|||||||
6|741|1000000||50500289|10|2782.53 ||0.00|SECUERVIC LUWE|29161|
7|948|1000000||||||||||||
8|996|1000000||960050035|10|7497.3||0.00|SCOUOUT URBISH IDM647 |38271|
9|1320|1000000||800500319|10|1395.93||0.00|TUATO AIRS|36427|
10|12054|1000000||9000287|10|458.42||0.00|SECURICE GOLA|||||
In the above example line 4, 5, 7 and 10 data is missing after certain fields due to the # in the source system field. But there is data in the source system for these line items.
How to recognize these line items as the missing information / records issue, if I have a large volume of .txt file for 10 Million-line items.
Please kindly share the SQL query/ any other way to identify these line items with the missing data.
another example
LINE|PANO| INOW|DEL|EASLN|EBSAP|LIM1IT|NOMIT|VALUE|KTE1|
1|7870|1000000||40500369|10|25624.0||0.00|SERVI TORNG|33277|
2|294|1000000||500324|10|590.84 ||0.00|REFUDIAL GATNGWAM|30448|
3|9410|1000000||200500325|10|5905.61||0.00|SUPLIVER EXTRACNS|37478|
4|573|1000000||600004075|10
5|739|1000000||700500290|10|40917.37
6|741|1000000||50500289|10|2782.53 ||0.00|SECUERVIC LUWE|29161|
7|948|1000000
8|996|1000000||960050035|10|7497.3||0.00|SCOUOUT URBISH IDM647 |38271|
9|1320|1000000||800500319|10|1395.93||0.00|TUATO AIRS|36427|
10|12054|1000000||9000287|10|458.42||0.00|SECURICE GOLA
data truncated if # exists.
Would the following do what you require?
I created a temporary table #HiddenHash and populated it with some of your example data, you will obviously have the data from a BULK INSERT or whatever mechanism you are using.
CREATE TABLE
#HiddenHash
(
LINE VARCHAR (2)
,PANO VARCHAR (25)
,INOW VARCHAR (25)
,DEL VARCHAR (25)
,EASLN VARCHAR (25)
,EBSAP VARCHAR (25)
,LIM1IT VARCHAR (25)
,NOMIT VARCHAR (25)
,VALUE VARCHAR (25)
,KTE1 VARCHAR (25)
)
INSERT INTO #HiddenHash
VALUES
('1','7870','1000000','','40500369','10','25624.0','0.00','SERVI TORNG','33277')
,('2','294','1000000','',' 500324','10','590.84 ','0.00','REFUDIAL GATNGWAM','30448')
,('3','9410','1000000','','200500325','10','5905.61','0.00','SUPLIVER EXTRACNS','37478')
,('4','573','1000000','','600004075','10','','','','')
,('5','739','1000000','','700500290','10','40917.37','','','')
,('6','741','1000000','','50500289','10','2782.53 ','0.00','SECUERVIC LUWE','29161')
,('7','948','1000000','','','','','','','')
,('8','996','1000000','','960050035','10','7497.3','0.00','SCOUOUT URBISH IDM647 ','38271')
,('9','1320','1000000','','800500319','10','1395.93','0.00','TUATO AIRS','36427')
,('10','12054','1000000','','9000287','10','458.42','0.00','SECURICE GOLA','')
Then I count how many columns there are in the table.
DECLARE #CountColumns INT
SET #CountColumns = (SELECT COUNT (*)
FROM TEMPDB.SYS.COLUMNS
WHERE NAME <> 'DEL' AND
object_id = object_id('tempdb.dbo.#HiddenHash')
)
Then count those rows where the columns are blank and show those where they do not match the number of columns contained in the variable.
SELECT LINE,PANO,INOW,EASLN,EBSAP,LIM1IT,NOMIT,VALUE,KTE1
FROM (
SELECT
LINE,PANO,INOW,EASLN,EBSAP,LIM1IT,NOMIT,VALUE,KTE1,
(
SELECT COUNT(*)
FROM (VALUES (LINE),(PANO),(INOW),(EASLN),(EBSAP),(LIM1IT),(NOMIT),
(VALUE),(KTE1)) AS Cnt(col)
WHERE Cnt.Col <> ''
) AS NotBlank
FROM #HiddenHash)cc
WHERE cc.NotBlank <> #CountColumns
Which gives the following result
LINE PANO INOW EASLN EBSAP LIM1IT NOMIT VALUE KTE1
4 573 1000000 600004075 10
5 739 1000000 700500290 10 40917.37
7 948 1000000
10 12054 1000000 9000287 10 458.42 0.00 SECURICE GOLA

SQL: Loading a CSV file with BULK statement causing problems with hebrew strings

I'm trying to insert very large csv file into a table on SQL server.
On the table itself the fields are defined as nvarchar but when i'm trying to use the bulk statement to load that file - all the hebrew fields are gibberish.
When i'm using the INSERT statement everything is ok but the BULK one's getting all wrong. I even tried to put the string in the CSV file with the N'string' thing - but it just came to be (in the table: N'gibberish'.
The reason i'm not using just INSERT is because the file contains more than 250K long rows.
This is the statement that i'm using. The delimiter is '|' on purpose:
BULK INSERT [dbo].[SomeTable]
FROM 'C:\Desktop\csvfilesaved.csv'
WITH
(
FIRSTROW = 2,
FIELDTERMINATOR = '|',
ROWTERMINATOR = '\n',
ERRORFILE = 'C:\Desktop\Error.csv',
TABLOCK
)
And this is two row sample of the csv file:
2017-03|"מחוז ש""ת דן"|בני 18 עד 24|זכר|א. לא למד|ב. קלה|יהודים|ב. בין 31 ל-180 יום||הנדסאים, טכנאים, סוכנים ובעלי משלח יד נלווה|1|0|0|1|0|0
2017-03|"מחוז ש""ת דן"|בני 18 עד 24|זכר|א. לא למד|ג. בינונית|יהודים|ב. בין 31 ל-180 יום||עובדי מכירות ושירותים|1|0|0|1|0|0
Thanks!

Oracle PL-SQL : Import multiple delimited files into table

I have multiple files (f1.log, f2.log, f3.log etc)
Each file has the data in ; & = delimited format. (new lines are delimited by ; and fields are delimited by =) e.g.
data of f1:
1=a;2=b;3=c
data of f2:
1=p;2=q;3=r
I need to read all these files and import data into table in format:
filename number data
f1 1 a
f1 2 b
f1 3 c
f2 1 p
[...]
I am new to SQL. Can you please guide me, how can do it?
Use SQL*Loader to get the files into a table. Assuming you have a table created a bit like:
create table FLOG
(
FILENAME varchar2(1000)
,NUM varchar2(1000)
,DATA varchar2(1000)
);
Then you can use the following control file:
LOAD DATA
INFILE 'f1.log' "str ';'"
truncate INTO TABLE flog
fields terminated by '=' TRAILING NULLCOLS
(
filename constant 'f1'
,num char
,data char
)
However, you will need a different control file for each file. This can be done by making the control file dynamically using a shell script. A sample shell script can be:
cat >flog.ctl <<_EOF
LOAD DATA
INFILE '$1.log' "str ';'"
APPEND INTO TABLE flog
fields terminated by '=' TRAILING NULLCOLS
(
filename constant '$1'
,num char
,data char
)
_EOF
sqlldr <username>/<password>#<instance> control=flog.ctl data=$1.log
Saved as flog.sh it can then be run like:
./flog.sh f1
./flog.sh f2