BCP file format for SQL bulk insert of CSV file - sql

I'm trying a bulk insert of a csv file into a SQL table using BCP but can't fix this error: "The column is too long in the data file for row 1, column 2. Verify that the field terminator and row terminator are specified correctly." - Can anyone help please?
Here's my SQL code:
BULK INSERT UKPostCodeStaging
FROM 'C:\Users\user\Desktop\Data\TestFileOf2Records.csv'
WITH (
DATAFILETYPE='char',
FIRSTROW = 1,
FORMATFILE = 'C:\Users\User\UKPostCodeStaging.fmt');
Here's my test data contained in TestFileOf2Records.csv:
"HS1 2AA",10,14,93,"S923","","S814","","S1213","S132605"
"HS1 2AD",10,14,93,"S923","","S814","","S1213","S132605"
And here's my BCP file that I have attempted to edit appropriately:
10.0
11
1 SQLCHAR 0 0 "\"" 0 FIRST_QUOTE SQL_Latin1_General_CP1_CI_AS
2 SQLCHAR 8 0 "\"," 1 PostCode SQL_Latin1_General_CP1_CI_AS
3 SQLINT 1 0 "," 2 PositionalQualityIndicator ""
4 SQLINT 1 0 "," 3 MetresEastOfOrigin ""
5 SQLINT 1 0 ",\"" 4 MetresNorthOfOrigin ""
6 SQLCHAR 8 0 "\",\"" 5 CountryCode SQL_Latin1_General_CP1_CI_AS
7 SQLCHAR 8 0 "\",\"" 6 NHSRegionalHACode SQL_Latin1_General_CP1_CI_AS
8 SQLCHAR 8 0 "\",\"" 7 NHSHACode SQL_Latin1_General_CP1_CI_AS
9 SQLCHAR 8 0 "\",\"" 8 AdminCountyCode SQL_Latin1_General_CP1_CI_AS
10 SQLCHAR 8 0 "\",\"" 9 AdminDistrictCode SQL_Latin1_General_CP1_CI_AS
11 SQLCHAR 8 0 "\"\r\n" 10 AdminWardCode SQL_Latin1_General_CP1_CI_AS
Any ideas where I am going wrong?
thanks

Related

SQL Does it make sense to re-order database instead of using ORDER BY to increase performance?

I have a database with around 120.000 Entries and I need to do substring comparisons (where ... like 'test%') for an autocomplete function. The database won't change.
I have a column called "relevance" and for my searches I want them to be ordered by relevance DESC. I noticed, that as soon as I add the "ORDER BY relevance DESC" to my queries, the execution time increases by about 100% - since my queries already take around 100ms on average, this causes significant lag.
Does it make sense to re-order the whole database by relevance once so I can remove the ORDER BY? Can I be certain, that when searching through the table with SQL it will always go through the database in the order that I added the rows?
This is how my query looks like right now:
select *
from hao2_dict
where definitions like 'ba%'
or searchable_pinyin like 'ba%'
ORDER BY relevance DESC
LIMIT 100
UPDATE: For context, here is my DB structure:
And some time measurements:
Using an Index (relevance DESC) for the search term 'b%' gives me 50ms, which is faster than not using an Index. But the search term 'banana%' takes over 1700ms which is way slower than not using an Index. These are the results from 'explain':
b%:
0 Init 0 27 0 0
1 Noop 1 11 0 0
2 Integer 100 1 0 0
3 OpenRead 0 5 0 9 0
4 OpenRead 2 4223 0 k(2,-,) 0
5 Rewind 2 26 2 0 0
6 DeferredSeek 2 0 0 0
7 Column 0 6 4 0
8 Function 1 3 2 like(2) 0
9 If 2 13 0 0
10 Column 0 4 6 0
11 Function 1 5 2 like(2) 0
12 IfNot 2 25 1 0
13 IdxRowid 2 7 0 0
14 Column 0 1 8 0
15 Column 0 2 9 0
16 Column 0 3 10 0
17 Column 0 4 11 0
18 Column 0 5 12 0
19 Column 0 6 13 0
20 Column 0 7 14 0
21 Column 2 0 15 0
22 RealAffinity 15 0 0 0
23 ResultRow 7 9 0 0
24 DecrJumpZero 1 26 0 0
25 Next 2 6 0 1
26 Halt 0 0 0 0
27 Transaction 0 0 10 0 1
28 String8 0 3 0 b% 0
29 String8 0 5 0 b% 0
30 Goto 0 1 0 0
banana%:
0 Init 0 27 0 0
1 Noop 1 11 0 0
2 Integer 100 1 0 0
3 OpenRead 0 5 0 9 0
4 OpenRead 2 4223 0 k(2,-,) 0
5 Rewind 2 26 2 0 0
6 DeferredSeek 2 0 0 0
7 Column 0 6 4 0
8 Function 1 3 2 like(2) 0
9 If 2 13 0 0
10 Column 0 4 6 0
11 Function 1 5 2 like(2) 0
12 IfNot 2 25 1 0
13 IdxRowid 2 7 0 0
14 Column 0 1 8 0
15 Column 0 2 9 0
16 Column 0 3 10 0
17 Column 0 4 11 0
18 Column 0 5 12 0
19 Column 0 6 13 0
20 Column 0 7 14 0
21 Column 2 0 15 0
22 RealAffinity 15 0 0 0
23 ResultRow 7 9 0 0
24 DecrJumpZero 1 26 0 0
25 Next 2 6 0 1
26 Halt 0 0 0 0
27 Transaction 0 0 10 0 1
28 String8 0 3 0 banana% 0
29 String8 0 5 0 banana% 0
30 Goto 0 1 0 0
Can I be certain, that when searching through the table with SQL it will always go through the database in the order that I added the rows?
No. SQL results have no inherent order. They might come out in the order you inserted them, but there is no guarantee.
Instead, put an index on the column. Indexes keep their values in order.
However, this will only deal with the sorting. In the query above it still has to search the whole table for rows with matching definitions and searchable_pinyins. In general, SQL will only use one index per table at a time; usually trying to use two is inefficient. So you need one multi-column index to make this query not have to search the whole table and get the results in sorted order. Make sure relevance is first, you need to have the index columns in the same order as your order by.
(relevance, definitions, searchable_pinyins) will make that query use only the index for searching and sorting. Adding (relevance, searchable_pinyins) as well will handle searching by definitions, searchable_pinyins, or both.

invalid column number in Formatfile sql

I looked at this one (Bulk inserting a csv in SQL using a formatfile to remove double quotes) but my situation is just different enough.
First, how would I upload this very lengthy format file? Looking at Github but just not clear enough.
This is the error I get
bulk insert equi2022a
From 'C:\Users\someone\Desktop\equi.txt'
WITH (FORMATFILE = 'C:\Users\someone\Desktop\formatfileequi-2.txt'
);
Msg 4823, Level 16, State 1, Line 1
Cannot bulk load. Invalid column number in the format file
"C:\Users\someone\Desktop\formatfileequi-2.txt".
I created this manually and here is a small snippet of it. I painstakingly went through every row to make sure that it was in perfect order....1,2,3,4,... and so on until 122 in both of the columns designated for this.
11.0
122
1 SQLCHAR 0 01 "" 1 transcode ""
2 SQLCHAR 0 02 "" 2 stfips ""
3 SQLCHAR 0 04 "" 3 year ""
4 SQLCHAR 0 01 "" 4 qtr ""
5 SQLCHAR 0 10 "" 5 uiacct ""
6 SQLCHAR 0 05 "" 6 run ""
7 SQLCHAR 0 09 "" 7 ein ""
8 SQLCHAR 0 10 "" 8 presesaid ""
9 SQLCHAR 0 05 "" 9 predrun ""
10 SQLCHAR 0 10 "" 10 succuiacct ""
11 SQLCHAR 0 05 "" 11 succrun ""
12 SQLCHAR 0 35 "" 12 legalname ""
and then the ending
115 SQLCHAR 0 10 "" 115 wrlargestcontribsucc""
116 SQLCHAR 0 06 "" 116 wrcountlargestcontrib""
117 SQLCHAR 0 06 "" 117 wrhires ""
118 SQLCHAR 0 06 "" 118 wrseparations""
119 SQLCHAR 0 06 "" 119 wrnewentrants""
120 SQLCHAR 0 60 "" 120 wrexits ""
121 SQLCHAR 0 06 "" 121 wrcontrecords""
122 SQLCHAR 0 78 "" 122 Blank7 ""

Pad column with n zeros and trim excess values

For example, the original data file
file.org :
1 2 3 4 5
6 7 8 9 0
11 12 13 14 15
16 17 18 19 20
21 22 23 24 25
Insert three data points (0) in column 2,
The output file should look like this
file.out :
1 0 3 4 5
6 0 8 9 0
11 0 13 14 15
16 2 18 19 20
21 7 23 24 25
Please help.
The following awk will do the trick:
awk -v n=3 '{a[NR]=$2; $2=a[NR-n]+0}1' file
$ awk -v n=3 '{x=$2; $2=a[NR%n]+0; a[NR%n]=x} 1' file
1 0 3 4 5
6 0 8 9 0
11 0 13 14 15
16 2 18 19 20
21 7 23 24 25
If you want to try Perl,
$ cat file.orig
1 2 3 4 5
6 7 8 9 0
11 12 13 14 15
16 17 18 19 20
21 22 23 24 25
$ perl -lane ' BEGIN { push(#t,0,0,0) } push(#t,$F[1]);$F[1]=shift #t; print join(" ",#F) ' file.orig
1 0 3 4 5
6 0 8 9 0
11 0 13 14 15
16 2 18 19 20
21 7 23 24 25
$
EDIT: Since OP has edited question so adding solution as per new question.
awk -v count=3 '++val<=count{a[val]=$2;$2=0} val>count{if(++re_count<=count){$2=a[re_count]}} 1' Input_file
Output will be as follows.
1 0 3 4 5
6 0 8 9 0
11 0 13 14 15
16 2 18 19 20
21 7 23 24 25
Could you please try following.
awk -v count=5 '
BEGIN{
OFS="\t"
}
$2{
val=(val?val ORS OFS:OFS)$2
$2=0
occ++
$1=$1
}
1
END{
while(++occ<=count){
print OFS 0
}
print val
}' Input_file
Output will be as follows.
1 0 3 4 5
6 0 8 9 0
11 0 13 14 15
0
0
2
7
12

BCP error if last column is empty

My bcp app is getting an error if the last column in the datafile is empty. This causes an error:
XX,YY,42,0,2,201501,652,,
This doesn't:
XX,YY,42,0,2,201501,652,,0
Unfortunately I can't specify a zero instead of a null. The destination table allows null on every column. The datatype is float (the last three columns are floats in fact). Here's the format file:
8.0
9
1 SQLCHAR 0 10 "," 1 NOT SQL_Latin1_General_CP1_CI_AS
2 SQLCHAR 0 10 "," 2 VIOLATING SQL_Latin1_General_CP1_CI_AS
3 SQLCHAR 0 10 "," 3 COMPANY SQL_Latin1_General_CP1_CI_AS
4 SQLCHAR 0 2 "," 4 POLICY SQL_Latin1_General_CP1_CI_AS
5 SQLCHAR 0 2 "," 5 ON SQL_Latin1_General_CP1_CI_AS
6 SQLCHAR 0 6 "," 6 INFORMATION SQL_Latin1_General_CP1_CI_AS
7 SQLCHAR 0 25 "," 7 SECURITY ""
8 SQLCHAR 0 25 "," 8 QTY2 ""
9 SQLCHAR 0 25 "\n" 9 QTY1 ""
The error:
Row 1, Column 9: Invalid character value for cast specification

Ordering several tables in the same file using awk

In my workflow, files containing simple tables with a two-line header (see end of post) are created. I want to order these tables by number using:
(head -n 2 && tail -n +3 | sort -n -r) > ordered.txt
That works fine, but I don't know how to split the file so that I can order every table and print it in ONE file. My approach is:
awk '/^TARGET/ {(head -n 2 && tail -n +3 | sort -n -r) >> ordered.txt}' output.txt
However, this causes an error message. I want to avoid any intermediate output files. What is missing in my awk command?
The input files look like that:
TARGET 1
Sample1 Sample2 Sample3 Pattern
3 3 3 z..........................Z........................................z.........Z...z
147 171 49 Z..........................Z........................................Z.........Z...Z
27 28 13 z..........................Z........................................z.........z...z
75 64 32 Z..........................Z........................................Z.........z...Z
TARGET 2
Sample1 Sample2 Sample3 Pattern
2 0 1 z..........................z........................................z.........Z...Z
21 21 7 z..........................Z........................................Z.........Z...Z
1 0 0 ...........................Z........................................Z.............Z
4 8 6 Z..........................Z........................................z.........Z...z
2 0 1 Z..........................Z........................................Z.........Z....
1 0 0 z..........................Z........................................Z.............Z
1 0 0 z...................................................................Z.........Z...Z
TARGET 3
Sample1 Sample2 Sample3 Pattern
1 0 0 z..........................Z........................................z.............z
1 3 0 z..........................z........................................Z.........Z...Z
1 1 0 Z..........................Z........................................Z.............z
1 0 0 Z..........................Z........................................Z.............Z
0 1 2 ...........................Z........................................Z.........Z...Z
0 0 1 z..........................z........................................z..............
My output should like that - no dropping of any line:
TARGET 1
Sample1 Sample2 Sample3 Pattern
147 171 49 Z..........................Z........................................Z.........Z...Z
75 64 32 Z..........................Z........................................Z.........z...Z
27 28 13 z..........................Z........................................z.........z...z
3 3 3 z..........................Z........................................z.........Z...z
TARGET 2
Sample1 Sample2 Sample3 Pattern
21 21 7 z..........................Z........................................Z.........Z...Z
4 8 6 Z..........................Z........................................z.........Z...z
2 0 1 z..........................z........................................z.........Z...Z
2 0 1 z..........................z........................................z.........Z...Z
1 0 0 ...........................Z........................................Z.............Z
1 0 0 ...........................Z........................................Z.............Z
1 0 0 ...........................Z........................................Z.............Z
TARGET 3
Sample1 Sample2 Sample3 Pattern
1 0 0 z..........................Z........................................z.............z
1 0 0 z..........................Z........................................z.............z
1 0 0 z..........................Z........................................z.............z
1 0 0 z..........................Z........................................z.............z
0 1 2 ...........................Z........................................Z.........Z...Z
0 0 1 z..........................z........................................z..............
requires GNU awk for the array traversal sorting:
gawk '
BEGIN {PROCINFO["sorted_in"] = "#val_num_asc"}
function output_table() {
for (key in table) print table[key]
delete table
i=0
}
/TARGET/ {print; getline; print; next}
/^$/ {output_table(); print; next}
{table[++i] = $0}
END {output_table()}
' file
outputs
TARGET 1
Sample1 Sample2 Sample3 Pattern
3 3 3 z..........................Z........................................z.........Z...z
27 28 13 z..........................Z........................................z.........z...z
75 64 32 Z..........................Z........................................Z.........z...Z
147 171 49 Z..........................Z........................................Z.........Z...Z
TARGET 2
Sample1 Sample2 Sample3 Pattern
1 0 0 ...........................Z........................................Z.............Z
1 0 0 z...................................................................Z.........Z...Z
1 0 0 z..........................Z........................................Z.............Z
2 0 1 Z..........................Z........................................Z.........Z....
2 0 1 z..........................z........................................z.........Z...Z
4 8 6 Z..........................Z........................................z.........Z...z
21 21 7 z..........................Z........................................Z.........Z...Z
TARGET 3
Sample1 Sample2 Sample3 Pattern
0 0 1 z..........................z........................................z..............
0 1 2 ...........................Z........................................Z.........Z...Z
1 0 0 Z..........................Z........................................Z.............Z
1 0 0 z..........................Z........................................z.............z
1 1 0 Z..........................Z........................................Z.............z
1 3 0 z..........................z........................................Z.........Z...Z
This is a bit of a mess but assuming you dont want to lose records when you sort this should work
awk 'function sortit(){
x=asort(a)
for(i=1;i<=x;i++)print b[a[i]" "d[i]++]
delete(a);delete(b);delete(c);delete(d)
}
/^[0-9]/{a[$0]=$1;b[$1" "c[$1]++]=$0}
/TARGET/{print;getline;print}
!NF{sortit();print}
END(sortit()}' file