Bulk insert from csv file - Ignore rows with errors - SQL Server - sql

I am trying to import data from a csv file to SQL Server. There are thousands of entries in the csv file and we have a lot of rows with incorrect data in it.
Some of the rows in the CSV File are:
`"ID"|"EmpID"|"FName"|"LName"|"Gender"|"DateOfBirth"
"1"|"90043041961"|"ABCD"|"TEST"|"F"|"1848-05-05 00:00:00.000"
"1"|"10010161961"|"XYZ"|"TEST"|"F"|"1888-12-12 00:00:00.000"
.
.
..
..
....
"4"|"75101141821PPKKLL"|"LLKK"|"F"|"1925-09-09 00:00:00.000"|""
"4"|"32041401961UUYYTT"|"PPLL"|"M"|"1920-01-01 00:00:00.000"|""
.
.....
"25"|"00468132034"|"FGTT"|"OOOO"|"F"|"1922-11-11 00:00:00.000"
"25"|"00468132034"|"KKKK"|"PPPP"|"F"|"1922-11-11 00:00:00.000"
Creating the TestTable and trying to insert data (from csv file) into it:
create table TestTable
(
ID varchar(5),
EmpID varchar(25),
FName varchar(25),
LName varchar(25),
Gender varchar(5),
DateOfirthB varchar(30)
);
I am using the following script to import data from csv file to the TestTable in SQL Server:
bulk insert TestTable
from 'C:\TestData.csv'
with
(firstrow = 2,
DATAFILETYPE='char',
FIELDTERMINATOR= '"|"',
ROWTERMINATOR = '\n',
ERRORFILE ='C:\ImportErrors.csv',
MAXERRORS = 0,
TABLOCK
);
Errors:
Msg 4863, Level 16, State 1, Line 1
Bulk load data conversion error (truncation) for row 32763, column 5 (Gender).
Msg 4863, Level 16, State 1, Line 1
Bulk load data conversion error (truncation) for row 32764, column 5 (Gender).
Is there any way to ignore the rows (in the csv file) which can not be added for some or other reason and insert the one's which have the correct syntax?
Thanks
PS: I can not use SSIS. Only allowed to use SQL

I deal with different CSV Files that I receive from different sources on a weekly basis, so of the data is nice and clean and others are a nightmare. So this is how I handle the CSV Fields I receive, I hope it helps you. You will still need to add some data validation to handle malformed data.
SET NOCOUNT ON
GO
-- Create Staging Table
IF OBJECT_ID(N'TempDB..#ImportData', N'U') IS NOT NULL
DROP TABLE #ImportData
CREATE TABLE #ImportData(CSV NVARCHAR(MAX))
-- Insert the CSV Data
BULK INSERT #ImportData
FROM 'C:\TestData.csv'
-- Add Control Columns
ALTER TABLE #ImportData
ADD ID INT IDENTITY(1, 1)
ALTER TABLE #ImportData
ADD Malformed BIT DEFAULT(0)
-- Declare Variables
DECLARE #Deliminator NVARCHAR(5) = '|', #ID INT = 0, #DDL NVARCHAR(MAX)
DECLARE #NumberCols INT = (SELECT LEN(CSV) - LEN(REPLACE(CSV, #Deliminator, '')) FROM #ImportData WHERE ID = 1)
-- Flag Malformed Rows
UPDATE #ImportData
SET Malformed = CASE WHEN LEN(CSV) - LEN(REPLACE(CSV, #Deliminator, '')) != #NumberCols THEN 1 ELSE 0 END
-- Create Second Staging Table
IF OBJECT_ID(N'TestTable', N'U') IS NOT NULL
DROP TABLE TestTable
CREATE table TestTable
(ID varchar(4000),
EmpID varchar(4000),
FName varchar(4000),
LName varchar(4000),
Gender varchar(4000),
DateOfirthB varchar(4000));
-- Insert CSV Rows
WHILE(1 = 1)
BEGIN
SELECT TOP 1
#ID = ID
,#DDL = 'INSERT INTO TestTable(ID, EmpID, FName, LName, Gender, DateOfirthB)' + CHAR(13) + CHAR(10) + REPLICATE(CHAR(9), 1)
+ 'VALUES' -- + CHAR(13) + CHAR(10) + REPLICATE(CHAR(9), 2)
+ '(' + DDL + ')'
FROM
(
SELECT
ID
,DDL = '''' + REPLACE(REPLACE(REPLACE(CSV, '''', ''''''), #Deliminator, ''','''), '"', '') + ''''
FROM
#ImportData
WHERE
ID > 1
AND Malformed = 0) D
WHERE
ID > #ID
ORDER BY
ID
IF ##ROWCOUNT = 0 BREAK
EXEC sp_executesql #DDL
END
-- Clean Up
IF OBJECT_ID(N'TempDB..#ImportData', N'U') IS NOT NULL
DROP TABLE #ImportData
-- View Results
SELECT * FROM dbo.TestTable

Since the OP stated "[...] insert the the one's which have the correct syntax", I wonder why nobody suggested to modify the MAXERRORS clause. Despite not all errors can be masqueraded, it works well for conversion ones.
Therefore, my suggestion is using MAXERRORS=999 in place of MAXERRORS=0 (as per orinal example).

Related

Error when converting data to XML in SQL Server

I need to retrieve the content in position 10 of a comma separated string in a View table.
Row 1 N,l,S,T,A,,<all>,,N,A,N,N,N,Y,Y,,Y,Y,Y,,AA,SA,Enterprise,
Row 2 M,,A,S,AS,SS,AS,N,N,N,N,Y,Y,Y,ENTERPRISE,S,,A
Row 3 L,,A,D,S,A,A,AA,Y,Y,Y,YNN,N,N,N,N,A,AA,AD,D,D
Div1 is the name of my column, Div2 is the name of the result column. I use the following code:
SELECT TOP (2000)
[Id],
CONVERT(XML,'<x>' + REPLACE(REPLACE(REPLACE(Div1, '>', ''), '<', ''), ',', '</x <x>') + '</x>').value('/x[10]', 'VARCHAR(MAX)') [Div2],
Div1
FROM
[dbo].[database]
I use character type VARCHAR(MAX) because that is the type for Div1 in my database. The code works if I run less than 20000 rows. But the data set I use has more than 100,000 rows. If I run the whole data it stops and the following error occurs:
Msg 9421, Level 16, State 1, Line 1.
XML parsing: line 1, character 218, illegal name character
Is there a way to work this around?
XML has CDATA[] section to treat content as-is without parsing. There is no need for multiple REPLACE() function calls. Check it out.
SQL
-- DDL and sample data population, start
DECLARE #tbl TABLE (ID INT IDENTITY(1,1) PRIMARY KEY, Div1 VARCHAR(MAX));
INSERT INTO #tbl (Div1)
VALUES
('N,l,S,T,A,,<all>,,N,A,N,N,N,Y,Y,,Y,Y,Y,,AA,SA,Enterprise')
, ('M,,A,S,AS,SS,AS,N,N,N,N,Y,Y,Y,ENTERPRISE,S,,A')
, ('L,,A,D,S,A,A,AA,Y,Y,Y,YNN,N,N,N,N,A,AA,AD,D,D');
-- DDL and sample data population, end
SELECT [Id],
CAST('<x><![CDATA[' + REPLACE(Div1, ',', ']]></x><x><![CDATA[') + ']]></x>' AS XML).value('(/x/text())[10]', 'VARCHAR(MAX)') [Div2],
Div1
FROM #tbl;
You can create a function to split string like below:
CREATE FUNCTION dbo.split_delimited_string
(
#list varchar(max),
#delimiter varchar(5)
)
RETURNS #items TABLE
(
pos_id int identity(1,1),
item varchar(255)
)
AS
BEGIN
DECLARE #pos int, #delimiter_len tinyint;
SET #pos = CHARINDEX(#delimiter,#list);
SET #delimiter_len=LEN(#delimiter);
WHILE (#pos>0)
BEGIN
INSERT INTO #items (item)
SELECT LEFT(#list,#pos - 1)
SET #list = RIGHT(#list,LEN(#list) - #pos - #delimiter_len + 1);
SET #pos = CHARINDEX(#delimiter,#list);
END
IF #list<>N''
BEGIN
INSERT INTO #items (item)
SELECT #list;
END
RETURN;
END
The following query will return the content in the 10th position:
SELECT
t.[Id],
l.item AS Div2
t.Div1
FROM [dbo].[database] t
CROSS APPLY dbo.split_delimited_string(t.Div1,',') l
WHERE l.pos_id = 10;

SQL CSV as Query Results Column

I have the following SQL which queries a single table, single row, and returns the results as a comma separate string e.g.
Forms
1, 10, 4
SQL :
DECLARE #tmp varchar(250)
SET #tmp = ''
SELECT #tmp = #tmp + Form_Number + ', '
FROM Facility_EI_Forms_Required
WHERE Facility_ID = 11 AND EI_Year=2012 -- single Facility, single year
SELECT SUBSTRING(#tmp, 1, LEN(#tmp) - 1) AS Forms
The Facility_EI_Forms_Required table has three records for Facility_ID = 11
Facility_ID EI_Year Form_Number
11 2012 1
11 2012 10
11 2012 4
Form_number is a varchar field.
And I have a Facility table with Facility_ID and Facility_Name++.
How do I create a query to query all Facilites for a given year and produce the CSV output field?
I have this so far:
DECLARE #tmp varchar(250)
SET #tmp = ''
SELECT TOP 100 A.Facility_ID, A.Facility_Name,
(
SELECT #tmp = #tmp + B.Form_Number + ', '
FROM B
WHERE B.Facility_ID = A.Facility_ID
AND B.EI_Year=2012
)
FROM Facility A, Facility_EI_Forms_Required B
But it gets syntax errors on using #tmp
My guess is this is too complex a task for a query and a stored procedure may be need, but I have little knowledge of SPs. Can this be done with a nested query?
I tried a Scalar Value Function
ALTER FUNCTION [dbo].[sp_func_EI_Form_List]
(
-- Add the parameters for the function here
#p1 int,
#pYr int
)
RETURNS varchar
AS
BEGIN
-- Declare the return variable here
DECLARE #Result varchar
-- Add the T-SQL statements to compute the return value here
DECLARE #tmp varchar(250)
SET #tmp = ''
SELECT #tmp = #tmp + Form_Number + ', '
FROM OIS..Facility_EI_Forms_Required
WHERE Facility_ID = #p1 AND EI_Year = #pYr -- single Facility, single year
SELECT #Result = #tmp -- SUBSTRING(#tmp, 1, LEN(#tmp) - 1)-- #p1
-- Return the result of the function
RETURN #Result
END
The call
select Facility_ID, Facility.Facility_Name,
dbo.sp_func_EI_Form_List(Facility_ID,2012)
from facility where Facility_ID=11
returns
Facility_ID Facility_Name Form_List
11 Hanson Aggregates 1
so it is only returning the first record instead of all three. What am I doing wrong?
Try the following approach, which is an analogy to SO answer Concatenate many rows into a single text string. I hope it is correct, as I cannot try it out without having the schema and some demo data (maybe you can add schema and data to your question):
Select distinct A.Facility_ID, A.Facility_Name,
substring(
(
Select ',' + B.Form_Number AS [text()]
From Facility_EI_Forms_Required B
Where B.Facility_ID = A.Facility_ID
AND B.EI_Year=2012
ORDER BY B.Facility_ID
For XML PATH ('')
), 2, 1000) [Form_List]
From Facility A

Convert unknown number of comma separated varchars within 1 column into multiple columns

Let me say upfront that I'm a brand-spanking-new SQL Developer. I've researched this and haven't been able to find the answer.
I'm working in SSMS 2012 and I have a one-column table (axis1) with values like this:
axis1
296.90, 309.4
296.32, 309.81
296.90
300.11, 309.81, 311, 313.89, 314.00, 314.01, V61.8, V62.3
I need to convert this column into multiple columns like so:
axis1 axis2 axis3 axis4
296.90 309.4 null null
296.32 309.81 null null
296.90 null null null
300.11 309.81 311 313.89...
So far I've tried/considered:
select case when charindex(',',Axis1,1)>0
then substring(Axis1,1,CHARINDEX(',',Axis1,1)-1)
else Axis1
end as Axis1
from tablex
That works fine for a known number of column values, but there could be 0, 1, or 20+ values in this column.
Is there any way to split an unknown quantity of comma-separated values that are in one column into multiple single-value columns?
Thanks in advance for any help everyone!
I made one assumption while creating this answer, which is that you need this as a separate stored proc.
Step 1
Create a data type to enable the use of passing a table-valued parameter (TVP) into a stored proc.
use db_name
GO
create type axisTable as table
(
axis1 varchar(max)
)
GO
Step 2
Create the procedure to parse out the values.
USE [db_name]
GO
SET ANSI_NULLS ON
GO
SET QUOTED_IDENTIFIER ON
GO
CREATE PROCEDURE [dbo].[usp_util_parse_out_axis]
(
#axis_tbl_prelim axisTable readonly
)
AS
BEGIN
-- SET NOCOUNT ON added to prevent extra result sets from
-- interfering with SELECT statements.
SET NOCOUNT ON;
declare #axis_tbl axisTable
--since TVP's are readonly, moving the data in the TVP to a local variable
--so that the update statement later on will work as expected
insert into #axis_tbl
select *
from #axis_tbl_prelim
declare #comma_cnt int
, #i int
, #sql_dyn nvarchar(max)
, #col_list nvarchar(max)
--dropping the global temp table if it already exists
if object_id('tempdb..##axis_unpvt') is not null
drop table ##axis_unpvt
create table ##axis_unpvt
(
axis_nbr varchar(25)
, row_num int
, axis_val varchar(max)
)
--getting the most commas
set #comma_cnt = (select max(len(a.axis1) - len(replace(a.axis1, ',', '')))
from #axis_tbl as a)
set #i = 1
while #i <= #comma_cnt + 1
begin --while loop
--insert the data into the "unpivot" table one parsed value at a time (all rows)
insert into ##axis_unpvt
select 'axis' + cast(#i as varchar(3))
, row_number() over (order by (select 100)) as row_num --making sure the data stays in the right row
, case when charindex(',', a.axis1, 0) = 0 and len(a.axis1) = 0 then NULL
when charindex(',', a.axis1, 0) = 0 and len(a.axis1) > 0 then a.axis1
when charindex(',', a.axis1, 0) > 0 then replace(left(a.axis1, charindex(',', a.axis1, 0)), ',', '')
else NULL
end as axis1
from #axis_tbl as a
--getting rid of the value that was just inserted from the source table
update a
set a.axis1 = case when charindex(',', a.axis1, 0) = 0 and len(a.axis1) > 0 then NULL
when charindex(',', a.axis1, 0) > 0 then rtrim(ltrim(right(a.axis1, (len(a.axis1) - charindex(',', a.axis1, 0)))))
else NULL
end
from #axis_tbl as a
where 1=1
and (charindex(',', a.axis1, 0) = 0 and len(a.axis1) > 0
or charindex(',', a.axis1, 0) > 0)
--incrementing toward terminating condition
set #i += 1
end --while loop
--getting list of what the columns will be after pivoting
set #col_list = (select stuff((select distinct ', ' + axis_nbr
from ##axis_unpvt as a
for xml path ('')),1,1,''))
--building the pivot statement
set #sql_dyn = '
select '
+ #col_list +
'
from ##axis_unpvt as a
pivot (max(a.axis_val)
for a.axis_nbr in ('
+ #col_list +
')) as p'
--executing the pivot statement
exec(#sql_dyn);
END
Step 3
Make a procedure call using the data type created in Step 1 as the parameter.
use db_name
go
declare #tvp as axisTable
insert into #tvp values ('296.90, 309.4')
insert into #tvp values ('296.32, 309.81')
insert into #tvp values ('296.90')
insert into #tvp values ('300.11, 309.81, 311, 313.89, 314.00, 314.01, V61.8, V62.3')
exec db_name.dbo.usp_util_parse_out_axis #tvp
Results from your example are as follows:

How to UPDATE all columns of a record without having to list every column

I'm trying to figure out a way to update a record without having to list every column name that needs to be updated.
For instance, it would be nice if I could use something similar to the following:
// the parts inside braces are what I am trying to figure out
UPDATE Employee
SET {all columns, without listing each of them}
WITH {this record with id of '111' from other table}
WHERE employee_id = '100'
If this can be done, what would be the most straightforward/efficient way of writing such a query?
It's not possible.
What you're trying to do is not part of SQL specification and is not supported by any database vendor. See the specifications of SQL UPDATE statements for MySQL, Postgresql, MSSQL, Oracle, Firebird, Teradata. Every one of those supports only below syntax:
UPDATE table_reference
SET column1 = {expression} [, column2 = {expression}] ...
[WHERE ...]
This is not posible, but..
you can doit:
begin tran
delete from table where CONDITION
insert into table select * from EqualDesingTabletoTable where CONDITION
commit tran
be carefoul with identity fields.
Here's a hardcore way to do it with SQL SERVER. Carefully consider security and integrity before you try it, though.
This uses schema to get the names of all the columns and then puts together a big update statement to update all columns except ID column, which it uses to join the tables.
This only works for a single column key, not composites.
usage: EXEC UPDATE_ALL 'source_table','destination_table','id_column'
CREATE PROCEDURE UPDATE_ALL
#SOURCE VARCHAR(100),
#DEST VARCHAR(100),
#ID VARCHAR(100)
AS
DECLARE #SQL VARCHAR(MAX) =
'UPDATE D SET ' +
-- Google 'for xml path stuff' This gets the rows from query results and
-- turns into comma separated list.
STUFF((SELECT ', D.'+ COLUMN_NAME + ' = S.' + COLUMN_NAME
FROM INFORMATION_SCHEMA.COLUMNS
WHERE TABLE_NAME = #DEST
AND COLUMN_NAME <> #ID
FOR XML PATH('')),1,1,'')
+ ' FROM ' + #SOURCE + ' S JOIN ' + #DEST + ' D ON S.' + #ID + ' = D.' + #ID
--SELECT #SQL
EXEC (#SQL)
In Oracle PL/SQL, you can use the following syntax:
DECLARE
r my_table%ROWTYPE;
BEGIN
r.a := 1;
r.b := 2;
...
UPDATE my_table
SET ROW = r
WHERE id = r.id;
END;
Of course that just moves the burden from the UPDATE statement to the record construction, but you might already have fetched the record from somewhere.
How about using Merge?
https://technet.microsoft.com/en-us/library/bb522522(v=sql.105).aspx
It gives you the ability to run Insert, Update, and Delete. One other piece of advice is if you're going to be updating a large data set with indexes, and the source subset is smaller than your target but both tables are very large, move the changes to a temporary table first. I tried to merge two tables that were nearly two million rows each and 20 records took 22 minutes. Once I moved the deltas over to a temp table, it took seconds.
If you are using Oracle, you can use rowtype
declare
var_x TABLE_A%ROWTYPE;
Begin
select * into var_x
from TABLE_B where rownum = 1;
update TABLE_A set row = var_x
where ID = var_x.ID;
end;
/
given that TABLE_A and TABLE_B are of same schema
It is possible. Like npe said it's not a standard practice. But if you really have to:
1. First a scalar function
CREATE FUNCTION [dte].[getCleanUpdateQuery] (#pTableName varchar(40), #pQueryFirstPart VARCHAR(200) = '', #pQueryLastPart VARCHAR(200) = '', #pIncludeCurVal BIT = 1)
RETURNS VARCHAR(8000) AS
BEGIN
DECLARE #pQuery VARCHAR(8000);
WITH cte_Temp
AS
(
SELECT
C.name
FROM SYS.COLUMNS AS C
INNER JOIN SYS.TABLES AS T ON T.object_id = C.object_id
WHERE T.name = #pTableName
)
SELECT #pQuery = (
CASE #pIncludeCurVal
WHEN 0 THEN
(
STUFF(
(SELECT ', ' + name + ' = ' + #pQueryFirstPart + #pQueryLastPart FROM cte_Temp FOR XML PATH('')), 1, 2, ''
)
)
ELSE
(
STUFF(
(SELECT ', ' + name + ' = ' + #pQueryFirstPart + name + #pQueryLastPart FROM cte_Temp FOR XML PATH('')), 1, 2, ''
)
) END)
RETURN 'UPDATE ' + #pTableName + ' SET ' + #pQuery
END
2. Use it like this
DECLARE #pQuery VARCHAR(8000) = dte.getCleanUpdateQuery(<your table name>, <query part before current value>, <query part after current value>, <1 if current value is used. 0 if updating everything to a static value>);
EXEC (#pQuery)
Example 1: make all employees columns 'Unknown' (you need to make sure column type matches the intended value:
DECLARE #pQuery VARCHAR(8000) = dte.getCleanUpdateQuery('employee', '', 'Unknown', 0);
EXEC (#pQuery)
Example 2: Remove an undesired text qualifier (e.g. #)
DECLARE #pQuery VARCHAR(8000) = dte.getCleanUpdateQuery('employee', 'REPLACE(', ', ''#'', '''')', 1);
EXEC (#pQuery)
This query can be improved. This is just the one I saved and sometime I use. You get the idea.
Similar to an upsert, you could check if the item exists on the table, if so, delete it and insert it with the new values (technically updating it) but you would lose your rowid if that's something sensitive to keep in your case.
Behold, the updelsert
IF NOT EXISTS (SELECT * FROM Employee WHERE ID = #SomeID)
INSERT INTO Employee VALUES(#SomeID, #Your, #Vals, #Here)
ELSE
DELETE FROM Employee WHERE ID = #SomeID
INSERT INTO Employee VALUES(#SomeID, #Your, #Vals, #Here)
you could do it by deleting the column in the table and adding the column back in and adding a default value of whatever you needed it to be. then saving this will require to rebuild the table

Trigger to build comma separated list of values for insert & update

I have a table that I want to store all the insert, update and delete operations performed against the database in. The table should look like this:
ID Created Operation
1 2012-05-01 INSERT [SomeTable] (ID, Col1) VALUES (1, 'xyz')
2 2012-05-01 UPDATE [SomeTable] SET Col1 - 'abc' WHERE ID = 1
And so on...
Using a trigger on each table, how do I construct the Operation query? Assume that you will not always know the columns in the table causing the trigger, as you may add columns to the table at a later date and don't want to have to go back and rebuild the trigger.
Here might be the start of it:
CREATE TRIGGER tr_MyTable_RowInserted ON MyTable
FOR INSERT
AS
DECLARE #columnList nvarchar(4000)
SET #columnList = NULL
SELECT #columnList = COALESCE(#columnList + ', ', '') + [name] FROM sys.columns WHERE Object_ID = Object_ID(N'MyTable')
print #columnList
-- now we have the columns, need to spit out the values that are in the trigger's inserted table
-- then we need to concatenate all of that into a single, syntactically correct SQL string