CSV Import with Validation - sql

I have a need to import a number of CSV files into corresponding tables in SQL. I am trying to write a stored procedure that will import any of these CSV files, using a number of parameters to set things like file name, destination name etc.
Simple so far. The problem comes because of the structure of this DB. Each data table has a number of (usually 5) columns that are of a set format, and then however many data columns you want. There are then a set of data validation tables that contain specific sets of values that these 5 columns can contain. So the problem is, is that when I do the import from CSV, I need to validate that each row that is imported meets the criteria in these validation tables, essentially that there is a row in the validation table that has data that matches the 5 columns in the imported data.
If it does not, then it needs to write an error to the log and not import it, if it does then it should import it.
Here is an example of what I mean:
Data Table (where the imported data will go)
|datatype|country|currency| datacolumn1 | datacolumn|
|1 | 2 | GBP | 10000 | 400 |
|3 | 4 | USD | 10000 | 400 |
Validation table
|datatype|country|currency|
|1 |2 |GBP |
|2 |3 |USD |
So the first line is valid, it has a matching record in the validation table for the first 3 columns, but the second is not and should be rejected.
The added problem is that each table can reference a different validation table (although many reference the same one) so the columns that have to be checked often vary in amount and name.
My first problem is really how to do a row by row check when importing from CSV, is there any way to do so without importing into a temporary table first?
After that, what is the best way to check that the columns match, in a generic way despite that fact that the name and number of columns change depending on what table is being imported.

You can import the contents of a csv into some temporary tables by using this -
SELECT * into newtable FROM
OPENROWSET ('MSDASQL', 'Driver={Microsoft Text Driver (*.txt; *.csv)};DefaultDir={Directory Path of the CSV File};',
'SELECT * from yourfile.csv');
Once you have your data in some sql table, you can use an inner join to validate the data and narrow down to the valid rows.
SELECT A.*,B.* FROM newtable A
INNER JOIN validation_table B ON A.Datatype = B.Datatype
INNER JOIN validation_table C ON A.Country = C.Country
INNER JOIN validation_table D ON A.Currency = D.Currency
This should give you the valid rows according to your validation rules.

SSIS would let you check, filter, and process data while it was being loaded. I'm not aware of any other native SQL tool that does this. Without SSIS (or a third-party tool), you'd have to first load all the data from a file into some kind of "staging" table (#temp or dedicate permanent) and work from there.
#Pavan Reddy's OPENROWSET solution should work. I've used views, where I first determined the rows in the source file, built a "mapping" view on the target table, and then BULK INSERTED into the view (which also lets you play games with defaults on "skipped columns").
(Just to mention, you can launch an SSIS package from a stored procedure, using xp_cmdshell to call DTEXEC. It's complex and requires a host of parameters, but it can be done.)

Related

BigQuery Create Table Query from Google Sheet with Variable item string field into Repeated Field

I hope I explain this adequately.
I have a series of Google Sheets with data from an Airtable database. Several of the fields are stringified arrays with recordIds to another table.
These fields can have between 0 and n - comma separated values.
I run a create/overwrite table SELECT statement to create native BigQuery tables for reporting. This works great.
Now I need to add the recordIds to a Repeated field.
I've manually written to a repeated field using:
INSERT INTO `robotic-vista-339622.Insurly_dataset.zzPOLICYTEST` (policyID, locations, carrier)
VALUES ('12334556',[STRUCT('recordId1'),STRUCT('recordId2')], 'name of policy');
However, I need to know how I to do this using SELECT statement rather than INSERT. I also need to know how to do this if you do not know the number of recordIds that have been retrieved from Airtable. One record could have none and another record could have 10 or more.
Any given sheet will look like the following, where "locations" contains the recordIds I want to add to a repeated field.
SHEETNAME: POLICIES
|policyId |carrier | locations |
|-----------|-----------|---------------------------------|
|recrTkk |Workman's | |
|rec45Yui |Workman's |recL45x32,recQz70,recPrjE3x |
|recQb17y |ABC Co. |rec5yUlt,recIrW34 |
In the above, the first row/record has no location Id's. And then three and two on the subsequent rows/records.
Any help is appreciated.
Thanks.
I'm unsure if answering my own question is the correct way to show that it was solved... but here is what it took.
I create a Native table in BigQuery. the field for locations is a string, mode repeated.
Then I just run an overwrite table SELECT statement.
SELECT recordId,Name, Amount, SPLIT(locations) as locations FROM `projectid.datasetid.googlesheetsdatatable`;
Tested and I run linked queries on the locations with unnest.

Redshift IN condition on thousands of values

What's the best way to get data that matches any one of ~100k values?
For this question, I'm using an Amazon Redshift database and have a table something like this with hundreds of millions of rows:
--------------------
| userID | c1 | c2 |
| 101000 | 12 | 'a'|
| 101002 | 25 | 'b'|
____________________
There are also millions of unique userIDs. I have a CSV list of 98,000 userIDs that I care about, and I want to do math on the columns for those specific users.
select c1, c2 from table where userID in (10101, 10102, ...)
What's the best solution to match against a giant list like this?
My approach was to make a python script that read in the result of all users in our condition set, then filtering against the CSV in python. It was dead slow and wouldn't work in all scenarios though.
A coworker suggested uploading the 98k users into a temporary table, then joining against in in the query. This seems like the smartest way, but I wanted to ask if you all had ideas.
I also wondered if printing an insanely long SQL query containing all 98k users to match against and running it would work. Out of curiosity, would that even have ran?
As your coworker suggests, put your IDs into a temporary table by uploading a CSV to S3 and then using COPY to import the file into a table. You can then use an INNER JOIN condition to filter your main data table on the list of IDs you're interested in.
An alternative option, if uploading a file to S3 isn't possible for you, could be to use CREATE TEMP TABLE to set up a table for your list of IDs and then use a spreadsheet to generate a whole of INSERT statements to populate the temp table. 100K of inserts could be quite slow though.

VB.NET Access Database 255 Columns Limit

I'm currently developing an application for a client using Visual Basic .NET. It's a rewrite of an application that accessed an Oracle database, filtered the columns and performed some actions on the data. Now, for reasons beyond my control, the client wants to use an Access (.mdb) database for the new application. The problem with this is that the tables have more than the 255 columns access supports so the client suggested splitting the data into multiple databases/tables.
Well even when the tables are split, at some point, I have to query all columns simultaneously (I did an INNER JOIN on both tables) which, of course, yields an error. The limit apparently is on number of simultaneously queryable columns not on the total number of columns.
Is there a possiblility to circumvent the 255 columns limit somehow? I was thinking in the direction of using LINQ to combine queries of both tables, i.e. have an adapter that emulates a single table I can perform queries on. A drawback of this is that .mdb is not a first-class citizen of LINQ-to-SQL (i.e. no insert/update supported etc.).
As a workaround, I might be able to rewrite my stuff so as to only need all columns at one point (I dynamically create control elements depending on the column names in the table). Therefore I would need to query say the first 250 columns and after that the following 150.
Is there a Access-SQL query that can achieve something like this. I thought of something like SELECT TOP 255 * FROM dbname or SELECT * FROM dbname LIMIT 1,250 but these are not valid.
Do I have other options?
Thanks a lot for your suggestions.
The ADO.NET DataTable object has no real limitations on the number of columns that it could contain.
So, once you have splitted the big table in two tables and set the same primary key in both subtables with less columns, you can use, on the VB.NET side, the DataTable.Merge method.
In their example on MSDN they show two tables with the same schema merged together, but it works also if you have two totally different schemas, but just the Primary key in common
Dim firstPart As DataTable = LoadFirstTable()
Dim secondPart As DataTable = LoadSecondTable()
firstPart.Merge(secondPart)
I have tested this just with only one column of difference, so I am not very sure that this is a viable solution in terms of performance.
As I know there is no way to directly bypass this problem using Access.
If you cannot change the db your only way I can think of is to make a wrapper that understand you're were the field are, automatically splits the query in more queryes and then regroup it in a custom class containing all the columns for every row.
For example you can split every table in more tables duplicating the field you're making the conditions on.
TABLEA
Id | ConditionFieldOne | ConditionFierldTwo | Data1 | Data2 | ... | Data N |
in
TABLEA_1
Id | ConditionFieldOne | ConditionFieldTwo | Data1 | Data2 | ... | DataN/2 |
TABLEA_2
Id | ConditionFieldOne | ConditionFieldTwo | Data(N/2)+1 | Data(n/2)+2 | ... | DataN |
and a query where is
SELECT * FROM TABLEA WHERE CONDITION1 = 'condition'
become with the wrapper
SELECT * FROM TABLEA_1 WHERE ConditionFieldOne = 'condition'
SELECT * FROM TABLEA_2 WHERE ConditionFieldOne = 'condition'
and then join the results.

Merge only Missing Data

I am working on an HR project that provides data to me in the form of an Excel document.
I have created a package that captures the data from the Spreadsheet and imports it into SQL. The customer then wanted to create a data connection and place the data into Pivot Tables to manipulate and run calculations on.
This brought to light a small issue that I have tried to get fixed from the source but looks like cannot be resolved on the System Side (working with an SAP backend).
What I have is information that comes into SQL from the import that is either missing the Cost Center Name or both the cost center number and the cost center name.
EXAMPLE:
EmpID EmployeeName CostCenterNo CostCenterName
001 Bob Smith 123456 Sales
010 Adam Eve 543211 Marketing
050 Thomas Adams 121111
121 James Avery
I worked with HR to get the appropriate information for these employees, I have added the information to a separate table.
What I would like to do is figure out a way to insert the missing information as the data is imported into the Staging table.
Essentially completing the data.
EmpID EmployeeName CostCenterNo CostCenterName
001 Bob Smith 123456 Sales
010 Adam Eve 543211 Marketing
050 Thomas Adams 121111 Supply Chain
121 James Avery 555316 Human Resources
Is there an issue with a basic update like
Update <tablename> set CostCenterNo = (SELECT CostCenterNo from <hr_sourced_table> where EmpID =x) where EmpID = x
In case if its needed you can add
Where CostcentreNo is null
Because even if you did not do this, it would update all data which should be correct, but for any reason if you dont need it you can update both the fields in a single query like this
Update <tablename> set CostCenterNo = (SELECT CostCenterNo from <hr_sourced_table> where EmpID =x),CostCenterName = (SELECT CostCenterName from <hr_sourced_table> where EmpID =x) where EmpID = x
If your data source table and the extra mapping information are both accessible from the same place, you don't have to update anything with SSIS. Just build a view that joins the two tables and populate the pivot table from the view. You will have to decide what to do if the data source and the mapping table disagree, but that is a business rule question.
Select e.EMPLID, e.EmployeeName, cc.CostCenterNo, cc.CostCenterName
From Employees e
Left Join CCMapping cc on e.emplid=cc.emplid
OR
Select e.EMPLID, e.EmployeeName,
coalesce(e.CostCenterNo, cc.CostCenterNo) as CostCenterNo,
coalesce(e.CostCenterName, cc.CostCenterName) as CostCenterName
From Employees e
Left Join CCMapping cc on e.emplid=cc.emplid
I would use a lookup transformation in your data flow that sources the missing data you got from HR. Then join this lookup data on a mutual field in the data coming from your sources (EmpID?). You can then add the cost center no and cost center name from the missing data table to the data flow. In a derived column transformation you can test to see if the data from the source is null and if so, use the columns that came from the missing data table to store in the destination table.
As I see it, your options are to complete the data in flight or update the data after it has landed. Which route I would chose would be dependent on the level of complexity.
In flight
Generally speaking, this is my preference. I'd rather have all the scrubbing take place while the data is moving versus applying a series of patches afterward to shine the data.
In your Data Flow, I would have a Conditional Split to funnel the data into 2 to 3 streams: Has all data, has cost center and has nothing.
"Has all data" would route directly into a Union All
"Has cost center" would lead to a Lookup Component which would use the supplied Cost Center to lookup against the reference table to acquire the text associated to the existing value. The Lookup Component expects to find matches so if the possibility exists that a Cost Center will not exist in your reference table, you will need to handle that situation. Depending on what version of SSIS you are using will determine whether you can just use the Unmatched Output column (2008+) or whether you have to commandeer the Error Output (2005). Either way, you will need to indicate to the Lookup that failure to match should not result in a package level failure. Once you've handled this lookup and handled the no-match option, join that stream to the Union All.
"has nothing" might behave as the "has cost center" stream where you will perform some lookup on other columns to determine cost center or you might simply apply a default/known-unknown value for the missing entities. How that works will depend on the rules your business owners have supplied.
Post processing
This keeps your data flow exactly as it is. You would simply add an Execute SQL Task after the Data Flow to polish any tarnished data. Whether I do this entirely in-line in the Execute SQL Task or create a dedicated clean up stored procedure would be based in part of the level of effort it takes to get code changed. Some places, pushing an SSIS package change is a chipshot activity. Other places, it takes an act of the SOX dieties to get a package change pushed but they were fine with proc changes.
My gut would be to push the scrubber logic into a stored procedure. Then your package wouldn't have to change every time they come up with scenarios that the original queries didn't satisfy.
You would have 2 statements in the proc, much as we performed in the In flight section. One query will update populating the Cost Center name. The other will apply cost center and name. If you need help with the specifics of the actual query, let me know and I can update this answer.
I worked with another developer to create a solution, here is what we came up with.
I Created an "Execute SQL Task" to run after the data flow that has this script in it.
MERGE [Staging].[HRIS_EEMaster] AS tgt
USING (
SELECT PersNo AS EmpID,
CostCenterNo AS CCNo,
CostCenterName AS CCName
FROM [dbo].[MissingTermedCC]
) AS src ON src.EmpID = tgt.PersNo
WHEN NOT MATCHED BY TARGET
THEN INSERT (
PersNo,
CostCenterNo,
CostCenterSubDiv
)
VALUES (
src.EmpID,
src.CCNo,
src.CCName
)
WHEN MATCHED
THEN UPDATE
SET tgt.CostCenterNo = CASE
WHEN src.CCNo > '' THEN src.CCNo
ELSE tgt.CostCenterNo
END,
tgt.CostCenterSubDiv = CASE
WHEN src.CCName > '' THEN src.CCName
ELSE tgt.CostCenterSubDiv
END;
I wanted to share in case anyone else runs into a similar issue. Thanks again for all of the help everyone.

Update existing database values from spreadsheet

I have an existing MSSQL database where the values in some columns need updating according to a spreadsheet which contains the mappings of old data and new data.
The spreadsheet is like this:
| OLD DATA | NEW DATA |
RECORD | A | B | C | D | A | B | C | D |
1 |OLD|OLD|OLD|OLD|NEW|NEW|NEW|NEW|
2 |OLD|OLD|OLD|OLD|NEW|NEW|NEW|NEW|
Where ABCD are the column names, which relate to the database, and OLD / NEW relates to the data.
Thus for each line (approx 2500 rows)
The database values that match OLD in each column, need to be changed to NEW
My current thoughts are to do it in a similar way to this:
SQL Statement that Updates an Oracle Database Table from an Excel Spreadsheet
Essentially getting Excel to formulate a list of replace statements, though this feels like a horribly convoluted way to deal with the problem!
Is there a way to have SQL cycle though each row of the spreadsheet, check all records for a=old, b=old2, c=old3, d=old4 and then replace those values with the appropriate a=new, b=new2, c=new3, d=new4?
You shouldn't need to loop through each row in the spreadsheet. You can use the OPENROWSET command, like in the answer you linked to, to load the spreadsheet data into a sort of temporary table. You can then run a regular UPDATE statement against that table.
It would look something like this
UPDATE YourTable
SET YourTable.A = ExcelTable.NewDataA,
YourTable.B = ExcelTable.NewDataB,
YourTable.C = ExcelTable.NewDataC,
YourTable.D = ExcelTable.NewDataD
FROM YourTable
INNER JOIN OPENROWSET('Microsoft.Jet.OLEDB.4.0',
'Excel 8.0;Database=C:\foldername\spreadsheetname.xls;',
'SELECT column1name, column2name, column3name, column4name
FROM [worksheetname$]') AS ExcelTable
ON YourTable.ID = ExcelTable.ID
WHERE (YourTable.A = ExcelTable.OldDataA
AND YourTable.B = ExcelTable.OldDataB
AND YourTable.C = ExcelTable.OldDataC
AND YourTable.D = ExcelTable.OldDataD)
Looks like Jeff got you the answer you needed, but for anyone looking to update a database from a Google Sheet, here's an alternative, using the SeekWell desktop app. For a version of this answer with screen shots, see this article.
Get the right rows and columns into the spreadsheet (looks like #sbozzie already had this)
Write a SQL statement that SELECT 's all the columns you want to be able update in your sheet. You can add filters as normal in the WHERE clause.
Select 'Sheets' the top of the app and open a Sheet. Then click the destination icon in the code cell and select "Sync with DB"
Add your table and primary key
In the destination inputs, add your table name and the primary key for the table.
Running the code cell will add a new Sheet with the selected data and your table name as the Sheet name. Please note that you must start your table in cell A1. It's a good idea to include an ORDER BY.
Add an action column
Add a "seekwell_action" column to your Sheet with the action you'd like performed for each row. Possible actions are:
Update - updates all columns in the row (unique primary key required)
Insert - adds the row to your database (you need to include all columns required for your database)
Sync - An Update action will be taken every time the query runs, on a schedule (see "5. Set Schedule" below)
Complete - status after the schedule has run (see below) and the actions have been taken. The new data should now be in your database. Note that 'Sync' actions will never show complete, as they run every time the schedule runs. To stop a 'Sync' action, change it manually.
Set Schedule
To execute the actions, select the clock icon at the top of the application, indicate the frequency and exact time, and click 'Save.' You can manage your schedule from the inserted 'RunSheet' (do not delete this sheet or you will need to reset your schedule) or from seekwell.io/profile. If you need to run the actions immediately, you can do so from /profile.
Gotchas
You need to start your table in cell A1.
Snowflake column names are case sensitive. Be sure to respect this when specifying the primary key, etc.
If your server is behind a firewall, you will need to whitelist SeekWell's static IP address to use scheduling. See more about whitelisting here.