Reading sparse columns from a CSV - sql

I get a CSV that I need to read into a SQL table. Right now it's manually uploaded with a web application, but I want to move this into SQL server. Rather than port my import script straight across into a script in SSIS, I wanted to check and see if there was a better way to do it.
The issue with this particular CSV is that the first few columns are known, and have appropriate headers. However, after that group, the rest of the columns are sparsely populated and might not even have headers.
Example:
Col1,Col2,Col3,,,,,,
value1,value2,value3,,value4
value1,value2,value3,value4,value5
value1,value2,value3,,value4,value5
value1,value2,value3,,,value4
What makes this tolerable is that everything after Col3 can get concatenated together. The script checks each row for these trailing columns and puts them together into a "misc" column. It has to do this in a bit of a blind method because there is no way of knowing ahead of time how many of these columns will be out there.
Is there a way to do this with SSIS tools, or should I just port my existing import script to an SSIS script task?

Another option outside of SSIS is using BulkInsert with format files.
Format files allow you to describe the format of the incoming data.
For example..
9.0
4
1 SQLCHAR 0 100 "," 1 Header1 SQL_Latin1_General_CP1_CI_AS
2 SQLCHAR 0 100 "," 2 Header2 SQL_Latin1_General_CP1_CI_AS
3 SQLCHAR 0 100 "," 3 Header3 SQL_Latin1_General_CP1_CI_AS
4 SQLCHAR 0 100 "\r\n" 4 Misc SQL_Latin1_General_CP1_CI_AS
Bulk Insert>> http://msdn.microsoft.com/en-us/library/ms188365.aspx
Format Files >> http://msdn.microsoft.com/en-us/library/ms178129.aspx

Step 0. My test file with an additional line
Col1,Col2,Col3,,,,,,
value1,value2,value3,,value4
value1,value2,value3,value4,value5
value1,value2,value3,,value4,value5
value1,value2,value3,,,value4
ends,with,comma,,,value4,
Drag a DFT on the Control flow surface
Inside the DFT, on the data flow surface, drag a Flat file source
Let is map by itself to start with. Check Column names in the first data row.
You will see Col1, Col2, Col3 which are your known fields.
You will also see Column 3 through Column 8. These are the columns
that need to be lumped into one Misc column.
Go to the Advanced section of the Flat File Manager Editor.
Rename Column 3 to Misc. Set field size to 4000.
Note: For longer than that, you would need to use Text data type.
That will pose some challenge, so be ready for fun ;-)
Delete Columns 4 through 8.
Now add a script component.
Input Columns - select only Misc field. Usage Type: ReadWrite
Code:
public override void Input0_ProcessInputRow(Input0Buffer Row)
{
string sMisc = Row.Misc;
string sManipulated = string.Empty;
string temp = string.Empty;
string[] values = sMisc.Split(',');
foreach (string value in values)
{
temp = value;
if (temp.Trim().Equals(string.Empty))
{
temp = "NA";
}
sManipulated = string.Format("{0},{1}", sManipulated, temp);
}
Row.Misc = sManipulated.Substring(1);
}
-- Destination.
Nothing different from usual.
Hope I have understood your problem and the solution works for you.

Related

psycopg2 copy_from Problems in Python 3

I'm new to Python (and coding) and bit off more than I can chew trying to use copy_from.
I am reading rows from a CSV, manipulating them a bit, then writing them into SQL. Using the normal INSERT commands takes a very long time with hundreds of thousands of rows, so I want to use copy_from. It does work with INSERT though.
https://www.psycopg.org/docs/cursor.html#cursor.copy_from this example uses tabs as column separators and newline at the end of each row, so I made each IO line accordingly:
43620929 2018-04-11 11:38:14 30263506 30263503 30262500 0 0 0 0 0 1000 1000 0
That's what the below outputs with the first print statement:
def copyFromIO(thisOutput):
print(thisOutput.getvalue())
cursor.copy_from(thisOutput, 'hands_new')
thisCommand = 'SELECT * FROM hands_new'
cursor.execute(thisCommand)
print(cursor.fetchall())
hands_new is an existing, empty SQL table. The second print statement is just [], so it isn't writing to the db. What am I getting wrong?
Obviously if it worked, I could make thisOutput much longer, with lots of rows instead of just the one.
I think I figured it out, so if anyone comes across this in the future for some reason:
'thisOutput' format was wrong, I built it from smaller pieces including adding '\t' etc. It works if instead I do:
copyFromIO(io.StringIO('43620929\t2018-04-11 11:38:14\t30263506\t30263503\t30262500\t0\t0\t0\t0\t0\t1000\t1000\t0\n'))
& I needed the right columns in the copy_from command:
def copyFromIO(thisOutput):
print(thisOutput.getvalue())
thisCol = ('pkey', 'created', 'gameid', 'tableid', 'playerid', 'bet', 'pot',
'isout', 'outround', 'rake', 'endstack', 'startstack', 'stppaid')
cursor.copy_from(thisOutput, 'hands_new', columns=(thisCol))
thisCommand = 'SELECT * FROM hands_new'
cursor.execute(thisCommand)
print(cursor.fetchall())

Generating variable observations for one id to be observation for new variable of another id

I have a data set that allows linking friends (i.e. observing peer groups) and thereby one can observe the characteristics of an individual's friends. What I have is an 8 digit identifier, id, each id's friend id's (up to 10 friends), and then many characteristic variables.
I want to take an individual and create a variables that are the foreign born status of each friend.
I already have an indicator for each person that is 1 if foreign born. Below is a small example, for just one friend. Notice, MF1 means male friend 1 and then MF1id is the id number for male friend 1. The respondents could list up to 5 male friends and 5 female friends.
So, I need Stata to look at MF1id and then match it down the id column, then look over to f_born for that matched id, and finally input the value of f_born there back up to the original id under MF1f_born.
edit: I did a poor job of explaining the data structure. I have a cross section so 1 observation per unique id. Row 1 is the first 8 digit id number with all the variables following over the row. The repeating id numbers are between the friend id's listed for each person (mf1id for example) and the id column. I hope that is a bit more clear.
Kevin Crow wrote vlookup that makes this sort of thing pretty easy:
use http://www.ats.ucla.edu/stat/stata/faq/dyads, clear
drop team y
rename (rater ratee) (id mf1_id)
bys id: gen f_born = mod(id,2)==1
net install vlookup
vlookup mf1_id, gen(mf1f_born) key(id) value(f_born)
So, Dimitriy's suggestion of vlookup is perfect except it will not work for me. After trying vlookup with both my data set, the UCLA data that Dimitriy used for his example, and a toy data set I created vlookup always failed at the point the program attempts to save a temp file to my temp folder. Below is the program for vlookup. Notice its sets tempfile file, manipulates the data, and then saves the file.
*! version 1.0.0 KHC 16oct2003
program define vlookup, sortpreserve
version 8.0
syntax varname, Generate(name) Key(varname) Value(varname)
qui {
tempvar g k
egen `k' = group(`key')
egen `g' = group(`key' `value')
local k = `k'[_N]
local g = `g'[_N]
if `k' != `g' {
di in red "`value' is unique within `key';"
di in red /*
*/ "there are multiple observations with different `value'" /*
*/ " within `key'."
exit 9
}
preserve
tempvar g _merge
tempfile file
sort `key'
by `key' : keep if _n == 1
keep `key' `value'
sort `key'
rename `key' `varlist'
rename `value' `generate'
save `file', replace
restore
sort `varlist'
joinby `varlist' using `file', unmatched(master) _merge(`_merge')
drop `_merge'
}
end
exit
For some reason, Stata gave me an error, "invalid file," at the save `file', replace point. I have a restricted data set with requirments to point all my Stata temp files to a very specific folder that has an erasure program sweeping it every so often. I don't know why this would create a problem but maybe it is, I really don't know. Regardless, I tweaked the vlookup program and it appears to do what I need now.
clear all
set more off
capture log close
input aid mf1aid fborn
1 2 1
2 1 1
3 5 0
4 2 0
5 1 0
6 4 0
7 6 1
8 2 .
9 1 0
10 8 1
end
program define justlinkit, sortpreserve
syntax varname, Generate(name) Key(varname) Value(name)
qui {
preserve
tempvar g _merge
sort `key'
by `key' : keep if _n ==1
keep `key' `value'
sort `key'
rename `key' `varlist'
rename `value' `generate'
save "Z:\Jonathan\created data sets\justlinkit program\fchara.dta",replace
restore
sort `varlist'
joinby `varlist' using "Z:\Jonathan\created data sets\justlinkit program\fchara.dta", unmatched(master) _merge(`_merge')
drop `_merge'
}
end
// set trace on
justlinkit mf1aid, gen(mf1_fborn) key(aid) value(fborn)
sort aid
list
Well, this fixed my problem. Thanks to all who responded I would not have figured this out without you.

store matrix data in SQLite for fast retrieval in R

I have 48 matrices of dimensions 1,000 rows and 300,000 columns where each column has a respective ID, and each row is a measurement at one time point. Each of the 48 matrices is of the same dimension and their column IDs are all the same.
The way I have the matrices stored now is as RData objects and also as text files. I guess for SQL I'd have to transpose and store by ID, and in such case now the matrix would be of dimensions 300,000 rows and 1,000 columns.
I guess if I transpose it a small version of the data would look like this:
id1 1.5 3.4 10 8.6 .... 10 (with 1,000 columns, and 30,0000 rows now)
I want to store them in a way such that I can use R to retrieve a few of the rows (~ 5 to 100 each time).
The general strategy I have in mind is as follows:
(1) Create a database in sqlite3 using R that I will use to store the matrices (in different tables)
For file 1 to 48 (each file is of dim 1,000 rows and 300,000 columns):
(2) Read in file into R
(3) Store the file as a matrix in R
(4) Transpose the matrix (now its of dimensions 300,000 rows and 1,000 columns). Each row now is the unique id in the table in sqlite.
(5) Dump/write the matrix into the sqlite3 database created in (1) (dump it into a new table probably?)
Steps 1-5 are to create the DB.
Next, I need step 6 to read-in the database:
(6) Read some rows (at most 100 or so at a time) into R as a (sub)matrix.
A simple example code doing steps 1-6 would be best.
Some Thoughts:
I have used SQL before but it was mostly to store tabular data where each column had a name, in this case each column is just one point of the data matrix, I guess I could just name it col1 ... to col1000? or there are better tricks?
If I look at: http://sandymuspratt.blogspot.com/2012/11/r-and-sqlite-part-1.html they show this example:
dbSendQuery(conn = db,
"CREATE TABLE School
(SchID INTEGER,
Location TEXT,
Authority TEXT,
SchSize TEXT)")
But in my case this would look like:
dbSendQuery(conn = db,
"CREATE TABLE mymatrixdata
(myid TEXT,
col1 float,
col2 float,
.... etc.....
col1000 float)")
I.e., I have to type in col1 to ... col1000 manually, that doesn't sound very smart. This is where I am mostly stuck. Some code snippet would help me.
Then, I need to dump the text files into the SQLite database? Again, unsure how to do this from R.
Seems I could do something like this:
setwd(<directory where to save the database>)
db <- dbConnect(SQLite(), dbname="myDBname")
mymatrix.df = read.table(<full name to my text file containing one of the matrices>)
mymatrix = as.matrix(mymatrix.df)
Here I need to now the coe on how to dump this into the database...
Finally,
How to fast retrieve the values (without having to read the entire matrices each time) for some of the rows (by ID) using R?
From the tutorial it'd look like this:
sqldf("SELECT id1,id2,id30 FROM mymatrixdata", dbname = "Test2.sqlite")
But it the id1,id2,id30 are hardcoded in the code and I need to dynamically obtain them. I.e., sometimes i may want id1, id2, id10, id100; and another time i may want id80, id90, id250000, etc.
Something like this would be more approp for my needs:
cols.i.want = c("id1","id2","id30")
sqldf("SELECT cols.i.want FROM mymatrixdata", dbname = "Test2.sqlite")
Again, unsure how to proceed here. Code snippets would also help.
A simple example would help me a lot here, no need to code the whole 48 files, etc. just a simple example would be great!
Note: I am using Linux server, SQlite 3 and R 2.13 (I could update it as well).
In the comments the poster explained that it is only necessary to retrieve specific rows, not columns:
library(RSQLite)
m <- matrix(1:24, 6, dimnames = list(LETTERS[1:6], NULL)) # test matrix
con <- dbConnect(SQLite()) # could add dbname= arg. Here use in-memory so not needed.
dbWriteTable(con, "m", as.data.frame(m)) # write
dbGetQuery(con, "create unique index mi on m(row_names)")
# retrieve submatrix back as m2
m2.df <- dbGetQuery(con, "select * from m where row_names in ('A', 'C')
order by row_names")
m2 <- as.matrix(m2.df[-1])
rownames(m2) <- m2.df$row_names
Note that relational databases are set based and the order that the rows are stored in is not guaranteed. We have used order by row_names to get out a specific order. If that is not good enough then add a column giving the row index: 1, 2, 3, ... .
REVISED based on comments.

Using bulk insert to import .csv file

I have a csv format file, which I want to import to sql server 2008 using bulk insert. I have 80 columns in csv file which has comma for example, column state has NY,NJ,AZ,TX,AR,VA,MA like this for few millions of rows.
So I enclosed the state column in double quotes using custom format in excel, so that this column will be treated as single column and does not split at comma in between the column. But still the import is not successful; still it is splitting at comma. Can anyone please suggest successful import of the columns containing comma using bulk insert
I am using this code
bulk insert test from 'C:\test.csv'
with (
fieldterminator=',', rowterminator='\n'
)
go
I saw similar question previously asked here, but I don't know visual basic to apply the code. Is there any other option to modify file in excel?
Is there any other option to modify file in excel?
It turns out there is, at least in Windows.
Go to Start Menu > Control Panel > Regional and Language Options.
In the Regional Options tab, click the Customize Button.
In the List Separator field, replace the , with a |. Click OK.
Saving a file as a .CSV through Excel will now create a pipe-separated value file. Be sure to undo this change to the Regional Options setting, as Excel uses the list separator for other things like functions.
Then you can do as datagod suggests and bulk upload the file using | as the column delimiter.
You should create a format file: http://msdn.microsoft.com/en-us/library/ms191516.aspx
If your data contains commas, I would choose a different delimiter. You can specify "|" as the delimiteter in the format file.
Example:
10.0
4
1 SQLCHAR 0 100 "|" 1 Col1 SQL_Latin1_General_CP1_CI_AS
2 SQLCHAR 0 100 "|" 2 Col2 SQL_Latin1_General_CP1_CI_AS
3 SQLCHAR 0 100 "|" 3 Col3 SQL_Latin1_General_CP1_CI_AS
4 SQLCHAR 0 7000 "\r\n" 4 Col11 SQL_Latin1_General_CP1_CI_AS

How to control range of rows while loading csv file into sqltable using ssis

I have a csv file.The format of csv file is something like this:
[A src dt]
[col1 col2 col3 col4 col5]
[1 2 3 4 5]
[1 2 3 4 5]
[n n n n n]
[z src dt]
I want to load data up to n row.I don't want the last row.
I can skip the first row in flate file connection manager editor.But how can i skip the last row while inserting data into sqltable.
Thanks in advance,
David
You could put all the rows into a staging table in your DB, then use some T-SQL to move all but the last row into the recipient table.
You could probably do something with a script transformation in your dataflow to do what you ask solely using SSIS, but it would be a lot more work than the above staging table method.
That is the job for the Script Task:) Write a simple c#/VB script that checks whether it's the last row in the flow (hasMoreRows i think is the property) and redirect row to appropriate output (or simply eat it:)
Luke