[Q]uestion about reading and saving a large txt-file via {RSQLite} line by line into a DB - sql

Since my hardware is very limited (a dual core with 32bit Win7 and 4GB of ram - I need to make the best of it.....) I try to save a large text-file (about 1.2GB) into a DB, which I can then trigger by SQL-like queries to do some analytics on particular subgroups.
To be honest I'm not familiar with this area and since I could not find help regarding my issues via "googling", I just quickly show what I came up with and how I thought things would look like:
First I check how many columns my txt-file has:
k <- length(scan("data.txt", nlines=1, sep="\t", what="character"))
Then I open a connection to the text file so that it does not need to be opened
again for every single line:
filecon<-file("data.txt", open="r")
Then I initialize a connection (dbcon) to an SQLite database
dbcon<- dbConnect(dbDriver("SQLite"), dbname="mydb.dbms")
I find out where the position of the first line is
pos<-seek(filecon, rw="r")
Since the first line contains the column-names I save them for later use
col_names <- unlist(strsplit(readLines(filecon, n=1), "\t"))
Next, I test to read the first 10 lines, line by line,
and save them into a DB, which themself (should) contain k - columns with columns-names = col_names.
for(i in 1:10) {
# prints the iteration number in hundreds
if(i %% 100 == 0) {
print(i)
}
# read one line into a variable tt
tt<-readLines(filecon, n=1)
# parse tt into a variable tt2, since tt is a string
tt2<-unlist(strsplit(tt, "\t"))
# Every line, read and parsed from the text file, is immediately saved
# in the SQLite database table "results" using the command dbWriteTable()
dbWriteTable(conn=dbcon, name="results", value=as.data.frame(t(tt2[1:k]),stringsAsFactors=T), col.names=col_names, append=T)
pos<-c(pos, seek(filecon, rw="r"))
}
If I run this I get the following error
Warning messages:
1: In value[[3L]](cond) :
RS-DBI driver: (error in statement: table results has 738 columns but 13 values were supplied)
Why should I supply 738 columns? If I change k (which is 12) to 738, the code works but then I need to trigger the columns by V1, V2, V3,.... and not by the column-names I intended to supply
res <- dbGetQuery(dbcon, "select V1, V2, V3, V4, V5, V6 from results")
Any help or even a small hint is very much appreciated!

Related

Nextflow: add unique ID, hash, or row number to tuple

ch_files = Channel.fromPath("myfiles/*.csv")
ch_parameters = Channel.from(['A','B, 'C', 'D'])
ch_samplesize = Channel.from([4, 16, 128])
process makeGrid {
input:
path input_file from ch_files
each parameter from ch_parameters
each samplesize from ch_samplesize
output:
tuple path(input_file), parameter, samplesize, path("config_file.ini") into settings_grid
"""
echo "parameter=$parameter;sampleSize=$samplesize" > config_file.ini
"""
}
gives me a number_of_files * 4 * 3 grid of settings files, so I can run some script for each combination of parameters and input files.
How do I add some ID to each line of this grid? A row ID would be OK, but I would even prefer some unique 6-digit alphanumeric code without a "meaning" because the order in the table doesn't matter. I could extract out the last part of the working folder which is seemingly unique per process; but I don't think it is ideal to rely on sed and $PWD for this, and I didn't see it provided as a runtime metadata variable provider. (plus it's a bit long but OK). In a former setup I had a job ID from the LSF cluster system for this purpose, but I want this to be portable.
Every combination is not guaranteed to be unique (e.g. having parameter 'A' twice in the input channel should be valid).
To be clear, I would like this output
file1.csv A 4 pathto/config.ini 1ac5r
file1.csv A 16 pathto/config.ini 7zfge
file1.csv A 128 pathto/config.ini ztgg4
file2.csv A 4 pathto/config.ini 123js
etc.
Given the input declaration, which uses the each qualifier as an input repeater, it will be difficult to append some unique id to the grid without some refactoring to use either the combine or cross operators. If the inputs are just files or simple values (like in your example code), refactoring doesn't make much sense.
To get a unique code, the simple options are:
Like you mentioned, there's no way, unfortunately, to access the unique task hash without some hack to parse $PWD. Although, it might be possible to use BASH parameter substitution to avoid sed/awk/cut (assuming BASH is your shell of course...) you could try using: "${PWD##*/}"
You might instead prefer using ${task.index}, which is a unique index within the same task. Although the task index is not guaranteed to be unique across executions, it should be sufficient in most cases. It can also be formatted for example:
process example {
...
script:
def idx = String.format("%06d", task.index)
"""
echo "${idx}"
"""
}
Alternatively, create your own UUID. You might be able to take the first N characters but this will of course decrease the likelihood of the IDs being unique (not that there was any guarantee of that anyway). This might not really matter though for a small finite set of inputs:
process example {
...
script:
def uuid = UUID.randomUUID().toString()
"""
echo "${uuid}"
echo "${uuid.take(6)}"
echo "${uuid.takeBefore('-')}"
"""
}

Issue automating CSV import to an RSQLite DB

I'm trying to automate writing CSV files to an RSQLite DB.
I am doing so by indexing csvFiles, which is a list of data.frame variables stored in the environment.
I can't seem to figure out why my dbWriteTable() code works perfectly fine when I enter it manually but not when I try to index the name and value fields.
### CREATE DB ###
mydb <- dbConnect(RSQLite::SQLite(),"")
# FOR LOOP TO BATCH IMPORT DATA INTO DATABASE
for (i in 1:length(csvFiles)) {
dbWriteTable(mydb,name = csvFiles[i], value = csvFiles[i], overwrite=T)
i=i+1
}
# EXAMPLE CODE THAT SUCCESSFULLY MANUAL IMPORTS INTO mydb
dbWriteTable(mydb,"DEPARTMENT",DEPARTMENT)
When I run the for loop above, I'm given this error:
"Error in file(file, "rt") : cannot open the connection
In addition: Warning message:
In file(file, "rt") :
cannot open file 'DEPARTMENT': No such file or directory
# note that 'DEPARTMENT' is the value of csvFiles[1]
Here's the dput output of csvFiles:
c("DEPARTMENT", "EMPLOYEE_PHONE", "PRODUCT", "EMPLOYEE", "SALES_ORDER_LINE",
"SALES_ORDER", "CUSTOMER", "INVOICES", "STOCK_TOTAL")
I've researched this error and it seems to be related to my working directory; however, I don't really understand what to change, as I'm not even trying to manipulate files from my computer, simply data.frames already in my environment.
Please help!
Simply use get() for the value argument as you are passing a string value when a dataframe object is expected. Notice your manual version does not have DEPARTMENT quoted for value.
# FOR LOOP TO BATCH IMPORT DATA INTO DATABASE
for (i in seq_along(csvFiles)) {
dbWriteTable(mydb,name = csvFiles[i], value = get(csvFiles[i]), overwrite=T)
}
Alternatively, consider building a list of named dataframes with mget and loop element-wise between list's names and df elements with Map:
dfs <- mget(csvfiles)
output <- Map(function(n, d) dbWriteTable(mydb, name = n, value = d, overwrite=T), names(dfs), dfs)

unable to merge large files in r

I have run into a problem.
I have 10 large separate files, file type File without column headers, which are in total near 4GB which are require merging. I have been told they are text files and pipe delimited, so I added the file extension txt on each files, which I hope is not the problem. R Studio is crashing when I use the following code...
multmerge = function(mypath){
filenames=list.files(path=mypath, full.names=TRUE)
datalist = lapply(filenames, function(x){read.csv(file=x,header=F, sep
= "|")})
Reduce(function(x,y) {merge(x,y, all=T)}, datalist)}
mymergeddata = multmerge("C://FolderName//FolderName")
or when I try to do something like this...
temp1 <- read.csv(file="filename.txt", sep="|")
:
temp10 <- read.csv(file="filename.txt", sep="|")
SomeData = Reduce(function(x, y) merge(x, y), list(temp1...,
temp10))
I seeing errors such as
"Error: C stack usage is too close to the limit r" and
"In scan(file = file, what = what, sep = sep, quote = quote, dec = dec, :
Reached total allocation of 8183Mb: see help(memory.size)"
Then I saw a someone asked a question on SO as I am writing this question,
here, so I was wondering if SQL command can used in R Studio or SSMS to merge these large files? If they can how can it be merged to. If it can be done please can you advise me how to do this. I will looking around on the net.
If it can't then what is the best method to merge these rather large files. Can this be achieved in R Studio or is there open source?
I am working on a PC which has 64bit Windows with 8GB RAMS. I have included R and SQL Tags to see what options there are.
Thanks in advance if anyone can help me.
Your machine doesn't have enough memory for your selected operations.
You have 10 files ~ 4GB in total.
When you merge the 10 files you create another object which is also about 4GB, putting you very close to your machine's limit.
Your operating system and R and whatever else you're running also consume RAM so it's no surprise you run out of RAM.
I'd suggest taking a stepwise approach if you don't have access to a bigger maching:
- take the first two files and merge them.
- delete the file objects from R and keep only the merged one.
- load the third object and merge it with the earlier merger.
Repeat until done.

concatenate text files and import them into a SQLite DB

Let us say I have thousands of comma separated text files with 1050 columns each (no header). Is there a way to concatenate and import all the text files into one table, one database in SQLite (Ideally I'd use R and sqldf to communicate with SQlite).
I.e.,
Each file is called, table1.txt, table2.txt, table3.txt; all of different number of rows, but same column types, and different unique IDs in the IDs column (first column of each file).
table1.txt
id1,20.3,1.2,3.4
id10,2.1,5.2,9.3
id21,20.5,1.2,8.4
table2.txt
id2,20.3,1.2,3.4
id92,2.1,5.2,9.3
table3.txt
id3,1.3,2.2,5.4
id30,9.1,4.4,9.3
The real example is pretty much the same but with more columns and more rows. AS you can note the first column in each file corresponds to a unique ID.
Now I'd like my new table in supertable, in the DB, super.db to be (also uniquely indexed):
super.db - name of the DB
mysupertable - name of the table in the DB
myids,v1,v2,v3
id1,20.3,1.2,3.4
id10,2.1,5.2,9.3
id21,20.5,1.2,8.4
id2,20.3,1.2,3.4
id92,2.1,5.2,9.3
id3,1.3,2.2,5.4
id30,9.1,4.4,9.3
For reference, I am using SQLite3; and I am looking for a SQL command that I can run on the background without logging interactively into the sqlite3 interpreter, i.e., IMPORT bla INTO,...
I could try in unix:
cat *.txt > allmyfiles.txt
and then a .sql file,
CREATE TABLE test (myids varchar(255), v1 float, v2 float, v3 float);
.separator ,
.import output.csv test
But this command does not work since I am using R sqldf library, and dbGetQuery(db, sql) and I have no idea how to create such string in R without getting an error.
p.s. I asked a similar Q for appending tables from a DB but this time I need to append/import text files not tables from a DB.
If you are using sqlite database files anyway, you might want to consider working with RSQLite.
install.packages( "RSQLite" ) # will install package "DBI"
library( RSQLite )
db <- dbConnect( dbDriver("SQLite"), dbname = "super.db" )
You still can use the unix command within R which should be faster than any loop in R, using the system() command:
system( "cat *.txt > allmyfiles.txt" )
Provided that your allmyfiles.txt has a consistent format, you can import it as a data.frame into R
allMyFiles <- read.table( "allmyfiles.txt", header = FALSE, sep = "," )
and write it to your database, following #Martín Bel's advice, with something like
dbWriteTable( db, "mysupertable", allMyFiles, overwrite = TRUE, append = FALSE )
EDIT:
Or, if you don't want to route your data through R,you can again resort to using the system() command. This may get you started:
You have a file with the data you want to get into SQLite called allmyfiles.txt. Create a file called table.sql with this content (obviously the structure must match):
CREATE TABLE mysupertable (myids varchar(255), v1 float, v2 float, v3 float);
.separator ,
.import allmyfiles.txt mysupertable
and call it from R with
system( "sqlite3 super.db < table.sql" )
That should avoid routing the data through R but still do all the work from within R.
Take a look at termsql:
https://gitorious.org/termsql/pages/Home
cat *.txt | termsql -d ',' -t mysupertable -c 'myids,v1,v2,v3' -o mynew.db
This should do the job.

sybase update getting slower and slower

I have a big text file about 4GB, and more than 8 million lines, I'm writing a perl script to read this file line by line, do some processing and update the info to sybase, i did this in a batch way, 1000 lines per batch for update commit, but here comes the problem, at first, a batch only costs 10 to 20 seconds, but with the processing goes, updating a batch becomes slower and slower, a batch costs 3 to 4 min, I definitely have no idea why this is happening! Any body can help me analys this what may be the cause? Thanks in advance, on my knee...
==>I'm writing a perl script to read this file line by line, do some processing and update the info to sybase
Please do entire processing at one go mean process your source file at one go; Prepare data structure using hash, array as per requirement and then start inserting data into database.
Please keep below points in mind while inserting large data into database.
1- If each column data is not too large then you can insert entire data at one go also.( you may require good RAM not sure about size because it depend on dataset u need to process).
2- You should use execute_array of perl DBI so that you can insert data at one go.
3- If you have not sufficient RAM to insert data at one go then please devide your data ( may be in 8 part, 1 million lines each time).
4- Also Make it sure that you are preparing the statement once. In every run you are just executing with new data set.
5- Set your auto_commit off.
A sample code to use execute_array of perl DBI. I have used this to insert around 10 million of data into mysql.
Please keep ur data in arrays like below in the form of array.
#column1_data, #column2_data, #column3_data
print $logfile_handle, "Total records to insert--".scalar(#column1_data);
print $logfile_handle, "Inserting data into database";
my $sth = $$dbh_ref->prepare("INSERT INTO $tablename (column1,column2,column3) VALUES (?,?,?)")
or print ($logfile_handle, "ERROR- Couldn't prepare statement: " . $$dbh_ref->errsr) && exit;
my $tuples = $sth->execute_array(
{ ArrayTupleStatus => \my #tuple_status },
\#column1_data,
\#column2_data,
\#column3_data
);
$$dbh_ref->do("commit");
print ($logfile_handle,"Data Insertion Completed.");
if ($tuples) {
print ($logfile_handle,"Successfully inserted $tuples records\n");
} else {
##print Error log or those linese which are not inserted
my $status = $tuple_status[$tuple];
$status = [0, "Skipped"] unless defined $status;
next unless ref $status;
print ($logfile_handle, "ERROR- Failed to insert (%s,%s,%s): %s\n",
$column1_data[$tuple], $column2_data[$tuple],$column3_data[$tuple], $status->[1]);
}
}