store matrix data in SQLite for fast retrieval in R - sql

I have 48 matrices of dimensions 1,000 rows and 300,000 columns where each column has a respective ID, and each row is a measurement at one time point. Each of the 48 matrices is of the same dimension and their column IDs are all the same.
The way I have the matrices stored now is as RData objects and also as text files. I guess for SQL I'd have to transpose and store by ID, and in such case now the matrix would be of dimensions 300,000 rows and 1,000 columns.
I guess if I transpose it a small version of the data would look like this:
id1 1.5 3.4 10 8.6 .... 10 (with 1,000 columns, and 30,0000 rows now)
I want to store them in a way such that I can use R to retrieve a few of the rows (~ 5 to 100 each time).
The general strategy I have in mind is as follows:
(1) Create a database in sqlite3 using R that I will use to store the matrices (in different tables)
For file 1 to 48 (each file is of dim 1,000 rows and 300,000 columns):
(2) Read in file into R
(3) Store the file as a matrix in R
(4) Transpose the matrix (now its of dimensions 300,000 rows and 1,000 columns). Each row now is the unique id in the table in sqlite.
(5) Dump/write the matrix into the sqlite3 database created in (1) (dump it into a new table probably?)
Steps 1-5 are to create the DB.
Next, I need step 6 to read-in the database:
(6) Read some rows (at most 100 or so at a time) into R as a (sub)matrix.
A simple example code doing steps 1-6 would be best.
Some Thoughts:
I have used SQL before but it was mostly to store tabular data where each column had a name, in this case each column is just one point of the data matrix, I guess I could just name it col1 ... to col1000? or there are better tricks?
If I look at: http://sandymuspratt.blogspot.com/2012/11/r-and-sqlite-part-1.html they show this example:
dbSendQuery(conn = db,
"CREATE TABLE School
(SchID INTEGER,
Location TEXT,
Authority TEXT,
SchSize TEXT)")
But in my case this would look like:
dbSendQuery(conn = db,
"CREATE TABLE mymatrixdata
(myid TEXT,
col1 float,
col2 float,
.... etc.....
col1000 float)")
I.e., I have to type in col1 to ... col1000 manually, that doesn't sound very smart. This is where I am mostly stuck. Some code snippet would help me.
Then, I need to dump the text files into the SQLite database? Again, unsure how to do this from R.
Seems I could do something like this:
setwd(<directory where to save the database>)
db <- dbConnect(SQLite(), dbname="myDBname")
mymatrix.df = read.table(<full name to my text file containing one of the matrices>)
mymatrix = as.matrix(mymatrix.df)
Here I need to now the coe on how to dump this into the database...
Finally,
How to fast retrieve the values (without having to read the entire matrices each time) for some of the rows (by ID) using R?
From the tutorial it'd look like this:
sqldf("SELECT id1,id2,id30 FROM mymatrixdata", dbname = "Test2.sqlite")
But it the id1,id2,id30 are hardcoded in the code and I need to dynamically obtain them. I.e., sometimes i may want id1, id2, id10, id100; and another time i may want id80, id90, id250000, etc.
Something like this would be more approp for my needs:
cols.i.want = c("id1","id2","id30")
sqldf("SELECT cols.i.want FROM mymatrixdata", dbname = "Test2.sqlite")
Again, unsure how to proceed here. Code snippets would also help.
A simple example would help me a lot here, no need to code the whole 48 files, etc. just a simple example would be great!
Note: I am using Linux server, SQlite 3 and R 2.13 (I could update it as well).

In the comments the poster explained that it is only necessary to retrieve specific rows, not columns:
library(RSQLite)
m <- matrix(1:24, 6, dimnames = list(LETTERS[1:6], NULL)) # test matrix
con <- dbConnect(SQLite()) # could add dbname= arg. Here use in-memory so not needed.
dbWriteTable(con, "m", as.data.frame(m)) # write
dbGetQuery(con, "create unique index mi on m(row_names)")
# retrieve submatrix back as m2
m2.df <- dbGetQuery(con, "select * from m where row_names in ('A', 'C')
order by row_names")
m2 <- as.matrix(m2.df[-1])
rownames(m2) <- m2.df$row_names
Note that relational databases are set based and the order that the rows are stored in is not guaranteed. We have used order by row_names to get out a specific order. If that is not good enough then add a column giving the row index: 1, 2, 3, ... .
REVISED based on comments.

Related

Compare 2 Datasets and get differences in Spark-Scala

I have a Dataset from Audit Table which is 24 hrs Old (Ds1) and current Day Changes as of Now (Ds2)
How to get differences of values (individual cells) in a 3rd - DiffDs
If the two dataset have the same schema you can do
val delta = ds2.except(ds1)
that from the doc Returns a new Dataset containing rows in this Dataset but not in another Dataset.
This will be the delta of the newest record and the oldest record.
If the schema of the dataset are different in my opinion all ds2 are the difference.
But this how you say on the comment return the difference on the entire Row.
I think that, to extract the cell difference you need to do something like this:
val diff = ds1.union(ds2).distinct() //that contains all different record
diff.rdd.keyBy(r(key_index_here)).groupByKey(a_simple_function_that_compute_column_difference)
Now you have to write a function that compute the difference in terms of cell from a sequence of Row that are grouped by a key.

R: Matching two tables on multiple columns and creating a matched/not matched flag

I'm a beginner to R from a SAS background trying to do a basic "case when" match on two tables to get a flag where I have and have not found a match. Please see the SAS code I have in mind below. I just need something analogous to this in R. Thanks in advance.
proc sql;
create table
x as
select
a.*,
b.*,
case when a.first_column=b.column_first and
a.second_column=b.column_second
then 1 else 0 end as matched_flag
from table1 as a
left join
table2 as b
on a.first_column=b.column_first and a.second_column=b.column_second;
quit;
I'm not familiar with SAS, but I think I understand what you are trying to do. To see how many rows/columns are similar between two tables, you can use %in% and the length function.
For example, initialize two matrices of different dimensions and given them similar row names and column names:
mat.a <- matrix(1, nrow=3, ncol = 2)
mat.b <- matrix(1, nrow=2, ncol = 3)
rownames(mat.a) <- c('a','b','c')
rownames(mat.b) <- c('a','d')
colnames(mat.a) <- c('g','h')
colnames(mat.b) <- c('h','i')
mat.a and mat.b now exist with different row and column names. To match the rows by names, you can use:
row.match <- rownames(mat.a)[rownames(mat.a) %in% rownames(mat.b)]
num.row.match <- length(row.match)
Note that row.match can now be used to index into both of the matrices. The %in% operator returns a logical of the same length of the first argument (in this case, rownames(mat.a)) that indicates if the ith element of the first argument was found anywhere in the elements of the second argument. This nature of %in% means that you have to be sensitive to how you order the arguments for your indexing.
If you simply want to quantify how many rows or columns are the same between the two matrices, then you can use the sum function with the %in% operator:
sum(rownames(mat.a) %in% rownames(mat.b))
With the sum function used like this, you do not need to be sensitive to how you order the arguments, because the number of row names of mat.a in row names of mat.b is equivalent to the number of row names of mat.b in row names of mat.a. That is to say that this usage of %in% is commutative.
I hope this helps!
You will want to use dataframe objects. These are like datasets in SAS. You can use bind to put two dataframe objects together side by side. Then you can select rows based on conditions and set the flag based on this. In the code below you will see that I did this twice: once to set the 1 flag and once to set the 0 flag.
To select the rows where all fields match you can do something similar, but instead of assigning a new column you can assign all the results back to the name of the table you are working on.
Here's the code:
# make up example a and b data frames
table1 <- data.frame(list(a.first_column=c(1,2,3),a.second_column=c(4,5,6)))
table2 <- data.frame(list(b.first_column=c(1,3,6),b.second_column=c(4,5,9)))
# Combine columns (horizontally)
x <- cbind(table1, table2)
print("Combined Data Frames")
print(x)
# create matched flag (1 when the first columns match)
x$matched_flag[x$a.first_column==x$b.first_column] <- 1
x$matched_flag[!x$a.first_column==x$b.first_column] <- 0
# only select records that match both data frames
x <- x[x$a.first_column==x$b.first_column & x$a.second_column==x$b.second_column,]
print("Matched Data Frames")
print(x)
BTW: since you are used to using SQL, you might want to try the sqldf package in R. It will let you use the same techniques that you are used to but in R and on data frames.

PIG Item Count and Histogram

This is a two part problem:
PART 1:
I am using the cloudera pig editor to transform my data. The data set is derived from the US Patents Citations data set. The first column is the "Cited" patent. The remaining data is the patents that cite the first patent.
3858241 3634889,3557384,3398406,1324234,956203
3858242 3707004,3668705,3319261,1515701
3858243 3684611,3681785,3574238,3221341,3156927,3146465,2949611
3858244 2912700,2838924,2635670,2211676,17445,14040
3858245 3755824,3699969,3621837,3608095,3553737,3176316,2072303
3858246 3601877,3503079,3451067
3858247 3755824,3694819,3621837,2807431,1600859
I need to create PIG code that will count the number of citation that the first patent has. So, I need the output to be:
3858241 5
3858242 4
3858243 7
3858244 6
3858245 7
3858246 3
3858247 6
PART 2:
I need to create a histogram of the output from problem 1 using a PIG script.
Any help would be greatly appreciated.
Thanks
this script should work.
X = LOAD 'pigpatient.txt' using PigStorage(' ') AS (pid:int,str:chararray);
X1 = FOREACH X GENERATE pid,STRSPLIT(str, ',') AS (y:tuple());
X2 = FOREACH X1 GENERATE pid,SIZE(y) as numofcitan;
dump X2;
X3 = group X2 by numofcitan;
Histograms = foreach X3 GENERATE group as numofcitan,COUNT(X2.pid);
dump Histograms;
input:
3858241 3634889,3557384,3398406,1324234,956203
3858242 3707004,3668705,3319261,1515701
3858243 3684611,3681785,3574238,3221341,3156927,3146465,2949611
3858244 2912700,2838924,2635670,2211676,17445,14040
3858245 3755824,3699969,3621837,3608095,3553737,3176316,2072303
3858246 3601877,3503079,3451067
3858247 3755824,3694819,3621837,2807431,1600859
Result:
(3858241,5)
(3858242,4)
(3858243,7)
(3858244,6)
(3858245,7)
(3858246,3)
(3858247,5)
Histogram output:
Number of citatatins,number of patients
(3,1)
(4,1)
(5,2)
(6,1)
(7,2)
#Sravan K Reddy's answer is good enough to be a solution, but it is essential to know what is histogram?
Histogram is frequency distribution of datasets and gives statistical information about data. Most commonly used histogram types are; Equi-width and equi-depth which is called equi-height or height-balanced.
In database tools, equi-depth histogram is prefered. ex: Oracle see
#Sravan K Reddy intends to create equi-width histogram of patent citations. However, in order to create histogram, data must be sorted. That is vital for histogram construction.
If you want to create histogram of your big data, read this paper and check Apache Pig Scripts.

apache spark sql query optimization and saving result values?

I have a large data in text files (1,000,000 lines) .Each line has 128 columns .
Here each line is a feature and each column is a dimension.
I have converted the txt files in json format and able to run sql queries on json file using spark.
Now i am trying to build a kd tree with this large data .
My steps :
1) calculate variance of each column pick the column with maximum variance and make it as key first node , mean of the column as the value of the node.
2) based on the first node value split the data into 2 parts an repeat the process until you reach a point.
my sample code :
import sqlContext._
val people = sqlContext.jsonFile("siftoutput/")
people.printSchema()
people.registerTempTable("people")
val output = sqlContext.sql("SELECT * From people")
the people table has 128 columns
My questions :
1) How to save result values of a query into a list ?
2) How to calculate variance of a column ?
3) i will be runnig multiple queries on same data .Does spark has any way to optimize it ?
4) how to save the output as key value pairs in a text file ?
please help

Looping calculations from data frames

I have a large dataset coming in from SQLdf. I use split to order it by an index field from the query and list2env to split these into several data frames. These data frames will have names like 1 through 178. After splitting them, i want to do some calculations on all of them. How should i "call" a calculations for 1 through 178 (might change from day to day) ?
Simplification: one dataset becomes n data frames splitted on an index (like this):
return date return benchmark_returen index
28-03-2014 0.03 0.05 6095
with typically 252 * 5 obs (IE: 5 years)
then i want to split these on the index into (now 178 dfs)
and perform typically risk/return analytics from the PerformanceAnalytics package like for example chart.Histogram or charts.PerformanceSummary.
In the next step i would like to group these and insert them into a PDF for each Index. (the graphs/results that is).
As others have pointed out the question lacks a proper example but indexing of environments can be done as with lists. In order to construct a list the have digits as index values one needs to use backticks, and arguments to [[ when accessing environments need to be characters
> mylist <- list(`1`="a", `2`="b")
> myenv <- list2env(mylist)
> myenv$`1`
[1] "a"
> myenv[[as.character(1)]]
[1] "a"
If you want to extract values (and then possibly put them back into the environment:
sapply(1:2, function(n) get(as.character(n), envir=myenv) )
[1] "a" "b"
myenv$calc <- with(myenv, paste(`1`, `2`))