Store multiple disconnected datasets in SQL correctly - sql

I have multiple datasets with identical schema and I am not sure how I should design the SQL correctly here. The question is very simple, but I just do not have experience with SQL. Lets' say that there are 40 tables that store matrix data as row_num, col_num, val. Each such table has its own name. Because the tables have hundreds of millions of rows, to fit all of them into just one table seems wrong from the point of performance. So, I am thinking of creating 40 tables, but I am not sure how optimal schema should look like in this case. Each such table, that represents a matrix, in turn, will have the relevant tables with different schema:
table_of_type_MATRIX_1 --> table_of_type_BIRDS (relevant for table_of_type_MATRIX_1 only!)
table_of_type_MATRIX_2 --> table_of_type_BIRDS (relevant for table_of_type_MATRIX_2 only!)
So, basically there is a bunch of kind of disconnected data that I want to store in one database and I am not sure how to organize it. There will be queries, of course, which will require looking into multiple tables with identical schema. Any suggestions would be greatly appreciated.
Example
Matrix looks like that:
gene cell_id expr
0 0610005C13Rik GCTAAGTATTTN_CTL-6_OPC 0.000000
1 0610007N19Rik GCTAAGTATTTN_CTL-6_OPC 0.000000
2 0610007P14Rik GCTAAGTATTTN_CTL-6_OPC 3.593143
3 0610009B22Rik GCTAAGTATTTN_CTL-6_OPC 3.593143
4 0610009D07Rik GCTAAGTATTTN_CTL-6_OPC 10.779429
...
other dozen millions of rows
It is the matrix of gene expression: in the first column we have gene that is expressed in the cell that is shown in the second column with the expression level shown in the third. The cells (second column) are also grouped into clusters after dimensionality reduction and clustering algorithms are run, and so, we have second table that is related to the first:
cell_id cluster
GCTAAGTATTTN_CTL-6_OPC 1
GCTGGGTATTTN_CTL-6_OPC 2
GCTAAGTATAAN_CTL-6_OPC 2
GCTAAGTATTTN_CTL-6_OPC 3
...
and so on for all of the cells
So, these two related tables: gene expression matrix and cells' clusters assignment will form a disconnected dataset in itself. There will be many such groups of 2 tables that need to be stored.

Related

A different merge

So I have two tables and thoses are the samples:
df1:
Element
Range
Family
Ae_aag2/0013F
5-2500
Chuviridae
Ae_aag2/0014F
300-2100
Flaviviridae
df2:
Element
Range
Family
0012F
30-720
Chuviridae
0013F
23-1200
Chuviridae
0013F
1300-2610
Xinmoviridae
And I need to join the tables in the following logic:
Element_df1
Element_df2
Family_df1
Family_df2
Ae_aag2/0013F
"0013F:23-1200,0013F:1300-2610"
Chuviridae
"Chuviridae,Xinmoviridae"
I need the common rows in the two dataframes of the column (Element) in one line, saving the element of the first and second and also the family of the first and second. If the 3 elements are common, in the two df, it should join the 3 in one single line.
I tried using the merge in pandas, but it gets me two lines, not one as I needed:
I searched and didn't find how make exceptions on how to merge the two dataframe. I tried using groupby afterwards but kind make worst :(
Unfortunately I don't have much knowledge on working with pandas. Please be kind I'm new at the subject.
Use:
df1.drop(columns='Range').merge(
df2.assign(group=lambda d: d['Element'],
Element=lambda d: d['Element']+':'+d['Range'])
.groupby('group')[['Element', 'Family']].agg(','.join),
left_on=df1['Element'].str.extract('/(.*)$', expand=False),
right_index=True, suffixes=('_df1', '_df2')
)#.drop(columns='key_0') # uncomment to remove the key
Output:
key_0 Element_df1 Family_df1 Element_df2 Family_df2
0 0013F Ae_aag2/0013F Chuviridae 0013F:23-1200,0013F:1300-2610 Chuviridae,Xinmoviridae

How to label a whole dataset?

I have a question.I have a pandas dataframe that contains 5000 columns and 12 rows. Each row represents the signal received from an electrocardiogram lead. I want to assign 3 labels to this dataset. These 3 tags belong to the entire dataset and are not related to a specific row. How can I do this?
I have attached the picture of my dataframepandas dataframe.
and my labels are: Atrial Fibrillation:0,
right bundle branch block:1,
T Wave Change:2
I tried to assign 3 labels to a large dataset
(Not for a specific row or column)
but I didn't find a solution.
As you see, it has 12 rows and 5000 columns. each row represents 5000 data from one specific lead and overall we have 12 leads which refers to this 12 rows (I, II, III, aVR,.... V6) in my data frame. professional experts are recognised 3 label for this data frame which helps us to train a ML Model to detect different heart disease. I have 10000 data frame just like this and each one has 3 or 4 specific labels. Here is my question: How can I assign these 3 labels to this dataset that I mentioned.as I told before these labels don't refers to specific rows, in fact each data frame has 3 or 4 label for its whole. I mean, How can I assign 3 label to a whole data frame?

Levenshtein for multiple words on multiple columns

I'm trying to make search a bit more friendly and wanted to exploit the Levenshtein distance. This works great but if a value in a column has a length of 25 characters long, the distance to only 3 characters is too far. In this case, it performs worse than the LIKE method. I solved this by splitting all words into their own rows using regexp_split_to_table. This is nice, but it's still not working if I have multiple words as input.
For example:
Let the data look as following
id
col1
col2
1
one two
three
2
two
one
3
horse
tree
4
house
three
using regexp_split_to_table would transform this to
id
col
1
one
1
two
1
three
2
one
2
two
2
two
3
horse
3
tree
4
house
4
three
If I search for one tree, I'd like to compare one with each word but also compare tree with each word and then order by the sum of both distances.
I have no idea where to start. I also do not know if this is the best approach to do this (it seems somewhat excessive but I'm also not an expert). Maybe I'm also overthinking this. I'd appreciate a hint into the right direction :).

recoding multiple variables in the same way

I am looking for the shortest way to recode many variables in the same way.
For example I have data frame where columns a,b,c are names of items of survey and rows are observations.
d <- data.frame(a=c(1,2,3), b=c(1,3,2), c=c(1,2,1))
I want to change values of all observations for selected columns. For instance value 1 of column "a" and "c" should be replaced to string "low" and values 2,3 of these columns should be replaced to "high".
I do it often with many columns so I am looking for function which can do it in very simple way, like this:
recode2(data=d, columns=a,d, "1=low, 2,3=high").
Almost ok is function recode from package cars, but if I have 10 columns to recode I have to rewrite it 10 times and it is not as effective as I want.

store matrix data in SQLite for fast retrieval in R

I have 48 matrices of dimensions 1,000 rows and 300,000 columns where each column has a respective ID, and each row is a measurement at one time point. Each of the 48 matrices is of the same dimension and their column IDs are all the same.
The way I have the matrices stored now is as RData objects and also as text files. I guess for SQL I'd have to transpose and store by ID, and in such case now the matrix would be of dimensions 300,000 rows and 1,000 columns.
I guess if I transpose it a small version of the data would look like this:
id1 1.5 3.4 10 8.6 .... 10 (with 1,000 columns, and 30,0000 rows now)
I want to store them in a way such that I can use R to retrieve a few of the rows (~ 5 to 100 each time).
The general strategy I have in mind is as follows:
(1) Create a database in sqlite3 using R that I will use to store the matrices (in different tables)
For file 1 to 48 (each file is of dim 1,000 rows and 300,000 columns):
(2) Read in file into R
(3) Store the file as a matrix in R
(4) Transpose the matrix (now its of dimensions 300,000 rows and 1,000 columns). Each row now is the unique id in the table in sqlite.
(5) Dump/write the matrix into the sqlite3 database created in (1) (dump it into a new table probably?)
Steps 1-5 are to create the DB.
Next, I need step 6 to read-in the database:
(6) Read some rows (at most 100 or so at a time) into R as a (sub)matrix.
A simple example code doing steps 1-6 would be best.
Some Thoughts:
I have used SQL before but it was mostly to store tabular data where each column had a name, in this case each column is just one point of the data matrix, I guess I could just name it col1 ... to col1000? or there are better tricks?
If I look at: http://sandymuspratt.blogspot.com/2012/11/r-and-sqlite-part-1.html they show this example:
dbSendQuery(conn = db,
"CREATE TABLE School
(SchID INTEGER,
Location TEXT,
Authority TEXT,
SchSize TEXT)")
But in my case this would look like:
dbSendQuery(conn = db,
"CREATE TABLE mymatrixdata
(myid TEXT,
col1 float,
col2 float,
.... etc.....
col1000 float)")
I.e., I have to type in col1 to ... col1000 manually, that doesn't sound very smart. This is where I am mostly stuck. Some code snippet would help me.
Then, I need to dump the text files into the SQLite database? Again, unsure how to do this from R.
Seems I could do something like this:
setwd(<directory where to save the database>)
db <- dbConnect(SQLite(), dbname="myDBname")
mymatrix.df = read.table(<full name to my text file containing one of the matrices>)
mymatrix = as.matrix(mymatrix.df)
Here I need to now the coe on how to dump this into the database...
Finally,
How to fast retrieve the values (without having to read the entire matrices each time) for some of the rows (by ID) using R?
From the tutorial it'd look like this:
sqldf("SELECT id1,id2,id30 FROM mymatrixdata", dbname = "Test2.sqlite")
But it the id1,id2,id30 are hardcoded in the code and I need to dynamically obtain them. I.e., sometimes i may want id1, id2, id10, id100; and another time i may want id80, id90, id250000, etc.
Something like this would be more approp for my needs:
cols.i.want = c("id1","id2","id30")
sqldf("SELECT cols.i.want FROM mymatrixdata", dbname = "Test2.sqlite")
Again, unsure how to proceed here. Code snippets would also help.
A simple example would help me a lot here, no need to code the whole 48 files, etc. just a simple example would be great!
Note: I am using Linux server, SQlite 3 and R 2.13 (I could update it as well).
In the comments the poster explained that it is only necessary to retrieve specific rows, not columns:
library(RSQLite)
m <- matrix(1:24, 6, dimnames = list(LETTERS[1:6], NULL)) # test matrix
con <- dbConnect(SQLite()) # could add dbname= arg. Here use in-memory so not needed.
dbWriteTable(con, "m", as.data.frame(m)) # write
dbGetQuery(con, "create unique index mi on m(row_names)")
# retrieve submatrix back as m2
m2.df <- dbGetQuery(con, "select * from m where row_names in ('A', 'C')
order by row_names")
m2 <- as.matrix(m2.df[-1])
rownames(m2) <- m2.df$row_names
Note that relational databases are set based and the order that the rows are stored in is not guaranteed. We have used order by row_names to get out a specific order. If that is not good enough then add a column giving the row index: 1, 2, 3, ... .
REVISED based on comments.