How to convert large .csv file with "too many columns" into SQL database - sql

I was given a large .csv file (around 6.5 Gb) with 25k rows and 20k columns. Let's call first column ID1 and then each additional column is a value for each of these ID1s in different conditions. Let's call these ID2s.
This is the first time I work with such large files. I wanted to process the .csv file in R and summarize the values, mean, standard deviation and coefficient of variation for each ID1.
My idea was to read the file directly (with datatable fread), convert it into "long" data (with dplyr) so I have three columns: ID1, ID2 and value. Then group them by ID1,ID2 and summarize. However, I do not seem to have enough memory to read the file (I assume R uses more memory than the file's size to store it).
I think it would be more efficient to first convert the file into a SQL database and then process it from there. I have tried to convert it using sqlite3 but it gives me an error message stating that the maximum number of columns to read are 4096.
I have no experience with SQL, so I was wondering what would be the best way of converting the .csv file into a database. I guess reading each column and storing them as a table or something like that would work.
I have searched for similar questions but most of them just say that having so many columns is a bad db design. I cannot generate the .csv file with a proper structure.
Any suggestions for an efficient way of processing the .csv file?
Best,
Edit: I was able to read the initial file in R, but I still find some problems:
1- I cannot write into a sqlite db because of the "too many columns" limit.
2- I cannot pivot it inside R because I get the error:
Error: cannot allocate vector of size 7.8 Gb
Eventhough my memory limit is high enough. I have 8.5 Gb of free memory and:
> memory.limit()
[1] 16222
I have used #danlooo 's code but the data is not in the format I would like it to be. Probably I was not clear enough explaining its structure.
Here is an example of how I would like the data to look like (ID1 = Sample, ID2 = name, value = value)
> test = input[1:5,1:5]
>
> test
Sample DRX007662 DRX007663 DRX007664 DRX014481
1: AT1G01010 12.141565 16.281420 14.482322 35.19884
2: AT1G01020 12.166693 18.054251 12.075236 37.14983
3: AT1G01030 9.396695 9.704697 8.211935 4.36051
4: AT1G01040 25.278412 24.429031 22.484845 17.51553
5: AT1G01050 64.082870 66.022141 62.268711 58.06854
> test2 = pivot_longer(test, -Sample)
> test2
# A tibble: 20 x 3
Sample name value
<chr> <chr> <dbl>
1 AT1G01010 DRX007662 12.1
2 AT1G01010 DRX007663 16.3
3 AT1G01010 DRX007664 14.5
4 AT1G01010 DRX014481 35.2
5 AT1G01020 DRX007662 12.2
6 AT1G01020 DRX007663 18.1
7 AT1G01020 DRX007664 12.1
8 AT1G01020 DRX014481 37.1
9 AT1G01030 DRX007662 9.40
10 AT1G01030 DRX007663 9.70
11 AT1G01030 DRX007664 8.21
12 AT1G01030 DRX014481 4.36
13 AT1G01040 DRX007662 25.3
14 AT1G01040 DRX007663 24.4
15 AT1G01040 DRX007664 22.5
16 AT1G01040 DRX014481 17.5
17 AT1G01050 DRX007662 64.1
18 AT1G01050 DRX007663 66.0
19 AT1G01050 DRX007664 62.3
20 AT1G01050 DRX014481 58.1
> test3 = test2 %>% group_by(Sample) %>% summarize(mean(value))
> test3
# A tibble: 5 x 2
Sample `mean(value)`
<chr> <dbl>
1 AT1G01010 19.5
2 AT1G01020 19.9
3 AT1G01030 7.92
4 AT1G01040 22.4
5 AT1G01050 62.6
How should I change the code to make it look that way?
Thanks a lot!

Pivoting in SQL is very tedious and often requires writing nested queries for each column. SQLite3 is indeed the way to go if the data can not live in the RAM. This code will read the text file in chunks, pivot the data in long format and puts it into the SQL database. Then you can access the database with dplyr verbs for summarizing. This uses another example dataset, because I have no idea which column types ID1 and ID2 have. You might want to do pivot_longer(-ID2) to have two name columns.
library(tidyverse)
library(DBI)
library(vroom)
conn <- dbConnect(RSQLite::SQLite(), "my-db.sqlite")
dbCreateTable(conn, "data", tibble(name = character(), value = character()))
file <- "https://github.com/r-lib/vroom/raw/main/inst/extdata/mtcars.csv"
chunk_size <- 10 # read this many lines of the text file at once
n_chunks <- 5
# start with offset 1 to ignore header
for(chunk_offset in seq(1, chunk_size * n_chunks, by = chunk_size)) {
# everything must be character to allow pivoting numeric and text columns
vroom(file, skip = chunk_offset, n_max = chunk_size,
col_names = FALSE, col_types = cols(.default = col_character())
) %>%
pivot_longer(everything()) %>%
dbAppendTable(conn, "data", value = .)
}
data <- conn %>% tbl("data")
data
#> # Source: table<data> [?? x 2]
#> # Database: sqlite 3.37.0 [my-db.sqlite]
#> name value
#> <chr> <chr>
#> 1 X1 Mazda RX4
#> 2 X2 21
#> 3 X3 6
#> 4 X4 160
#> 5 X5 110
#> 6 X6 3.9
#> 7 X7 2.62
#> 8 X8 16.46
#> 9 X9 0
#> 10 X10 1
#> # … with more rows
data %>%
# summarise only the 3rd column
filter(name == "X3") %>%
group_by(value) %>%
count() %>%
arrange(-n) %>%
collect()
#> # A tibble: 3 × 2
#> value n
#> <chr> <int>
#> 1 8 14
#> 2 4 11
#> 3 6 7
Created on 2022-04-15 by the reprex package (v2.0.1)

Related

How to speed up the loop in a dataframe

I would like to speed up my loop because I have to do it on 900 000 data.
To simplify i show you a sample.
I would like to add an column name 'Count' which count the number of times where score was under target score for the same player. But for each row the target change.
Input :
index Nom player score target score
0 0 felix 3 10
1 1 felix 8 7
2 2 theo 4 5
3 3 patrick 12 6
4 4 sophie 7 6
5 5 sophie 3 6
6 6 felix 2 4
7 7 felix 2 2
8 8 felix 2 3
Result :
index Nom player score target score Count
0 0 felix 3 10 5
1 1 felix 8 7 4
2 2 theo 4 5 1
3 3 patrick 12 6 0
4 4 sophie 7 6 1
5 5 sophie 3 6 1
6 6 felix 2 4 4
7 7 felix 2 2 0
8 8 felix 2 3 3
Below the code i current use but is it possible to speed up ? I saw some articles about vectorization is it possible to apply on my calcul ? If yes how to do it
df2 = df.copy()
df2['Count']= [np.count_nonzero((df.values[:,1] == row[2] )& ( df.values[:,2] < row[4]) ) for row in df.itertuples()]
print(df2)
Jérôme Richard's insights for a O(n log n) solution can be translated to pandas. The speed up depends on the number and size of the groups in the dataframe.
df2 = df.copy()
gr = df2.groupby('Nom player')
lookup = gr.score.apply(np.sort).to_dict()
df2['count'] = gr.apply(
lambda x: pd.Series(
np.searchsorted(lookup[x.name], x['target score']),
index=x.index)
).droplevel(0)
print(df2)
Output
index Nom player score target score count
0 0 felix 3 10 5
1 1 felix 8 7 4
2 2 theo 4 5 1
3 3 patrick 12 6 0
4 4 sophie 7 6 1
5 5 sophie 3 6 1
6 6 felix 2 4 4
7 7 felix 2 2 0
8 8 felix 2 3 3
You can try:
df['Count'] = df.groupby("Nom player").apply(
lambda x: pd.Series((sum(x["score"] < s) for s in x["target score"]), index=x.index)
).droplevel(0)
print(df)
Prints:
index Nom player score target score Count
0 0 felix 3 10 5
1 1 felix 8 7 4
2 2 theo 4 5 1
3 3 patrick 12 6 0
4 4 sophie 7 6 1
5 5 sophie 3 6 1
6 6 felix 2 4 4
7 7 felix 2 2 0
8 8 felix 2 3 3
EDIT: Quick benchmark:
from timeit import timeit
def add_count1(df):
df["Count"] = (
df.groupby("Nom player")
.apply(
lambda x: pd.Series(
((x["score"] < s).sum() for s in x["target score"]), index=x.index
)
)
.droplevel(0)
)
def add_count2(df):
df["Count"] = [
np.count_nonzero((df.values[:, 1] == row[2]) & (df.values[:, 2] < row[4]))
for row in df.itertuples()
]
def add_count3(df):
gr = df.groupby('Nom player')
lookup = gr.score.apply(lambda x: np.sort(np.array(x))).to_dict()
df['count'] = gr.apply(
lambda x: pd.Series(
np.searchsorted(lookup[x.name], x['target score']),
index=x.index)
).droplevel(0)
df = pd.concat([df] * 1000).reset_index(drop=True) # DataFrame of len=9000
t1 = timeit("add_count1(x)", "x=df.copy()", number=1, globals=globals())
t2 = timeit("add_count2(x)", "x=df.copy()", number=1, globals=globals())
t3 = timeit("add_count3(x)", "x=df.copy()", number=1, globals=globals())
print(t1)
print(t2)
print(t3)
Prints on my computer:
0.7540620159707032
6.63946107000811
0.004106967011466622
So my answer should be faster than the original, but Michael Szczesny's answer is the fastest.
There are two main issues in the current code. CPython string objects are slow, especially string comparison. Moreover, the current algorithm has a quadratic complexity: it compares all rows matching with the current one, for each row. The later is the biggest issue for large dataframes.
Implementation
The first thing to do is to replace the string comparison with something faster. Strings objects can be converted to native string using np.array. Then, the unique strings can be extracted as well as their location using np.unique. This basically help us to replace the string matching problem with an integer matching problem. Comparing native integer is generally significantly faster mainly because the processor like that and Numpy can use efficient SIMD instructions so to compare integers. Here is how to convert the string column to label indices:
# 0.065 ms
labels, labelIds = np.unique(np.array(df.values[:,1], dtype='U'), return_inverse=True)
Now, we can group-by the score by label (player names) efficiently. The thing is Numpy does not provide any group-by function. While this is possible to do that efficiently using multiple np.argsort, a basic pure-Python dict-based approach turns out to be pretty fast in practice. Here is the code grouping scores by label and sorting the set of score associated to each label (useful for the next step):
# 0.014 ms
from collections import defaultdict
scoreByGroups = defaultdict(lambda: [])
labelIdsList = labelIds.tolist()
scoresList = df['score'].tolist()
targetScoresList = df['target score'].tolist()
for labelId, score in zip(labelIdsList, scoresList):
scoreByGroups[labelId].append(score)
for labelId, scoreGroup in scoreByGroups.items():
scoreByGroups[labelId] = np.sort(np.array(scoreGroup, np.int32))
scoreByGroups can now be used to efficiently find the number of scores smaller than a given one for a given label. One just need to read scoreByGroups[label] (constant time) and then do a binary search on the resulting array (O(log n)). Here is how:
# 0.014 ms
counts = [np.searchsorted(scoreByGroups[labelId], score)
for labelId, score in zip(labelIdsList, targetScoresList)]
# Copies are slow, but adding a new column seems even slower
# 0.212 ms
df2 = df.copy()
df2['Count'] = np.fromiter(counts, np.int32)
Results
The resulting code takes 0.305 ms on my machine on the example input while the initial code takes 1.35 ms. This means this implementation is about 4.5 times faster. 2/3 of the time is unfortunately spent in the creation of the new dataframe with the new column. Note that the provided code should be much faster than the initial code on large dataframe thanks to a O(n log n) complexity instead of a O(n²) one.
Faster implementation for large dataframes
On large dataframe, calling np.searchsorted for each item is expensive due to the overhead of Numpy. On solution to easily remove this overhead is to use Numba. The computation can be optimized using a list instead of a dictionary since the labels are integers in the range 0..len(labelIds). The computation can also partially done in parallel.
The string to int conversion can be made significantly faster using pd.factorize though this is still an expensive process.
Here is the complete Numba-based solution:
import numba as nb
#nb.njit('(int64[:], int64[:], int64[:])', parallel=True)
def compute_counts(labelIds, scores, targetScores):
groupSizes = np.bincount(labelIds)
groupOffset = np.zeros(groupSizes.size, dtype=np.int64)
scoreByGroups = [np.empty(e, dtype=np.int64) for e in groupSizes]
counts = np.empty(labelIds.size, dtype=np.int64)
for labelId, score in zip(labelIds, scores):
offset = groupOffset[labelId]
scoreByGroups[labelId][offset] = score
groupOffset[labelId] = offset + 1
for labelId in nb.prange(len(scoreByGroups)):
scoreByGroups[labelId].sort()
for i in nb.prange(labelIds.size):
counts[i] = np.searchsorted(scoreByGroups[labelIds[i]], targetScores[i])
return counts
df2 = df.copy() # Slow part
labelIds, labels = pd.factorize(df['Nom player']) # Slow part
counts = compute_counts( # Pretty fast part
labelIds.astype(np.int64),
df['score'].to_numpy().astype(np.int64),
df['target score'].to_numpy().astype(np.int64)
)
df2['Count'] = counts # Slow part
On my 6-core machine, this code is much faster on large dataframe. In fact, it is the fastest one of the proposed answers. It is only 2.5 faster than the one of #MichaelSzczesny on a random dataframe with 9000 rows. The string to int conversion takes 40-45% of the time and the creation of the new Pandas dataframe (with the additional column) takes 25% of the time. The time taken by the Numba function is actually small in the end. Most of the time is finally lost in overheads.
Note that using categorial data can be done once (pre-computation) and it can be useful to other computation so it may actually not be so expensive.

Iteratively get the max of a data frame column, add one and repeat for all rows in r

I need to perform a database operation where I'll be adding new data to an existing table and then assigning the new rows a unique id. I'm asking about this in R so I can get the logic straight before I attempt to rewrite it in sql or pyspark.
Imagine that I've already added the new data to the existing data. Here's a simplified version of what it might look like:
library(tidyverse)
df <- tibble(id = c(1, 2, 3, NA, NA),
descriptions = c("dodgers", "yankees","giants", "orioles", "mets"))
# A tibble: 5 x 2
id descriptions
<dbl> <chr>
1 1 dodgers
2 2 yankees
3 3 giants
4 NA orioles
5 NA mets
What I want is:
# A tibble: 5 x 2
id descriptions
<dbl> <chr>
1 1 dodgers
2 2 yankees
3 3 giants
4 4 orioles
5 5 mets
An I can't use arrange with rowid_to_columns id's be deleted.
To get a unique id for the NA rows while not changing the existing ones, I want to get the max of the id column, add one, replace NA with that value and then move to the next row. My instinct was to do something like this: df %>% mutate(new_id = max(id, na.rm = TRUE) + 1) but that only get's the max plus one, not a new max for each row. I feel like I could do this with a mapping function but what I've tried returns a result identical to the input dataframe:
df %>%
mutate(id = ifelse(is.na(id),
map_dbl(id, ~max(.) + 1, na.rm = FALSE),
id))
# A tibble: 5 x 2
id descriptions
<dbl> <chr>
1 1 dodgers
2 2 yankees
3 3 giants
4 NA orioles
5 NA mets
Thanks in advance--now if someone can help me directly in sql, that's also a plus!
SQL option, using sqldf for demo:
sqldf::sqldf("
with cte as (
select max(id) as maxid from df
)
select cte.maxid + row_number() over () as id, df.descriptions
from df
left join cte where df.id is null
union
select * from df where id is not null")
# id descriptions
# 1 1 dodgers
# 2 2 yankees
# 3 3 giants
# 4 4 orioles
# 5 5 mets
Here is one method where we add the max value with the cumulative sum of logical vector based on the NA values and coalesce with the original column 'id'
library(dplyr)
df <- df %>%
mutate(id = coalesce(id, max(id, na.rm = TRUE) + cumsum(is.na(id))))
-output
df
# A tibble: 5 x 2
id descriptions
<dbl> <chr>
1 1 dodgers
2 2 yankees
3 3 giants
4 4 orioles
5 5 mets

dbplyr, dplyr, and functions with no SQL equivalents [eg `slice()`]

library(tidyverse)
con <- DBI::dbConnect(RSQLite::SQLite(), ":memory:")
copy_to(con, mtcars)
mtcars2 <- tbl(con, "mtcars")
I can create this mock SQL database above. And it's very cool that I can perform standard dplyr functions on this "database":
mtcars2 %>%
group_by(cyl) %>%
summarise(mpg = mean(mpg, na.rm = TRUE)) %>%
arrange(desc(mpg))
#> # Source: lazy query [?? x 2]
#> # Database: sqlite 3.29.0 [:memory:]
#> # Ordered by: desc(mpg)
#> cyl mpg
#> <dbl> <dbl>
#> 1 4 26.7
#> 2 6 19.7
#> 3 8 15.1
It appears I'm unable to use dplyr functions that have no direct SQL equivalents, (eg dplyr::slice()). In the case of slice() I can use the alternative combination of filter() and row_number() to get the same results as just using slice(). But what happens when there's not such an easy workaround?
mtcars2 %>% slice(1:5)
#>Error in UseMethod("slice_") :
#> no applicable method for 'slice_' applied to an object of class
#> "c('tbl_SQLiteConnection', 'tbl_dbi', 'tbl_sql', 'tbl_lazy', 'tbl')"
When dplyr functions have no direct SQL equivalents can I force their use with dbplyr, or is the only option to get creative with dplyr verbs that do have SQL equivalents, or just write the SQL directly (which is not my preferred solution)?
I understood this question: How can I make slice() work for SQL databases? This is different from "forcing their use" but still might work in your case.
The example below shows how to implement a "poor man's" variant of slice() that works on the database. We still need to do the legwork and implement it with verbs that work on the database, but then we can use it similarly to data frames.
Read more about S3 classes in http://adv-r.had.co.nz/OO-essentials.html#s3.
library(tidyverse)
con <- DBI::dbConnect(RSQLite::SQLite(), ":memory:")
copy_to(con, mtcars)
mtcars2 <- tbl(con, "mtcars")
# mtcars2 has a class attribute
class(mtcars2)
#> [1] "tbl_SQLiteConnection" "tbl_dbi" "tbl_sql"
#> [4] "tbl_lazy" "tbl"
# slice() is an S3 method
slice
#> function(.data, ..., .preserve = FALSE) {
#> UseMethod("slice")
#> }
#> <bytecode: 0x560a03460548>
#> <environment: namespace:dplyr>
# we can implement a "poor man's" variant of slice()
# for the particular class. (It doesn't work quite the same
# in all cases.)
#' #export
slice.tbl_sql <- function(.data, ...) {
rows <- c(...)
.data %>%
mutate(...row_id = row_number()) %>%
filter(...row_id %in% !!rows) %>%
select(-...row_id)
}
mtcars2 %>%
slice(1:5)
#> # Source: lazy query [?? x 11]
#> # Database: sqlite 3.29.0 [:memory:]
#> mpg cyl disp hp drat wt qsec vs am gear carb
#> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 21 6 160 110 3.9 2.62 16.5 0 1 4 4
#> 2 21 6 160 110 3.9 2.88 17.0 0 1 4 4
#> 3 22.8 4 108 93 3.85 2.32 18.6 1 1 4 1
#> 4 21.4 6 258 110 3.08 3.22 19.4 1 0 3 1
#> 5 18.7 8 360 175 3.15 3.44 17.0 0 0 3 2
Created on 2019-12-07 by the reprex package (v0.3.0)

return rows which elements are duplicates, not the logical vector

I know the duplicated-function of the package dplyr. The problem is that it only returns a logical vector indicating which elements (rows) are duplicates.
I want to get a vector which gives back those rows with the certain elements.
I want to get back all the observations of A and B because they have for the key Name and year duplicated values.
I already have coded this:
>df %>% group_by(Name) %>% filter(any(( ?????)))
but I dont know how to write the last part of code.
Anyone any ideas?
Thanks :)
An option using dplyr can be achieved by grouping on both Name and Year to calculate count. Afterwards group on only Name and filter for groups having any count > 1 (meaning duplicate):
library(dplyr)
df %>% group_by(Name, Year) %>%
mutate(count = n()) %>%
group_by(Name) %>%
filter(any(count > 1)) %>%
select(-count)
# # A tibble: 7 x 3
# # Groups: Name [2]
# Name Year Value
# <chr> <int> <int>
# 1 A 1990 5
# 2 A 1990 3
# 3 A 1991 5
# 4 A 1995 5
# 5 B 2000 0
# 6 B 2000 4
# 7 B 1998 5
Data:
df <- read.table(text =
"Name Year Value
A 1990 5
A 1990 3
A 1991 5
A 1995 5
B 2000 0
B 2000 4
B 1998 5
C 1890 3
C 1790 2",
header = TRUE, stringsAsFactors = FALSE)

Dataframe, keep only one column

I can't find the pandas function that returns a one column Dataframe from a multi column DF.
I need the exact opposit function of the drop(['']) one.
Any ideas?
You can use the following notation to return a single column dataframe:
df = pd.DataFrame(data=np.random.randint(1, 100 ,(10, 5)), columns=list('ABCDE'))
df_out = df[['C']]
Output:
C
0 65
1 48
2 1
3 41
4 85
5 55
6 45
7 10
8 44
9 11
Note: df['C'] returns a series. And, you can use the to_frame method to convert that series into a dataframe. Or use the double brackets, [[]].
For completeness, I would like to show how we can use the parameter drop to obtain a one column dataframe from a multicolumn one. Also, I explain the result using the tidyverse universe (paper).
Working with a minimal example for a dataframe DF
library(tidyverse)
DF <- data.frame(a = 1:2, b = c("e", "f"))
str(DF)
#> 'data.frame': 2 obs. of 2 variables:
#> $ a: int 1 2
#> $ b: chr "e" "f"
By the way, note that in R versions lower than 4.0, column b would be a factor by default (unless we use stringsAsFactors= FALSE)
Operator [ returns a list (dataframe) as it preserves the original structure (dataframe)
DF[1]
#> a
#> 1 1
#> 2 2
DF['a']
#> a
#> 1 1
#> 2 2
On the other hand, operator [[ simplifies the result to the simplest structure possible, a vector for a one-column dataframe. In the three expressions of it, you always get the simplified version (a vector)
DF[[1]]
#> [1] 1 2
DF[['a']]
#> [1] 1 2
DF$a
#> [1] 1 2
Finally, using [ with row and column dimension
DF[, 1]
#> [1] 1 2
also returns the simplified version because the parameter drop is set to TRUE by default. Setting it to FALSE, you preserve the structure and obtain a one-column dataframe
DF[, 1, drop = FALSE]
#> a
#> 1 1
#> 2 2
A good explanation of this point can be found at: Advanced R by Hadley Wickham, CRC, 2015, section 3.2.1 or section 4.2.5 in the on-line version of the book (June 2021)
Finally, within the tidyverse universe CRAN, you always obtain a dataframe (tibble) when selecting one column
DF %>%
select(2)
#> b
#> 1 e
#> 2 f
DF %>%
select("a")
#> a
#> 1 1
#> 2 2
DF %>%
select(a)
#> a
#> 1 1
#> 2 2
Created on 2021-06-04 by the reprex package (v0.3.0)
It is very simple just use the double brace to select it.
It will return the result in Data Frame. You can check it by type(df)
# First create a data frame to check this
column = df[['Risk']]
print(column)
print(type(column))