How to create multiple dataframes in R from different sql files - sql

I have 49 .db files.
I want to open them in R and then store its content in a dataframe for further use.
I am able to do it for one file but I want to modify the code to be able to do it for all the 49 .db file in one go.
This is the code that I am trying to do for one file:
sqlite <- dbDriver("SQLite")
dbname <- "en_Whole_Blood.db"
db = dbConnect(sqlite,dbname)
wholeblood_df <- dbGetQuery(db,"SELECT * FROM weights")
View(wholeblood_df)
I tried to use the list.files function to do it for all the .db file but its not happening.Its only creating a dataframe for the last object
This is the code for it:
library("RSQLite")
sqlite <- dbDriver("SQLite")
sqlite <- dbDriver("SQLite")
dbname <- data_files
dbname
for (i in length(dbname){
db=dbConnect(sqlite,dbname[i])
df <- dbGetQuery(db,"SELECT * FROM weights")
}
##This only gives me last .db file as a dataframe.
Does anyone know how can I edit this code to get 49 dataframe for each sql file.
Thank you.

Try replacing the for loop with lapply:
list_of_df <- lapply(dbname, function(x) {
db <- dbConnect(sqlite, x)
df <- dbGetQuery(db, "SELECT * FROM weights")
})
I'm not experience in handling SQL and / or connections, but I think it might work.
Edit
Second alternative maintaining the for loop:
df <- list()
for (i in 1:length(dbname)) {
db <- dbConnect(sqlite,dbname[i])
df <- c(df, dbGetQuery(db,"SELECT * FROM weights"))
}
Hope it helps

Another suggestion:
files <- list.files(pattern = "\\.db$")
list_of_frames <- lapply(files, function(fn) {
db <- dbConnect(RSQLite::SQLite(), fn)
on.exit(dbDisconnect(db))
dbGetQuery(db, "select * from weights")
})
oneframe <- do.call(rbind, list_of_frames)
Reproducible example
Create data (you don't need this):
for (i in 1:3) {
db <- DBI::dbConnect(RSQLite::SQLite(), sprintf("mtcars%i.db", i))
DBI::dbWriteTable(db, "weights", mtcars[i * 5 + 1:3,], append = FALSE, create = TRUE)
DBI::dbDisconnect(db)
}
Working solution:
files <- list.files(pattern = "\\.db$")
files
# [1] "mtcars1.db" "mtcars2.db" "mtcars3.db"
list_of_frames <- lapply(files, function(fn) {
db <- dbConnect(RSQLite::SQLite(), fn)
on.exit(dbDisconnect(db))
dbGetQuery(db, "select * from mt")
})
list_of_frames
# [[1]]
# mpg cyl disp hp drat wt qsec vs am gear carb
# 1 18.1 6 225.0 105 2.76 3.46 20.22 1 0 3 1
# 2 14.3 8 360.0 245 3.21 3.57 15.84 0 0 3 4
# 3 24.4 4 146.7 62 3.69 3.19 20.00 1 0 4 2
# [[2]]
# mpg cyl disp hp drat wt qsec vs am gear carb
# 1 17.8 6 167.6 123 3.92 3.44 18.9 1 0 4 4
# 2 16.4 8 275.8 180 3.07 4.07 17.4 0 0 3 3
# 3 17.3 8 275.8 180 3.07 3.73 17.6 0 0 3 3
# [[3]]
# mpg cyl disp hp drat wt qsec vs am gear carb
# 1 10.4 8 460.0 215 3.00 5.424 17.82 0 0 3 4
# 2 14.7 8 440.0 230 3.23 5.345 17.42 0 0 3 4
# 3 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1
oneframe <- do.call(rbind, list_of_frames)
oneframe
# mpg cyl disp hp drat wt qsec vs am gear carb
# 1 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1
# 2 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4
# 3 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2
# 4 17.8 6 167.6 123 3.92 3.440 18.90 1 0 4 4
# 5 16.4 8 275.8 180 3.07 4.070 17.40 0 0 3 3
# 6 17.3 8 275.8 180 3.07 3.730 17.60 0 0 3 3
# 7 10.4 8 460.0 215 3.00 5.424 17.82 0 0 3 4
# 8 14.7 8 440.0 230 3.23 5.345 17.42 0 0 3 4
# 9 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1
Tidyverse alternative:
library(dplyr) # just for %>%, could use magrittr as well
library(purrr) # map_dfr
oneframe <- files %>%
map_dfr(~ {
db <- DBI::dbConnect(RSQLite::SQLite(), .)
on.exit(DBI::dbDisconnect(db))
DBI::dbGetQuery(db, "select * from mt")
})
### same result

Related

How to convert large .csv file with "too many columns" into SQL database

I was given a large .csv file (around 6.5 Gb) with 25k rows and 20k columns. Let's call first column ID1 and then each additional column is a value for each of these ID1s in different conditions. Let's call these ID2s.
This is the first time I work with such large files. I wanted to process the .csv file in R and summarize the values, mean, standard deviation and coefficient of variation for each ID1.
My idea was to read the file directly (with datatable fread), convert it into "long" data (with dplyr) so I have three columns: ID1, ID2 and value. Then group them by ID1,ID2 and summarize. However, I do not seem to have enough memory to read the file (I assume R uses more memory than the file's size to store it).
I think it would be more efficient to first convert the file into a SQL database and then process it from there. I have tried to convert it using sqlite3 but it gives me an error message stating that the maximum number of columns to read are 4096.
I have no experience with SQL, so I was wondering what would be the best way of converting the .csv file into a database. I guess reading each column and storing them as a table or something like that would work.
I have searched for similar questions but most of them just say that having so many columns is a bad db design. I cannot generate the .csv file with a proper structure.
Any suggestions for an efficient way of processing the .csv file?
Best,
Edit: I was able to read the initial file in R, but I still find some problems:
1- I cannot write into a sqlite db because of the "too many columns" limit.
2- I cannot pivot it inside R because I get the error:
Error: cannot allocate vector of size 7.8 Gb
Eventhough my memory limit is high enough. I have 8.5 Gb of free memory and:
> memory.limit()
[1] 16222
I have used #danlooo 's code but the data is not in the format I would like it to be. Probably I was not clear enough explaining its structure.
Here is an example of how I would like the data to look like (ID1 = Sample, ID2 = name, value = value)
> test = input[1:5,1:5]
>
> test
Sample DRX007662 DRX007663 DRX007664 DRX014481
1: AT1G01010 12.141565 16.281420 14.482322 35.19884
2: AT1G01020 12.166693 18.054251 12.075236 37.14983
3: AT1G01030 9.396695 9.704697 8.211935 4.36051
4: AT1G01040 25.278412 24.429031 22.484845 17.51553
5: AT1G01050 64.082870 66.022141 62.268711 58.06854
> test2 = pivot_longer(test, -Sample)
> test2
# A tibble: 20 x 3
Sample name value
<chr> <chr> <dbl>
1 AT1G01010 DRX007662 12.1
2 AT1G01010 DRX007663 16.3
3 AT1G01010 DRX007664 14.5
4 AT1G01010 DRX014481 35.2
5 AT1G01020 DRX007662 12.2
6 AT1G01020 DRX007663 18.1
7 AT1G01020 DRX007664 12.1
8 AT1G01020 DRX014481 37.1
9 AT1G01030 DRX007662 9.40
10 AT1G01030 DRX007663 9.70
11 AT1G01030 DRX007664 8.21
12 AT1G01030 DRX014481 4.36
13 AT1G01040 DRX007662 25.3
14 AT1G01040 DRX007663 24.4
15 AT1G01040 DRX007664 22.5
16 AT1G01040 DRX014481 17.5
17 AT1G01050 DRX007662 64.1
18 AT1G01050 DRX007663 66.0
19 AT1G01050 DRX007664 62.3
20 AT1G01050 DRX014481 58.1
> test3 = test2 %>% group_by(Sample) %>% summarize(mean(value))
> test3
# A tibble: 5 x 2
Sample `mean(value)`
<chr> <dbl>
1 AT1G01010 19.5
2 AT1G01020 19.9
3 AT1G01030 7.92
4 AT1G01040 22.4
5 AT1G01050 62.6
How should I change the code to make it look that way?
Thanks a lot!
Pivoting in SQL is very tedious and often requires writing nested queries for each column. SQLite3 is indeed the way to go if the data can not live in the RAM. This code will read the text file in chunks, pivot the data in long format and puts it into the SQL database. Then you can access the database with dplyr verbs for summarizing. This uses another example dataset, because I have no idea which column types ID1 and ID2 have. You might want to do pivot_longer(-ID2) to have two name columns.
library(tidyverse)
library(DBI)
library(vroom)
conn <- dbConnect(RSQLite::SQLite(), "my-db.sqlite")
dbCreateTable(conn, "data", tibble(name = character(), value = character()))
file <- "https://github.com/r-lib/vroom/raw/main/inst/extdata/mtcars.csv"
chunk_size <- 10 # read this many lines of the text file at once
n_chunks <- 5
# start with offset 1 to ignore header
for(chunk_offset in seq(1, chunk_size * n_chunks, by = chunk_size)) {
# everything must be character to allow pivoting numeric and text columns
vroom(file, skip = chunk_offset, n_max = chunk_size,
col_names = FALSE, col_types = cols(.default = col_character())
) %>%
pivot_longer(everything()) %>%
dbAppendTable(conn, "data", value = .)
}
data <- conn %>% tbl("data")
data
#> # Source: table<data> [?? x 2]
#> # Database: sqlite 3.37.0 [my-db.sqlite]
#> name value
#> <chr> <chr>
#> 1 X1 Mazda RX4
#> 2 X2 21
#> 3 X3 6
#> 4 X4 160
#> 5 X5 110
#> 6 X6 3.9
#> 7 X7 2.62
#> 8 X8 16.46
#> 9 X9 0
#> 10 X10 1
#> # … with more rows
data %>%
# summarise only the 3rd column
filter(name == "X3") %>%
group_by(value) %>%
count() %>%
arrange(-n) %>%
collect()
#> # A tibble: 3 × 2
#> value n
#> <chr> <int>
#> 1 8 14
#> 2 4 11
#> 3 6 7
Created on 2022-04-15 by the reprex package (v2.0.1)

Calculate mean-deviated values (subtract mean of all columns except one from this one column)

I have a dataset with the following structure:
df <- data.frame(id = 1:5,
study = c("st1","st2","st3","st4","st5"),
a_var = c(10,20,30,40,50),
b_var = c(6,5,4,3,2),
c_var = c(3,4,5,6,7),
d_var = c(80,70,60,50,40))
I would like to calculate the difference between each column that has _var in its name and the mean of all other columns containing _var in their names, like this:
mean_deviated_value <- function(data, variable) {
md_value = data[,variable] - rowMeans(data[,names(data) != variable])
md_value
}
df$a_var_md <- mean_deviated_value(dplyr::select(df, contains("_var")), "a_var")
df$b_var_md <- mean_deviated_value(dplyr::select(df, contains("_var")), "b_var")
df$c_var_md <- mean_deviated_value(dplyr::select(df, contains("_var")), "c_var")
df$d_var_md <- mean_deviated_value(dplyr::select(df, contains("_var")), "d_var")
Which gives me my desired output:
id study a_var b_var c_var d_var a_var_md b_var_md c_var_md d_var_md
1 1 st1 10 6 3 80 -19.666667 -12.33333 -9.80 83.80000
2 2 st2 20 5 4 70 -6.333333 -16.91667 -10.35 70.76667
3 3 st3 30 4 5 60 7.000000 -21.50000 -10.90 57.73333
4 4 st4 40 3 6 50 20.333333 -26.08333 -11.45 44.70000
5 5 st5 50 2 7 40 33.666667 -30.66667 -12.00 31.66667
How do I do it in one go, without repeating the code, preferably with dplyr/purrr?
I tried this:
df %>%
mutate(across(contains("_var"), ~ list(md = .x - rowMeans(select(., contains("_var") & !.x)))))
And got this error:
Error: Problem with `mutate()` input `..1`.
ℹ `..1 = across(...)`.
x no applicable method for 'select' applied to an object of class "c('double', 'numeric')"
We can use map_dfc with transmute to create *_md columns, and glue syntax for the names.
library(tidyverse)
nms <- names(df) %>%
str_subset('^.*_')
bind_cols(df, map_dfc(nms, ~transmute(df, '{.x}_md' := mean_deviated_value(select(df, contains("_var")), .x))))
#> id study a_var b_var c_var d_var a_var_md b_var_md c_var_md d_var_md
#> 1 1 st1 10 6 3 80 -19.666667 -25.00000 -29.00000 73.66667
#> 2 2 st2 20 5 4 70 -6.333333 -26.33333 -27.66667 60.33333
#> 3 3 st3 30 4 5 60 7.000000 -27.66667 -26.33333 47.00000
#> 4 4 st4 40 3 6 50 20.333333 -29.00000 -25.00000 33.66667
#> 5 5 st5 50 2 7 40 33.666667 -30.33333 -23.66667 20.33333
Note that if you use assigment. The first time rowMeans will compute with b_var, c_bar and d_bar. But the second time, contains("_var") will also capture the previously created a_var_md and use it to compute the means. I don't know if this is intended behaviour but it is worth mentioning.
df$a_var_md <- mean_deviated_value(dplyr::select(df, contains("_var")), "a_var")
select(df, contains("_var"))
#> a_var b_var c_var d_var a_var_md
#> 1 10 6 3 80 -19.666667
#> 2 20 5 4 70 -6.333333
#> 3 30 4 5 60 7.000000
#> 4 40 3 6 50 20.333333
#> 5 50 2 7 40 33.666667
We can avoid this by replacing contains("_var") with matches("^.*_var$")
Created on 2021-12-20 by the reprex package (v2.0.1)

How to read all tables from a SQLite database and store as datasets/variables in R?

I have a large SQLite database with many tables. I have established a connection to this database in RStudio using RSQLite and DBI packages. (I have named this database db)
library(RSQLite)
library(DBI)
At the moment I have to read in all the tables and assign them a name manually. For example:
country <- dbReadTable(db, "country")
date <- dbReadTable(db, "date")
#...and so on
You can see this can be a very time-consuming process if you were to have many tables.
So I was wondering if it is possible to create a new function or using existing functions (e.g. lapply() ?) to complete this more efficiently and essentially speed up this process?
Any suggestions are much appreciated :)
Two mindsets:
All tables/data into one named-list:
alldat <- lapply(setNames(nm = dbListTables(db)), dbReadTable, conn = db)
The benefit of this is that if the tables have similar meaning, then you can use lapply to apply the same function to each. Another benefit is that all data from one database are stored together.
See How do I make a list of data frames? for working on a list-of-frames.
If you want them as actual variables in the global (or enclosing) environment, then take the previous alldat, and
ign <- list2env(alldat, envir = .GlobalEnv)
The return value from list2env is the environment that we passed to list2env, so it's not incredibly useful in this context (though it is useful other times). The only reason I capture it into ign is to reduce the clutter on the console ... which is minor. list2env works primarily in side-effect, so the return value in this case is not critical.
You can use dbListTables() to generate a character vector of all your table names in your SQLite database and use lapply() to import them into R efficiently. I would first check you are able to import all the tables in your database into memory.
Below is an reproducible example of this:
library(RSQLite)
library(DBI)
db <- dbConnect(RSQLite::SQLite(), ":memory:")
dbWriteTable(db, "mtcars", mtcars)
dbWriteTable(db, "iris", iris)
db_tbls <- dbListTables(db)
tbl_list <- lapply(db_tbls, dbReadTable, conn = db)
tbl_list <- setNames(tbl_list, db_tbls)
dbDisconnect(db)
> lapply(tbl_list, head)
$iris
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3.0 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
4 4.6 3.1 1.5 0.2 setosa
5 5.0 3.6 1.4 0.2 setosa
6 5.4 3.9 1.7 0.4 setosa
$mtcars
mpg cyl disp hp drat wt qsec vs am gear carb
1 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
2 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
3 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
4 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
5 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
6 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1

List of Pandas Dataframes: Merging Function Outputs

I've researched previous similar questions, but couldn't find any applicable leads:
I have a dataframe, called "df" which is roughly structured as follows:
Income Income_Quantile Score_1 Score_2 Score_3
0 100000 5 75 75 100
1 97500 5 80 76 94
2 80000 5 79 99 83
3 79000 5 88 78 91
4 70000 4 55 77 80
5 66348 4 65 63 57
6 67931 4 60 65 57
7 69232 4 65 59 62
8 67948 4 64 64 60
9 50000 3 66 50 60
10 49593 3 58 51 50
11 49588 3 58 54 50
12 48995 3 59 59 60
13 35000 2 61 50 53
14 30000 2 66 35 77
15 12000 1 22 60 30
16 10000 1 15 45 12
Using the "Income_Quantile" column and the following "for-loop", I divided the dataframe into a list of 5 subset dataframes (which each contain observations from the same income quantile):
dfs = []
for level in df.Income_Quantile.unique():
df_temp = df.loc[df.Income_Quantile == level]
dfs.append(df_temp)
Now, I would like to apply the following function for calculating the spearman correlation, p-value and t-statistic to the dataframe (fyi: scipy.stats functions are used in the main function):
def create_list_of_scores(df):
df_result = pd.DataFrame(columns=cols)
df_result.loc['t-statistic'] = [ttest_ind(df['Income'], df[x])[0] for x in cols]
df_result.loc['p-value'] = [ttest_ind(df['Income'], df[x])[1] for x in cols]
df_result.loc['correlation'] = [spearmanr(df['Income'], df[x])[1] for x in cols]
return df_result
The functions that "create_list_of_scores" uses, i.e. "ttest_ind" and "ttest_ind", can be accessed from scipy.stats as follows:
from scipy.stats import ttest_ind
from scipy.stats import spearmanr
I tested the function on one subset of the dataframe:
data = dfs[1]
result = create_list_of_scores(data)
It works as expected.
However, when it comes to applying the function to the entire list of dataframes, "dfs", a lot of issues arise. If I apply it to the list of dataframes as follows:
result = pd.concat([create_list_of_scores(d) for d in dfs], axis=1)
I get the output as the columns "Score_1, Score_2, and Score_3" x 5.
I would like to:
Have just three columns "Score_1, Score_2, and Score_3".
Index the output using the t-statistic, p-value and correlations as the first level index, and; the "Income_Quantile" as the second level index.
Here is what I have in mind:
Score_1 Score_2 Score_3
t-statistic 1
2
3
4
5
p-value 1
2
3
4
5
correlation 1
2
3
4
5
Any idea on how I can merge the output of my function as requested?
I think better is use GroupBy.apply:
cols = ['Score_1','Score_2','Score_3']
def create_list_of_scores(df):
df_result = pd.DataFrame(columns=cols)
df_result.loc['t-statistic'] = [ttest_ind(df['Income'], df[x])[0] for x in cols]
df_result.loc['p-value'] = [ttest_ind(df['Income'], df[x])[1] for x in cols]
df_result.loc['correlation'] = [spearmanr(df['Income'], df[x])[1] for x in cols]
return df_result
df = df.groupby('Income_Quantile').apply(create_list_of_scores).swaplevel(0,1).sort_index()
print (df)
Score_1 Score_2 Score_3
Income_Quantile
correlation 1 NaN NaN NaN
2 NaN NaN NaN
3 6.837722e-01 0.000000e+00 1.000000e+00
4 4.337662e-01 6.238377e-01 4.818230e-03
5 2.000000e-01 2.000000e-01 2.000000e-01
p-value 1 8.190692e-03 8.241377e-03 8.194933e-03
2 5.887943e-03 5.880440e-03 5.888611e-03
3 3.606128e-13 3.603267e-13 3.604996e-13
4 5.584822e-14 5.587619e-14 5.586583e-14
5 3.861801e-06 3.862192e-06 3.864736e-06
t-statistic 1 1.098143e+01 1.094719e+01 1.097856e+01
2 1.297459e+01 1.298294e+01 1.297385e+01
3 2.391611e+02 2.391927e+02 2.391736e+02
4 1.090548e+02 1.090479e+02 1.090505e+02
5 1.594605e+01 1.594577e+01 1.594399e+01

UTF-8 encoding with dplyr and SQLite

I have a table in SQLite and I’d like to open it with dplyr. I use SQLite Expert Version 35.58.2478, R Studio Version 0.98.1062 on a PC with Win 7.
After connecting to the database with src_sqlite() and reading with tbl() I get the table. But the character enconding is wrong. Reading the same table from a csv-file just works by adding encoding = “utf-8” to the function read.csv but in this case another error in the first column name occurs (please consider the minimal example below).
Note that in the SQLite table the encoding is UTF-8 and SQLite displays the data correctly.
I tried to change the encoding in R Studio options with no success. Also changing the region in windows or in r doesn’t have any effect.
Is there any solution of getting the characters in the table correctly into r using dplyr?
Minimal Example
library(dplyr)
db <- src_sqlite("C:/Users/Jens/Documents/SQLite/my_db.sqlite")
tbl(db, "prozesse")
## Source: sqlite 3.7.17 [C:/Users/Jens/Documents/SQLite/my_db.sqlite]
## From: prozesse [4 x 4]
##
## KH_ID Einschätzung Prozess Gruppe
## 1 1 3 Buchung IT
## 2 2 4 Buchung IT
## 3 3 3 Buchung OLP
## 4 4 5 Buchung OLP
You see the wrong encoding in the name of the second column. This issue occures as well in the colums with ä, ö, ü etc.
The name of the second column is displayed correctly, but the first column is wrong:
read.csv("C:/Users/Jens/Documents/SQLite/prozess.csv", encoding = "UTF-8")
## X.U.FEFF.KH_ID Einschätzung Gruppe Prozess
## 1 1 3 PO visite
## 2 2 3 IT visite
## 3 3 3 IT visite
## 4 2 3 PO visite
sessionInfo()
## R version 3.1.1 (2014-07-10)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
##
## locale:
## [1] LC_COLLATE=German_Germany.1252 LC_CTYPE=German_Germany.1252
## [3] LC_MONETARY=German_Germany.1252 LC_NUMERIC=C
## [5] LC_TIME=German_Germany.1252
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] RSQLite.extfuns_0.0.1 RSQLite_0.11.4 DBI_0.3.0
## [4] dplyr_0.2
##
## loaded via a namespace (and not attached):
## [1] assertthat_0.1 digest_0.6.4 evaluate_0.5.5 formatR_1.0
## [5] htmltools_0.2.6 knitr_1.6 parallel_3.1.1 Rcpp_0.11.2
## [9] rmarkdown_0.3.3 stringr_0.6.2 tools_3.1.1 yaml_2.1.13
I had the same problem. I solved it like below. However, I do not guarantee that the solution is rock solid. Give it a try:
library(dplyr)
library(sqldf)
# Modifying built-in mtcars dataset
mtcars$test <-
c("č", "ž", "š", "č", "ž", "š", letters) %>%
enc2utf8(.)
mtcars$češćžä <-
c("č", "ž", "š", "č", "ž", "š", letters) %>%
enc2utf8(.)
names(mtcars) <-
iconv(names(mtcars), "cp1250", "utf-8")
# Connecting to sqlite database
my_db <- src_sqlite("my_db.sqlite3", create = T)
# exporting mtcars dataset to database
copy_to(my_db, mtcars, temporary = FALSE)
# dbSendQuery(my_db$con, "drop table mtcars")
# getting data from sqlite database
my_mtcars_from_db <-
collect(tbl(my_db, "mtcars"))
# disconnecting from database
dbDisconnect(my_db$con)
convert_to_encoding() function
# a function that encodes
# column names and values in character columns
# with specified encodings
convert_to_encoding <-
function(x, from_encoding = "UTF-8", to_encoding = "cp1250"){
# names of columns are encoded in specified encoding
my_names <-
iconv(names(x), from_encoding, to_encoding)
# if any column name is NA, leave the names
# otherwise replace them with new names
if(any(is.na(my_names))){
names(x)
} else {
names(x) <- my_names
}
# get column classes
x_char_columns <- sapply(x, class)
# identify character columns
x_cols <- names(x_char_columns[x_char_columns == "character"])
# convert all string values in character columns to
# specified encoding
x <-
x %>%
mutate_each_(funs(iconv(., from_encoding, to_encoding)),
x_cols)
# return x
return(x)
}
# use
convert_to_encoding(my_mtcars_from_db, "UTF-8", "cp1250")
Results
# before conversion
my_mtcars_from_db
Source: local data frame [32 x 13]
mpg cyl disp hp drat wt qsec vs am gear carb češćžä test
1 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4 ÄŤ ÄŤ
2 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4 Ĺľ Ĺľ
3 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1 š š
4 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1 ÄŤ ÄŤ
5 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2 Ĺľ Ĺľ
6 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1 š š
7 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4 a a
8 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2 b b
9 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2 c c
10 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4 d d
.. ... ... ... ... ... ... ... .. .. ... ... ... ...
# after conversion
convert_to_encoding(my_mtcars_from_db, "UTF-8", "cp1250")
Source: local data frame [32 x 13]
mpg cyl disp hp drat wt qsec vs am gear carb test češćžä
1 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4 č č
2 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4 ž ž
3 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1 š š
4 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1 č č
5 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2 ž ž
6 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1 š š
7 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4 a a
8 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2 b b
9 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2 c c
10 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4 d d
.. ... ... ... ... ... ... ... .. .. ... ... ... ...
Session information
devtools::session_info()
Session info -------------------------------------------------------------------
setting value
version R version 3.2.0 (2015-04-16)
system x86_64, mingw32
ui RStudio (0.99.441)
language (EN)
collate Slovenian_Slovenia.1250
tz Europe/Prague
Packages -----------------------------------------------------------------------
package * version date source
assertthat * 0.1 2013-12-06 CRAN (R 3.2.0)
chron * 2.3-45 2014-02-11 CRAN (R 3.2.0)
DBI 0.3.1 2014-09-24 CRAN (R 3.2.0)
devtools * 1.7.0 2015-01-17 CRAN (R 3.2.0)
dplyr 0.4.1 2015-01-14 CRAN (R 3.2.0)
gsubfn 0.6-6 2014-08-27 CRAN (R 3.2.0)
lazyeval * 0.1.10 2015-01-02 CRAN (R 3.2.0)
magrittr * 1.5 2014-11-22 CRAN (R 3.2.0)
proto 0.3-10 2012-12-22 CRAN (R 3.2.0)
R6 * 2.0.1 2014-10-29 CRAN (R 3.2.0)
Rcpp * 0.11.6 2015-05-01 CRAN (R 3.2.0)
RSQLite 1.0.0 2014-10-25 CRAN (R 3.2.0)
rstudioapi * 0.3.1 2015-04-07 CRAN (R 3.2.0)
sqldf 0.4-10 2014-11-07 CRAN (R 3.2.0)