UTF-8 encoding with dplyr and SQLite

UTF-8 encoding with dplyr and SQLite - sql

I have a table in SQLite and I’d like to open it with dplyr. I use SQLite Expert Version 35.58.2478, R Studio Version 0.98.1062 on a PC with Win 7.
After connecting to the database with src_sqlite() and reading with tbl() I get the table. But the character enconding is wrong. Reading the same table from a csv-file just works by adding encoding = “utf-8” to the function read.csv but in this case another error in the first column name occurs (please consider the minimal example below).
Note that in the SQLite table the encoding is UTF-8 and SQLite displays the data correctly.
I tried to change the encoding in R Studio options with no success. Also changing the region in windows or in r doesn’t have any effect.
Is there any solution of getting the characters in the table correctly into r using dplyr?
Minimal Example
library(dplyr)
db <- src_sqlite("C:/Users/Jens/Documents/SQLite/my_db.sqlite")
tbl(db, "prozesse")
## Source: sqlite 3.7.17 [C:/Users/Jens/Documents/SQLite/my_db.sqlite]
## From: prozesse [4 x 4]
##
## KH_ID EinschÃ¤tzung Prozess Gruppe
## 1 1 3 Buchung IT
## 2 2 4 Buchung IT
## 3 3 3 Buchung OLP
## 4 4 5 Buchung OLP
You see the wrong encoding in the name of the second column. This issue occures as well in the colums with ä, ö, ü etc.
The name of the second column is displayed correctly, but the first column is wrong:
read.csv("C:/Users/Jens/Documents/SQLite/prozess.csv", encoding = "UTF-8")
## X.U.FEFF.KH_ID Einschätzung Gruppe Prozess
## 1 1 3 PO visite
## 2 2 3 IT visite
## 3 3 3 IT visite
## 4 2 3 PO visite
sessionInfo()
## R version 3.1.1 (2014-07-10)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
##
## locale:
## [1] LC_COLLATE=German_Germany.1252 LC_CTYPE=German_Germany.1252
## [3] LC_MONETARY=German_Germany.1252 LC_NUMERIC=C
## [5] LC_TIME=German_Germany.1252
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] RSQLite.extfuns_0.0.1 RSQLite_0.11.4 DBI_0.3.0
## [4] dplyr_0.2
##
## loaded via a namespace (and not attached):
## [1] assertthat_0.1 digest_0.6.4 evaluate_0.5.5 formatR_1.0
## [5] htmltools_0.2.6 knitr_1.6 parallel_3.1.1 Rcpp_0.11.2
## [9] rmarkdown_0.3.3 stringr_0.6.2 tools_3.1.1 yaml_2.1.13

I had the same problem. I solved it like below. However, I do not guarantee that the solution is rock solid. Give it a try:
library(dplyr)
library(sqldf)
# Modifying built-in mtcars dataset
mtcars$test <-
c("č", "ž", "š", "č", "ž", "š", letters) %>%
enc2utf8(.)
mtcars$češćžä <-
c("č", "ž", "š", "č", "ž", "š", letters) %>%
enc2utf8(.)
names(mtcars) <-
iconv(names(mtcars), "cp1250", "utf-8")
# Connecting to sqlite database
my_db <- src_sqlite("my_db.sqlite3", create = T)
# exporting mtcars dataset to database
copy_to(my_db, mtcars, temporary = FALSE)
# dbSendQuery(my_db$con, "drop table mtcars")
# getting data from sqlite database
my_mtcars_from_db <-
collect(tbl(my_db, "mtcars"))
# disconnecting from database
dbDisconnect(my_db$con)
convert_to_encoding() function
# a function that encodes
# column names and values in character columns
# with specified encodings
convert_to_encoding <-
function(x, from_encoding = "UTF-8", to_encoding = "cp1250"){
# names of columns are encoded in specified encoding
my_names <-
iconv(names(x), from_encoding, to_encoding)
# if any column name is NA, leave the names
# otherwise replace them with new names
if(any(is.na(my_names))){
names(x)
} else {
names(x) <- my_names
}
# get column classes
x_char_columns <- sapply(x, class)
# identify character columns
x_cols <- names(x_char_columns[x_char_columns == "character"])
# convert all string values in character columns to
# specified encoding
x <-
x %>%
mutate_each_(funs(iconv(., from_encoding, to_encoding)),
x_cols)
# return x
return(x)
}
# use
convert_to_encoding(my_mtcars_from_db, "UTF-8", "cp1250")
Results
# before conversion
my_mtcars_from_db
Source: local data frame [32 x 13]
mpg cyl disp hp drat wt qsec vs am gear carb ÄŤeĹˇÄ‡ĹľĂ¤ test
1 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4 ÄŤ ÄŤ
2 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4 Ĺľ Ĺľ
3 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1 Ĺˇ Ĺˇ
4 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1 ÄŤ ÄŤ
5 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2 Ĺľ Ĺľ
6 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1 Ĺˇ Ĺˇ
7 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4 a a
8 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2 b b
9 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2 c c
10 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4 d d
.. ... ... ... ... ... ... ... .. .. ... ... ... ...
# after conversion
convert_to_encoding(my_mtcars_from_db, "UTF-8", "cp1250")
Source: local data frame [32 x 13]
mpg cyl disp hp drat wt qsec vs am gear carb test češćžä
1 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4 č č
2 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4 ž ž
3 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1 š š
4 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1 č č
5 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2 ž ž
6 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1 š š
7 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4 a a
8 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2 b b
9 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2 c c
10 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4 d d
.. ... ... ... ... ... ... ... .. .. ... ... ... ...
Session information
devtools::session_info()
Session info -------------------------------------------------------------------
setting value
version R version 3.2.0 (2015-04-16)
system x86_64, mingw32
ui RStudio (0.99.441)
language (EN)
collate Slovenian_Slovenia.1250
tz Europe/Prague
Packages -----------------------------------------------------------------------
package * version date source
assertthat * 0.1 2013-12-06 CRAN (R 3.2.0)
chron * 2.3-45 2014-02-11 CRAN (R 3.2.0)
DBI 0.3.1 2014-09-24 CRAN (R 3.2.0)
devtools * 1.7.0 2015-01-17 CRAN (R 3.2.0)
dplyr 0.4.1 2015-01-14 CRAN (R 3.2.0)
gsubfn 0.6-6 2014-08-27 CRAN (R 3.2.0)
lazyeval * 0.1.10 2015-01-02 CRAN (R 3.2.0)
magrittr * 1.5 2014-11-22 CRAN (R 3.2.0)
proto 0.3-10 2012-12-22 CRAN (R 3.2.0)
R6 * 2.0.1 2014-10-29 CRAN (R 3.2.0)
Rcpp * 0.11.6 2015-05-01 CRAN (R 3.2.0)
RSQLite 1.0.0 2014-10-25 CRAN (R 3.2.0)
rstudioapi * 0.3.1 2015-04-07 CRAN (R 3.2.0)
sqldf 0.4-10 2014-11-07 CRAN (R 3.2.0)

Related

How to create multiple dataframes in R from different sql files

I have 49 .db files.
I want to open them in R and then store its content in a dataframe for further use.
I am able to do it for one file but I want to modify the code to be able to do it for all the 49 .db file in one go.
This is the code that I am trying to do for one file:
sqlite <- dbDriver("SQLite")
dbname <- "en_Whole_Blood.db"
db = dbConnect(sqlite,dbname)
wholeblood_df <- dbGetQuery(db,"SELECT * FROM weights")
View(wholeblood_df)
I tried to use the list.files function to do it for all the .db file but its not happening.Its only creating a dataframe for the last object
This is the code for it:
library("RSQLite")
sqlite <- dbDriver("SQLite")
sqlite <- dbDriver("SQLite")
dbname <- data_files
dbname
for (i in length(dbname){
db=dbConnect(sqlite,dbname[i])
df <- dbGetQuery(db,"SELECT * FROM weights")
}
##This only gives me last .db file as a dataframe.
Does anyone know how can I edit this code to get 49 dataframe for each sql file.
Thank you.

Try replacing the for loop with lapply:
list_of_df <- lapply(dbname, function(x) {
db <- dbConnect(sqlite, x)
df <- dbGetQuery(db, "SELECT * FROM weights")
})
I'm not experience in handling SQL and / or connections, but I think it might work.
Edit
Second alternative maintaining the for loop:
df <- list()
for (i in 1:length(dbname)) {
db <- dbConnect(sqlite,dbname[i])
df <- c(df, dbGetQuery(db,"SELECT * FROM weights"))
}
Hope it helps

Another suggestion:
files <- list.files(pattern = "\\.db$")
list_of_frames <- lapply(files, function(fn) {
db <- dbConnect(RSQLite::SQLite(), fn)
on.exit(dbDisconnect(db))
dbGetQuery(db, "select * from weights")
})
oneframe <- do.call(rbind, list_of_frames)
Reproducible example
Create data (you don't need this):
for (i in 1:3) {
db <- DBI::dbConnect(RSQLite::SQLite(), sprintf("mtcars%i.db", i))
DBI::dbWriteTable(db, "weights", mtcars[i * 5 + 1:3,], append = FALSE, create = TRUE)
DBI::dbDisconnect(db)
}
Working solution:
files <- list.files(pattern = "\\.db$")
files
# [1] "mtcars1.db" "mtcars2.db" "mtcars3.db"
list_of_frames <- lapply(files, function(fn) {
db <- dbConnect(RSQLite::SQLite(), fn)
on.exit(dbDisconnect(db))
dbGetQuery(db, "select * from mt")
})
list_of_frames
# [[1]]
# mpg cyl disp hp drat wt qsec vs am gear carb
# 1 18.1 6 225.0 105 2.76 3.46 20.22 1 0 3 1
# 2 14.3 8 360.0 245 3.21 3.57 15.84 0 0 3 4
# 3 24.4 4 146.7 62 3.69 3.19 20.00 1 0 4 2
# [[2]]
# mpg cyl disp hp drat wt qsec vs am gear carb
# 1 17.8 6 167.6 123 3.92 3.44 18.9 1 0 4 4
# 2 16.4 8 275.8 180 3.07 4.07 17.4 0 0 3 3
# 3 17.3 8 275.8 180 3.07 3.73 17.6 0 0 3 3
# [[3]]
# mpg cyl disp hp drat wt qsec vs am gear carb
# 1 10.4 8 460.0 215 3.00 5.424 17.82 0 0 3 4
# 2 14.7 8 440.0 230 3.23 5.345 17.42 0 0 3 4
# 3 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1
oneframe <- do.call(rbind, list_of_frames)
oneframe
# mpg cyl disp hp drat wt qsec vs am gear carb
# 1 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1
# 2 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4
# 3 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2
# 4 17.8 6 167.6 123 3.92 3.440 18.90 1 0 4 4
# 5 16.4 8 275.8 180 3.07 4.070 17.40 0 0 3 3
# 6 17.3 8 275.8 180 3.07 3.730 17.60 0 0 3 3
# 7 10.4 8 460.0 215 3.00 5.424 17.82 0 0 3 4
# 8 14.7 8 440.0 230 3.23 5.345 17.42 0 0 3 4
# 9 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1
Tidyverse alternative:
library(dplyr) # just for %>%, could use magrittr as well
library(purrr) # map_dfr
oneframe <- files %>%
map_dfr(~ {
db <- DBI::dbConnect(RSQLite::SQLite(), .)
on.exit(DBI::dbDisconnect(db))
DBI::dbGetQuery(db, "select * from mt")
})
### same result

I asked an earlier question on changing a dollar value to float and divide it, and it's worked, but it doesn't change the value in the data frame

Here was the original question:
With only being able to import numpy and pandas, I need to do the following: Scale the medianIncome to express the values in $10,000 of dollars (example: 150000 will become 15, 30000 will become 3, 15000 will become 1.5, etc)
Here's the code that works:
temp_housing['medianIncome'].replace( '[(]','-', regex=True ).astype(float)) / 10000
But when I call the df after, it still shows the original amount instead of the 15 of 1.5. I'm not sure what I'm missing on this.
id medianHouseValue housingMedianAge totalBedrooms totalRooms households population medianIncome
0 23 113.903 31.0 543.0 2438.0 481.0 1016.0 17250.0
1 24 99.701 56.0 337.0 1692.0 328.0 856.0 21806.0
2 26 107.500 41.0 123.0 535.0 121.0 317.0 24038.0
3 27 93.803 53.0 244.0 1132.0 241.0 607.0 24597.0
4 28 105.504 52.0 423.0 1899.0 400.0 1104.0 18080.0
The result is:
id medianIncome
0 1.7250
1 2.1806
2 2.4038
3 2.4597
4 1.8080
Name: medianIncome, Length: 20640, dtype: float64
But then when I call the df with
housing_cal
, it's back to:
id medianHouseValue housingMedianAge totalBedrooms totalRooms households population medianIncome
0 23 113.903 31.0 543.0 2438.0 481.0 1016.0 17250.0
1 24 99.701 56.0 337.0 1692.0 328.0 856.0 21806.0
2 26 107.500 41.0 123.0 535.0 121.0 317.0 24038.0
3 27 93.803 53.0 244.0 1132.0 241.0 607.0 24597.0
4 28 105.504 52.0 423.0 1899.0 400.0 1104.0 18080.0

How to read all tables from a SQLite database and store as datasets/variables in R?

I have a large SQLite database with many tables. I have established a connection to this database in RStudio using RSQLite and DBI packages. (I have named this database db)
library(RSQLite)
library(DBI)
At the moment I have to read in all the tables and assign them a name manually. For example:
country <- dbReadTable(db, "country")
date <- dbReadTable(db, "date")
#...and so on
You can see this can be a very time-consuming process if you were to have many tables.
So I was wondering if it is possible to create a new function or using existing functions (e.g. lapply() ?) to complete this more efficiently and essentially speed up this process?
Any suggestions are much appreciated :)

Two mindsets:
All tables/data into one named-list:
alldat <- lapply(setNames(nm = dbListTables(db)), dbReadTable, conn = db)
The benefit of this is that if the tables have similar meaning, then you can use lapply to apply the same function to each. Another benefit is that all data from one database are stored together.
See How do I make a list of data frames? for working on a list-of-frames.
If you want them as actual variables in the global (or enclosing) environment, then take the previous alldat, and
ign <- list2env(alldat, envir = .GlobalEnv)
The return value from list2env is the environment that we passed to list2env, so it's not incredibly useful in this context (though it is useful other times). The only reason I capture it into ign is to reduce the clutter on the console ... which is minor. list2env works primarily in side-effect, so the return value in this case is not critical.

You can use dbListTables() to generate a character vector of all your table names in your SQLite database and use lapply() to import them into R efficiently. I would first check you are able to import all the tables in your database into memory.
Below is an reproducible example of this:
library(RSQLite)
library(DBI)
db <- dbConnect(RSQLite::SQLite(), ":memory:")
dbWriteTable(db, "mtcars", mtcars)
dbWriteTable(db, "iris", iris)
db_tbls <- dbListTables(db)
tbl_list <- lapply(db_tbls, dbReadTable, conn = db)
tbl_list <- setNames(tbl_list, db_tbls)
dbDisconnect(db)
> lapply(tbl_list, head)
$iris
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3.0 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
4 4.6 3.1 1.5 0.2 setosa
5 5.0 3.6 1.4 0.2 setosa
6 5.4 3.9 1.7 0.4 setosa
$mtcars
mpg cyl disp hp drat wt qsec vs am gear carb
1 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
2 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
3 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
4 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
5 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
6 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1

how to remove \n from my dataframe output?

enter image description here
My code are as below:
i am trying to learn data science and found this \n in my output.
can someone help to sort it out ?
Year\n Team\n GP\n GS\n MPG\n FG%\n 3P%\n FT%\n RPG\n APG\n SPG\n BPG\n PPG\n
0 1984–85\n Chicago\n 82 82 38.3 .515 .173 .845 6.5 5.9 2.4 .8 28.2\n
1 1985–86\n Chicago\n 18 7 25.1 .457 .167 .840 3.6 2.9 2.1 1.2 22.7\n
2 1986–87\n Chicago\n 82 82 40.0 .482 .182 .857 5.2 4.6 2.9 1.5 37.1*\n
3 1987–88\n Chicago\n 82 82 40.4* .535 .132 .841 5.5 5.9 3.2* 1.6 35.0*\n
4 1988–89\n Chicago\n 81 81 40.2* .538 .276 .850 8.0 8.0 2.9 .8 32.5*\n
5 1989–90\n Chicago\n 82 82 39.0 .526 .376 .848 6.9 6.3 2.8* .7 33.6*\n
6 1990–91†\n Chicago\n 82 82 37.0 .539 .312 .851 6.0 5.5 2.7 1.0 31.5*\n
7 1991–92†\n Chicago\n 80 80 38.8 .519 .270 .832 6.4 6.1 2.3 .9 30.1*\n
8 1992–93†\n Chicago\n 78 78 39.3 .495 .352 .837 6.7 5.5 2.8* .8 32.6*\n
9 1994–95\n Chicago\n 17 17 39.3 .411 .500 .801 6.9 5.3 1.8 .8 26.9\n
10 1995–96†\n Chicago\n 82 82 37.7 .495 .427 .834 6.6 4.3 2.2 .5 30.4*\n
11 1996–97†\n Chicago\n 82 82 37.9 .486 .374 .833 5.9 4.3 1.7 .5 29.6*\n
12 1997–98†\n Chicago\n 82 82 38.8 .465 .238 .784 5.8 3.5 1.7 .5 28.7*\n
13 2001–02\n Washington\n 60 53 34.9 .416 .189 .790 5.7 5.2 1.4 .4 22.9\n
14 2002–03\n Washington\n 82 67 37.0 .445 .291 .821 6.1 3.8 1.5 .5 20.0\n
15 Career\n 1,072 1,039 38.3 .497 .327 .835 6.2 5.3 2.3 .8 30.1\n None
16 All-Star\n 13 13 29.4 .472 .273 .750 4.7 4.2 2.8 .5 20.2\n None
response = requests.get(links)
soup = BeautifulSoup(response.text,'html.parser')
table = soup.find('table', class_ = 'wikitable sortable')
all_raws = table.find_all('tr')
data = []
for raw in all_raws:
raw_list = raw.find_all('td')
//raw_list will print only last raw of table
[<td colspan="2" style="text-align:center;">All-Star </td>, <td>13</td>, <td>13</td>, <td>29.4</td>, <td>.472</td>, <td>.273</td>, <td>.750</td>, <td>4.7</td>, <td>4.2</td>, <td>2.8</td>, <td>.5</td>, <td>20.2</td>]
dataRaw = []
for cell in raw_list:
dataRaw.append(cell.text) # datRaw will print only last raw ['All-Star\n', '13', '13', '29.4', '.472', '.273', '.750', '4.7', '4.2', '2.8', '.5', '20.2\n']
data.append(dataRaw)
data = data[1:]
header_list = []
col_header = table.find_all('th')
for col in col_header:
header_list.append(col.text)
df = pd.DataFrame(data)
df.columns = header_list
df

this should work
df['<column name>'] = df.<column name>.str.replace(r'\n', '')

Split SparkR dataframe into list of dataframes

I am new to sparkR and trying to split the sparkR dataframe in to list of Dataframes based on columns.
The data has a billion records of Sls_D(date), mdse_item_i(item id), co_loc_i(location id), traffic_ti_8_00, traffic_ti_9_00, traffic_ti_10_00, traffic_ti_11_00 (each has no of traffic in the specific hour).
Data Snapshot:
sls_d co_loc_i mdse_item_i traffic_ti_8_00 traffic_ti_9_00 traffic_ti_10_00 traffic_ti_11_00
1 2016-10-21 1592 4694620 1 113 156 209
2 2016-10-21 1273 4694620 1 64 152 249
3 2016-10-21 1273 15281024 1 64 152 249
4 2016-10-21 1498 4694620 2 54 124 184
5 2016-10-21 1498 15281024 2 54 124 184
Desired Output:
sls_d co_loc_i mdse_item_i traffic_ti_8_00 traffic_ti_9_00 traffic_ti_10_00 traffic_ti_11_00
2016-10-21 4 4694620 3 67 145 283
A list of Dataframes.
d.2 = split(data.2.2,list(data.2.2$mdse_item_i,data.2.2$co_loc_i,data.2.2$sls_d))
Error in x[ind[[k]]] : Expressions other than filtering predicates
are not supported in the first parameter of extract operator [ or
subset() method.
Is there any way around to do this in sparkR apart from converting the sparkDataframe to base R.
As converting the sparkdataframe to base R results in memory error and defeats the problem of parallel processing.
Any help is greatly appreciated.

Your question is somewhat unclear; if you mean to split the columns of a Spark dataframe, you should use select. Here is an example using the iris data in SparkR 2.2:
df <- as.DataFrame(iris) # Spark dataframe
df
# SparkDataFrame[Sepal_Length:double, Sepal_Width:double, Petal_Length:double, Petal_Width:double, Species:string]
# separate the length-related & width-related columns into 2 Spark dataframes:
df_length = select(df, 'Sepal_Length', 'Petal_Length')
df_width = select(df, 'Sepal_Width', 'Petal_Width')
head(collect(df_width)) # for demonstration purposes only
# Sepal_Width Petal_Width
# 1 3.5 0.2
# 2 3.0 0.2
# 3 3.2 0.2
# 4 3.1 0.2
# 5 3.6 0.2
# 6 3.9 0.4
Now, you can put these 2 Spark dataframes into an R list, but I'm not sure how useful this will be - any list operations that may make sense are not usable [EDIT after comment]:
my_list = c(df_length, df_width)
head(collect(my_list[[1]]))
# Sepal_Length Petal_Length
# 1 5.1 1.4
# 2 4.9 1.4
# 3 4.7 1.3
# 4 4.6 1.5
# 5 5.0 1.4
# 6 5.4 1.7

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

UTF-8 encoding with dplyr and SQLite - sql

Related

How to create multiple dataframes in R from different sql files

I asked an earlier question on changing a dollar value to float and divide it, and it's worked, but it doesn't change the value in the data frame

How to read all tables from a SQLite database and store as datasets/variables in R?

how to remove \n from my dataframe output?

Split SparkR dataframe into list of dataframes

Categories

Resources