Importing JSON data from SQL DB to an R dataframe - sql

I would like to know whether there is a way of importing JSON data from a MySQL DB to an R dataframe.
I have a table like this:
id created_at json
1 2020-07-01 {"name":"Dent, Arthur","group":"Green","age (y)":43,"height (cm)":187,"wieght (kg)":89,"sensor":34834834}
2 2020-07-01 {"name":"Doe, Jane","group":"Blue","age (y)":23,"height (cm)":172,"wieght (kg)":67,"sensor":12342439}
3 2020-07-01 {"name":"Curt, Travis","group":"Red","age (y)":13,"height (cm)":128,"wieght (kg)":47,"sensor":83287699}
I would like to get the columns 'id' and 'json'.
I am using RMySQL package for getting the data from the db to an R dataframe but this gives me only the column 'id', the column 'json' contains only NAs in each row.
Is there any way how to import/load the data and get the json column displayed? And possibly to extract the "sensor" part of the json values?
The result would be a dataframe (df) like this:
id json
1 {"name":"Dent, Arthur","group":"Green","age (y)":43,"height (cm)":187,"wieght (kg)":89,"sensor":34834834}
2 {"name":"Doe, Jane","group":"Blue","age (y)":23,"height (cm)":172,"wieght (kg)":67,"sensor":12342439}
3 {"name":"Curt, Travis","group":"Red","age (y)":13,"height (cm)":128,"wieght (kg)":47,"sensor":83287699}
Or with with the extracted value:
id sensor
1 "sensor":34834834
2 "sensor":12342439
3 "sensor":83287699
Thank you very much for any suggestions.

Using unnest_wider from tidyr
library(dplyr)
con <- DBI::dbConnect(RMySQL::MySQL(), 'db_name', user = 'user', password = 'pass', host = 'hostname')
t <- tbl(con, 'table_name')
t %>%
as_tibble() %>%
transmute(j = purrr::map(json, jsonlite::fromJSON)) %>%
tidyr::unnest_wider(j)
DBI::dbDisconnect(con)
Result:
# A tibble: 3 x 6
name group `age (y)` `height (cm)` `wieght (kg)` sensor
<chr> <chr> <int> <int> <int> <int>
1 Dent, Arthur Green 43 187 89 34834834
2 Doe, Jane Blue 23 172 67 12342439
3 Curt, Travis Red 13 128 47 83287699
If you want to only retrieve data from the last 24 hours (as the OP requested) change the tbl(con, 'table_name') statement to:
t <- DBI::dbGetQuery(con, 'SELECT * FROM `table_name` WHERE DATE(`created_at`) > NOW() - INTERVAL 1 DAY')

Converting your JSON response to a data frame should be straightforward but, because the structure of a JSON response is essentially arbitrary and you haven't given us details of how you obtain it or exact details of its content, it's impossible to give you code that will work in your specific case. However, this is the basic process that works in one of my appliocations, starting with the post call to the API that provides access to the database.
library(httr)
library(jsonlite)
# Query the API
response <- POST(<your code here>)
# Extract the content of the response. Amend the format an encoding if necessary.
content <- content(response, as="text", encoding="UTF-8")
# Convert the content to an R object
content <- fromJSON(content, flatten=FALSE)
# Coerce to data.frame
df <- as.data.frame(content)
You should, of course, incorporate error and status checking throughout the process.
Note: your data contains a spelling mistake. "wieght" should be "weight".

Related

How to set flag value based on data that use one-hot-encoding

I have a database consisting of three tables like this:
I want to make a machine learning model in R using that database, and the data I need is like this:
I can use one hot encoding to convert categorical variable from t_pengolahan (such as "Pengupasan, Fermentasi, etc") into attributes. But, how to set flag (yes or no) to the data value based on "result (using SQL query)" data above?
We can combine two answers to previous related questions, each of which provides half of the solution; those answers are found here and here:
library(dplyr) ## dplyr and tidyr loaded for wrangling
library(tidyr)
options(dplyr.width = Inf) ## we want to show all columns of result
yes_fun <- function(x) { ## helps with pivot_wider() below
if ( length(x) > 0 ) {
return("yes")
}
}
sql_result %>%
separate_rows(pengolahan) %>% ## add rows for unique words in pengolahan
pivot_wider(names_from = pengolahan, ## spread to yes/no indicators
values_from = pengolahan,
values_fill = list(pengolahan = "no"),
values_fn = list(pengolahan = yes_fun))
Data
id_pangan <- 1:3
kategori <- c("Daging", "Buah", "Susu")
pengolahan <- c("Penggilingan, Perebusan", "Pengupasan",
"Fermentasi, Sterilisasi")
batas <- c(100, 50, 200)
sql_result <- data.frame(id_pangan, kategori, pengolahan, batas)
# A tibble: 3 x 8
id_pangan kategori batas Penggilingan Perebusan Pengupasan
<int> <fct> <dbl> <chr> <chr> <chr>
1 1 Daging 100 yes yes no
2 2 Buah 50 no no yes
3 3 Susu 200 no no no
Fermentasi Sterilisasi
<chr> <chr>
1 no no
2 no no
3 yes yes
This seems unclear to me. What do you mean with "how to set flag (yes or no) to the data value based on "result (using SQL query)" data"? Do you want to convert one of the column to a boolean value? If so you need to specify the decision rule.
This may look like this:
SELECT (... other columns),
CASE case_expression
WHEN when_expression_1 THEN 'yes'
WHEN when_expression_2 THEN 'no'
ELSE ''
END
To help others help you:
- which SQL variant do you use? (e.g. would a sqlite solution work for you?)
- provide an sql script of your table creation, plus a script to "use one hot encoding to convert categorical variable from t_pengolahan (such as "Pengupasan, Fermentasi, etc") into attributes"

Read multiple files, create a data frame and add a new column containing the name of each file in R

I am new using dplyr package and I have been trying to read multiple files in R and then create a data frame by binding all the rows, but including the name of each file as a new column. This new column is the corresponding date which is not included in the data.
My list of files (for example):
01012019.aps
02012019.aps
I would like to have my final dataframe like this:
x y file date
1 4 01012019 01-01-2019
2 5 01012019 01-01-2019
3 6 02012019 02-01-2019
4 7 02012019 02-01-2019
I've been trying this:
path_aps<- "C:/Users/.../.../APS"
files_aps <- list.files(path_aps, pattern = "*.aps")
data_aps <- files_aps %>%
map(~ read.table(file.path(path_aps, .), sep = "\t")) %>%
map(~ mutate(filename = files_aps, .))%>%
reduce(gtools::smartbind)
But I am getting this error:
Error: Column filename must be length 288 (the number of rows) or one, not 61
I understand that the list of files in files_aps has 61 elements as this is the number of files that I have in my directory and 288 is the number of rows of each .aps file; however, I haven't been able to make it work to the extend of each .aps file. I've been reading multiple answers to similar questions but still I am not getting the expected result.
I've solved it with the help of this other answer and I've got this:
data_aps <- list.files(path_aps, pattern = "*.aps", full.names = TRUE) %>%
map_df(function(x) read.table(x, sep = "\t") %>%
mutate(filename=gsub(".aps","", basename(x))))

Rename certain values of a row in a certain column if the meet the criteria

How can I rename certain values of a row from a column if they meet a certain if-statement?
Example:
Date Type C1
2000 none 3
2000 love 4
2000 none 6
2000 none 2
2000 bad 8
So I want to rename "love" and "bad" in my column type into "xxx".
Date Type C1
2000 none 3
2000 xxx 4
2000 none 6
2000 none 2
2000 xxx 8
Is there a neat way of doing it?
Thank you :)
First, make sure it's not a factor, then rename:
df$Type = as.character(df$Type)
df$Type[df$Type %in% c("love", "bad")] = "xxx"
If the data is a factor, you want to rename the factor level. The easiest way to do that is with fct_recode() in the forcats package. If it's a character vector, ifelse works well if the number of changes is small. If it's large, case_when in the dplyr package works well.
library(forcats)
library(dplyr)
df <- within(df, { # if you use `dplyr`, you can replace this with mutate. You'd also need to change `<-` to `=` and add `,` at the end of each line.
Type_fct1 <- fct_recode(Type, xxx = "love", xxx = "bad")
# in base R, you need can change the factor labels, but its clunky
Type_fct2 <- Type
levels(Type_fct2)[levels(Type_fct2) %in% c("love", "bad")] <- "xxx"
# methods using character vectors
Type_chr1 <- ifelse(as.character(Type) %in% c("love", "bad"), "xxx", as.character(Type))
Type_chr2 <- case_when(
Type %in% c("love", "bad") ~ "xxx",
Type == "none" ~ "something_else", # thrown in to show how to use `case_when` with many different criterion.
TRUE ~ NA_character_
)
})

Reformat wide Excel table to more SQL-friendly structure

I have a very wide Excel sheet, from Column A - DIE (about 2500 columns wide), of survey data. Each column is a question, and each row is a response. I'm trying to upload the data to SQL and convert it to a more SQL-friendly format using the UNPIVOT function, but I can't even get it loaded into SQL because it exceeds the 1024-column limit.
Basically, I have an Excel sheet that looks like this:
But I want to convert it to look like this:
What options do I have to make this change, either in Excel (prior to upload) or SQL (while circumventing the 1024 column limit)?
I have had to do this quite a bit. My solution was to write a Python script that would un-crosstab a CSV file (typically exported from Excel), creating another CSV file. The Python code is here: https://pypi.python.org/pypi/un-xtab/ and the documentation is here: http://pythonhosted.org/un-xtab/. I've never run it on a file with 2500 columns, but don't know why it wouldn't work.
R has a very specific function call in one of it's libraries. You can also connect, read, and write data with R into a database. Would suggest downloading R and Rstudio.
Here is a working script to get you started that does what you need:
Sample data:
df <- data.frame(id = c(1,2,3), question_1 = c(1,0,1), question_2 = c(2,0,2))
df
Input table:
id question_1 question_2
1 1 1 2
2 2 0 0
3 3 1 2
Code to transpose the data:
df2 <- gather(df, key = id, value = values)
df2
Output:
id id values
1 1 question_1 1
2 2 question_1 0
3 3 question_1 1
4 1 question_2 2
5 2 question_2 0
6 3 question_2 2
Some helper functions for you to import and export the csv data:
# Install and load the necessary libraries
install.packages(c('tidyr','readr'))
library(tidyr)
library(readr)
# to read a csv file
df <- read_csv('[some directory][some filename].csv')
# To output the csv file
write.csv(df2, '[some directory]data.csv', row.names = FALSE)
Thanks for all the help. I ended up using Python due to limitations in both SQL (over 1024 columns wide) and Excel (well over 1 million rows in the output). I borrowed the concepts from rd_nielson's code, but that was a bit more complicated than I needed. In case it's helpful to anyone else, this is the code I used. It outputs a csv file with 3 columns and 14 million rows that I can upload to SQL.
import csv
with open('Responses.csv') as f:
reader = csv.reader(f)
headers = next(reader) # capture current field headers
newHeaders = ['ResponseID','Question','Response'] # establish new header names
with open('PythonOut.csv','w') as outputfile:
writer=csv.writer(outputfile, dialect='excel', lineterminator='\n')
writer.writerow(newHeaders) # write new headers to output
QuestionHeaders = headers[1:len(headers)] # Slice the question headers from original header list
for row in reader:
questionCount = 0 # start counter to loop through each question (column) for every response (row)
while questionCount <= len(QuestionHeaders) - 1:
newRow = [row[0], QuestionHeaders[questionCount], row[questionCount + 1]]
writer.writerow(newRow)
questionCount += 1

Read only n-th column of a text file which has no header with R and sqldf

I have a similiar problem like this question:
selecting every Nth column in using SQLDF or read.csv.sql
I want to read some columns of large files (table of 150rows, >500,000 columns, space separated, filled with numeric data and only a 32 bit system available). This file has no header, therefore the code in the thread above didn't work and I decided to write a new post.
Do you have an idea to solve this problem?
I thought about something like that, but any results with fread or read.table are also ok:
MyConnection <- file("path/file.txt")
df<-sqldf("select column 1 100 1000 235612 from MyConnection",file.format = list(header=F,sep=" "))
You can use substr to specify the start and end position of the columns you want to read in if they are fixed width:
x <- tempfile()
cat("12345", "67890", "09876", "54321", sep = "\n", file = x)
myfile <- file(x)
sqldf("select substr(V1, 1, 1) var1, substr(V1, 3, 5) var2 from myfile")
# var1 var2
# 1 1 345
# 2 6 890
# 3 9 76
# 4 5 321
See this blog post for some more examples. The "select" statement can easily be constructed with paste if you know the details about the column starting positions and widths.