How to set flag value based on data that use one-hot-encoding - sql

I have a database consisting of three tables like this:
I want to make a machine learning model in R using that database, and the data I need is like this:
I can use one hot encoding to convert categorical variable from t_pengolahan (such as "Pengupasan, Fermentasi, etc") into attributes. But, how to set flag (yes or no) to the data value based on "result (using SQL query)" data above?

We can combine two answers to previous related questions, each of which provides half of the solution; those answers are found here and here:
library(dplyr) ## dplyr and tidyr loaded for wrangling
library(tidyr)
options(dplyr.width = Inf) ## we want to show all columns of result
yes_fun <- function(x) { ## helps with pivot_wider() below
if ( length(x) > 0 ) {
return("yes")
}
}
sql_result %>%
separate_rows(pengolahan) %>% ## add rows for unique words in pengolahan
pivot_wider(names_from = pengolahan, ## spread to yes/no indicators
values_from = pengolahan,
values_fill = list(pengolahan = "no"),
values_fn = list(pengolahan = yes_fun))
Data
id_pangan <- 1:3
kategori <- c("Daging", "Buah", "Susu")
pengolahan <- c("Penggilingan, Perebusan", "Pengupasan",
"Fermentasi, Sterilisasi")
batas <- c(100, 50, 200)
sql_result <- data.frame(id_pangan, kategori, pengolahan, batas)
# A tibble: 3 x 8
id_pangan kategori batas Penggilingan Perebusan Pengupasan
<int> <fct> <dbl> <chr> <chr> <chr>
1 1 Daging 100 yes yes no
2 2 Buah 50 no no yes
3 3 Susu 200 no no no
Fermentasi Sterilisasi
<chr> <chr>
1 no no
2 no no
3 yes yes

This seems unclear to me. What do you mean with "how to set flag (yes or no) to the data value based on "result (using SQL query)" data"? Do you want to convert one of the column to a boolean value? If so you need to specify the decision rule.
This may look like this:
SELECT (... other columns),
CASE case_expression
WHEN when_expression_1 THEN 'yes'
WHEN when_expression_2 THEN 'no'
ELSE ''
END
To help others help you:
- which SQL variant do you use? (e.g. would a sqlite solution work for you?)
- provide an sql script of your table creation, plus a script to "use one hot encoding to convert categorical variable from t_pengolahan (such as "Pengupasan, Fermentasi, etc") into attributes"

Related

Importing data using R from SQL Server truncate leading zeros

I'm trying to import data from a table in SQL Server and then write it into a .txt file. I'm doing it in the following way. However when I do that all numbers having leading 0 s seems to get trimmed.
For example if I have 000124 in the database, it's shown as 124 in the .txt as well as if I check x_1 it's 124 in there as well.
How can I avoid this? I want to keep the leading 0 s in x_1 and also need them in the output .txt file.
library(RODBC)
library(lubridate)
library(data.table)
cn_1 <- odbcConnect('channel_name')
qry <- "
select
*
from table_name
"
x_1 <- sqlQuery(channel=cn_1, query=qry, stringsAsFactors=FALSE)
rm(qry)
setDT(x_1)
fwrite(x=x_1, file=paste0(export_location, "file_name", date_today, ".txt"), sep="|", quote=TRUE, row.names=FALSE, na="")
Assuming that the underlying data in the DBMS is indeed "string"-like ...
RODBC::sqlQuery has the as.is= argument that can prevent it from trying to convert values. The default is FALSE, and when false and not a clear type like "date" or "timestamp", RODBC calls type.convert which will see the number-like field and convert it to integers or numbers.
Try:
x_1 <- sqlQuery(channel=cn_1, query=qry, stringsAsFactors=FALSE, as.is = TRUE)
and that will stop auto-conversion of all columns.
That is a bit nuclear, to be honest, and will stop conversion of dates/times, and perhaps other columns that should be converted. We can narrow this down; ?sqlQuery says that read.table's documentation on as.is is relevant, and it says:
as.is: controls conversion of character variables (insofar as they
are not converted to logical, numeric or complex) to factors,
if not otherwise specified by 'colClasses'. Its value is
either a vector of logicals (values are recycled if
necessary), or a vector of numeric or character indices which
specify which columns should not be converted to factors.
so if you know which column (by name or column index) is being unnecessarily converted, then you can include it directly. Perhaps
## by column name
x_1 <- sqlQuery(channel=cn_1, query=qry, stringsAsFactors=FALSE, as.is = "somename")
## or by column index
x_1 <- sqlQuery(channel=cn_1, query=qry, stringsAsFactors=FALSE, as.is = 7)
(Side note: while I use select * ... on occasion as well, the presumption of knowing columns by-number is predicated on know all of the columns included in that table/query. If anything changes, perhaps it's actually a SQL view and somebody updates it ... or if somebody changes the order of columns, than your assumptions of column indices is a little fragile. All of my "production" queries in my internal packages have all columns spelled out, no use of select *. I have been bitten once when I used it, which is why I'm a little defensive about it.)
If you don't know, a hastily-dynamic way (that double-taps the query, unfortunately) could be something like
qry10 <- "
select
*
from table_name
limit 10"
x_1 <- sqlQuery(channel=cn_1, query=qry10, stringsAsFactors=FALSE, as.is = TRUE)
leadzero <- sapply(x_1, function(z) all(grepl("^0+[1-9]", z)))
x_1 <- sqlQuery(channel=cn_1, query=qry, stringsAsFactors=FALSE, as.is = which(leadzero))
Caveat: I don't use RODBC nor have I set up a temporary database with appropriately-fashioned values, so this untested.
Let x_1 be the result data.table from your SQL query. Then you can convert numeric columns (e.g. value) to formatted strings using sprintf to get leading zeros:
library(data.table)
x_1 <- data.table(value = c(1,12,123,1234))
x_1
#> value
#> 1: 1
#> 2: 12
#> 3: 123
#> 4: 1234
x_1$value <- x_1$value |> sprintf(fmt = "%04d")
x_1
#> value
#> 1: 0001
#> 2: 0012
#> 3: 0123
#> 4: 1234
Created on 2021-10-08 by the reprex package (v2.0.1)

How to split one SQL column into multiple columns in r

I create a tbl_SQLiteConnection object by dbplyr::memdb_frame() in R.
library(dbplyr)
library(dplyr)
tb=memdb_frame(data.frame(ch=c('1a1','2a3'),cq=c(1,2)))
tb
# Source: table<dbplyr_002> [?? x 2]
# Database: sqlite 3.33.0 [:memory:]
ch cq
<chr> <dbl>
1 1a1 1
2 2a3 2
I want to split column ch by a into two columns: ch1 and ch2 like follows:
# Source: table<dbplyr_004> [?? x 3]
# Database: sqlite 3.33.0 [:memory:]
ch1 ch2 cq
<dbl> <dbl> <dbl>
1 1 1 1
2 2 3 2
I try something like this
sq <- paste0('SELECT split_part(ch,"a",1)
FROM (',sql_render(tb),')')
tbl(tb$src$con,sql(sq))
It does not work!!!
Could you please help me how to conduct this in R ?
Does this work:
library(dplyr)
library(tidyr)
tb %>% separate(col = ch, into = c('ch1','ch2'), sep = 'a')
ch1 ch2 cq
1 1 1 1
2 2 3 2
If the question is how to separate the indicated column in a data frame into two using an SQLite select statement in R then the following select statement uses SQLite's instr and substr functions.
library(sqldf)
tb <- data.frame(ch = c('1a1', '2a3'), cq = c(1, 2))
sqldf("
select
substr(ch, 1, instr(ch, 'a') - 1) as ch1,
substr(ch, instr(ch, 'a') + 1) as ch2,
cq
from tb")
giving:
ch1 ch2 cq
1 1 1 1
2 2 3 2
Building on my answer to this question, here are two approaches. Both assume you do not know where in the string the splitting character (in this case 'a') is, and so this is the first step. You can simplify the method if your splitting character is always in the same position (e.g. it is always the second character in the string, as per the example above).
SQL based
We can write an SQL query and pass it to the table definition, following the approach in the question:
query = paste0("
SELECT ch, cq, split_index
,SUBSTRING(ch, 1, split_index - 1) AS ch1
,SUBSTRING(ch, split_index + 1, LEN(ch)) AS ch2
FROM (
SELECT ch, cq, CHARINDEX('a', ch) AS split_index
FROM ", sql_render(input_tbl), "
) AS subquery"
tbl(input_tbl$src$con, sql(query))
Note that different flavors of SQL may have different names for the SUBSTRING, CHARINDEX, and LEN functions. The functions may also take different arguments (e.g. length of the substring or index of the end of the substring). You will need to ensure you have the right functions for your flavor of SQL.
dbplyr based
dbplyr can translate some standard R and dplyr commands into SQL. This lets you write R and have it automatically translated to SQL.
lubrary(dbplyr)
library(stringr)
input_tbl %>%
mutate(split_index = str_locate('a', ch) %>%
mutate(ch1 = substr(ch, 1, split_index - 1),
ch2 = substr(ch, split_index + 1, nchar(ch)))
Note: str_locate finds the index of the splitting character. In my testing, not every flavor of SQL had a translation defined for this. So this approach may fail depending on your database.
However, you can use the equivalent SQL function directly. Because dbplyr does not have a translation defined for it the function should be passed untranslated to the server.
For example, in SQL server CHARINDEX has the same purpose as str_locate. Hence you could use CHARINDEX('a', ch) instead of str_locate('a', ch).

Importing JSON data from SQL DB to an R dataframe

I would like to know whether there is a way of importing JSON data from a MySQL DB to an R dataframe.
I have a table like this:
id created_at json
1 2020-07-01 {"name":"Dent, Arthur","group":"Green","age (y)":43,"height (cm)":187,"wieght (kg)":89,"sensor":34834834}
2 2020-07-01 {"name":"Doe, Jane","group":"Blue","age (y)":23,"height (cm)":172,"wieght (kg)":67,"sensor":12342439}
3 2020-07-01 {"name":"Curt, Travis","group":"Red","age (y)":13,"height (cm)":128,"wieght (kg)":47,"sensor":83287699}
I would like to get the columns 'id' and 'json'.
I am using RMySQL package for getting the data from the db to an R dataframe but this gives me only the column 'id', the column 'json' contains only NAs in each row.
Is there any way how to import/load the data and get the json column displayed? And possibly to extract the "sensor" part of the json values?
The result would be a dataframe (df) like this:
id json
1 {"name":"Dent, Arthur","group":"Green","age (y)":43,"height (cm)":187,"wieght (kg)":89,"sensor":34834834}
2 {"name":"Doe, Jane","group":"Blue","age (y)":23,"height (cm)":172,"wieght (kg)":67,"sensor":12342439}
3 {"name":"Curt, Travis","group":"Red","age (y)":13,"height (cm)":128,"wieght (kg)":47,"sensor":83287699}
Or with with the extracted value:
id sensor
1 "sensor":34834834
2 "sensor":12342439
3 "sensor":83287699
Thank you very much for any suggestions.
Using unnest_wider from tidyr
library(dplyr)
con <- DBI::dbConnect(RMySQL::MySQL(), 'db_name', user = 'user', password = 'pass', host = 'hostname')
t <- tbl(con, 'table_name')
t %>%
as_tibble() %>%
transmute(j = purrr::map(json, jsonlite::fromJSON)) %>%
tidyr::unnest_wider(j)
DBI::dbDisconnect(con)
Result:
# A tibble: 3 x 6
name group `age (y)` `height (cm)` `wieght (kg)` sensor
<chr> <chr> <int> <int> <int> <int>
1 Dent, Arthur Green 43 187 89 34834834
2 Doe, Jane Blue 23 172 67 12342439
3 Curt, Travis Red 13 128 47 83287699
If you want to only retrieve data from the last 24 hours (as the OP requested) change the tbl(con, 'table_name') statement to:
t <- DBI::dbGetQuery(con, 'SELECT * FROM `table_name` WHERE DATE(`created_at`) > NOW() - INTERVAL 1 DAY')
Converting your JSON response to a data frame should be straightforward but, because the structure of a JSON response is essentially arbitrary and you haven't given us details of how you obtain it or exact details of its content, it's impossible to give you code that will work in your specific case. However, this is the basic process that works in one of my appliocations, starting with the post call to the API that provides access to the database.
library(httr)
library(jsonlite)
# Query the API
response <- POST(<your code here>)
# Extract the content of the response. Amend the format an encoding if necessary.
content <- content(response, as="text", encoding="UTF-8")
# Convert the content to an R object
content <- fromJSON(content, flatten=FALSE)
# Coerce to data.frame
df <- as.data.frame(content)
You should, of course, incorporate error and status checking throughout the process.
Note: your data contains a spelling mistake. "wieght" should be "weight".

Rename certain values of a row in a certain column if the meet the criteria

How can I rename certain values of a row from a column if they meet a certain if-statement?
Example:
Date Type C1
2000 none 3
2000 love 4
2000 none 6
2000 none 2
2000 bad 8
So I want to rename "love" and "bad" in my column type into "xxx".
Date Type C1
2000 none 3
2000 xxx 4
2000 none 6
2000 none 2
2000 xxx 8
Is there a neat way of doing it?
Thank you :)
First, make sure it's not a factor, then rename:
df$Type = as.character(df$Type)
df$Type[df$Type %in% c("love", "bad")] = "xxx"
If the data is a factor, you want to rename the factor level. The easiest way to do that is with fct_recode() in the forcats package. If it's a character vector, ifelse works well if the number of changes is small. If it's large, case_when in the dplyr package works well.
library(forcats)
library(dplyr)
df <- within(df, { # if you use `dplyr`, you can replace this with mutate. You'd also need to change `<-` to `=` and add `,` at the end of each line.
Type_fct1 <- fct_recode(Type, xxx = "love", xxx = "bad")
# in base R, you need can change the factor labels, but its clunky
Type_fct2 <- Type
levels(Type_fct2)[levels(Type_fct2) %in% c("love", "bad")] <- "xxx"
# methods using character vectors
Type_chr1 <- ifelse(as.character(Type) %in% c("love", "bad"), "xxx", as.character(Type))
Type_chr2 <- case_when(
Type %in% c("love", "bad") ~ "xxx",
Type == "none" ~ "something_else", # thrown in to show how to use `case_when` with many different criterion.
TRUE ~ NA_character_
)
})

Create 20 unique bingo cards

I'm trying to create 20 unique cards with numbers, but I struggle a bit.. So basically I need to create 20 unique matrices 3x3 having numbers 1-10 in first column, numbers 11-20 in the second column and 21-30 in the third column.. Any ideas? I'd prefer to have it done in r, especially as I don't know Visual Basic. In excel I know how to generate the cards, but not sure how to ensure they are unique..
It seems to be quite precise and straightforward to me. Anyway, i needed to create 20 matrices that would look like :
[,1] [,2] [,3]
[1,] 5 17 23
[2,] 8 18 22
[3,] 3 16 24
Each of the matrices should be unique and each of the columns should consist of three unique numbers ( the 1st column - numbers 1-10, the 2nd column 11-20, the 3rd column - 21-30).
Generating random numbers is easy, though how to make sure that generated cards are unique?Please have a look at the post that i voted for as an answer - as it gives you thorough explanation how to achieve it.
(N.B. : I misread "rows" instead of "columns", so the following code and explanation will deal with matrices with random numbers 1-10 on 1st row, 11-20 on 2nd row etc., instead of columns, but it's exactly the same just transposed)
This code should guarantee uniqueness and good randomness :
library(gtools)
# helper function
getKthPermWithRep <- function(k,n,r){
k <- k - 1
if(n^r< k){
stop('k is greater than possibile permutations')
}
v <- rep.int(0,r)
index <- length(v)
while ( k != 0 )
{
remainder<- k %% n
k <- k %/% n
v[index] <- remainder
index <- index - 1
}
return(v+1)
}
# get all possible permutations of 10 elements taken 3 at a time
# (singlerowperms = 720)
allperms <- permutations(10,3)
singlerowperms <- nrow(allperms)
# get 20 random and unique bingo cards
cards <- lapply(sample.int(singlerowperms^3,20),FUN=function(k){
perm2use <- getKthPermWithRep(k,singlerowperms,3)
m <- allperms[perm2use,]
m[2,] <- m[2,] + 10
m[3,] <- m[3,] + 20
return(m)
# if you want transpose the result just do:
# return(t(m))
})
Explanation
(disclaimer tl;dr)
To guarantee both randomness and uniqueness, one safe approach is generating all the possibile bingo cards and then choose randomly among them without replacements.
To generate all the possible cards, we should :
generate all the possibilities for each row of 3 elements
get the cartesian product of them
Step (1) can be easily obtained using function permutations of package gtools (see the object allPerms in the code). Note that we just need the permutations for the first row (i.e. 3 elements taken from 1-10) since the permutations of the other rows can be easily obtained from the first by adding 10 and 20 respectively.
Step (2) is also easy to get in R, but let's first consider how many possibilities will be generated. Step (1) returned 720 cases for each row, so, in the end we will have 720*720*720 = 720^3 = 373248000 possible bingo cards!
Generate all of them is not practical since the occupied memory would be huge, thus we need to find a way to get 20 random elements in this big range of possibilities without actually keeping them in memory.
The solution comes from the function getKthPermWithRep, which, given an index k, it returns the k-th permutation with repetition of r elements taken from 1:n (note that in this case permutation with repetition corresponds to the cartesian product).
e.g.
# all permutations with repetition of 2 elements in 1:3 are
permutations(n = 3, r = 2,repeats.allowed = TRUE)
# [,1] [,2]
# [1,] 1 1
# [2,] 1 2
# [3,] 1 3
# [4,] 2 1
# [5,] 2 2
# [6,] 2 3
# [7,] 3 1
# [8,] 3 2
# [9,] 3 3
# using the getKthPermWithRep you can get directly the k-th permutation you want :
getKthPermWithRep(k=4,n=3,r=2)
# [1] 2 1
getKthPermWithRep(k=8,n=3,r=2)
# [1] 3 2
Hence now we just choose 20 random indexes in the range 1:720^3 (using sample.int function), then for each of them we get the corresponding permutation of 3 numbers taken from 1:720 using function getKthPermWithRep.
Finally these triplets of numbers, can be converted to actual card rows by using them as indexes to subset allPerms and get our final matrix (after, of course, adding +10 and +20 to the 2nd and 3rd row).
Bonus
Explanation of getKthPermWithRep
If you look at the example above (permutations with repetition of 2 elements in 1:3), and subtract 1 to all number of the results you get this :
> permutations(n = 3, r = 2,repeats.allowed = T) - 1
[,1] [,2]
[1,] 0 0
[2,] 0 1
[3,] 0 2
[4,] 1 0
[5,] 1 1
[6,] 1 2
[7,] 2 0
[8,] 2 1
[9,] 2 2
If you consider each number of each row as a number digit, you can notice that those rows (00, 01, 02...) are all the numbers from 0 to 8, represented in base 3 (yes, 3 as n). So, when you ask the k-th permutation with repetition of r elements in 1:n, you are also asking to translate k-1 into base n and return the digits increased by 1.
Therefore, given the algorithm to change any number from base 10 to base n :
changeBase <- function(num,base){
v <- NULL
while ( num != 0 )
{
remainder = num %% base # assume K > 1
num = num %/% base # integer division
v <- c(remainder,v)
}
if(is.null(v)){
return(0)
}
return(v)
}
you can easily obtain getKthPermWithRep function.
One 3x3 matrix with the desired value range can be generated with the following code:
mat <- matrix(c(sample(1:10,3), sample(11:20,3), sample(21:30, 3)), nrow=3)
Furthermore, you can use a for loop to generate a list of 20 unique matrices as follows:
for (i in 1:20) {
mat[[i]] <- list(matrix(c(sample(1:10,3), sample(11:20,3), sample(21:30,3)), nrow=3))
print(mat[[i]])
}
Well OK I may fall on my face here but I propose a checksum (using Excel).
This is a unique signature for each bingo card which will remain invariate if the order of numbers within any column is changed without changing the actual numbers. The formula is
=SUM(10^MOD(A2:A4,10)+2*10^MOD(B2:B4,10)+4*10^MOD(C2:C4,10))
where the bingo numbers for the first card are in A2:C4.
The idea is to generate a 10-digit number for each column, then multiply each by a constant and add them to get the signature.
So here I have generated two random bingo cards using a standard formula from here plus two which are deliberately made to be just permutations of each other.
Then I check if any of the signatures are duplicates using the formula
=MAX(COUNTIF(D5:D20,D5:D20))
which shouldn't given an answer more than 1.
In the unlikely event that there were duplicates, then you would just press F9 and generate some new cards.
All formulae are array formulae and must be entered with CtrlShiftEnter
Here is an inelegant way to do this. Generate all possible combinations and then sample without replacement. These are permutations, combinations: order does matter in bingo
library(dplyr)
library(tidyr)
library(magrittr)
generate_samples = function(n) {
first = data_frame(first = (n-9):n)
first %>%
merge(first %>% rename(second = first)) %>%
merge(first %>% rename(third = first)) %>%
sample_n(20)
}
suffix = function(df, suffix)
df %>%
setNames(names(.) %>%
paste0(suffix))
generate_samples(10) %>% suffix(10) %>%
bind_cols(generate_samples(20) %>% suffix(20)) %>%
bind_cols(generate_samples(30) %>% suffix(30)) %>%
rowwise %>%
do(matrix = t(.) %>% matrix(3)) %>%
use_series(matrix)