Rename certain values of a row in a certain column if the meet the criteria - dataframe

How can I rename certain values of a row from a column if they meet a certain if-statement?
Example:
Date Type C1
2000 none 3
2000 love 4
2000 none 6
2000 none 2
2000 bad 8
So I want to rename "love" and "bad" in my column type into "xxx".
Date Type C1
2000 none 3
2000 xxx 4
2000 none 6
2000 none 2
2000 xxx 8
Is there a neat way of doing it?
Thank you :)

First, make sure it's not a factor, then rename:
df$Type = as.character(df$Type)
df$Type[df$Type %in% c("love", "bad")] = "xxx"

If the data is a factor, you want to rename the factor level. The easiest way to do that is with fct_recode() in the forcats package. If it's a character vector, ifelse works well if the number of changes is small. If it's large, case_when in the dplyr package works well.
library(forcats)
library(dplyr)
df <- within(df, { # if you use `dplyr`, you can replace this with mutate. You'd also need to change `<-` to `=` and add `,` at the end of each line.
Type_fct1 <- fct_recode(Type, xxx = "love", xxx = "bad")
# in base R, you need can change the factor labels, but its clunky
Type_fct2 <- Type
levels(Type_fct2)[levels(Type_fct2) %in% c("love", "bad")] <- "xxx"
# methods using character vectors
Type_chr1 <- ifelse(as.character(Type) %in% c("love", "bad"), "xxx", as.character(Type))
Type_chr2 <- case_when(
Type %in% c("love", "bad") ~ "xxx",
Type == "none" ~ "something_else", # thrown in to show how to use `case_when` with many different criterion.
TRUE ~ NA_character_
)
})

Related

Importing data using R from SQL Server truncate leading zeros

I'm trying to import data from a table in SQL Server and then write it into a .txt file. I'm doing it in the following way. However when I do that all numbers having leading 0 s seems to get trimmed.
For example if I have 000124 in the database, it's shown as 124 in the .txt as well as if I check x_1 it's 124 in there as well.
How can I avoid this? I want to keep the leading 0 s in x_1 and also need them in the output .txt file.
library(RODBC)
library(lubridate)
library(data.table)
cn_1 <- odbcConnect('channel_name')
qry <- "
select
*
from table_name
"
x_1 <- sqlQuery(channel=cn_1, query=qry, stringsAsFactors=FALSE)
rm(qry)
setDT(x_1)
fwrite(x=x_1, file=paste0(export_location, "file_name", date_today, ".txt"), sep="|", quote=TRUE, row.names=FALSE, na="")
Assuming that the underlying data in the DBMS is indeed "string"-like ...
RODBC::sqlQuery has the as.is= argument that can prevent it from trying to convert values. The default is FALSE, and when false and not a clear type like "date" or "timestamp", RODBC calls type.convert which will see the number-like field and convert it to integers or numbers.
Try:
x_1 <- sqlQuery(channel=cn_1, query=qry, stringsAsFactors=FALSE, as.is = TRUE)
and that will stop auto-conversion of all columns.
That is a bit nuclear, to be honest, and will stop conversion of dates/times, and perhaps other columns that should be converted. We can narrow this down; ?sqlQuery says that read.table's documentation on as.is is relevant, and it says:
as.is: controls conversion of character variables (insofar as they
are not converted to logical, numeric or complex) to factors,
if not otherwise specified by 'colClasses'. Its value is
either a vector of logicals (values are recycled if
necessary), or a vector of numeric or character indices which
specify which columns should not be converted to factors.
so if you know which column (by name or column index) is being unnecessarily converted, then you can include it directly. Perhaps
## by column name
x_1 <- sqlQuery(channel=cn_1, query=qry, stringsAsFactors=FALSE, as.is = "somename")
## or by column index
x_1 <- sqlQuery(channel=cn_1, query=qry, stringsAsFactors=FALSE, as.is = 7)
(Side note: while I use select * ... on occasion as well, the presumption of knowing columns by-number is predicated on know all of the columns included in that table/query. If anything changes, perhaps it's actually a SQL view and somebody updates it ... or if somebody changes the order of columns, than your assumptions of column indices is a little fragile. All of my "production" queries in my internal packages have all columns spelled out, no use of select *. I have been bitten once when I used it, which is why I'm a little defensive about it.)
If you don't know, a hastily-dynamic way (that double-taps the query, unfortunately) could be something like
qry10 <- "
select
*
from table_name
limit 10"
x_1 <- sqlQuery(channel=cn_1, query=qry10, stringsAsFactors=FALSE, as.is = TRUE)
leadzero <- sapply(x_1, function(z) all(grepl("^0+[1-9]", z)))
x_1 <- sqlQuery(channel=cn_1, query=qry, stringsAsFactors=FALSE, as.is = which(leadzero))
Caveat: I don't use RODBC nor have I set up a temporary database with appropriately-fashioned values, so this untested.
Let x_1 be the result data.table from your SQL query. Then you can convert numeric columns (e.g. value) to formatted strings using sprintf to get leading zeros:
library(data.table)
x_1 <- data.table(value = c(1,12,123,1234))
x_1
#> value
#> 1: 1
#> 2: 12
#> 3: 123
#> 4: 1234
x_1$value <- x_1$value |> sprintf(fmt = "%04d")
x_1
#> value
#> 1: 0001
#> 2: 0012
#> 3: 0123
#> 4: 1234
Created on 2021-10-08 by the reprex package (v2.0.1)

How to set flag value based on data that use one-hot-encoding

I have a database consisting of three tables like this:
I want to make a machine learning model in R using that database, and the data I need is like this:
I can use one hot encoding to convert categorical variable from t_pengolahan (such as "Pengupasan, Fermentasi, etc") into attributes. But, how to set flag (yes or no) to the data value based on "result (using SQL query)" data above?
We can combine two answers to previous related questions, each of which provides half of the solution; those answers are found here and here:
library(dplyr) ## dplyr and tidyr loaded for wrangling
library(tidyr)
options(dplyr.width = Inf) ## we want to show all columns of result
yes_fun <- function(x) { ## helps with pivot_wider() below
if ( length(x) > 0 ) {
return("yes")
}
}
sql_result %>%
separate_rows(pengolahan) %>% ## add rows for unique words in pengolahan
pivot_wider(names_from = pengolahan, ## spread to yes/no indicators
values_from = pengolahan,
values_fill = list(pengolahan = "no"),
values_fn = list(pengolahan = yes_fun))
Data
id_pangan <- 1:3
kategori <- c("Daging", "Buah", "Susu")
pengolahan <- c("Penggilingan, Perebusan", "Pengupasan",
"Fermentasi, Sterilisasi")
batas <- c(100, 50, 200)
sql_result <- data.frame(id_pangan, kategori, pengolahan, batas)
# A tibble: 3 x 8
id_pangan kategori batas Penggilingan Perebusan Pengupasan
<int> <fct> <dbl> <chr> <chr> <chr>
1 1 Daging 100 yes yes no
2 2 Buah 50 no no yes
3 3 Susu 200 no no no
Fermentasi Sterilisasi
<chr> <chr>
1 no no
2 no no
3 yes yes
This seems unclear to me. What do you mean with "how to set flag (yes or no) to the data value based on "result (using SQL query)" data"? Do you want to convert one of the column to a boolean value? If so you need to specify the decision rule.
This may look like this:
SELECT (... other columns),
CASE case_expression
WHEN when_expression_1 THEN 'yes'
WHEN when_expression_2 THEN 'no'
ELSE ''
END
To help others help you:
- which SQL variant do you use? (e.g. would a sqlite solution work for you?)
- provide an sql script of your table creation, plus a script to "use one hot encoding to convert categorical variable from t_pengolahan (such as "Pengupasan, Fermentasi, etc") into attributes"

R: Why is numeric data treated like non-numeric data?

my R script is:
aa <- data.frame("1","3","1.5","2.1")
mean(aa)
Then I get:
[1] NA
Warning message: In mean.default(aa) : argument is not numeric or logical: returning NA
Can this be solved without removing the quotation marks or changing data.frame to something else?
For a data.frame if you want to get mean value for each column, you can do it using colMeans function:
aa <- data.frame("1","3","1.5","2.1", stringsAsFactors = FALSE)
aa <- sapply(aa, as.numeric)
colMeans(aa)
mean(colMeans(aa)) # if you want average across all columns
You must create the data.frame using stringsAsFactors = FALSE otherwise all character values will be separate factors with ordinal 1 each. Numeric representation would be 1 for each value.

Create 20 unique bingo cards

I'm trying to create 20 unique cards with numbers, but I struggle a bit.. So basically I need to create 20 unique matrices 3x3 having numbers 1-10 in first column, numbers 11-20 in the second column and 21-30 in the third column.. Any ideas? I'd prefer to have it done in r, especially as I don't know Visual Basic. In excel I know how to generate the cards, but not sure how to ensure they are unique..
It seems to be quite precise and straightforward to me. Anyway, i needed to create 20 matrices that would look like :
[,1] [,2] [,3]
[1,] 5 17 23
[2,] 8 18 22
[3,] 3 16 24
Each of the matrices should be unique and each of the columns should consist of three unique numbers ( the 1st column - numbers 1-10, the 2nd column 11-20, the 3rd column - 21-30).
Generating random numbers is easy, though how to make sure that generated cards are unique?Please have a look at the post that i voted for as an answer - as it gives you thorough explanation how to achieve it.
(N.B. : I misread "rows" instead of "columns", so the following code and explanation will deal with matrices with random numbers 1-10 on 1st row, 11-20 on 2nd row etc., instead of columns, but it's exactly the same just transposed)
This code should guarantee uniqueness and good randomness :
library(gtools)
# helper function
getKthPermWithRep <- function(k,n,r){
k <- k - 1
if(n^r< k){
stop('k is greater than possibile permutations')
}
v <- rep.int(0,r)
index <- length(v)
while ( k != 0 )
{
remainder<- k %% n
k <- k %/% n
v[index] <- remainder
index <- index - 1
}
return(v+1)
}
# get all possible permutations of 10 elements taken 3 at a time
# (singlerowperms = 720)
allperms <- permutations(10,3)
singlerowperms <- nrow(allperms)
# get 20 random and unique bingo cards
cards <- lapply(sample.int(singlerowperms^3,20),FUN=function(k){
perm2use <- getKthPermWithRep(k,singlerowperms,3)
m <- allperms[perm2use,]
m[2,] <- m[2,] + 10
m[3,] <- m[3,] + 20
return(m)
# if you want transpose the result just do:
# return(t(m))
})
Explanation
(disclaimer tl;dr)
To guarantee both randomness and uniqueness, one safe approach is generating all the possibile bingo cards and then choose randomly among them without replacements.
To generate all the possible cards, we should :
generate all the possibilities for each row of 3 elements
get the cartesian product of them
Step (1) can be easily obtained using function permutations of package gtools (see the object allPerms in the code). Note that we just need the permutations for the first row (i.e. 3 elements taken from 1-10) since the permutations of the other rows can be easily obtained from the first by adding 10 and 20 respectively.
Step (2) is also easy to get in R, but let's first consider how many possibilities will be generated. Step (1) returned 720 cases for each row, so, in the end we will have 720*720*720 = 720^3 = 373248000 possible bingo cards!
Generate all of them is not practical since the occupied memory would be huge, thus we need to find a way to get 20 random elements in this big range of possibilities without actually keeping them in memory.
The solution comes from the function getKthPermWithRep, which, given an index k, it returns the k-th permutation with repetition of r elements taken from 1:n (note that in this case permutation with repetition corresponds to the cartesian product).
e.g.
# all permutations with repetition of 2 elements in 1:3 are
permutations(n = 3, r = 2,repeats.allowed = TRUE)
# [,1] [,2]
# [1,] 1 1
# [2,] 1 2
# [3,] 1 3
# [4,] 2 1
# [5,] 2 2
# [6,] 2 3
# [7,] 3 1
# [8,] 3 2
# [9,] 3 3
# using the getKthPermWithRep you can get directly the k-th permutation you want :
getKthPermWithRep(k=4,n=3,r=2)
# [1] 2 1
getKthPermWithRep(k=8,n=3,r=2)
# [1] 3 2
Hence now we just choose 20 random indexes in the range 1:720^3 (using sample.int function), then for each of them we get the corresponding permutation of 3 numbers taken from 1:720 using function getKthPermWithRep.
Finally these triplets of numbers, can be converted to actual card rows by using them as indexes to subset allPerms and get our final matrix (after, of course, adding +10 and +20 to the 2nd and 3rd row).
Bonus
Explanation of getKthPermWithRep
If you look at the example above (permutations with repetition of 2 elements in 1:3), and subtract 1 to all number of the results you get this :
> permutations(n = 3, r = 2,repeats.allowed = T) - 1
[,1] [,2]
[1,] 0 0
[2,] 0 1
[3,] 0 2
[4,] 1 0
[5,] 1 1
[6,] 1 2
[7,] 2 0
[8,] 2 1
[9,] 2 2
If you consider each number of each row as a number digit, you can notice that those rows (00, 01, 02...) are all the numbers from 0 to 8, represented in base 3 (yes, 3 as n). So, when you ask the k-th permutation with repetition of r elements in 1:n, you are also asking to translate k-1 into base n and return the digits increased by 1.
Therefore, given the algorithm to change any number from base 10 to base n :
changeBase <- function(num,base){
v <- NULL
while ( num != 0 )
{
remainder = num %% base # assume K > 1
num = num %/% base # integer division
v <- c(remainder,v)
}
if(is.null(v)){
return(0)
}
return(v)
}
you can easily obtain getKthPermWithRep function.
One 3x3 matrix with the desired value range can be generated with the following code:
mat <- matrix(c(sample(1:10,3), sample(11:20,3), sample(21:30, 3)), nrow=3)
Furthermore, you can use a for loop to generate a list of 20 unique matrices as follows:
for (i in 1:20) {
mat[[i]] <- list(matrix(c(sample(1:10,3), sample(11:20,3), sample(21:30,3)), nrow=3))
print(mat[[i]])
}
Well OK I may fall on my face here but I propose a checksum (using Excel).
This is a unique signature for each bingo card which will remain invariate if the order of numbers within any column is changed without changing the actual numbers. The formula is
=SUM(10^MOD(A2:A4,10)+2*10^MOD(B2:B4,10)+4*10^MOD(C2:C4,10))
where the bingo numbers for the first card are in A2:C4.
The idea is to generate a 10-digit number for each column, then multiply each by a constant and add them to get the signature.
So here I have generated two random bingo cards using a standard formula from here plus two which are deliberately made to be just permutations of each other.
Then I check if any of the signatures are duplicates using the formula
=MAX(COUNTIF(D5:D20,D5:D20))
which shouldn't given an answer more than 1.
In the unlikely event that there were duplicates, then you would just press F9 and generate some new cards.
All formulae are array formulae and must be entered with CtrlShiftEnter
Here is an inelegant way to do this. Generate all possible combinations and then sample without replacement. These are permutations, combinations: order does matter in bingo
library(dplyr)
library(tidyr)
library(magrittr)
generate_samples = function(n) {
first = data_frame(first = (n-9):n)
first %>%
merge(first %>% rename(second = first)) %>%
merge(first %>% rename(third = first)) %>%
sample_n(20)
}
suffix = function(df, suffix)
df %>%
setNames(names(.) %>%
paste0(suffix))
generate_samples(10) %>% suffix(10) %>%
bind_cols(generate_samples(20) %>% suffix(20)) %>%
bind_cols(generate_samples(30) %>% suffix(30)) %>%
rowwise %>%
do(matrix = t(.) %>% matrix(3)) %>%
use_series(matrix)

Read only n-th column of a text file which has no header with R and sqldf

I have a similiar problem like this question:
selecting every Nth column in using SQLDF or read.csv.sql
I want to read some columns of large files (table of 150rows, >500,000 columns, space separated, filled with numeric data and only a 32 bit system available). This file has no header, therefore the code in the thread above didn't work and I decided to write a new post.
Do you have an idea to solve this problem?
I thought about something like that, but any results with fread or read.table are also ok:
MyConnection <- file("path/file.txt")
df<-sqldf("select column 1 100 1000 235612 from MyConnection",file.format = list(header=F,sep=" "))
You can use substr to specify the start and end position of the columns you want to read in if they are fixed width:
x <- tempfile()
cat("12345", "67890", "09876", "54321", sep = "\n", file = x)
myfile <- file(x)
sqldf("select substr(V1, 1, 1) var1, substr(V1, 3, 5) var2 from myfile")
# var1 var2
# 1 1 345
# 2 6 890
# 3 9 76
# 4 5 321
See this blog post for some more examples. The "select" statement can easily be constructed with paste if you know the details about the column starting positions and widths.