R: Why is numeric data treated like non-numeric data? - dataframe

my R script is:
aa <- data.frame("1","3","1.5","2.1")
mean(aa)
Then I get:
[1] NA
Warning message: In mean.default(aa) : argument is not numeric or logical: returning NA
Can this be solved without removing the quotation marks or changing data.frame to something else?

For a data.frame if you want to get mean value for each column, you can do it using colMeans function:
aa <- data.frame("1","3","1.5","2.1", stringsAsFactors = FALSE)
aa <- sapply(aa, as.numeric)
colMeans(aa)
mean(colMeans(aa)) # if you want average across all columns
You must create the data.frame using stringsAsFactors = FALSE otherwise all character values will be separate factors with ordinal 1 each. Numeric representation would be 1 for each value.

Related

Importing data using R from SQL Server truncate leading zeros

I'm trying to import data from a table in SQL Server and then write it into a .txt file. I'm doing it in the following way. However when I do that all numbers having leading 0 s seems to get trimmed.
For example if I have 000124 in the database, it's shown as 124 in the .txt as well as if I check x_1 it's 124 in there as well.
How can I avoid this? I want to keep the leading 0 s in x_1 and also need them in the output .txt file.
library(RODBC)
library(lubridate)
library(data.table)
cn_1 <- odbcConnect('channel_name')
qry <- "
select
*
from table_name
"
x_1 <- sqlQuery(channel=cn_1, query=qry, stringsAsFactors=FALSE)
rm(qry)
setDT(x_1)
fwrite(x=x_1, file=paste0(export_location, "file_name", date_today, ".txt"), sep="|", quote=TRUE, row.names=FALSE, na="")
Assuming that the underlying data in the DBMS is indeed "string"-like ...
RODBC::sqlQuery has the as.is= argument that can prevent it from trying to convert values. The default is FALSE, and when false and not a clear type like "date" or "timestamp", RODBC calls type.convert which will see the number-like field and convert it to integers or numbers.
Try:
x_1 <- sqlQuery(channel=cn_1, query=qry, stringsAsFactors=FALSE, as.is = TRUE)
and that will stop auto-conversion of all columns.
That is a bit nuclear, to be honest, and will stop conversion of dates/times, and perhaps other columns that should be converted. We can narrow this down; ?sqlQuery says that read.table's documentation on as.is is relevant, and it says:
as.is: controls conversion of character variables (insofar as they
are not converted to logical, numeric or complex) to factors,
if not otherwise specified by 'colClasses'. Its value is
either a vector of logicals (values are recycled if
necessary), or a vector of numeric or character indices which
specify which columns should not be converted to factors.
so if you know which column (by name or column index) is being unnecessarily converted, then you can include it directly. Perhaps
## by column name
x_1 <- sqlQuery(channel=cn_1, query=qry, stringsAsFactors=FALSE, as.is = "somename")
## or by column index
x_1 <- sqlQuery(channel=cn_1, query=qry, stringsAsFactors=FALSE, as.is = 7)
(Side note: while I use select * ... on occasion as well, the presumption of knowing columns by-number is predicated on know all of the columns included in that table/query. If anything changes, perhaps it's actually a SQL view and somebody updates it ... or if somebody changes the order of columns, than your assumptions of column indices is a little fragile. All of my "production" queries in my internal packages have all columns spelled out, no use of select *. I have been bitten once when I used it, which is why I'm a little defensive about it.)
If you don't know, a hastily-dynamic way (that double-taps the query, unfortunately) could be something like
qry10 <- "
select
*
from table_name
limit 10"
x_1 <- sqlQuery(channel=cn_1, query=qry10, stringsAsFactors=FALSE, as.is = TRUE)
leadzero <- sapply(x_1, function(z) all(grepl("^0+[1-9]", z)))
x_1 <- sqlQuery(channel=cn_1, query=qry, stringsAsFactors=FALSE, as.is = which(leadzero))
Caveat: I don't use RODBC nor have I set up a temporary database with appropriately-fashioned values, so this untested.
Let x_1 be the result data.table from your SQL query. Then you can convert numeric columns (e.g. value) to formatted strings using sprintf to get leading zeros:
library(data.table)
x_1 <- data.table(value = c(1,12,123,1234))
x_1
#> value
#> 1: 1
#> 2: 12
#> 3: 123
#> 4: 1234
x_1$value <- x_1$value |> sprintf(fmt = "%04d")
x_1
#> value
#> 1: 0001
#> 2: 0012
#> 3: 0123
#> 4: 1234
Created on 2021-10-08 by the reprex package (v2.0.1)

Rename certain values of a row in a certain column if the meet the criteria

How can I rename certain values of a row from a column if they meet a certain if-statement?
Example:
Date Type C1
2000 none 3
2000 love 4
2000 none 6
2000 none 2
2000 bad 8
So I want to rename "love" and "bad" in my column type into "xxx".
Date Type C1
2000 none 3
2000 xxx 4
2000 none 6
2000 none 2
2000 xxx 8
Is there a neat way of doing it?
Thank you :)
First, make sure it's not a factor, then rename:
df$Type = as.character(df$Type)
df$Type[df$Type %in% c("love", "bad")] = "xxx"
If the data is a factor, you want to rename the factor level. The easiest way to do that is with fct_recode() in the forcats package. If it's a character vector, ifelse works well if the number of changes is small. If it's large, case_when in the dplyr package works well.
library(forcats)
library(dplyr)
df <- within(df, { # if you use `dplyr`, you can replace this with mutate. You'd also need to change `<-` to `=` and add `,` at the end of each line.
Type_fct1 <- fct_recode(Type, xxx = "love", xxx = "bad")
# in base R, you need can change the factor labels, but its clunky
Type_fct2 <- Type
levels(Type_fct2)[levels(Type_fct2) %in% c("love", "bad")] <- "xxx"
# methods using character vectors
Type_chr1 <- ifelse(as.character(Type) %in% c("love", "bad"), "xxx", as.character(Type))
Type_chr2 <- case_when(
Type %in% c("love", "bad") ~ "xxx",
Type == "none" ~ "something_else", # thrown in to show how to use `case_when` with many different criterion.
TRUE ~ NA_character_
)
})

R inner join different data types

I was wondering if there was a way or maybe another package that uses SQL queries to manipulate dataframes so that I don't necessarily have to convert numerical variables to strings/characters.
input_key <- c(9061,8680,1546,5376,9550,9909,3853,3732,9209)
output_data <- data.frame(input_key)
answer_product <- c("Water", "Bread", "Soda", "Chips", "Chicken", "Cheese", "Chocolate", "Donuts", "Juice")
answer_data <- data.frame(cbind(input_key, answer_product), stringsAsFactors = FALSE)
left_join(output_data,answer_data, by = "input_key")
The left_join function from dplyr work also with numerical value as key.
I think that you problem come from the 'cbind' function, because its output is a matrix those can only store one kind of data type. In your case, the numeric values are casted to char.
In contrary of matrix, data.frame could store different type of data, like a list.
Form your code, the key column is converted to char:
> str(answer_data)
'data.frame': 9 obs. of 2 variables:
$ input_key : chr "9061" "8680" "1546" "5376" ...
$ answer_product: chr "Water" "Bread" "Soda" "Chips" ...
If instead you construct the data.frame with:
answer_data_2 <- data.frame(
input_key = input_key,
answer_product = answer_product,
stringsAsFactors = FALSE
)
the key colunm stay numeric
> str(answer_data_2)
'data.frame': 9 obs. of 2 variables:
$ input_key : num 9061 8680 1546 5376 9550 ...
$ answer_product: chr "Water" "Bread" "Soda" "Chips" ...
and
left_join(output_data,answer_data, by = "input_key")
work with the numerical keys

keeping leading zeros with sqldf

I am a total SQL ignoramus so I apologize if this is very simple..
I have data that contains an ID column consisting of numbers, and in many cases contains leading zeros. I would like to import the data using sqldf, but in doing so I lose the leading zeros for these. Is there a way to keep the leading zeros? Maybe by somehow specifying that all columns are character classes like in R's read.table?
I can't share my data due to the nature of my work, but I am doing something like this:
a <- formatC(sample(1:99, 10), width = 8, format = "d", flag = "0")
fakeDF <- data.frame(v1=a, v2=rnorm(10, 0, 1))
f1 <- tempfile()
write.table(fakeDF, file=f1, quote=FALSE, row.names=FALSE, col.names=FALSE, sep="|")
f2 <- file(f1)
mydat <- sqldf::sqldf("SELECT * FROM f2", dbname=tempfile(),
file.format=list(header=FALSE, sep="|", eol="\n", skip=1))
mydat
Also, I would like to add that the length is not the same for all of these IDs. If possible, I would like to avoid having to manually pad the data with zeros after the fact..
Use colClasses like this:
library(sqldf)
read.csv.sql(f1, header = FALSE, sep = "|", colClasses = c("character", "numeric"))
giving:
V1 V2
1 00000029 1.7150650
2 00000078 0.4609162
3 00000040 -1.2650612
4 00000085 -0.6868529
5 00000090 -0.4456620
6 00000005 1.2240818
7 00000050 0.3598138
8 00000083 0.4007715
9 00000051 0.1106827
10 00000042 -0.5558411
Note: We used the input file generated using this random seed:
set.seed(123)
a <- formatC(sample(1:99, 10), width = 8, format = "d", flag = "0")
fakeDF <- data.frame(v1=a, v2=rnorm(10, 0, 1))
f1 <- tempfile()
write.table(fakeDF, file=f1, quote=FALSE, row.names=FALSE, col.names=FALSE, sep="|")
One way to run leading zeros is using SQL string functions. Just impose an amount of zeros higher than your desired string length, concatenate with your actual ID field, and strip from the rightmost character the specified length of column you require. Below uses 8 characters as string length:
mydat <- sqldf::sqldf("select rightstr('0000000000000' || ID, 8) As LeadZeroID,
* from f2;",
dbname=tempfile(),
file.format=list(header=FALSE, sep="|", eol="\n", skip=1))

R: Matching two tables on multiple columns and creating a matched/not matched flag

I'm a beginner to R from a SAS background trying to do a basic "case when" match on two tables to get a flag where I have and have not found a match. Please see the SAS code I have in mind below. I just need something analogous to this in R. Thanks in advance.
proc sql;
create table
x as
select
a.*,
b.*,
case when a.first_column=b.column_first and
a.second_column=b.column_second
then 1 else 0 end as matched_flag
from table1 as a
left join
table2 as b
on a.first_column=b.column_first and a.second_column=b.column_second;
quit;
I'm not familiar with SAS, but I think I understand what you are trying to do. To see how many rows/columns are similar between two tables, you can use %in% and the length function.
For example, initialize two matrices of different dimensions and given them similar row names and column names:
mat.a <- matrix(1, nrow=3, ncol = 2)
mat.b <- matrix(1, nrow=2, ncol = 3)
rownames(mat.a) <- c('a','b','c')
rownames(mat.b) <- c('a','d')
colnames(mat.a) <- c('g','h')
colnames(mat.b) <- c('h','i')
mat.a and mat.b now exist with different row and column names. To match the rows by names, you can use:
row.match <- rownames(mat.a)[rownames(mat.a) %in% rownames(mat.b)]
num.row.match <- length(row.match)
Note that row.match can now be used to index into both of the matrices. The %in% operator returns a logical of the same length of the first argument (in this case, rownames(mat.a)) that indicates if the ith element of the first argument was found anywhere in the elements of the second argument. This nature of %in% means that you have to be sensitive to how you order the arguments for your indexing.
If you simply want to quantify how many rows or columns are the same between the two matrices, then you can use the sum function with the %in% operator:
sum(rownames(mat.a) %in% rownames(mat.b))
With the sum function used like this, you do not need to be sensitive to how you order the arguments, because the number of row names of mat.a in row names of mat.b is equivalent to the number of row names of mat.b in row names of mat.a. That is to say that this usage of %in% is commutative.
I hope this helps!
You will want to use dataframe objects. These are like datasets in SAS. You can use bind to put two dataframe objects together side by side. Then you can select rows based on conditions and set the flag based on this. In the code below you will see that I did this twice: once to set the 1 flag and once to set the 0 flag.
To select the rows where all fields match you can do something similar, but instead of assigning a new column you can assign all the results back to the name of the table you are working on.
Here's the code:
# make up example a and b data frames
table1 <- data.frame(list(a.first_column=c(1,2,3),a.second_column=c(4,5,6)))
table2 <- data.frame(list(b.first_column=c(1,3,6),b.second_column=c(4,5,9)))
# Combine columns (horizontally)
x <- cbind(table1, table2)
print("Combined Data Frames")
print(x)
# create matched flag (1 when the first columns match)
x$matched_flag[x$a.first_column==x$b.first_column] <- 1
x$matched_flag[!x$a.first_column==x$b.first_column] <- 0
# only select records that match both data frames
x <- x[x$a.first_column==x$b.first_column & x$a.second_column==x$b.second_column,]
print("Matched Data Frames")
print(x)
BTW: since you are used to using SQL, you might want to try the sqldf package in R. It will let you use the same techniques that you are used to but in R and on data frames.