keeping leading zeros with sqldf - sql

I am a total SQL ignoramus so I apologize if this is very simple..
I have data that contains an ID column consisting of numbers, and in many cases contains leading zeros. I would like to import the data using sqldf, but in doing so I lose the leading zeros for these. Is there a way to keep the leading zeros? Maybe by somehow specifying that all columns are character classes like in R's read.table?
I can't share my data due to the nature of my work, but I am doing something like this:
a <- formatC(sample(1:99, 10), width = 8, format = "d", flag = "0")
fakeDF <- data.frame(v1=a, v2=rnorm(10, 0, 1))
f1 <- tempfile()
write.table(fakeDF, file=f1, quote=FALSE, row.names=FALSE, col.names=FALSE, sep="|")
f2 <- file(f1)
mydat <- sqldf::sqldf("SELECT * FROM f2", dbname=tempfile(),
file.format=list(header=FALSE, sep="|", eol="\n", skip=1))
mydat
Also, I would like to add that the length is not the same for all of these IDs. If possible, I would like to avoid having to manually pad the data with zeros after the fact..

Use colClasses like this:
library(sqldf)
read.csv.sql(f1, header = FALSE, sep = "|", colClasses = c("character", "numeric"))
giving:
V1 V2
1 00000029 1.7150650
2 00000078 0.4609162
3 00000040 -1.2650612
4 00000085 -0.6868529
5 00000090 -0.4456620
6 00000005 1.2240818
7 00000050 0.3598138
8 00000083 0.4007715
9 00000051 0.1106827
10 00000042 -0.5558411
Note: We used the input file generated using this random seed:
set.seed(123)
a <- formatC(sample(1:99, 10), width = 8, format = "d", flag = "0")
fakeDF <- data.frame(v1=a, v2=rnorm(10, 0, 1))
f1 <- tempfile()
write.table(fakeDF, file=f1, quote=FALSE, row.names=FALSE, col.names=FALSE, sep="|")

One way to run leading zeros is using SQL string functions. Just impose an amount of zeros higher than your desired string length, concatenate with your actual ID field, and strip from the rightmost character the specified length of column you require. Below uses 8 characters as string length:
mydat <- sqldf::sqldf("select rightstr('0000000000000' || ID, 8) As LeadZeroID,
* from f2;",
dbname=tempfile(),
file.format=list(header=FALSE, sep="|", eol="\n", skip=1))

Related

Importing data using R from SQL Server truncate leading zeros

I'm trying to import data from a table in SQL Server and then write it into a .txt file. I'm doing it in the following way. However when I do that all numbers having leading 0 s seems to get trimmed.
For example if I have 000124 in the database, it's shown as 124 in the .txt as well as if I check x_1 it's 124 in there as well.
How can I avoid this? I want to keep the leading 0 s in x_1 and also need them in the output .txt file.
library(RODBC)
library(lubridate)
library(data.table)
cn_1 <- odbcConnect('channel_name')
qry <- "
select
*
from table_name
"
x_1 <- sqlQuery(channel=cn_1, query=qry, stringsAsFactors=FALSE)
rm(qry)
setDT(x_1)
fwrite(x=x_1, file=paste0(export_location, "file_name", date_today, ".txt"), sep="|", quote=TRUE, row.names=FALSE, na="")
Assuming that the underlying data in the DBMS is indeed "string"-like ...
RODBC::sqlQuery has the as.is= argument that can prevent it from trying to convert values. The default is FALSE, and when false and not a clear type like "date" or "timestamp", RODBC calls type.convert which will see the number-like field and convert it to integers or numbers.
Try:
x_1 <- sqlQuery(channel=cn_1, query=qry, stringsAsFactors=FALSE, as.is = TRUE)
and that will stop auto-conversion of all columns.
That is a bit nuclear, to be honest, and will stop conversion of dates/times, and perhaps other columns that should be converted. We can narrow this down; ?sqlQuery says that read.table's documentation on as.is is relevant, and it says:
as.is: controls conversion of character variables (insofar as they
are not converted to logical, numeric or complex) to factors,
if not otherwise specified by 'colClasses'. Its value is
either a vector of logicals (values are recycled if
necessary), or a vector of numeric or character indices which
specify which columns should not be converted to factors.
so if you know which column (by name or column index) is being unnecessarily converted, then you can include it directly. Perhaps
## by column name
x_1 <- sqlQuery(channel=cn_1, query=qry, stringsAsFactors=FALSE, as.is = "somename")
## or by column index
x_1 <- sqlQuery(channel=cn_1, query=qry, stringsAsFactors=FALSE, as.is = 7)
(Side note: while I use select * ... on occasion as well, the presumption of knowing columns by-number is predicated on know all of the columns included in that table/query. If anything changes, perhaps it's actually a SQL view and somebody updates it ... or if somebody changes the order of columns, than your assumptions of column indices is a little fragile. All of my "production" queries in my internal packages have all columns spelled out, no use of select *. I have been bitten once when I used it, which is why I'm a little defensive about it.)
If you don't know, a hastily-dynamic way (that double-taps the query, unfortunately) could be something like
qry10 <- "
select
*
from table_name
limit 10"
x_1 <- sqlQuery(channel=cn_1, query=qry10, stringsAsFactors=FALSE, as.is = TRUE)
leadzero <- sapply(x_1, function(z) all(grepl("^0+[1-9]", z)))
x_1 <- sqlQuery(channel=cn_1, query=qry, stringsAsFactors=FALSE, as.is = which(leadzero))
Caveat: I don't use RODBC nor have I set up a temporary database with appropriately-fashioned values, so this untested.
Let x_1 be the result data.table from your SQL query. Then you can convert numeric columns (e.g. value) to formatted strings using sprintf to get leading zeros:
library(data.table)
x_1 <- data.table(value = c(1,12,123,1234))
x_1
#> value
#> 1: 1
#> 2: 12
#> 3: 123
#> 4: 1234
x_1$value <- x_1$value |> sprintf(fmt = "%04d")
x_1
#> value
#> 1: 0001
#> 2: 0012
#> 3: 0123
#> 4: 1234
Created on 2021-10-08 by the reprex package (v2.0.1)

R: Why is numeric data treated like non-numeric data?

my R script is:
aa <- data.frame("1","3","1.5","2.1")
mean(aa)
Then I get:
[1] NA
Warning message: In mean.default(aa) : argument is not numeric or logical: returning NA
Can this be solved without removing the quotation marks or changing data.frame to something else?
For a data.frame if you want to get mean value for each column, you can do it using colMeans function:
aa <- data.frame("1","3","1.5","2.1", stringsAsFactors = FALSE)
aa <- sapply(aa, as.numeric)
colMeans(aa)
mean(colMeans(aa)) # if you want average across all columns
You must create the data.frame using stringsAsFactors = FALSE otherwise all character values will be separate factors with ordinal 1 each. Numeric representation would be 1 for each value.

R inner join different data types

I was wondering if there was a way or maybe another package that uses SQL queries to manipulate dataframes so that I don't necessarily have to convert numerical variables to strings/characters.
input_key <- c(9061,8680,1546,5376,9550,9909,3853,3732,9209)
output_data <- data.frame(input_key)
answer_product <- c("Water", "Bread", "Soda", "Chips", "Chicken", "Cheese", "Chocolate", "Donuts", "Juice")
answer_data <- data.frame(cbind(input_key, answer_product), stringsAsFactors = FALSE)
left_join(output_data,answer_data, by = "input_key")
The left_join function from dplyr work also with numerical value as key.
I think that you problem come from the 'cbind' function, because its output is a matrix those can only store one kind of data type. In your case, the numeric values are casted to char.
In contrary of matrix, data.frame could store different type of data, like a list.
Form your code, the key column is converted to char:
> str(answer_data)
'data.frame': 9 obs. of 2 variables:
$ input_key : chr "9061" "8680" "1546" "5376" ...
$ answer_product: chr "Water" "Bread" "Soda" "Chips" ...
If instead you construct the data.frame with:
answer_data_2 <- data.frame(
input_key = input_key,
answer_product = answer_product,
stringsAsFactors = FALSE
)
the key colunm stay numeric
> str(answer_data_2)
'data.frame': 9 obs. of 2 variables:
$ input_key : num 9061 8680 1546 5376 9550 ...
$ answer_product: chr "Water" "Bread" "Soda" "Chips" ...
and
left_join(output_data,answer_data, by = "input_key")
work with the numerical keys

SAS Index function?

Can anyone explain why the below piece of code gives two different values?
87 data _null_;
88 length a b $14;
89 a = 'ABC.DEF (X=Y)';
90 b = 'X=Y';
91 x = index(a,b);
92 y = index('ABC.DEF (X=Y)','X=Y');
93 put x y;
94 run;
0 10
NOTE: DATA statement used (Total process time):
real time 0.01 seconds
cpu time 0.01 seconds
Thanks.
It seems this is an exact copy of the example on the SAS website, so it would have been helpful if you would have looked for an answer there first.
This is their explanation:
Example 2:
Removing Trailing Spaces When You Use the INDEX Function with the TRIM Function
The following example shows the results when you use the INDEX function with and without the TRIM function. If you use INDEX without the TRIM function, leading and trailing spaces are considered part of the excerpt argument. If you use INDEX with the TRIM function, TRIM removes trailing spaces from the excerpt argument as you can see in this example. Note that the TRIM function is used inside the INDEX function.
options nodate nostimer ls=78 ps=60;
data _null_;
length a b $14;
a='ABC.DEF (X=Y)';
b='X=Y';
q=index(a,b);
w=index(a,trim(b));
put q= w=;
run;
SAS writes the following output to the log:
q=0 w=10
Added based on mjsqu's comment:
data _null_;
length a b $14 c $3;
a='ABC.DEF (X=Y)';
b='X=Y';
c='X=Y';
x=index(a,b);
y=index(a,c);
z=index(a,trim(b));
d = "|" || a ||"|";
e = "|" || b ||"|";
f = "|" || c ||"|";
put d=;
put e=;
put f=;
put x= y= z=;
run;
d=|ABC.DEF (X=Y) |
e=|X=Y |
f=|X=Y|
x=0 y=10 z=10
You can see that b has a trailing space which is part of the string that the Index function will be looking for. Since in string a X=Y is followed by ) and not a space, this means it will not be found => q = 0. You can also see here that if you change the length of b to the actual lenght of the string you want to look for (3 in this case), it would give you the same outcome.

Read only n-th column of a text file which has no header with R and sqldf

I have a similiar problem like this question:
selecting every Nth column in using SQLDF or read.csv.sql
I want to read some columns of large files (table of 150rows, >500,000 columns, space separated, filled with numeric data and only a 32 bit system available). This file has no header, therefore the code in the thread above didn't work and I decided to write a new post.
Do you have an idea to solve this problem?
I thought about something like that, but any results with fread or read.table are also ok:
MyConnection <- file("path/file.txt")
df<-sqldf("select column 1 100 1000 235612 from MyConnection",file.format = list(header=F,sep=" "))
You can use substr to specify the start and end position of the columns you want to read in if they are fixed width:
x <- tempfile()
cat("12345", "67890", "09876", "54321", sep = "\n", file = x)
myfile <- file(x)
sqldf("select substr(V1, 1, 1) var1, substr(V1, 3, 5) var2 from myfile")
# var1 var2
# 1 1 345
# 2 6 890
# 3 9 76
# 4 5 321
See this blog post for some more examples. The "select" statement can easily be constructed with paste if you know the details about the column starting positions and widths.