I am rather new to using SQL statements, and am having a little trouble using them to select the desired columns from a large table and pulling them into R.
I want to take a csv file and read selected columns into r, in particular, every 9th and 10th column. In R, something like:
read.csv.sql("myfile.csv", sql(select * from file [EVERY 9th and 10th COLUMN])
My trawl of the internet suggests that selecting every nth row could be done with an SQL statement using MOD something like this (please correct me if I am wrong):
"SELECT *
FROM file
WHERE (ROWID,0) IN (SELECT ROWID, MOD(ROWNUM,9) OR MOD(ROWNUM,10)"
Is there a way to make this work for columns? Thanks in advance.
read.csv read.csv would be adequate for this:
# determine number of columns
DF1 <- read.csv(myfile, nrows = 1)
nc <- ncol(DF1)
# create a list nc long where unwanted columns are NULL and wanted are NA
colClasses <- rep(rep(list("NULL", NA), c(8, 2)), length = nc)
# read in
DF <- read.csv(myfile, colClasses = colClasses)
sqldf To use sqldf replace the last line with these:
nms <- names(DF1)
vars <- toString(nms[is.na(colClasses)])
DF <- fn$read.csv.sql(myfile, "select $vars from file")
UPDATE: switched to read.csv.sql
UPDATE 2: correction.
Related
I'm new to python, I have a multiple data frame and select data frame based one columns which contains value xxx.
below is my code
MasterFiles = [Master_Jun22, Master_May22, Master_Apr22, Master_Mar22, Master_Feb22, Master_Jan22,
Master_Dec21, Master_Nov21, Master_Oct21, Master_Sep21, Master_Aug21, Master_Jul21,
Master_Jun21, Master_May21, Master_Apr21]
ColumName = ['product_category']
for d in MasterFiles:
for c in ColumName:
d = d.loc[d[c]=='XXX']
it is not working please help on this.
You need to gather the output and append it to a new Dataframe:
MasterFiles = [Master_Jun22, Master_May22, Master_Apr22, Master_Mar22, Master_Feb22, Master_Jan22,
Master_Dec21, Master_Nov21, Master_Oct21, Master_Sep21, Master_Aug21, Master_Jul21,
Master_Jun21, Master_May21, Master_Apr21]
ColumName = ['product_category']
res_df = pandas.Dataframe(columns=ColumName)
for d in MasterFiles:
for c in ColumName:
res_df.append[d.loc[d[c]=='XXX']]
# the results
res_df.head()
I am not sure if I am understanding your question correctly. So, please let me rephrase your question here.
You have 3 tasks,
first is to loop through each pandas data frame,
second is to loop through each column in your ColumName list, and
third is to return the data frame rows that consists of value Surabhi - DCL - Unsecured based on the column name in the columnName list.
If I am interpreting this correctly. This is how I would work on your issue.
MasterFiles = [Master_Jun22, Master_May22, Master_Apr22, Master_Mar22, Master_Feb22, Master_Jan22,
Master_Dec21, Master_Nov21, Master_Oct21, Master_Sep21, Master_Aug21, Master_Jul21,
Master_Jun21, Master_May21, Master_Apr21]
ColumName = ['product_category']
## list to store filter data frame by rows
df_temp = []
for d in MasterFiles:
for c in ColumName:
df_temp.append(d.loc[d[c] == 'Surabhi - DCL - Unsecured'])
## Assuming row wise concatenation
## i.e., using same column names to join data
df = pd.concat(df_temp, axis=0, ignore_index=True)
## df is the data frame you need
I have a list of IDs that are not associated with any actual data. I have a SQL database that has a table for each of these IDs, and those tables have data that I would like to combine together into one large data frame based on the list of IDs that I have. I figured a for loop would be needed for this, but I haven't been able to get it to work properly.
For example I have a list of IDs"
1,2,3,4,5
I have a SQL database with tables for each of these, and they also have other data associated with the IDs. Each ID has multiple rows and columns.
I would like my end product to be the combination of those rows and columns for the list of IDs to be in a single data frame in r. How could I do this? What is the most efficient way to do so?
#Example data set
library(lubridate)
date <- rep_len(seq(dmy("26-12-2010"), dmy("20-12-2011"), by = "days"), 500)
ID <- rep(seq(1, 5), 100)
df <- data.frame(date = date,
x = runif(length(date), min = 60000, max = 80000),
y = runif(length(date), min = 800000, max = 900000),
ID)
for (i in 1: length(ID)){
ID[i] <- dbReadTable(mydb, ID[i])
}
Thank you so much for your time.
I'll expand on my comment to finish the question.
IDs <- lapply(setNames(nm=ID), function(i) dbReadTable(mydb, i))
and then one of:
## base R
IDs <- Map(function(x, nm) transform(x, id = nm), IDs, names(IDs))
DF <- do.call(rbind, IDs)
## dplyr
DF <- dplyr::bind_rows(IDs, .id = "id")
## data.table
DF <- data.table::rbindlist(IDs, idcol = "id")
The addition of the "id" column is to easily differentiate the rows based on the source ID. If the table already includes that, then you can omit the Map (base) and .id/idcol arguments.
(This assumes, btw, that all tables have the same exact structure: same column names and same data types.)
I am very new to R. I wanted to use a sql query to get data with R. (I am using athena but I think this doesn't matter).
con <- dbConnect(RAthena::athena(),
s3_staging_dir = 's3://bla/bla/'
)
df <- tbl(con, sql("SELECT * FROM db.my_data"))
My problem is that df is not a dataframe. So when I do names(df) for example I don't get the columns (as I would with Python) but get 'src''ops' instead. I don't know what this means (these are not columns). How can I convert this df so it is a dataframe in R?
You can use function dbGetQuery from package DBI
From documentation:
Description Returns the result of a query as a data frame.
dbGetQuery() comes with a default implementation (which should work
with most backends) that calls dbSendQuery(), then dbFetch(), ensuring
that the result is always free-d by dbClearResult().
Implemantation from your example:
df <- dbGetQuery(con, "SELECT * FROM db.my_data")
I am trying to find all the stock records where the yield is greater than the median for all stock using sqldf but I am getting this message.
I have tried using the actual number 2.39 and it works but I have not been successful substituting a variable to make it dynamic. Maybe a sub-select would be better?
mYd <- median(df3$Yield, na.rm = TRUE)
df4 <- sqldf("SELECT a.*
FROM df3 a
WHERE (a.Yield > mYd)
;")
Error in rsqlite_send_query(conn#ptr, statement) : no such column: mYd
The error is stemming from SQLdf's inability to find a column in df3 called mYd. It needs to find columns in the data frame for every corresponding column referenced in your query. Try adding the mYd variable to your df3 data frame as a column proper:
df3$mYd <- median(df3$Yield, na.rm=TRUE)
df4 <- sqldf("SELECT * FROM df3 WHERE Yield > mYd;")
Note that you don't really need to alias df3 here since it is the only table in the query, and you aren't generating any computed columns.
How do I add another column with a moving difference of Column2?
For Example: I want to add a column where it will have the following values: (0,-372706.6,-284087.1, -119883.7, etc.)
Here's a way to go about it.
## For a small dataset
x <- data.frame(matrix(nrow=7,ncol=2,c(0,12,1,10,2,9.5,3,8,4,7,5,5,6,2),byrow = T))
names(x) <- c("Time","Count")
x[1,"Diff"] <- NA
x[2:nrow(x),"Diff"] <- rev(diff(rev(x$Count)))
There is a way to do it with the plyr package as well.