I am very new to R. I wanted to use a sql query to get data with R. (I am using athena but I think this doesn't matter).
con <- dbConnect(RAthena::athena(),
s3_staging_dir = 's3://bla/bla/'
)
df <- tbl(con, sql("SELECT * FROM db.my_data"))
My problem is that df is not a dataframe. So when I do names(df) for example I don't get the columns (as I would with Python) but get 'src''ops' instead. I don't know what this means (these are not columns). How can I convert this df so it is a dataframe in R?
You can use function dbGetQuery from package DBI
From documentation:
Description Returns the result of a query as a data frame.
dbGetQuery() comes with a default implementation (which should work
with most backends) that calls dbSendQuery(), then dbFetch(), ensuring
that the result is always free-d by dbClearResult().
Implemantation from your example:
df <- dbGetQuery(con, "SELECT * FROM db.my_data")
Related
I am attempting to write a script that will allow me to insert values from an uploaded dataframe into a table inside of an Oracle DB; but my issue lies with
too many columns to hard-code
columns aren't one-to-one
What I'm hoping for is a way to write out the columns, check to see if they sync with the columns of my dataframe and from there use an INSERT VALUES sql statement to input the values from the dataframe to the ODS table.
so far these are the important parts of my script:
import pandas as pd
import cx_Oracle
import config
df = pd.read_excel("Employee_data.xlsx")
conn = None
try:
conn = cx_Oracle.connect(config.username, config.password, config.dsn, encoding=config.encoding)
except cx_Oracle.Error as error:
print(error)
finally:
cursor = conn.cursor
sql = "SELECT * FROM ODSMGR.EMPLOYEE_TABLE"
cursor.execute(sql)
data = cursor.fetchall()
col_names = []
for i in range(0, len(cursor.description)):
col_names.append(cursor.description[i][0])
#instead of using df.columns I use:
rows = [tuple(x) for x in df.values]
which prints my ODS column names, and allows me to conveniently store my rows from the df in an array but I'm at a loss for how to import these to the ODS. I found something like:
cursor.execute("insert into ODSMGR.EMPLOYEE_TABLE(col1,col2) values (:col1, :col2)", {":col1df":df, "col2df:df"})
but that'll mean I'll have to hard-code everything which wouldn't be scalable. I'm hoping I can get some sort of insight to help. It's just difficult since the columns aren't 1-to-1 and that there is some compression/collapsing of columns from the DF to the ODS but any help is appreciated.
NOTE: I've also attempted to use SQLalchemy but I am always given an error "ORA-12505: TNS:listener does not currently know of SID given in connect descriptor" which is really strange given that I am able to connect with cx_Oracle
EDIT 1:
I was able to get a list of columns that share the same name; so after running:
import numpy as np
a = np.intersect1d(df.columns, col_names)
print("common columns:", a)
I was able to get a list of columns that the two datasets share.
I also tried to use this as my engine:
engine = create_engine("oracle+cx_oracle://username:password#ODS-test.domain.com:1521/?ODS-Test")
dtyp = {c:types.VARCHAR(df[c].str.len().max())
for c in df.columns[df.dtypes=='object'].tolist()}
df.to_sql('ODS.EMPLOYEE_TABLE', con = engine, dtype=dtyp, if_exists='append')
which has given me nothing but errors.
I am trying to find all the stock records where the yield is greater than the median for all stock using sqldf but I am getting this message.
I have tried using the actual number 2.39 and it works but I have not been successful substituting a variable to make it dynamic. Maybe a sub-select would be better?
mYd <- median(df3$Yield, na.rm = TRUE)
df4 <- sqldf("SELECT a.*
FROM df3 a
WHERE (a.Yield > mYd)
;")
Error in rsqlite_send_query(conn#ptr, statement) : no such column: mYd
The error is stemming from SQLdf's inability to find a column in df3 called mYd. It needs to find columns in the data frame for every corresponding column referenced in your query. Try adding the mYd variable to your df3 data frame as a column proper:
df3$mYd <- median(df3$Yield, na.rm=TRUE)
df4 <- sqldf("SELECT * FROM df3 WHERE Yield > mYd;")
Note that you don't really need to alias df3 here since it is the only table in the query, and you aren't generating any computed columns.
I want to know can I passing a paremeter with sql query doesn't work
See below sample of my code
Vector <- c(123,436,765)
for ( i in 1:length(Vector){
Result <- sqldf("select * from DF where studentID =" , Vector[i] )
Print( Result)}
Note that my DF has studentId with data type: integer
Thanks
I prefer plain R, but if you want to use sql-like languages on data.frame you should consider using dplyr, that is pretty much mainstream nowdays.
Here a working example of doing an operation like the one you mentioned.
The pipe %>% simply take what is on the the left and put it as first argument on function of the right, making easier nested functions.
required(dplyr)
iris
iris <- tbl_df(iris)
iris %>% filter(Sepal.Length<=5) %>% select(Species)
I am rather new to using SQL statements, and am having a little trouble using them to select the desired columns from a large table and pulling them into R.
I want to take a csv file and read selected columns into r, in particular, every 9th and 10th column. In R, something like:
read.csv.sql("myfile.csv", sql(select * from file [EVERY 9th and 10th COLUMN])
My trawl of the internet suggests that selecting every nth row could be done with an SQL statement using MOD something like this (please correct me if I am wrong):
"SELECT *
FROM file
WHERE (ROWID,0) IN (SELECT ROWID, MOD(ROWNUM,9) OR MOD(ROWNUM,10)"
Is there a way to make this work for columns? Thanks in advance.
read.csv read.csv would be adequate for this:
# determine number of columns
DF1 <- read.csv(myfile, nrows = 1)
nc <- ncol(DF1)
# create a list nc long where unwanted columns are NULL and wanted are NA
colClasses <- rep(rep(list("NULL", NA), c(8, 2)), length = nc)
# read in
DF <- read.csv(myfile, colClasses = colClasses)
sqldf To use sqldf replace the last line with these:
nms <- names(DF1)
vars <- toString(nms[is.na(colClasses)])
DF <- fn$read.csv.sql(myfile, "select $vars from file")
UPDATE: switched to read.csv.sql
UPDATE 2: correction.
Say I have a dataframe df with two or more columns, is there an easy way to use unique() or other R function to create a subset of unique combinations of two or more columns?
I know I can use sqldf() and write an easy "SELECT DISTINCT var1, var2, ... varN" query, but I am looking for an R way of doing this.
It occurred to me to try ftable coerced to a dataframe and use the field names, but I also get the cross tabulations of combinations that don't exist in the dataset:
uniques <- as.data.frame(ftable(df$var1, df$var2))
unique works on data.frame so unique(df[c("var1","var2")]) should be what you want.
Another option is distinct from dplyr package:
df %>% distinct(var1, var2) # or distinct(df, var1, var2)
Note:
For older versions of dplyr (< 0.5.0, 2016-06-24) distinct required additional step
df %>% select(var1, var2) %>% distinct
(or oldish way distinct(select(df, var1, var2))).
#Marek's answer is obviously correct, but may be outdated. The current dplyrversion (0.7.4) allows for an even simpler code:
Simply use:
df %>% distinct(var1, var2)
If you want to keep all columns, add
df %>% distinct(var1, var2, .keep_all = TRUE)
To KEEP all other variables in df use this:
unique_rows <- !duplicated(df[c("var1","var2")])
unique.df <- df[unique_rows,]
Another less recommended method is using row.names() #(see David's comment below):
unique_rows <- row.names(unique(df[c("var1","var2")]))
unique.df <- df[unique_rows,]
In addition to answers above, the data.table version:
setDT(df)
unique_dt = unique(df, by = c('var1', 'var2'))