Using SQLDF to select specific values from a column - sql

SQLDF newbie here.
I have a data frame which has about 15,000 rows and 1 column.
The data looks like:
cars
autocar
carsinfo
whatisthat
donnadrive
car
telephone
...
I wanted to use the package sqldf to loop through the column and
pick all values which contain "car" anywhere in their value.
However, the following code generates an error.
> sqldf("SELECT Keyword FROM dat WHERE Keyword="car")
Error: unexpected symbol in "sqldf("SELECT Keyword FROM dat WHERE Keyword="car"
There is no unexpected symbol, so I'm not sure whats wrong.
so first, I want to know all the values which contain 'car'.
then I want to know only those values which contain just 'car' by itself.
Can anyone help.
EDIT:
allright, there was an unexpected symbol, but it only gives me just car and not every
row which contains 'car'.
> sqldf("SELECT Keyword FROM dat WHERE Keyword='car'")
Keyword
1 car

Using = will only return exact matches.
You should probably use the like operator combined with the wildcards % or _. The % wildcard will match multiple characters, while _ matches a single character.
Something like the following will find all instances of car, e.g. "cars", "motorcar", etc:
sqldf("SELECT Keyword FROM dat WHERE Keyword like '%car%'")
And the following will match "car" or "cars":
sqldf("SELECT Keyword FROM dat WHERE Keyword like 'car_'")

This has nothing to do with sqldf; your SQL statement is the problem. You need:
dat <- data.frame(Keyword=c("cars","autocar","carsinfo",
"whatisthat","donnadrive","car","telephone"))
sqldf("SELECT Keyword FROM dat WHERE Keyword like '%car%'")
# Keyword
# 1 cars
# 2 autocar
# 3 carsinfo
# 4 car

You can also use regular expressions to do this sort of filtering. grepl returns a logical vector (TRUE / FALSE) stating whether or not there was a match or not. You can get very sophisticated to match specific items, but a basic query will work in this case:
#Using #Joshua's dat data.frame
subset(dat, grepl("car", Keyword, ignore.case = TRUE))
Keyword
1 cars
2 autocar
3 carsinfo
6 car

Very similar to the solution provided by #Chase. Because we do not use subset we do not need a logical vector and can use both grep or grepl:
df <- data.frame(keyword = c("cars", "autocar", "carsinfo", "whatisthat", "donnadrive", "car", "telephone"))
df[grep("car", df$keyword), , drop = FALSE] # or
df[grepl("car", df$keyword), , drop = FALSE]
keyword
1 cars
2 autocar
3 carsinfo
6 car
I took the idea from Selecting rows where a column has a string like 'hsa..' (partial string match)

Related

Passing Optional List argument from Django to filter with in Raw SQL

When using primitive types such as Integer, I can without any problems do a query like this:
with connection.cursor() as cursor:
cursor.execute(sql='''SELECT count(*) FROM account
WHERE %(pk)s ISNULL OR id %(pk)s''', params={'pk': 1})
Which would either return row with id = 1 or it would return all rows if pk parameter was equal to None.
However, when trying to use similar approach to pass a list/tuple of IDs, I always produce a SQL syntax error when passing empty/None tuple, e.g. trying:
with connection.cursor() as cursor:
cursor.execute(sql='''SELECT count(*) FROM account
WHERE %(ids)s ISNULL OR id IN %(ids)s''', params={'ids': (1,2,3)})
works, but passing () produces SQL syntax error:
psycopg2.ProgrammingError: syntax error at or near ")"
LINE 1: SELECT count(*) FROM account WHERE () ISNULL OR id IN ()
Or if I pass None I get:
django.db.utils.ProgrammingError: syntax error at or near "NULL"
LINE 1: ...LECT count(*) FROM account WHERE NULL ISNULL OR id IN NULL
I tried putting the argument in SQL in () - (%(ids)s) - but that always breaks one or the other condition. I also tried playing around with pg_typeof or casting the argument, but with no results.
Notes:
the actual SQL is much more complex, this one here is a simplification for illustrative purposes
as a last resort - I could alter the SQL in Python based on the argument, but I really wanted to avoid that.)
At first I had an idea of using just 1 argument, but replacing it with a dummy value [-1] and then using it like
cursor.execute(sql='''SELECT ... WHERE -1 = any(%(ids)s) OR id = ANY(%(ids)s)''', params={'ids': ids if ids else [-1]})
but this did a Full table scan for non empty lists, which was unfortunate, so a no go.
Then I thought I could do a little preprocessing in python and send 2 arguments instead of just the single list- the actual list and an empty list boolean indicator. That is
cursor.execute(sql='''SELECT ... WHERE %(empty_ids)s = TRUE OR id = ANY(%(ids)s)''', params={'empty_ids': not ids, 'ids': ids})
Not the most elegant solution, but it performs quite well (Index scan for non empty list, Full table scan for empty list - but that returns the whole table anyway, so it's ok)
And finally I came up with the simplest solution and quite elegant:
cursor.execute(sql='''SELECT ... WHERE '{}' = %(ids)s OR id = ANY(%(ids)s)''', params={'ids': ids})
This one also performs Index scan for non empty lists, so it's quite fast.
From the psycopg2 docs:
Note You can use a Python list as the argument of the IN operator using the PostgreSQL ANY operator.
ids = [10, 20, 30]
cur.execute("SELECT * FROM data WHERE id = ANY(%s);", (ids,))
Furthermore ANY can also work with empty lists, whereas IN () is a SQL syntax error.

Regular expression in dbGetQuery does not work

I want to get all rows in my database where a condition with regular expressions is met. The variable should start with "J12", "J13", "J14" or "J15".
This was my attempt:
Data <- dbGetQuery(db,
"SELECT * FROM 'XXX.XXXX.XXX'
WHERE TYPE = 'xyz' AND [xyz_DIAG] LIKE '^J1[2-5]' ")
Then a data.frame with 0 rows is returned.
When I send the query
Data <- dbGetQuery(db,
"SELECT * FROM 'XXX.XXXX.XXX'
WHERE TYPE = 'xyz'")
I get a quite large data.frame and then I call
Data %>% setDT %>% .[str_detect(xyz_DIAG, "^J1[2-5]")] and I get the expected result because in fact there are many rows that fulfill that regexp. Have I done something wrong?
For the time being, REGEXP operator has not been added to RSQLITE, see this pull request.
You thus need to "unwrap" the regex and use ORed LIKE:
Data <- dbGetQuery(db,
"SELECT * FROM 'XXX.XXXX.XXX'
WHERE TYPE = 'xyz' AND ([xyz_DIAG] LIKE 'J12%' OR [xyz_DIAG] LIKE 'J13%' OR [xyz_DIAG] LIKE 'J14%' OR [xyz_DIAG] LIKE 'J15%') ")

query using string in PyTables 3

I have a table:
h5file=open_file("ex.h5", "w")
class ex(IsDescription):
A=StringCol(5, pos=0)
B=StringCol(5, pos=1)
C=StringCol(5, pos=2)
table=h5file.create_table('/', 'table', ex, "Passing string as column name")
table=h5file.root.table
rows=[
('abc', 'bcd', 'dse'),
('der', 'fre', 'swr'),
('xsd', 'weq', 'rty')
]
table.append(rows)
table.flush()
I am trying to query as per below:
find='swr'
creteria='B'
if creteria=='B':
condition='B'
else:
condition='C'
value=[x['A'] for x in table.where("""condition==find""")]
print(value)
It returns:
ValueError: there are no columns taking part in condition condition==find
Is there a way to use condition as a column name in above query?
Thanks in advance.
Yes, you can use Pytables .where() to search based on a condition. The problem is how you constructed your query for the table.where(condition). See Note about strings under Table.where() in the Pytables Users Guide:
A special care should be taken when the query condition includes string literals. ... Python 3 strings are unicode objects.
in Python 3, “condition” should be defined like this:
condition = 'col1 == b"AAAA"'
The reason is that in Python 3 “condition” implies a comparison between a string of bytes (“col1” contents) and an unicode literal (“AAAA”).
The simplest form of your query is shown below. It returns a subset of rows that match the condition. Note use of single and double quotes for string and unicode:
query_table = table.where('C=="swr"') # search in column C
I rewrote your example as best I could. See below. It shows several ways to enter the condition. I'm not smart enough to figure out how to combine your creteria and find variables into a single condition variable with string and unicode characters.
from tables import *
class ex(IsDescription):
A=StringCol(5, pos=0)
B=StringCol(5, pos=1)
C=StringCol(5, pos=2)
h5file=open_file("ex.h5", "w")
table=h5file.create_table('/', 'table', ex, "Passing string as column name")
## table=h5file.root.table
rows=[
('abc', 'bcd', 'dse'),
('der', 'fre', 'swr'),
('xsd', 'weq', 'rty')
]
table.append(rows)
table.flush()
find='swr'
query_table = table.where('C==find')
for row in query_table :
print (row)
print (row['A'], row['B'], row['C'])
value=[x['A'] for x in table.where('C == "swr"')]
print(value)
value=[x['A'] for x in table.where('C == find')]
print(value)
h5file.close()
Output shown below:
/table.row (Row), pointing to row #1
b'der' b'fre' b'swr'
[b'der']
[b'der']

Putting output from sql query into another query using R environment

I am wondering what approach should have been selected to perform action from title. I am using ODBC connection and what I get from first sql query are like 40-50 rows in one column. What I want is to put this output as a values in to search for.
How should i treat this? Like a array or separated variables? I still do not know R well so just need to know where to search for.
Regards
------more explanation below----
I have list of 40-50 numbers of 10 digits each, organized in a column.
I am trying to do this:
list <- c(my_input)
sql_in <- paste0(list, collapse="")
and characters are organized like this after this operations:
'c(1234567890, , 1234567890, 1234567890)'
and almost all looks fine and fit into my query besides additional c character at the beginning and missing apostrophes.I try to use gsub function but did not work in way I want.
You may likely do this in one SQL call using a subquery. Notice in the call below that the result of
SELECT n_gear
FROM Gear
WHERE n_gear IN (3,4)
Is passed to the WHERE clause of the primary query. This is perfectly valid and will allow your query to execute entirely in SQL without having to do any intermediate steps in R.
(I use sqldf for simplicity of illustration, but this should work through just about any ODBC connection)
library(sqldf)
Gear <- data.frame(n_gear = 1:5)
sqldf(
"SELECT mpg, qsec, gear, wt
FROM mtcars
WHERE gear IN (SELECT n_gear
FROM Gear
WHERE n_gear IN (3,4))"
)
Try something like this:
list<-c("try","this") #The output from your first query
sql_in<-paste0(list, collapse="','")
The Output
paste("select * from table where table.var in ",paste("('",sql_in,"')",sep=''))
[1] "select * from table where table.var in ('try','this')"
If yuo have space as first or last element of the string you can use this code:
`list<-c(" first element is a space","try","this","last element is a space ")` #The output from your first query
Find space at first or last character
first_space<-substr(list, start = 1, stop = 1)==" "
last_space<-substr(list, start = nchar(list), stop = nchar(list))==" "
Remove spaces
list[first_space]<-substr(list[first_space], start = 2, stop = nchar(list[first_space]))
list[last_space]<-substr(list[last_space], start = 1, stop = nchar(list[last_space])-1)
sql_in<-paste0(list, collapse="','")
Your output
paste0("select * from table where table.var in ",paste("('",sql_in,"')",sep=''))
"select * from table where table.var in ('first element is a space','try','this','last element is a space')"
I think You are expecting some thing like shown below code,
data <- dbGetQuery(con, "select column from yourfirsttable")
list <- paste(data$column, collapse="','")
result <- dbGetQuery(con, statement = sprintf("select * from yourresulttable where inv in ('%s')",list))
It's not entirely clear exactly what you're wanting to achieve here. For example, one use case just means you can do it all with a join. But I have cases where I don't know the values for the test without doing some computation. Then I do a separate query having created a query string thus:
> id <- 1:5
> paste0("SELECT * FROM table WHERE ID IN (", paste0(id, collapse = ","), ")")
[1] "SELECT * FROM table WHERE ID IN (1,2,3,4,5)"

Pig: Cast error while grouping data

This is the code that I am trying to run. Steps:
Take an input (there is a .pig_schema file in the input folder)
Take only two fields (chararray) from it and remove duplicates
Group on one of those fields
The code is as follows:
x = LOAD '$input' USING PigStorage('\t'); --The input is tab separated
x = LIMIT x 25;
DESCRIBE x;
-- Output of DESCRIBE x:
-- x: {id: chararray,keywords: chararray,score: chararray,time: long}
distinctCounts = FOREACH x GENERATE keywords, id; -- generate two fields
distinctCounts = DISTINCT distinctCounts; -- remove duplicates
DESCRIBE distinctCounts;
-- Output of DESCRIBE distinctCounts;
-- distinctCounts: {keywords: chararray,id: chararray}
grouped = GROUP distinctCounts BY keywords; --group by keywords
DESCRIBE grouped; --THIS IS WHERE IT GIVES AN ERROR
DUMP grouped;
When I do the grouped, it gives the following error:
ERROR org.apache.pig.tools.pigstats.SimplePigStats -
ERROR: org.apache.pig.data.DataByteArray cannot be cast to java.lang.String
keywords is a chararray and Pig should be able to group on a chararray. Any ideas?
EDIT:
Input file:
0000010000014743 call for midwife 23 1425761139
0000010000062069 naruto 1 56 1425780386
0000010000079919 the following 98 1425788874
0000010000081650 planes 2 76 1425721945
0000010000118785 law and order 21 1425763899
0000010000136965 family guy 12 1425766338
0000010000136100 american dad 19 1425766702
.pig_schema file
{"fields":[{"name":"id","type":55},{"name":"keywords","type":55},{"name":"score","type":55},{"name":"time","type":15}]}
Pig is not able to identify the value of keywords as chararray.Its better to go for field naming during initial load, in this way we are explicitly stating the field types.
x = LOAD '$input' USING PigStorage('\t') AS (id:chararray,keywords:chararray,score: chararray,time: long);
UPDATE :
Tried the below snippet with updated .pig_schema to introduce score, used '\t' as separator and tried the below steps for the input shared.
x = LOAD 'a.csv' USING PigStorage('\t');
distinctCounts = FOREACH x GENERATE keywords, id;
distinctCounts = DISTINCT distinctCounts;
grouped = GROUP distinctCounts BY keywords;
DUMP grouped;
Would suggest to use unique alias names for better readability and maintainability.
Output :
(naruto 1,{(naruto 1,0000010000062069)})
(planes 2,{(planes 2,0000010000081650)})
(family guy,{(family guy,0000010000136965)})
(american dad,{(american dad,0000010000136100)})
(law and order,{(law and order,0000010000118785)})
(the following,{(the following,0000010000079919)})
(call for midwife,{(call for midwife,0000010000014743)})