Search for any of a list of strings inside another string - sql

I need to identify records with valid addresses by comparing the address fields against a list of street-like words.
So the code would look something like:
set street_list = 'STREET', 'ROAD', 'AVENUE', 'DRIVE', 'WAY', 'PLACE' (etc.)
;
create table [new table] as
select *
from [source table]
where [address line 1] (contains any word from STREET_LIST) or
[address line 2] (contains any word from STREET_LIST) or
[address line 3] (contains any word from STREET_LIST)
;
Is this possible?
Using LostReality's regexp suggestion, I got as far as:
select *
from [source table]
where upper([address line 1]) regexp '.* STREET.*|.* ST.*|.* ROAD.*|.* RD.*|.* CLOSE.*|.* LANE.*|.* LA.*|.* AVENUE.*|.* AVE.*|.* DRIVE.*|.* DR.*|.* HOUSE.*|.* WAY.*|.* PLACE.*|.* SQUARE.*|.* WALK.*|.* GROVE.*|.* GREEN.*|.* PARK.*|.* PK.*|.* CRESCENT.*|.* TERRACE.*|.* PARADE.*|.* GARDEN.*|.* GARDENS.*|.* COURT.*|.* COTTAGES.*|.* COTTAGE.*|.* MEWS.*|.* ESTATE.*|.* RISE.*|.* FARM.*'
;
and it seems to work.
But I have two small problems with it:
1) how do I write the regexp on more than one line so it's easier to read?
2) is there any way of putting that regexp into a macro variable because I want to check 5 address lines and I don't want 5 copies of the same expression.
Thanks

Solution for Hive. You can put regexp pattern in the variable and also you can use macro, fixed your template:
set hivevar:street_list ='STREET|ST|ROAD|RD|CLOSE|LANE|LA|AVENUE|AVE|DRIVE|DR|HOUSE|WAY|PLACE|SQUARE|WALK|GROVE|GREEN|PARK|PK|CRESCENT|TERRACE|PARADE|GARDEN|GARDENS|COURT|COTTAGES|COTTAGE|MEWS|ESTATE|RISE|FARM';
--boolean macro for using in the WHERE
create temporary macro contains_word(s string) (upper(s) rlike ${hivevar:street_list} ) ;
with some_table as ( --use your table instead of this synthetic example
select stack(2,'some string containing STREET and WALK',
'some string containing something else') as str
) --use your table instead of this synthetic example
--use macro in your query
select str from some_table
where contains_word(str);
Result:
OK
some string containing STREET and WALK
Time taken: 0.229 seconds, Fetched: 1 row(s)
Use OR like in your question:
where contains_word(address_line_1) OR contains_word(address_line_2) ...
Hope you have got the idea

Related

Postgresql , updating existing table row with another tables data

I am trying to update a null column using another tables value but it doesn't seems to work right. below codes were tried
SET
"Test name "= "Test"(
SELECT Transformertest.Test,Transformertest.TestID
FROM public.Transformertest WHERE TestID='Tes3')
WHERE test2table.Type='Oil Immersed Transformers'
UPDATE
public.test2table
SET
"Test name" = subquery."Test"
FROM
(
SELECT
"Test"
FROM Transformertest WHERE "TestID"='Tes2'
) AS subquery
WHERE
"Type"='Auto Transformer' AND "Phase"='3' AND "Rated Frequency"='60';
enter image description here
don't use space in column name.
Integers don't need to be quoted
See the result here (enter link description here)
what you need to do here (assuming Phase and Rated Frequency are integers)
remove unnecessary "" and spaces on column names
UPDATE
public.test2table
SET
test_name = subquery.Test
FROM
(
SELECT
test
FROM Transformertest WHERE Test_ID='Tes2'
) AS subquery
WHERE
Type='Auto Transformer' AND Phase=3 AND Rated_Frequency=60;
this should be working now

Parsing SQL Queries by semicolon

I'm trying to read, using Scala, a sql file full of queries to be executed, however, I'm struggling to parse special cases that contain a semicolon that is not the terminator. For example, if the query is:
SELECT * FROM table WHERE name LIKE "%;%",
It separates this into two statements even though it should be one.
Assuming that the query terminator is always a ; at the end of a line, we can make good use of
.split(";\\s*\\n"); matching the ; zero or more whitespace characters follows by an newline character.
or .split("(?m);\\s*$") using the inline (?m) multiline modifier that allows us to match the end of the line with $).
Sample Code:
val a = """SELECT * FROM table WHERE name LIKE "%;%"
AND regexp_replace(
'abcd1234df-TEXT_I-WANT' -- use your input column here instead
, '^[a-z0-9]{10}-(.*)\$' -- matches whole string, captures "TEXT_I-WANT" in \$1
, '\$1' -- inserts \$1 to returnÖ TEXT_I-WANT
) = 'TEXT_I-WANT'
;
SELECT * FROM table WHERE name LIKE "%;%"
AND regexp_replace(
'abcd1234df-TEXT_I-WANT' -- use your input column here instead
, '^[a-z0-9]{10}-(.*)\$' -- matches whole string, captures "TEXT_I-WANT" in \$1
, '\$1' -- inserts \$1 to returnÖ TEXT_I-WANT
) = 'TEXT_I-WANT'
;""".split(";\\s*\\n")
println(a.mkString("Next Query:"))
If you prefer to match, this pattern can do a good job too: "(?m)^[\\s\\S]*?;$"
(add additional whitespace \s as needed)
Full Sample:
import scala.util.matching.Regex
object Demo {
def main(args: Array[String]) {
val pattern = new Regex("(?m)^[\\s\\S]*?;\\s*$")
val str = """SELECT * FROM table WHERE name LIKE "%;%"
AND regexp_replace(
'abcd1234df-TEXT_I-WANT' -- use your input column here instead
, '^[a-z0-9]{10}-(.*)\$' -- matches whole string, captures "TEXT_I-WANT" in \$1
, '\$1' -- inserts \$1 to returnÖ TEXT_I-WANT
) = 'TEXT_I-WANT'
;
SELECT * FROM table WHERE name LIKE "%;%"
AND regexp_replace(
'abcd1234df-TEXT_I-WANT' -- use your input column here instead
, '^[a-z0-9]{10}-(.*)\$' -- matches whole string, captures "TEXT_I-WANT" in \$1
, '\$1' -- inserts \$1 to returnÖ TEXT_I-WANT
) = 'TEXT_I-WANT'
;"""
println((pattern findAllIn str).mkString("\n----------------\n"))
}
}
Try Regex: ^.*?;$ with m option (to match new line) as mentioned here
Demo

Putting output from sql query into another query using R environment

I am wondering what approach should have been selected to perform action from title. I am using ODBC connection and what I get from first sql query are like 40-50 rows in one column. What I want is to put this output as a values in to search for.
How should i treat this? Like a array or separated variables? I still do not know R well so just need to know where to search for.
Regards
------more explanation below----
I have list of 40-50 numbers of 10 digits each, organized in a column.
I am trying to do this:
list <- c(my_input)
sql_in <- paste0(list, collapse="")
and characters are organized like this after this operations:
'c(1234567890, , 1234567890, 1234567890)'
and almost all looks fine and fit into my query besides additional c character at the beginning and missing apostrophes.I try to use gsub function but did not work in way I want.
You may likely do this in one SQL call using a subquery. Notice in the call below that the result of
SELECT n_gear
FROM Gear
WHERE n_gear IN (3,4)
Is passed to the WHERE clause of the primary query. This is perfectly valid and will allow your query to execute entirely in SQL without having to do any intermediate steps in R.
(I use sqldf for simplicity of illustration, but this should work through just about any ODBC connection)
library(sqldf)
Gear <- data.frame(n_gear = 1:5)
sqldf(
"SELECT mpg, qsec, gear, wt
FROM mtcars
WHERE gear IN (SELECT n_gear
FROM Gear
WHERE n_gear IN (3,4))"
)
Try something like this:
list<-c("try","this") #The output from your first query
sql_in<-paste0(list, collapse="','")
The Output
paste("select * from table where table.var in ",paste("('",sql_in,"')",sep=''))
[1] "select * from table where table.var in ('try','this')"
If yuo have space as first or last element of the string you can use this code:
`list<-c(" first element is a space","try","this","last element is a space ")` #The output from your first query
Find space at first or last character
first_space<-substr(list, start = 1, stop = 1)==" "
last_space<-substr(list, start = nchar(list), stop = nchar(list))==" "
Remove spaces
list[first_space]<-substr(list[first_space], start = 2, stop = nchar(list[first_space]))
list[last_space]<-substr(list[last_space], start = 1, stop = nchar(list[last_space])-1)
sql_in<-paste0(list, collapse="','")
Your output
paste0("select * from table where table.var in ",paste("('",sql_in,"')",sep=''))
"select * from table where table.var in ('first element is a space','try','this','last element is a space')"
I think You are expecting some thing like shown below code,
data <- dbGetQuery(con, "select column from yourfirsttable")
list <- paste(data$column, collapse="','")
result <- dbGetQuery(con, statement = sprintf("select * from yourresulttable where inv in ('%s')",list))
It's not entirely clear exactly what you're wanting to achieve here. For example, one use case just means you can do it all with a join. But I have cases where I don't know the values for the test without doing some computation. Then I do a separate query having created a query string thus:
> id <- 1:5
> paste0("SELECT * FROM table WHERE ID IN (", paste0(id, collapse = ","), ")")
[1] "SELECT * FROM table WHERE ID IN (1,2,3,4,5)"

Pig Latin - Extracting fields meeting two different filter criteria from chararray line and grouping in a bag

I am new to Pig Latin.
I want to extract all lines that match a filter criteria (have a word "line_token" ) from log files and then from these matching lines extract two different fields meeting two separate field match criteria . Since the lines aren't structured well I am loading them as a char array.
When I try to run the following code - I get an error
"Invalid resource schema: bag schema must have tuple as its field"
I have tried to perform an explicit cast to a tuple but that does not work
input_lines = LOAD '/inputdir/' AS ( line:chararray);
filtered_lines = FILTER input_lines BY (line MATCHES '.*line_token1.*' );
tokenized_lines = FOREACH filtered_lines GENERATE FLATTEN(TOKENIZE(line)) AS tok_line;
my_wordbag = FOREACH tokenized_lines {
word1 = FILTER tok_line BY ( $0 MATCHES '.*word_token1.*' ) ;
word2 = FILTER tok_line BY ( $0 MATCHES '.*word_token1.*' ) ;
GENERATE word1 , word2 as my_tuple ;
-- I also tried --> GENERATE (word1 , word2) as my_tuple ;
}
dump my_wordbag;
I suppose I am taking a very wrong approach.
Please note - my logs aren't structured well - so I cant mend the way I load
Post loading and initial filtering for lines of interest ( which is straightforward) - I guess I need to do something different rather than tokenize line and iterate through fields trying to find fields.
Or maybe I should use joins ?
Also if I know the structure of line beforehand well as all text fields, then will loading it differently ( not as a chararray) make it an easier problem ?
For now I made a compromise - I added a extra filter clause in my original - line filter and settled for picking just one field from line. When I get back to it I will try with joins and post that code ... - here's my working code that gets me a useful output - but not all that I want.
-- read input lines from poorly structured log
input_lines = LOAD '/log-in-dir-in-hdfs' AS ( line:chararray) ;
-- Filter for line filter criteria and date interested in passed as arg
filtered_lines = FILTER input_lines BY (
( line MATCHES '.*line_filter1*' )
AND ( line MATCHES '.*line_filter2.*' )
AND ( line MATCHES '.*$forDate.*' )
) ;
-- Tokenize every line
tok_lines = FOREACH filtered_lines
GENERATE TOKENIZE(line) AS tok_line;
-- Pick up specific field frm tokenized line based on column filter criteria
fnames = FOREACH tok_lines {
fname = FILTER tok_line BY ( $0 MATCHES '.*field_selection.*' ) ;
GENERATE FLATTEN(fname) as nnfname;
}
-- Count occurances of that field and store it with field name
-- My original intent is to store another field name as well
-- I will do that once I figure how to put both of them in a tuple
flgroup = FOREACH fnames
GENERATE FLATTEN(TOKENIZE((chararray)$0)) as cfname;
grpfnames = group flgroup by cfname;
readcounts = FOREACH grpfnames GENERATE COUNT(flgroup), group ;
STORE readcounts INTO '/out-dir-in-hdfs';
As I understand, after the FLATTEN operation, you have single line (tok_line) in each row and you want to extract 2 words from each line. REGEX_EXTRACT will help you achieve this. I'm not a REGEX expert so will leave writing the REGEX part up to you.
data = FOREACH tokenized_lines
GENERATE
REGEX_EXTRACT(tok_line, <first word regex goes here>) as firstWord,
REGEX_EXTRACT(tok_line, <second word regex goes here>) as secondWord;
I hope this helps.
You must refer to the alias, not the column.
So:
word1 = FILTER tokenized_lines BY ( $0 MATCHES '.*word_token1.*' ) ;
word1 and word2 are going to be aliases as well, not columns.
How do you need the output to look like?

SQL- Adding a condition

I am just starting to learn SQL.
How do you add a condition to a statement? I am trying to sort the destination to 'BNA' which is the airport code.
SELECT
CHARTER.CUS_CODE,
CHARTER.DESTINATION "AIRPORT",
CHARTER.CHAR_DATE,
CHARTER.CHAR_DISTANCE,
CHARTER.AC_NUMBER,
FROM C.CHARTER ;
WHERE DESTINATION = 'BNA' ;
Any hints in the right direction would be great.
The following is your query with the syntax corrected:
SELECT CHARTER.CUS_CODE,
CHARTER.DESTINATION "AIRPORT",
CHARTER.CHAR_DATE,
CHARTER.CHAR_DISTANCE,
CHARTER.AC_NUMBER
FROM CHARTER
WHERE DESTINATION = 'BNA';
The semicolon goes at the end only.
Get rid of "c." from the table name in your from clause. You might have been thinking of giving it an alias of "c" which, if if that's the case, you would put it after the table name (and then use it as a prefix for each field).
SELECT
CHARTER.CUS_CODE,
CHARTER.DESTINATION "AIRPORT",
CHARTER.CHAR_DATE,
CHARTER.CHAR_DISTANCE,
CHARTER.AC_NUMBER,
FROM C.CHARTER
WHERE DESTINATION = 'BNA' ;
The ; character is a statement terminator; you only need one per SQL statement.
there is ";" at the end of the FROM statement, remove it. and try the sql again. Pay attention with the double quote too on the AIRPORT text.
SELECT CHARTER.DESTINATION + 'AIRPORT '
FROM C.CHARTER
WHERE DESTINATION = 'BNA' ;