Parsing SQL Queries by semicolon - sql

I'm trying to read, using Scala, a sql file full of queries to be executed, however, I'm struggling to parse special cases that contain a semicolon that is not the terminator. For example, if the query is:
SELECT * FROM table WHERE name LIKE "%;%",
It separates this into two statements even though it should be one.

Assuming that the query terminator is always a ; at the end of a line, we can make good use of
.split(";\\s*\\n"); matching the ; zero or more whitespace characters follows by an newline character.
or .split("(?m);\\s*$") using the inline (?m) multiline modifier that allows us to match the end of the line with $).
Sample Code:
val a = """SELECT * FROM table WHERE name LIKE "%;%"
AND regexp_replace(
'abcd1234df-TEXT_I-WANT' -- use your input column here instead
, '^[a-z0-9]{10}-(.*)\$' -- matches whole string, captures "TEXT_I-WANT" in \$1
, '\$1' -- inserts \$1 to returnÖ TEXT_I-WANT
) = 'TEXT_I-WANT'
;
SELECT * FROM table WHERE name LIKE "%;%"
AND regexp_replace(
'abcd1234df-TEXT_I-WANT' -- use your input column here instead
, '^[a-z0-9]{10}-(.*)\$' -- matches whole string, captures "TEXT_I-WANT" in \$1
, '\$1' -- inserts \$1 to returnÖ TEXT_I-WANT
) = 'TEXT_I-WANT'
;""".split(";\\s*\\n")
println(a.mkString("Next Query:"))
If you prefer to match, this pattern can do a good job too: "(?m)^[\\s\\S]*?;$"
(add additional whitespace \s as needed)
Full Sample:
import scala.util.matching.Regex
object Demo {
def main(args: Array[String]) {
val pattern = new Regex("(?m)^[\\s\\S]*?;\\s*$")
val str = """SELECT * FROM table WHERE name LIKE "%;%"
AND regexp_replace(
'abcd1234df-TEXT_I-WANT' -- use your input column here instead
, '^[a-z0-9]{10}-(.*)\$' -- matches whole string, captures "TEXT_I-WANT" in \$1
, '\$1' -- inserts \$1 to returnÖ TEXT_I-WANT
) = 'TEXT_I-WANT'
;
SELECT * FROM table WHERE name LIKE "%;%"
AND regexp_replace(
'abcd1234df-TEXT_I-WANT' -- use your input column here instead
, '^[a-z0-9]{10}-(.*)\$' -- matches whole string, captures "TEXT_I-WANT" in \$1
, '\$1' -- inserts \$1 to returnÖ TEXT_I-WANT
) = 'TEXT_I-WANT'
;"""
println((pattern findAllIn str).mkString("\n----------------\n"))
}
}

Try Regex: ^.*?;$ with m option (to match new line) as mentioned here
Demo

Related

mismatched input 'from'. Expecting: ',', <expression>

I have a query that I am running on AWS athena that should return all the filenames that are not contained in the second table. I am basically trying to find all the filename that are not in ejpos landing table.
The one table looks like this (item sales):
origin_file
run_id
/datarite/ejpos/8023/20220706/filename1
8035
/datarite/ejpos/8023/20220706/filename2
8035
/datarite/ejpos/8023/20220706/filename3
8035
The other table looks like this (ejpos_files_landing):
filename
filename1
filename2
filename3
filename4
They don't have the same number of rows, hence I am trying to find the file names that are in ejpos_pos_landing but not in item sales table.
I get this error when I run:
mismatched input 'from'. Expecting: ',', <expression>
The query is here:
SELECT trim("/datarite/ejpos/8023/20220706/" from "validated"."datarite_ejpos_itemsale" where
run_id = '8035') as origin_file,
FROM "validated"."datarite_ejpos_itemsale"
LEFT JOIN "landing"."ejpos_landing_files" ON "landing"."ejpos_landing_files".filename =
"validated"."datarite_ejpos_itemsale".origin_file
WHERE "landing"."ejpos_landing_files".filename IS NULL;
The expected result would be:
|filename4|
Because it is not in the other table
Can anyone assist?
There is a lot of wrong stuff in your query based on the example data and declared goals.
trim("/datarite/ejpos/8023/20220706/" from "validated"."datarite_ejpos_itemsale" where run_id = '8035') as origin_file is not a valid sql.
ON "landing"."ejpos_landing_files".filename = "validated"."datarite_ejpos_itemsale".origin_file will not work cause origin_file is prefixed. You can use strpos if there should be only one instance of filename in the origin_file.
your join and filtering condition are build to find items present in datarite_ejpos_itemsale and missing in ejpos_landing_files while you state the vise versa is needed.
the mentioned in the comments extra comma
Try next:
-- sample data
WITH item_sales(origin_file, run_id) AS (
VALUES ('/datarite/ejpos/8023/20220706/filename1', 8035),
('/datarite/ejpos/8023/20220706/filename2', 8035),
('/datarite/ejpos/8023/20220706/filename3', 8035),
('/datarite/ejpos/8023/20220706/filename4', 8036)
),
ejpos_files_landing(filename) as(
VALUES ('filename1'),
('filename2'),
('filename3'),
('filename4')
)
-- query
select filename
from ejpos_files_landing l
left outer join item_sales s -- reverse the join
on strpos(s.origin_file, l.filename) >= 1 -- assuming that filename should be present only one time in the string
and s.run_id = 8035 -- if you need to filter out run id
where s.origin_file is null
Output:
filename
filename4
Alternative approach you can try:
-- query
select filename
from ejpos_files_landing l
where filename not in (
select element_at(split(origin_file, '/'), -1) -- split by '/' and get last
from item_sales
where run_id = 8035
)

Pandas read_sql_query with parameters for a string with no quotes

I have want to insert a string of identifiers into a piece of sql code using
df = pd.read_sql_query(query, self.connection,params=sql_parameter)
my parameter dictionary looks like this
sql_parameter = {'itemids':itemids_str}
where itemids_str is a string like
282940499, 276686324, 2665846, 46875436, 530272885, 2590230, 557021480, 282937154, 46259344
The SQL code looks like
SELECT
xxx,
yyy,
zzz
FROM tablexyz
where some_column_name in ( %(itemids)s )
My current code gets my the parameter inserted with its quotes
where some_column_name in ( '282940499, 276686324, 2665846, 46875436, 530272885, 2590230, 557021480, 282937154, 46259344' )
How can I prevent the string being inserted including the ', these are not part of my string, but I assume they come from the parameter type string when using %s
I don't think there is a provision in params to send a list of numeric values for one condition. I always add such condition directly to the query
item_ids = [str(item_id) for item_id in item_ids]
where_str = ','.join(item_ids)
query = f"""SELECT
xxx,
yyy,
zzz
FROM tablexyz
where some_column_name in ({where_str})"""

Validating Phone Numbers in Batch with PostgreSQL

This is my SQL:
SELECT
countries.locl_ctry_id,
countries.icc,
countries.active,
networks.locl_ntwrk_id,
networks.locl_ctry_id,
numberings.locl_ntwrk_id,
numberings.ndc,
numberings.size
FROM countries
LEFT JOIN networks
ON networks.locl_ctry_id = countries.locl_ctry_id
LEFT JOIN numberings
ON numberings.locl_ntwrk_id = networks.locl_ntwrk_id
WHERE
countries.active = 'true'
AND numberings.locl_ntwrk_id NOTNULL
AND CONCAT(countries.icc, numberings.ndc)
LIKE LEFT('381645554330', CHAR_LENGTH(CONCAT(countries.icc, numberings.ndc)))
AND LENGTH('381645554330') = numberings.size
I would like to run this script for a batch of numbers, for example:
381645554330 ‭
381629000814‬‬
381644446555‬
‭38975300155‬
‭38975604099 ‭
38976330923‬‬ ‭
38977772090‬ ‭
38978250177‬ ‭
38970333730‬
‭38971388262‬
‭38972228855‬
Take a look at the database structure here: http://sqlfiddle.com/#!17/13ce29/27
It needs to validate the Prefix as well as the Length of the number.
Any suggestions how to achieve this?
Put the batch of numbers in a union all subquery.
SELECT
countries.locl_ctry_id,
countries.icc,
countries.active,
networks.locl_ntwrk_id,
networks.locl_ctry_id,
numberings.locl_ntwrk_id,
numberings.ndc,
numberings.size
FROM countries
LEFT JOIN networks
ON networks.locl_ctry_id = countries.locl_ctry_id
LEFT JOIN numberings
ON numberings.locl_ntwrk_id = networks.locl_ntwrk_id
JOIN ( select '381645554330' as num
union all
select '38976330923‬‬‬‬'
union all
select '38975300155‬‬‬' ) batch_numbers
ON CONCAT(countries.icc, numberings.ndc)
LIKE LEFT(batch_numbers.num, CHAR_LENGTH(CONCAT(countries.icc, numberings.ndc)))
AND LENGTH(batch_numbers.num) = numberings.size
WHERE
countries.active = 'true'
AND numberings.locl_ntwrk_id NOTNULL
It seems the objective is not ability to return the set of values currently returned by the single, but to make an evaluation of the of multiple values. The issue with the above it requires an a-priori knowledge of and a modification to the query for each set to evaluate. The follow will attempt to remove that requirement.
Let's begin by developing a base line query as an extension to Jakup's "union" solution.
--- create a baseline solution
with to_be_validated (test_num) as -- CTE used strictly as data generator fir query
( values ('381645554330')
, ('381629000814')
, ('381644446555')
, ('38975300155')
, ('38975604099')
, ('38976330923')
, ('38977772090')
, ('38978250177')
, ('38970333730')
, ('38971388262')
, ('38972228855')
, ('81771388262')
, ('55572228855')
)
--- base query
select test_num
, case when icc is not null then 'Valid' else 'Invalid' end validation
from to_be_validated
left join(
select countries.icc, numberings.ndc, numberings.size
from countries
join networks on networks.locl_ctry_id = countries.locl_ctry_id
join numberings on numberings.locl_ntwrk_id = networks.locl_ntwrk_id
) base on ( concat(base.icc, base.ndc) = left( test_num, char_length(concat(base.icc, base.ndc)))
and length(test_num) = base.size
)
;
Notes on Query and Modifications:
1. the column countries.active is defined as binary, thus already providing a True/False value. Thus checking for "= 'true' is unnecessary. Altered to just contries.active.
2. The column numberings.locl_ntwrk_id is restricted to being NOT NULL, so the Predicate "nullnumberings.locl_ntwrk_id NOTNULL" is always true. Removed predicate.
3. The LEFT JOIN on networks and numberings will generate a result set with all countries, all networks, all numberings, even when the combination is itself invalid. This results in validating each phone number against every combination of the 3 base tables. Alter these inner joins.
4. Finally, I added a couple extra numbers to your test data. These are intended to fail the desired validation. You should always test with considerable invalid data, otherwise you cannot know if procedure/query/whatever gracefully and properly handles it.
Now with a base query in hand it's possible to just end here. But to be generally useful you cannot edit the query each time it wanted. Therefore lets wrap a function definition around it. We'll do this by wrapping a function definition around that base query, and provide either an array or a delimited string containing the phone numbers to b evaluated.
In each the base query remains the same, and we keep the CTE, but the CTE is modified to build a row for each phone number provided.
-- SQL Function with and Array input
create or replace function validate_phone_numbers( phone_numbers text[])
returns table ( phone_number text
, validation_status text
)
language sql
as $$
with to_be_validated as
( select unnest (phone_numbers) test_num )
-- Insert base query here --
$$
-- Test with Array
select phone_number, validation_status
from validate_phone_numbers (ARRAY
[ ('381629000814')
, ('381644446555')
, ('38975300155')
, ('38975604099')
, ('38976330923')
, ('38977772090')
, ('38978250177')
, ('38970333730')
, ('38971388262')
, ('38972228855')
, ('81771388262')
, ('55572228855')
]
) ;
With a minor extension we a delimited string version.
create or replace function validate_phone_numbers_with_string( phone_numbers text, delimiter text default ',')
returns table ( phone_number text
, validation_status text
)
language sql
as $$
with to_be_validated as
( select unnest (string_to_array (phone_numbers, delimiter)) test_num)
-- Insert base query here --
$$ ;
-- test with string
select phone_number, validation_status
from validate_phone_numbers_with_string('381629000814,381644446555,38975300155,38975604099,38976330923,38977772090,38978250177,38970333730,38971388262,38972228855,81771388262,55572228855');

Search for any of a list of strings inside another string

I need to identify records with valid addresses by comparing the address fields against a list of street-like words.
So the code would look something like:
set street_list = 'STREET', 'ROAD', 'AVENUE', 'DRIVE', 'WAY', 'PLACE' (etc.)
;
create table [new table] as
select *
from [source table]
where [address line 1] (contains any word from STREET_LIST) or
[address line 2] (contains any word from STREET_LIST) or
[address line 3] (contains any word from STREET_LIST)
;
Is this possible?
Using LostReality's regexp suggestion, I got as far as:
select *
from [source table]
where upper([address line 1]) regexp '.* STREET.*|.* ST.*|.* ROAD.*|.* RD.*|.* CLOSE.*|.* LANE.*|.* LA.*|.* AVENUE.*|.* AVE.*|.* DRIVE.*|.* DR.*|.* HOUSE.*|.* WAY.*|.* PLACE.*|.* SQUARE.*|.* WALK.*|.* GROVE.*|.* GREEN.*|.* PARK.*|.* PK.*|.* CRESCENT.*|.* TERRACE.*|.* PARADE.*|.* GARDEN.*|.* GARDENS.*|.* COURT.*|.* COTTAGES.*|.* COTTAGE.*|.* MEWS.*|.* ESTATE.*|.* RISE.*|.* FARM.*'
;
and it seems to work.
But I have two small problems with it:
1) how do I write the regexp on more than one line so it's easier to read?
2) is there any way of putting that regexp into a macro variable because I want to check 5 address lines and I don't want 5 copies of the same expression.
Thanks
Solution for Hive. You can put regexp pattern in the variable and also you can use macro, fixed your template:
set hivevar:street_list ='STREET|ST|ROAD|RD|CLOSE|LANE|LA|AVENUE|AVE|DRIVE|DR|HOUSE|WAY|PLACE|SQUARE|WALK|GROVE|GREEN|PARK|PK|CRESCENT|TERRACE|PARADE|GARDEN|GARDENS|COURT|COTTAGES|COTTAGE|MEWS|ESTATE|RISE|FARM';
--boolean macro for using in the WHERE
create temporary macro contains_word(s string) (upper(s) rlike ${hivevar:street_list} ) ;
with some_table as ( --use your table instead of this synthetic example
select stack(2,'some string containing STREET and WALK',
'some string containing something else') as str
) --use your table instead of this synthetic example
--use macro in your query
select str from some_table
where contains_word(str);
Result:
OK
some string containing STREET and WALK
Time taken: 0.229 seconds, Fetched: 1 row(s)
Use OR like in your question:
where contains_word(address_line_1) OR contains_word(address_line_2) ...
Hope you have got the idea

Pig Latin - Extracting fields meeting two different filter criteria from chararray line and grouping in a bag

I am new to Pig Latin.
I want to extract all lines that match a filter criteria (have a word "line_token" ) from log files and then from these matching lines extract two different fields meeting two separate field match criteria . Since the lines aren't structured well I am loading them as a char array.
When I try to run the following code - I get an error
"Invalid resource schema: bag schema must have tuple as its field"
I have tried to perform an explicit cast to a tuple but that does not work
input_lines = LOAD '/inputdir/' AS ( line:chararray);
filtered_lines = FILTER input_lines BY (line MATCHES '.*line_token1.*' );
tokenized_lines = FOREACH filtered_lines GENERATE FLATTEN(TOKENIZE(line)) AS tok_line;
my_wordbag = FOREACH tokenized_lines {
word1 = FILTER tok_line BY ( $0 MATCHES '.*word_token1.*' ) ;
word2 = FILTER tok_line BY ( $0 MATCHES '.*word_token1.*' ) ;
GENERATE word1 , word2 as my_tuple ;
-- I also tried --> GENERATE (word1 , word2) as my_tuple ;
}
dump my_wordbag;
I suppose I am taking a very wrong approach.
Please note - my logs aren't structured well - so I cant mend the way I load
Post loading and initial filtering for lines of interest ( which is straightforward) - I guess I need to do something different rather than tokenize line and iterate through fields trying to find fields.
Or maybe I should use joins ?
Also if I know the structure of line beforehand well as all text fields, then will loading it differently ( not as a chararray) make it an easier problem ?
For now I made a compromise - I added a extra filter clause in my original - line filter and settled for picking just one field from line. When I get back to it I will try with joins and post that code ... - here's my working code that gets me a useful output - but not all that I want.
-- read input lines from poorly structured log
input_lines = LOAD '/log-in-dir-in-hdfs' AS ( line:chararray) ;
-- Filter for line filter criteria and date interested in passed as arg
filtered_lines = FILTER input_lines BY (
( line MATCHES '.*line_filter1*' )
AND ( line MATCHES '.*line_filter2.*' )
AND ( line MATCHES '.*$forDate.*' )
) ;
-- Tokenize every line
tok_lines = FOREACH filtered_lines
GENERATE TOKENIZE(line) AS tok_line;
-- Pick up specific field frm tokenized line based on column filter criteria
fnames = FOREACH tok_lines {
fname = FILTER tok_line BY ( $0 MATCHES '.*field_selection.*' ) ;
GENERATE FLATTEN(fname) as nnfname;
}
-- Count occurances of that field and store it with field name
-- My original intent is to store another field name as well
-- I will do that once I figure how to put both of them in a tuple
flgroup = FOREACH fnames
GENERATE FLATTEN(TOKENIZE((chararray)$0)) as cfname;
grpfnames = group flgroup by cfname;
readcounts = FOREACH grpfnames GENERATE COUNT(flgroup), group ;
STORE readcounts INTO '/out-dir-in-hdfs';
As I understand, after the FLATTEN operation, you have single line (tok_line) in each row and you want to extract 2 words from each line. REGEX_EXTRACT will help you achieve this. I'm not a REGEX expert so will leave writing the REGEX part up to you.
data = FOREACH tokenized_lines
GENERATE
REGEX_EXTRACT(tok_line, <first word regex goes here>) as firstWord,
REGEX_EXTRACT(tok_line, <second word regex goes here>) as secondWord;
I hope this helps.
You must refer to the alias, not the column.
So:
word1 = FILTER tokenized_lines BY ( $0 MATCHES '.*word_token1.*' ) ;
word1 and word2 are going to be aliases as well, not columns.
How do you need the output to look like?