SQL parsing using pyparsing - sql

I am learning PyParsing in last few weeks. I plan to use it to get table names from SQL statements.
I have looked at http://pyparsing.wikispaces.com/file/view/simpleSQL.py. But I intend to keep the grammar simple because I am not trying to get every part of select statement parsed rather I am looking for just table names. Also it is quite involved to define the complete grammar for any commercially available modern day database like Teradata.
#!/usr/bin/env python
from pyparsing import *
import sys
semicolon = Combine(Literal(';') + lineEnd)
comma = Literal(',')
lparen = Literal('(')
rparen = Literal(')')
# Keyword definition
update_kw, volatile_kw, create_kw, table_kw, as_kw, from_kw, \
where_kw, join_kw, left_kw, right_kw, cross_kw, outer_kw, \
on_kw , insert_kw , into_kw= \
map(lambda x: Keyword(x, caseless=True), \
['UPDATE', 'VOLATILE', 'CREATE', 'TABLE', 'AS', 'FROM',
'WHERE', 'JOIN' , 'LEFT', 'RIGHT' , \
'CROSS', 'OUTER', 'ON', 'INSERT', 'INTO'])
# Teradata SQL allows SELECT and well as SEL keyword
select_kw = Keyword('SELECT', caseless=True) | Keyword('SEL' , caseless=True)
# list of reserved keywords
reserved_words = (update_kw | volatile_kw | create_kw | table_kw | as_kw |
select_kw | from_kw | where_kw | join_kw |
left_kw | right_kw | cross_kw | on_kw | insert_kw |
into_kw)
# Identifier can be used as table or column names. They can't be reserved words
ident = ~reserved_words + Word(alphas, alphanums + '_')
# Recursive definition for table
table = Forward()
# simple table name can be identifer or qualified identifier e.g. schema.table
simple_table = Combine(Optional(ident + Literal('.')) + ident)
# table name can also a complete select statement used as table
nested_table = lparen.suppress() + select_kw.suppress() + SkipTo(from_kw).suppress() + \
from_kw.suppress() + table + rparen.suppress()
# table can be simple table or nested table
table << (nested_table | simple_table)
# comma delimited list of tables
table_list = delimitedList(table)
# Building from clause only because table name(s) will always appears after that
from_clause = from_kw.suppress() + table_list
txt = """
SELECT p, (SELECT * FROM foo),e FROM a, d, (SELECT * FROM z), b
"""
for token, start, end in from_clause.scanString(txt):
print token
A thing worth mentioning here. I use "SkipTo(from_kw)" to jump over column list in SQL statement. This is primarily to avoid defining grammar for column list which can be comma delimited list of identifiers, many function names, DW analytical functions and what not. With this grammar I am able to parse above statement as well as any level of nesting in SELECT column list or table list.
['foo']
['a', 'd', 'z', 'b']
I am facing problem when SELECT has where clause:
nested_table = lparen.suppress() + select_kw.suppress() + SkipTo(from_kw).suppress() + \
from_kw.suppress() + table + rparen.suppress()
When WHERE clause is there then the same statement may look like:
SELECT ... FROM a,d , (SELECT * FROM z WHERE (c1 = 1) and (c2 = 3)), p
I thought of changing "nested_table" definition to:
nested_table = lparen.suppress() + select_kw.suppress() + SkipTo(from_kw).suppress() + \
from_kw.suppress() + table + Optional(where_kw + SkipTo(rparen)) + rparen
But this is not working since it matches to the right parenthesis following "c = 1". What I would like to know is how to skip to the right parenthesis that matches left parenthesis right before "SELECT * FROM z..." I don't know how to do it using PyParsing
Also on a different note I seek some advice the best way to get table names from complex nested SQLs. Any help is really appreciated.
Thanks
Abhijit

Considering that you are also trying to parse out nested SELECT's, I don't think you'll be able to avoid writing a fairly complete SQL parser. Fortunately, there is a more complete example on the Pyparsing wiki Examples page, select_parser.py. I hope that gets you further along.

Related

How can I count all NULL values, without column names, using SQL?

I'm reading and executing sql queries from file and I need to inspect the result sets to count all the null values across all columns. Because the SQL is read from file, I don't know the column names and thus can't call the columns by name when trying to find the null values.
I think using CTE is the best way to do this, but how can I call the columns when I don't know what the column names are?
WITH query_results AS
(
<sql_read_from_file_here>
)
select count_if(<column_name> is not null) FROM query_results
If you are using Python to read the file of SQL statements, you can do something like this which uses pglast to parse the SQL query to get the columns for you:
import pglast
sql_read_from_file_here = "SELECT 1 foo, 1 bar"
ast = pglast.parse_sql(sql_read_from_file_here)
cols = ast[0]['RawStmt']['stmt']['SelectStmt']['targetList']
sum_stmt = "sum(iff({col} is null,1,0))"
sums = [sum_sql.format(col = col['ResTarget']['name']) for col in cols]
print(f"select {' + '.join(sums)} total_null_count from query_results")
# outputs: select sum(iff(foo is null,1,0)) + sum(iff(bar is null,1,0)) total_null_count from query_results

Writing results of SQL query to Temp View in Databricks

I would like to create a Temporary View from the results of a SQL Query - which sounds like a basic thing to do, but I just couldn't make it work and don't understand what is wrong.
This is my SQL query - which works fine and returns Col1.
%sql
SELECT
Col1
FROM
Table1
WHERE EXISTS (
select *
from TempView1)
I would like to write the results in another table which I can query. Therefore I do this :
df = spark.sql("""
SELECT
Col1
FROM
Table1
WHERE EXISTS (
select *
from TempView1)""")
OK
df
Out[28]: DataFrame[Col1: bigint]
df.createOrReplaceTempView("df_tmp_view")
OK
%sql
select * from df_tmp_view
Error in SQL statement: AnalysisException: Table or view not found: df_tmp_view; line 1 pos 14;
'Project [*]
+- 'UnresolvedRelation [df_tmp_view], [], false
display(affected_customers_tmp_view)
NameError: name 'df_tmp_view' is not defined
What am I doing wrong ?
I don't understand the error saying that the name is not defined although I define it just one command above. Also the SQL query is working and returning data...so what am I missing ?
Thanks !
you need to get the global context of the view, for example in your case:
global_temp_db = spark.conf.get("spark.sql.globalTempDatabase")
display(table(global_temp_db + "." + 'df_tmp_view'))
documentation
for example:
df_pd = pd.DataFrame(
{
'Name' : [231232,12312321,3213231],
}
)
df = spark.createDataFrame(df_pd)
df.createOrReplaceGlobalTempView('test_tmp_view')
global_temp_db = spark.conf.get("spark.sql.globalTempDatabase")
display(table(global_temp_db + "." + 'test_tmp_view'))

sap hana placeholders pass * parameter with arrow notation

Trying to pass a star (*) in a sql Hana place holder with an arrow notation
The following works OK:
Select * FROM "table_1"
( PLACEHOLDER."$$IP_ShipmentStartDate$$" => '2020-01-01',
PLACEHOLDER."$$IP_ShipmentEndDate$$" => '2030-01-01' )
In the following, when trying to pass a *, i get a syntax error:
Select * FROM "table1"
( PLACEHOLDER."$$IP_ShipmentStartDate$$" => '2020-01-01',
PLACEHOLDER.'$$IP_ItemTypecd$$' => '''*''',
PLACEHOLDER."$$IP_ShipmentEndDate$$" => '2030-01-01' )
The reason i am using the arrow notation, is since its the only way i know that allows passing parameters as in the example bellow: (as in linked post)
do begin
declare lv_param nvarchar(100);
select max('some_date')
into lv_param
from dummy /*your_table*/;
select * from "_SYS_BIC"."path.to.your.view/CV_TEST" (
PLACEHOLDER."$$P_DUMMY$$" => :lv_param
);
end;
There's a typo in your code. You need to use double quotes around parameter name, but you have a single quote. It should be: PLACEHOLDER."$$IP_ItemTypecd$$".
When you pass something to Calculation View's parameter, you already have a string, that will be treated as string and have quotes around it where they needed, so no need to add more. But if you really need to pass some quotes inside the placeholder's value you also need to escape them with backslash complementary to doubling them (it was found by doing data preview on calculation view and entering '*' as a value of input parameter, then you'll find valid SQL statement in the log of preview):
do
begin
select *
from "_SYS_BIC"."ztest/CV_TEST_PERF"(
PLACEHOLDER."$$P_DUMMY$$" => '''*'''
);
end;
/*
SAP DBTech JDBC: [339]: invalid number: : line 3 col 3 (at pos 13): invalid number:
not a valid number string '' at function __typecast__()
*/
/*And in trace there's no more information, but interesting part
is preparation step, not an execution
w SQLScriptExecuto se_eapi_proxy.cc(00145) : Error <exception 71000339:
not a valid number string '' at function __typecast__()
> in preparation of internal statement:
*/
do
begin
select *
from "_SYS_BIC"."ztest/CV_TEST_PERF"(
PLACEHOLDER."$$P_DUMMY$$" => '\'*\''
);
end;
/*
SAP DBTech JDBC: [257]: sql syntax error: incorrect syntax near "\": line 5 col 38 (at pos 121)
*/
But this is ok:
do
begin
select *
from "_SYS_BIC"."ztest/CV_TEST_PERF"(
PLACEHOLDER."$$P_DUMMY$$" => '\''*\'''
);
end;
LOG_ID | DATUM | INPUT_PARAM | CUR_DATE
--------------------------+----------+-------------+---------
8IPYSJ23JLVZATTQYYBUYMZ9V | 20201224 | '*' | 20201224
3APKAAC9OGGM2T78TO3WUUBYR | 20201224 | '*' | 20201224
F0QVK7BVUU5IQJRI2Q9QLY0WJ | 20201224 | '*' | 20201224
CW8ISV4YIAS8CEIY8SNMYMSYB | 20201224 | '*' | 20201224
What about the star itself:
As #LarsBr already said, in SQL you need to use LIKE '%pattern%' to search for strings contains parretn in the middle, % is equivalent for ABAP's * (but as far as I know * is more verbose placeholder in non-SQL world). So there's no out-of-the-box conversion of FIELD = '*' to FIELD like '%' or something similar.
But there's no LIKE predicate in Column Engine (in filter or in calculated column).
If you really need LIKE functionality in filter or calculated column, you can:
Switch execution engine to SQL
Or use match(arg, pattern) function of Column Engine, which now dissapeared from the pallete and is hidden quite well in the documentation (here, at the very end of the page, after digging into the description field of the last row in the table, you'll find the actual syntax for it. Damn!).
But here you'll meet another surprise: as long as Column Engine has different operators than SQL (it is more internal and more close to the DB core), it uses star (*) for wildcard character. So for match(string, pattern) you need to use a star again: match('pat string tern', 'pat*tern').
After all the above said: there are cases where you can really want to search for data with wildcards and pass them as parameter. But then you need to use match and pass the parameter as plain text without tricks on star (*) or something (if you want to use officially supported features, not trying to exploit some internals).
After adding this filter to RSPCLOGCHAIN projection node of my CV from the previous thread, it works this way:
do
begin
select *
from "_SYS_BIC"."ztest/CV_TEST_PERF"(
PLACEHOLDER."$$P_DUMMY$$" => 'CW*'
);
end;
LOG_ID | DATUM | INPUT_PARAM | CUR_DATE
--------------------------+----------+-------------+---------
CW8ISV4YIAS8CEIY8SNMYMSYB | 20201224 | CW* | 20201224
do
begin
select *
from "_SYS_BIC"."ztest/CV_TEST_PERF"(
PLACEHOLDER."$$P_DUMMY$$" => 'CW'
);
end;
/*
Fetched 0 row(s) in 0 ms 0 µs (server processing time: 0 ms 0 µs)
*/
The notation with triple quotation marks '''*''' is likely what yields the syntax error here.
Instead, use single quotation marks to provide the '*' string.
But that is just half of the challenge here.
In SQL, the placeholder search is done via LIKE and the placeholder character is %, not *.
To mimic the ABAP behaviour when using calculation views, the input parameters must be used in filter expressions in the calculation view. And these filter expressions have to check for whether the input parameter value is * or not. If it is * then the filter condition needs to be a LIKE, otherwise an = (equal) condition.
A final comment: the PLACEHOLDER-syntax really only works with calculation views and not with tables.

Rails Active Record complex query with UPPER and OR in sql clause

I want to query my psql database from my rails app using the example query string:
select * from venues where upper(regexp_replace(postcode, ' ', '')) = '#{postcode}' or name = '#{name}'
There are 2 aspects to this query:
the first is to compare against a manipulated value in the database (upper and regexp_replace) which I can do within an active record where method
the second is to provide the or condition which appears to require the use of ARel
I would appreciate some help in joining these together.
See squeel, it can handle OR queries and functions in a pretty human friendly way: https://github.com/activerecord-hackery/squeel & http://railscasts.com/episodes/354-squeel
for example:
[6] pry(main)> Page.where{(upper(regexp_replace(postcode, ' ', '')) == 'foo') | (name == 'bar')}.to_sql
=> "SELECT \"pages\".* FROM \"pages\" WHERE upper(regexp_replace(\"pages\".\"postcode\", ' ', '')) = 'foo'"
alternative is to code the query directly:
scope :funny_postcode_raw, lambda{ |postcode, name| where("upper(regexp_replace(postcode, ' ', '')) = ? or name = ?", postcode, name) }
I don't suggest going the Arel path, not worth it 99.9% of the time
NOTE: the OR operator for squeel is |

How to use parameters with RPostgreSQL (to insert data)

I'm trying to insert data into a pre-existing PostgreSQL table using RPostgreSQL and I can't figure out the syntax for SQL parameters (prepared statements).
E.g. suppose I want to do the following
insert into mytable (a,b,c) values ($1,$2,$3)
How do I specify the parameters? dbSendQuery doesn't seem to understand if you just put the parameters in the ....
I've found dbWriteTable can be used to dump an entire table, but won't let you specify the columns (so no good for defaults etc.). And anyway, I'll need to know this for other queries once I get the data in there (so I suppose this isn't really insert specific)!
Sure I'm just missing something obvious...
I was looking for the same thing, for the same reasons, which is security.
Apparently dplyr package has the capacity that you are interested in. It's barely documented, but it's there. Scroll down to "Postgresql" in this vignette: http://cran.r-project.org/web/packages/dplyr/vignettes/databases.html
To summarize, dplyr offers functions sql() and escape(), which can be combined to produce a parametrized query. SQL() function from DBI package seems to work in exactly same way.
> sql(paste0('SELECT * FROM blaah WHERE id = ', escape('random "\'stuff')))
<SQL> SELECT * FROM blaah WHERE id = 'random "''stuff'
It returns an object of classes "sql" and "character", so you can either pass it on to tbl() or possibly dbSendQuery() as well.
The escape() function correctly handles vectors as well, which I find most useful:
> sql(paste0('SELECT * FROM blaah WHERE id in ', escape(1:5)))
<SQL> SELECT * FROM blaah WHERE id in (1, 2, 3, 4, 5)
Same naturally works with variables as well:
> tmp <- c("asd", 2, date())
> sql(paste0('SELECT * FROM blaah WHERE id in ', escape(tmp)))
<SQL> SELECT * FROM blaah WHERE id in ('asd', '2', 'Tue Nov 18 15:19:08 2014')
I feel much safer now putting together queries.
As of the latest RPostgreSQL it should work:
db_connection <- dbConnect(dbDriver("PostgreSQL"), dbname = database_name,
host = "localhost", port = database_port, password=database_user_password,
user = database_user)
qry = "insert into mytable (a,b,c) values ($1,$2,$3)"
dbSendQuery(db_connection, qry, c(1, "some string", "some string with | ' "))
Here's a version using the DBI and RPostgres packages, and inserting multiple rows at once, since all these years later it's still very difficult to figure out from the documentation.
x <- data.frame(
a = c(1:10),
b = letters[1:10],
c = letters[11:20]
)
# insert your own connection info
con <- DBI::dbConnect(
RPostgres::Postgres(),
dbname = '',
host = '',
port = 5432,
user = '',
password = ''
)
RPostgres::dbSendQuery(
con,
"INSERT INTO mytable (a,b,c) VALUES ($1,$2,$3);",
list(
x$a,
x$b,
x$c
)
)
The help for dbBind() in the DBI package is the only place that explains how to format parameters:
The placeholder format is currently not specified by DBI; in the
future, a uniform placeholder syntax may be supported. Consult the
backend documentation for the supported formats.... Known examples are:
? (positional matching in order of appearance) in RMySQL and RSQLite
$1 (positional matching by index) in RPostgres and RSQLite
:name and $name (named matching) in RSQLite
? is also the placeholder for R package RJDBC.