Regex pattern to identify column names in an SQL WHERE clause

Regex pattern to identify column names in an SQL WHERE clause - sql

I am looking for resources on how to build a regex pattern to match column names from an SQL WHERE clause.
I have the SQL SELECT statement:
SELECT * FROM test WHERE area = 'testarea' AND description = 'testdescription' AND ...
And I'm trying to extract the terms area and description, and any others following ANDs. I understand this can get more complicated as the WHERE clause gets more complicated, but for now, I'm assuming that it conforms to the structure in the example.
When I try to search the web for examples on how to do this, I only see examples of how to include regex in the WHERE clause, but not actually match against it.
Can someone help me get started here? I'm at a loss.

Related

Does anyone know what this regexp_replace do?

I received this snippet from someone who doesn't work for my current company anymore, and I can't figure out what this regex do.
The objective for him was to scan through sql query strings and rearrange table info
regexp_replace(query, ' ("*?)(analytics_internal|arr|bizops_analytics|cloud_api|cloud_backend_raw|data_science|eloqua|fb_ads|google_ads|information_schema|intricately|legacy_sfdc|marketing_analytics|ns__analytics_postprocessing|ns__global_write|product_analytics|raw_bing_ads|raw_cloud_api|raw_compass|raw_coveo|raw_eloqua|raw_g_search_console|raw_gainsight|raw_google_ads|raw_intercom|raw_linkedin_ads|raw_mightysignal|raw_realm|raw_sfdc|remodel_cloud|remodel_test|sales_analytics|sales_ops|sampledb|segment|sfdc|ts_analytics|university_platform_analytics|upstream_gainsight|usage|xform_cloud|xform_etl|xform_finance|xform_marketing|xform_reference|xform_sales|xform_tables)("*?)\.("*?)(.+?)("*?)(\s|$)', ' awsdatacatalog.$1$2$3.$4$5$6$7') as queryString

The regex you've provided attempts to match any of the many provided strings in the second capture group and modify that part of the query to be prefixed with awsdatacatalog.. This is most likely an attempt to modify queries to occur on a new database, in particular a database named awsdatacatalog. For example, the consider the following query:
SELECT * FROM "analytics_internal".foo.table
Your regex_replace should produce a new query that looks like
SELECT * FROM awsdatacatalog."analytics_internal".foo.table

Regexp search SQL query fields

I have a repository of SQL queries and I want to understand which queries use certain tables or fields.
Let's say I want to understand what queries use the email field, how can I write it?
Example SQL query:
select
users.email as email_user
,users.email as email_user_too
,email as email_user_too_2
email as email_user_too_3,
back_email as wrong_email -- wrong field
from users

So to state the problem more accurately, you are sorting through a list of SQL queries [as text], and you now need to find the queries that use certain fields using SQL & RegEx (Regular Expressions) in PostgreSQL. (please tag the question so that StackOverflow indexes your question correctly, more importantly, readers have more context about the question)
PostgreSQL has Regular Expression support OOTB (Out Of The Box). So we skip exploring other ways to do this. (If you are reading this as Microsoft SQL Server person, then I strongly suggest you to have a read of this brilliant article on Microsoft's website on defining a Table-Valued UDF (User Defined Function))
The simplest way I could think of to approach your problem, is to throw away what we don't want out of the query text first, and then filter out what's left.
This way, after throwing away the stuff you don't need, you will be left with a set of "tokens" that you can easily filter, and I'm putting token in quotes since we are not really parsing the SQL language, but if we did that would be the first step: to extract tokens.. (:
Take this query for example:
With Queries (
Id
, QueryText
) As (
values (1, 'select
users.email as email_user
,users.email as email_user_too
,email as email_user_too_2,
email as email_user_too_3,
back_email as wrong_email -- wrong field
from users')
)
Select QueryText
, found
From (
Select Id
, QueryText
, regexp_split_to_table (QueryText, '(--[\s\w]+|select|from|as|where|[ \s\n,])') As found
From Queries
) As Result
Where found != ''
And found = 'back_email'
I have sourced the concept of a "query repository" with a WITH statement for ease of doing the pseudo-code.
I have also selected few words/characters to split QueryText with. Like select, where etc. We don't need these in our 'found' set.
And in the end, as you can see above, I simply used found as what's left and filtered it with the field name you are looking for. (Assuming that you know the field you are looking for)
You could improve upon the RegEx I did, or change the method as you wish to make it better. But I think the general concept addresses what you need to achieve. One problem I can see with my solution right off the bat is the fact that you can search for anything really, not just names of the selected fields - which begs the question, why use RegEx, and not Like statements? But again, as I mentioned, you can improve upon the RegEx and address specific requirements you may have. Using Like might limit you in that direction. (In other words, only you know what's good for you. I can't say that from here.)
You can play with the query online here: db-fiddle query and use https://regex101.com/ for testing your RegEx.
Disclaimer I'm not a PostgreSQL developer. There must be other, perhaps better ways of doing this. (:

DB2 complex like

I have to write a select statement following the following pattern:
[A-Z][0-9][0-9][0-9][0-9][A-Z][0-9][0-9][0-9][0-9][0-9]
The only thing I'm sure of is that the first A-Z WILL be there. All the rest is optional and the optional part is the problem. I don't really know how I could do that.
Some example data:
B/0765/E 3
B/0765/E3
B/0764/A /02
B/0749/K
B/0768/
B/0784//02
B/0807/
My guess is that I best remove al the white spaces and the / in the data and then execute the select statement. But I'm having some problems writing the like pattern actually.. Anyone that could help me out?
The underlying reason for this is that I'm migrating a database. In the old database the values are just in 1 field but in the new one they are splitted into several fields but I first have to write a "control script" to know what records in the old database are not correct.
Even the following isn't working:
where someColumn LIKE '[a-zA-Z]%';

You can use Regular Expression via xQuery to define this pattern. There are many question in StackOverFlow that talk about patterns in DB2, and they have been solved with Regular Expressions.
DB2: find field value where first character is a lower case letter
Emulate REGEXP like behaviour in SQL

SQL exclusion regexp does not work, why?

I have some sentences in db. I want to select the ones that don't have urls in them.
So what I do is
select ID, SENTENCE
from SENTENCE_TABLE
where regexp_like(SENTENCE, '[^http]');
However after the query is executed the sentences that appear in the results pane still have urls. I tried a lot of other combinations without any success.
Can somebody explain or give a good link where it is explained how regexps actually work in SQL.
How can I filter(exclude) actual words in db with SQL query?

You're over-complicating this. Just use a standard LIKE.
select ID, SENTENCE
from SENTENCE_TABLE
where SENTENCE not like '%http%';
regexp_like(SENTENCE, '[^http]') will match everything but h, t and p separately. I like the PSOUG page on regular expressions in Oracle but I would also recommend reading the documentation.
To respond to your comment you can use REGEXP_LIKE, there's just no point.
select ID, SENTENCE
from SENTENCE_TABLE
where not regexp_like(SENTENCE, 'http');
This looks for the string http rather than the letters individually.

[^http] would match any character except h or t or t or p..So this would match any string that doesn't contain h or t or t or p anywhere in the string
It should be where not regexp_like(SENTENCE, '^http');..this would match anything that doesn`t start with http

Can scalar functions be applied before filtering when executing a SQL Statement?

I suppose I have always naively assumed that scalar functions in the select part of a SQL query will only get applied to the rows that meet all the criteria of the where clause.
Today I was debugging some code from a vendor and had that assumption challenged. The only reason I can think of for this code failing is that the Substring() function is getting called on data that should have been filtered out by the WHERE clause. But it appears that the substring call is being applied before the filtering happens, the query is failing.
Here is an example of what I mean. Let's say we have two tables, each with 2 columns and having 2 rows and 1 row respectively. The first column in each is just an id. NAME is just a string, and NAME_LENGTH tells us how many characters in the name with the same ID. Note that only names with more than one character have a corresponding row in the LONG_NAMES table.
NAMES: ID, NAME
1, "Peter"
2, "X"
LONG_NAMES: ID, NAME_LENGTH
1, 5
If I want a query to print each name with the last 3 letters cut off, I might first try something like this (assuming SQL Server syntax for now):
SELECT substring(NAME,1,len(NAME)-3)
FROM NAMES;
I would soon find out that this would give me an error, because when it reaches "X" it will try using a negative number for in the substring call, and it will fail.
The way my vendor decided to solve this was by filtering out rows where the strings were too short for the len - 3 query to work. He did it by joining to another table:
SELECT substring(NAMES.NAME,1,len(NAMES.NAME)-3)
FROM NAMES
INNER JOIN LONG_NAMES
ON NAMES.ID = LONG_NAMES.ID;
At first glance, this query looks like it might work. The join condition will eliminate any rows that have NAME fields short enough for the substring call to fail.
However, from what I can observe, SQL Server will sometimes try to calculate the the substring expression for everything in the table, and then apply the join to filter out rows. Is this supposed to happen this way? Is there a documented order of operations where I can find out when certain things will happen? Is it specific to a particular Database engine or part of the SQL standard? If I decided to include some predicate on my NAMES table to filter out short names, (like len(NAME) > 3), could SQL Server also choose to apply that after trying to apply the substring? If so then it seems the only safe way to do a substring would be to wrap it in a "case when" construct in the select?

Martin gave this link that pretty much explains what is going on - the query optimizer has free rein to reorder things however it likes. I am including this as an answer so I can accept something. Martin, if you create an answer with your link in it i will gladly accept that instead of this one.
I do want to leave my question here because I think it is a tricky one to search for, and my particular phrasing of the issue may be easier for someone else to find in the future.
TSQL divide by zero encountered despite no columns containing 0
EDIT: As more responses have come in, I am again confused. It does not seem clear yet when exactly the optimizer is allowed to evaluate things in the select clause. I guess I'll have to go find the SQL standard myself and see if i can make sense of it.

Joe Celko, who helped write early SQL standards, has posted something similar to this several times in various USENET newsfroups. (I'm skipping over the clauses that don't apply to your SELECT statement.) He usually said something like "This is how statements are supposed to act like they work". In other words, SQL implementations should behave exactly as if they did these steps, without actually being required to do each of these steps.
Build a working table from all of
the table constructors in the FROM
clause.
Remove from the working table those
rows that do not satisfy the WHERE
clause.
Construct the expressions in the
SELECT clause against the working table.
So, following this, no SQL dbms should act like it evaluates functions in the SELECT clause before it acts like it applies the WHERE clause.
In a recent posting, Joe expands the steps to include CTEs.
CJ Date and Hugh Darwen say essentially the same thing in chapter 11 ("Table Expressions") of their book A Guide to the SQL Standard. They also note that this chapter corresponds to the "Query Specification" section (sections?) in the SQL standards.

You are thinking about something called query execution plan. It's based on query optimization rules, indexes, temporaty buffers and execution time statistics. If you are using SQL Managment Studio you have toolbox over your query editor where you can look at estimated execution plan, it shows how your query will change to gain some speed. So if just used your Name table and it is in buffer, engine might first try to subquery your data, and then join it with other table.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas