Match count of a regular expression for every row - sql

I use below query to get content rows which has my_regex_pattern. But I don't know how many times the pattern hit for every row. What is the best way to get match count for every row in Postgres?
For example if a row's content is 'abcdefabcgh' and my regular expression is 'abc', I want 2 since 'abcdefabcgh' has two 'abc'.
SELECT content
FROM table1
WHERE content ~ 'my_regex_pattern'
Or how can I get rows which has matches more than a specific number. For example just give me records which has abc more than 4 times.

Of course you can make it work with regexp_matches(). Or better yet, regexp_split_to_table(). To apply to a whole table, use a LATERAL join (requires Postgres 9.3+):
SELECT content, ct
FROM table1 t, LATERAL (
SELECT count(*) - 1 AS ct
FROM regexp_split_to_table(t.content, 'abc')
) c
WHERE t.content ~ 'abc'; -- eliminate rows without match
For simple patterns like in the example in your question, you could also:
SELECT content, (length(content) - length(replace(content, 'abc', ''))) / length('abc')
FROM table1
WHERE content LIKE '%abc%';
Typically faster, since regular expression functions are costly. Also works for older versions.

Related

Bigquery SQL: convert array to columns

I have a table with a field A where each entry is a fixed length array A of integers (say length=1000). I want to know how to convert it into 1000 columns, with column name given by index_i, for i=0,1,2,...,999, and each element is the corresponding integer. I can have it done by something like
A[OFFSET(0)] as index_0,
A[OFFSET(1)] as index_1
A[OFFSET(2)] as index_2,
A[OFFSET(3)] as index_3,
A[OFFSET(4)] as index_4,
...
A[OFFSET(999)] as index_999,
I want to know what would be an elegant way of doing this. thanks!
The first thing to say is that, sadly, this is going to be much more complicated than most people expect. It can be conceptually easier to pass the values into a scripting language (e.g. Python) and work there, but clearly keeping things inside BigQuery is going to be much more performant. So here is an approach.
Cross-joining to turn array fields into long-format tables
I think the first thing you're going to want to do is get the values out of the arrays and into rows.
Typically in BigQuery this is accomplished using CROSS JOIN. The syntax is a tad unintuitive:
WITH raw AS (
SELECT "A" AS name, [1,2,3,4,5] AS a
UNION ALL
SELECT "B" AS name, [5,4,3,2,1] AS a
),
long_format AS (
SELECT name, vals
FROM raw
CROSS JOIN UNNEST(raw.a) AS vals
)
SELECT * FROM long_format
UNNEST(raw.a) is taking those arrays of values and turning each array into a set of (five) rows, every single one of which is then joined to the corresponding value of name (the definition of a CROSS JOIN). In this way we can 'unwrap' a table with an array field.
This will yields results like
name | vals
-------------
A | 1
A | 2
A | 3
A | 4
A | 5
B | 5
B | 4
B | 3
B | 2
B | 1
Confusingly, there is a shorthand for this syntax in which CROSS JOIN is replaced with a simple comma:
WITH raw AS (
SELECT "A" AS name, [1,2,3,4,5] AS a
UNION ALL
SELECT "B" AS name, [5,4,3,2,1] AS a
),
long_format AS (
SELECT name, vals
FROM raw, UNNEST(raw.a) AS vals
)
SELECT * FROM long_format
This is more compact but may be confusing if you haven't seen it before.
Typically this is where we stop. We have a long-format table, created without any requirement that the original arrays all had the same length. What you're asking for is harder to produce - you want a wide-format table containing the same information (relying on the fact that each array was the same length.
Pivot tables in BigQuery
The good news is that BigQuery now has a PIVOT function! That makes this kind of operation possible, albeit non-trivial:
WITH raw AS (
SELECT "A" AS name, [1,2,3,4,5] AS a
UNION ALL
SELECT "B" AS name, [5,4,3,2,1] AS a
),
long_format AS (
SELECT name, vals, offset
FROM raw, UNNEST(raw.a) AS vals WITH OFFSET
)
SELECT *
FROM long_format PIVOT(
ANY_VALUE(vals) AS vals
FOR offset IN (0,1,2,3,4)
)
This makes use of WITH OFFSET to generate an extra offset column (so that we know which order the values in the array originally had).
Also, in general pivoting requires us to aggregate the values returned in each cell. But here we expect exactly one value for each combination of name and offset, so we simply use the aggregation function ANY_VALUE, which non-deterministically selects a value from the group you're aggregating over. Since, in this case, each group has exactly one value, that's the value retrieved.
The query yields results like:
name vals_0 vals_1 vals_2 vals_3 vals_4
----------------------------------------------
A 1 2 3 4 5
B 5 4 3 2 1
This is starting to look pretty good, but we have a fundamental issue, in that the column names are still hard-coded. You wanted them generated dynamically.
Unfortunately expressions for the pivot column values aren't something PIVOT can accept out-of-the-box. Note that BigQuery has no way to know that your long-format table will resolve neatly to a fixed number of columns (it relies on offset having the values 0-4 for each and every set of records).
Dynamically building/executing the pivot
And yet, there is a way. We will have to leave behind the comfort of standard SQL and move into the realm of BigQuery Procedural Language.
What we must do is use the expression EXECUTE IMMEDIATE, which allows us to dynamically construct and execute a standard SQL query!
(as an aside, I bet you - OP or future searchers - weren't expecting this rabbit hole...)
This is, of course, inelegant to say the least. But here is the above toy example, implemented using EXECUTE IMMEDIATE. The trick is that the executed query is defined as a string, so we just have to use an expression to inject the full range of values you want into this string.
Recall that || can be used as a string concatenation operator.
EXECUTE IMMEDIATE """
WITH raw AS (
SELECT "A" AS name, [1,2,3,4,5] AS a
UNION ALL
SELECT "B" AS name, [5,4,3,2,1] AS a
),
long_format AS (
SELECT name, vals, offset
FROM raw, UNNEST(raw.a) AS vals WITH OFFSET
)
SELECT *
FROM long_format PIVOT(
ANY_VALUE(vals) AS vals
FOR offset IN ("""
|| (SELECT STRING_AGG(CAST(x AS STRING)) FROM UNNEST(GENERATE_ARRAY(0,4)) AS x)
|| """
)
)
"""
Ouch. I've tried to make that as readable as possible. Near the bottom there is an expression that generates the list of column suffices (pivoted values of offset):
(SELECT STRING_AGG(CAST(x AS STRING)) FROM UNNEST(GENERATE_ARRAY(0,4)) AS x)
This generates the string "0,1,2,3,4" which is then concatenated to give us ...FOR offset IN (0,1,2,3,4)... in our final query (as in the hard-coded example before).
REALLY dynamically executing the pivot
It hasn't escaped my notice that this is still technically insisting on your knowing up-front how long those arrays are! It's a big improvement (in the narrow sense of avoiding painful repetitive code) to use GENERATE_ARRAY(0,4), but it's not quite what was requested.
Unfortunately, I can't provide a working toy example, but I can tell you how to do it. You would simply replace the pivot values expression with
(SELECT STRING_AGG(DISTINCT CAST(offset AS STRING)) FROM long_format)
But doing this in the example above won't work, because long_format is a Common Table Expression that is only defined inside the EXECUTE IMMEDIATE block. The statement in that block won't be executed until after building it, so at build-time long_format has yet to be defined.
Yet all is not lost. This will work just fine:
SELECT *
FROM d.long_format PIVOT(
ANY_VALUE(vals) AS vals
FOR offset IN ("""
|| (SELECT STRING_AGG(DISTINCT CAST(offset AS STRING)) FROM d.long_format)
|| """
)
)
... provided you first define a BigQuery VIEW (for example) called long_format (or, better, some more expressive name) in a dataset d. That way, both the job that builds the query and the job that runs it will have access to the values.
If successful, you should see both jobs execute and succeed. You should then click 'VIEW RESULTS' on the job that ran the query.
As a final aside, this assumes you are working from the BigQuery console. If you're instead working from a scripting language, that gives you plenty of options to either load and manipulate the data, or build the query in your scripting language rather than massaging BigQuery into doing it for you.
Consider below approach
execute immediate ( select '''
select * except(id) from (
select to_json_string(A) id, * except(A)
from your_table, unnest(A) value with offset
)
pivot (any_value(value) index for offset in ('''
|| (select string_agg('' || val order by offset) from unnest(generate_array(0,999)) val with offset) || '))'
)
If to apply to dummy data like below (with 10 instead of 1000 elements)
select [10,11,12,13,14,15,16,17,18,19] as A union all
select [20,21,22,23,24,25,26,27,28,29] as A union all
select [30,31,32,33,34,35,36,37,38,39] as A
the output is

How to rotate a two-column table?

This might be a novice question – I'm still learning. I'm on PostgreSQL 9.6 with the following query:
SELECT locales, count(locales) FROM (
SELECT lower((regexp_matches(locale, '([a-z]{2,3}(-[a-z]{2,3})?)', 'i'))[1])
AS locales FROM users)
AS _ GROUP BY locales
My query returns the following dynamic rows:
locales
count
en
10
fr
7
de
3
n additional locales (~300)...
n-count
I'm trying to rotate it so that locale values end up as columns with a single row, like this:
en
fr
de
n additional locales (~300)...
10
7
3
n-count
I'm having to do this to play nice with a time-series db/app
I've tried using crosstab(), but all the examples show better defined tables with 3 or more columns.
I've looked at examples using join, but I can't figure out how to do it dynamically.
Base query
In Postgres 10 or later you could use the simpler and faster regexp_match() instead of regexp_matches(). (Since you only take the first match per row anyway.) But don't bother and use the even simpler substring() instead:
SELECT lower(substring(locale, '(?i)[a-z]{2,3}(?:-[a-z]{2,3})?')) AS locale
, count(*)::int AS ct
FROM users
WHERE locale ~* '[a-z]{2,3}' -- eliminate NULL, allow index support
GROUP BY 1
ORDER BY 2 DESC, 1
Simpler and faster than your original base query.
About those ordinal numbers in GROUP BY and ORDER BY:
Select first row in each GROUP BY group?
Subtle difference: regexp_matches() returns no row for no match, while substring() returns null. I added a WHERE clause to eliminate non-matches a-priori - and allow index support if applicable, but I don't expect indexes to help here.
Note the prefixed (?i), that's a so-called "embedded option" to use case-insensitive matching.
Added a deterministic ORDER BY clause. You'd need that for a simple crosstab().
Aside: you might need _ in the pattern instead of - for locales like "en_US".
Pivot
Try as you might, SQL does not allow dynamic result columns in a single query. You need two round trips to the server. See;
How do I generate a pivoted CROSS JOIN where the resulting table definition is unknown?
You can use a dynamically generated crosstab() query. Basics:
PostgreSQL Crosstab Query
Dynamic query:
PostgreSQL convert columns to rows? Transpose?
But since you generate a single row of plain integer values, I suggest a simple approach:
SELECT 'SELECT ' || string_agg(ct || ' AS ' || quote_ident(locale), ', ')
FROM (
SELECT lower(substring(locale, '(?i)[a-z]{2,3}(?:-[a-z]{2,3})?')) AS locale
, count(*)::int AS ct
FROM users
WHERE locale ~* '[a-z]{2,3}'
GROUP BY 1
ORDER BY 2 DESC, 1
) t
Generates a query of the form:
SELECT 10 AS en, 7 AS fr, 3 AS de, 3 AS "de-at"
Execute it to produce your desired result.
In psql you can append \gexec to the generating query to feed the generated SQL string back to the server immediately. See:
My function returned a string. How to execute it?

Athena/Presto | Can't match ID row on self join

I'm trying to get the bi-grams on a string column.
I've followed the approach here but Athena/Presto is giving me errors at the final steps.
Source code so far
with word_list as (
SELECT
transaction_id,
words,
n,
regexp_extract_all(f70_remittance_info, '([a-zA-Z]+)') as f70,
f70_remittance_info
FROM exploration_transaction
cross join unnest(regexp_extract_all(f70_remittance_info, '([a-zA-Z]+)')) with ordinality AS t (words, n)
where cardinality((regexp_extract_all(f70_remittance_info, '([a-zA-Z]+)'))) > 1
and f70_remittance_info is not null
limit 50 )
select wl1.f70, wl1.n, wl1.words, wl2.f70, wl2.n, wl2.words
from word_list wl1
join word_list wl2
on wl1.transaction_id = wl2.transaction_id
The specific issue I'm having is on the very last line, when I try to self join the transaction ids - it always returns zero rows. It does work if I join only by wl1.n = wl2.n-1 (the position on the array) which is useless if I can't constrain it to a same id.
Athena doesn't support the ngrams function by presto, so I'm left with this approach.
Any clues why this isn't working?
Thanks!
This is speculation. But I note that your CTE is using limit with no order by. That means that an arbitrary set of rows is being returned.
Although some databases materialize CTEs, many do not. They run the code independently each time it is referenced. My guess is that the code is run independently and the arbitrary set of 50 rows has no transaction ids in common.
One solution would be to add order by transacdtion_id in the subquery.

Redshift LISTAGG frame clause

I am trying to aggregate strings, but limited to only the preceding rows, not the whole partition. Does anyone know how to do this in Redshift?
What I am trying to achieve is the appended_event_namespace column below.
This is what I've tried so far.
LISTAGG(event_namespace, '/')
WITHIN GROUP (ORDER BY tstamp_true)
OVER (PARTITION BY acct_id) AS appended_event_namespace
This results in the full ApplicationLaunch/CategoryBrowse/NotificationCenter/UserProfile aggregation on every single row instead of what is in the desired screenshot.
The difficulty is in getting it to only append up to the current row since there doesn't seem to be a frame-clause for Redshift's LISTAGG(). Thanks for any ideas that may help.
You can hack this together with another query. Start with your appended_event_namespace as the result of your original LISTAGG
SELECT event_namespace,
SUBSTRING(appended_event_namespace,
1,
POSITION(event_namespace,appended_event_namespace) + LEN(event_namespace) - 1
) as appended_event_namespace_cum
FROM your_table;
Basically, you take your aggregated, ordered string, and then take the first N characters where N is ([where it appears in the aggregated string ]+[its length]), which will cut out everything after that item. This gives you a cumulative namespace.
LISTAGG with frame clause is not supported in RS yet. If you have some columns that you can use for partitioning and ordering you can make a self join (not so performant but would accomplish what you want):
SELECT
t1.id
,t2.tstamp_true
,t1.event_namespace
,LISTAGG(t2.event_namespace,'/') WITHIN GROUP (ORDER BY t2.tstamp_true)
FROM your_table t1
JOIN your_table t2
ON t1.id=t2.id
AND t1.tstamp_true>=t2.tstamp_true
GROUP BY 1,2,3
Alternatively, if you want to avoid self join you can build a JSON with the following structure using LISTAGG:
[{tstamp_true_1,event_namespace_1},{tstamp_true_N,event_namespace_N},...]
and write a Python UDF that takes such JSON for the given group of rows and tstamp_true of the given row and returns the path (the function would need to filter the tstamp_true_N values earlier than the second parameter and concatenate filtered event_namespace_N values for the output)

Eliminating Entries Based On Revision

I need to figure out how to eliminate older revisions from my query's results, my database stores orders as 'Q000000' and revisions have an appended '-number'. My query currently is as follows:
SELECT DISTINCT Estimate.EstimateNo
FROM Estimate
INNER JOIN EstimateDetails ON EstimateDetails.EstimateID = Estimate.EstimateID
INNER JOIN EstimateDoorList ON EstimateDoorList.ItemSpecID = EstimateDetails.ItemSpecID
WHERE (Estimate.SalesRepID = '67' OR Estimate.SalesRepID = '61') AND Estimate.EntryDate >= '2017-01-01 00:00:00.000' AND EstimateDoorList.SlabSpecies LIKE '%MDF%'
ORDER BY Estimate.EstimateNo
So for instance, the results would include:
Q120455-10
Q120445-11
Q121675-2
Q122361-1
Q123456
Q123456-1
From this, I need to eliminate 'Q120455-10' because of the presence of '-11' for that order, and 'Q123456' because of the presence of the '-1' revision. I'm struggling greatly with figuring out how to do this, my immediate thought was to use case statements but I'm not sure what is the best way to implement them and how to filter. Thank you in advance, let me know if any more information is needed.
First you have to parse your EstimateNo column into sequence number and revision number using CHARINDEX and SUBSTRING (or STRING_SPLIT in newer versions) and CAST/CONVERT the revision to a numeric type
SELECT
SUBSTRING(Estimate.EstimateNo,0,CHARINDEX('-',Estimate.EstimateNo)) as [EstimateNo],
CAST(SUBSTRING(Estimate.EstimateNo,CHARINDEX('-',Estimate.EstimateNo)+1, LEN(Estimate.EstimateNo)-CHARINDEX('-',Estimate.EstimateNo)+1) as INT) as [EstimateRevision]
FROM
...
You can then use
APPLY - to select TOP 1 row that matches the EstimateNo or
Window function such as ROW_NUMBER to select only records with row number of 1
For example, using a ROW_NUMBER would look something like below:
SELECT
ROW_NUMBER() OVER(PARTITION BY EstimateNo ORDER BY EstimateRevision DESC) AS "LastRevisionForEstimate",
-- rest of the needed columns
FROM
(
-- query above goes here
)
You can then wrap the query above in a simple select with a where predicate filtering out a specific value of LastRevisionForEstimate, for instance
SELECT --needed columns
FROM -- result set above
WHERE LastRevisionForEstimate = 1
Please note that this is to a certain extent, pseudocode, as I do not have your schema and cannot test the query
If you dislike the nested selects, check out the Common Table Expressions