count number of times a regex pattern occurs in hive

count number of times a regex pattern occurs in hive - sql

I have a string variable stored in hive as follows
stringvar
AA1,BB3,CD4
AA12,XJ5
I would like to count (and filter on) how many times the regex pattern \w\w\d occurs. In the example, in the first row there are obviously three such examples. How can I do that without resorting to lateral views and explosions of stringvar (too expensive)?
Thanks!

You can split string by pattern and calculate size of result array - 1.
Demo:
select size(split('AA1,BB3,CD4','\\w\\w\\d'))-1 --returns 3
select size(split('AA12,XJ5','\\w\\w\\d'))-1 --returns 2
select size(split('AAxx,XJx','\\w\\w\\d'))-1 --returns 0
select size(split('','\\w\\w\\d'))-1 --returns 0
If column is null-able than special care should be taken. For example like this (depends on what you need to be returned in case of NULL):
select case when col is null then 0
else size(split(col,'\\w\\w\\d'))-1
end
Or simply convert NULL to empty string using NVL function:
select size(split(NVL(col,''),'\\w\\w\\d'))-1
The solution above is the most flexible one, you can count the number of occurrences and use it for complex filtering/join/etc.
In case you just need to filter records with fixed number of pattern occurrences or at least fixed number and do not need to know exact count then simple RLIKE without splitting is the cheapest method.
For example check for at least 2 repeats:
select 'AA1,BB3,CD4' rlike('\\w\\w\\d+,\\w\\w\\d+') --returns true, can be used in WHERE

Related

Count all elements in an array

I have a table that I save some data include list of numbers.
like this:
numbers
(null)
،42593
،42593،42594،36725،42592،36725،42592
،42593،42594،36725،42592
،31046،36725،42592
I would like to count the number elements in every row in SQL Server
count
0
1
6
4
3

You could use a replacement trick here:
SELECT numbers,
COALESCE(LEN(numbers) - LEN(REPLACE(numbers, ',', '')), 0) AS num_elements
FROM yourTable;
The above trick works by counting the number of commas (assuming your data really has commas as separators). For example, your last sample data point was:
,31046,36725,42592 => length is 18
310463672542592 => length is 15
Hence the difference in lengths correctly yields the right number of elements.

Another idea is to useSTRING_SPLIT:
SELECT y.numbers,
(SELECT COUNT(Value) - 1
FROM string_split(COALESCE(y.numbers,''),',')) AS num_elements
FROM yourtable AS y;
I know this looks a bit unhandy on first glance due to this strange -1 in the second line and the COALESCE in the third line. So why do I talk about this option?
Well, the strange thing in your case which causes these difficulties in my query is that your rows always start with a comma.
This is quite weird and it would be much easier without this first comma in every row.
Let's assume you remove this comma in future. Then this will become really easy and good readable:
SELECT y.numbers,
(SELECT COUNT(Value)
FROM string_split(y.numbers,',')) AS num_elements
FROM yourtable AS y;
Try out: db<>fiddle

your data
CREATE TABLE yourtable(
numbers VARCHAR(max)
);
INSERT INTO yourtable
(numbers) VALUES
(null),
('،42593'),
('،42593،42594،36725،42592،36725،42592'),
('،42593،42594،36725،42592'),
('،31046،36725،42592');
you need ISNULL and len
select
ISNULL(len(numbers) - len(replace(numbers,'،','')) ,0) count
from yourtable
the other way is by using IIF and string_split as follows
SELECT IIF(count < 0, 0, count) count
FROM   (SELECT (SELECT Count(*) - 1
                FROM   STRING_SPLIT (Replace(Replace(numbers, 'R', ''), '،',
                                     'R'), 'R'
                       )) AS
               'count'
        FROM   yourtable) A
dbfiddle

Need to divide a date part in SQL Server

I have a column in my table with these values:
PING_TO_ME_20100828_Any87
TO_THESE_D_COLUMN_ENTRY_20200825
TO_THESE_D_20100829_COLUMN_ENTRY
201901_ARE_YOU_TRYING_TO_REACH47
ASK_TO_UOU_201008
I need to separate date values in a separate column.
My output should be:
20100828
20200825
20100829
201901
201008
Any help is very much appreciated.

You will (and already have) likely get comments about this telling you to fix your design. And while that is likely true...I won't try to pick apart why you are doing this, and I'll just give you the answer you came here for.
Your goal is to pick out either an 8 digit string of integers, or a 6 digit string of integers.
Here is one way you could do it:
SELECT x.y
, COALESCE(SUBSTRING(x.y, NULLIF(PATINDEX('%[0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9]%', x.y), 0), 8)
, SUBSTRING(x.y, NULLIF(PATINDEX('%[0-9][0-9][0-9][0-9][0-9][0-9]%', x.y), 0), 6))
FROM (
VALUES ('PING_TO_ME_20100828_Any87'),
('TO_THESE_D_COLUMN_ENTRY_20200825'),
('TO_THESE_D_20100829_COLUMN_ENTRY'),
('201901_ARE_YOU_TRYING_TO_REACH47'),
('ASK_TO_UOU_201008')
) x(y)
Explanation:
Since you are looking for both 8 and 6 digit values, you need to check for the longer of the two first. So first I search for the occurrence of a string of 8 integers using:
NULLIF(PATINDEX('%[0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9]%', x.y), 0)
This returns the first position of a string of 8 integers. The reason I wrap it in a NULLIF() is because if the value is not found, then PATINDEX will return 0.
I use NULLIF() to return NULL in that case, essentially indicating nothing was found. If you pass a NULL value to SUBSTRING() then it also returns NULL.
This is all just a nice way of "failing over" to the 6 character string check.
So there I do the same thing again:
NULLIF(PATINDEX('%[0-9][0-9][0-9][0-9][0-9][0-9]%', x.y), 0)
Except this time, I only repeat [0-9] six times. And again, I use the NULLIF() trick, so that it returns NULL if no string is found.
Throw that all into SUBSTRING() and COALESCE() and you've got a function that returns the results you're looking for.
Potential downsides
There are a couple down sides to this method.
It is not checking for a valid date, it's simply looking for a string of either 8 integers, or 6 integers. It could be 12345678 and it would still detect and return that.
If there are strings of integers longer than 8 digits, it will grab only the first 8 characters.
If there are multiple occurrences of 6 or 8 character integer strings...it will only return the first one.
There are much more robust ways you could write this, but it all depends on your data and what you need to do.
Other methods
Another way it could be done depending on which version of SQL Server you are using, is using STRING_SPLIT().
SELECT x.y, s.[value]
FROM (
VALUES ('PING_TO_ME_20100828_Any87'),('TO_THESE_D_COLUMN_ENTRY_20200825'),('TO_THESE_D_20100829_COLUMN_ENTRY'),('201901_ARE_YOU_TRYING_TO_REACH47'),('ASK_TO_UOU_201008')
) x(y)
CROSS APPLY (
SELECT [value]
FROM STRING_SPLIT(x.y, '_')
WHERE [value] LIKE '[0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9]'
OR [value] LIKE '[0-9][0-9][0-9][0-9][0-9][0-9]'
) s
This method handles a couple of the downsides mentioned earlier. For example, it will ONLY return integer strings of length 6 or 8. It will also return ALL integer strings of length 6 or 8 and not just the first one.
And there's other ways to identify the strings as well, like using ISNUMERIC(x.[value]) or TRY_CONVERT(int, s.[value]).
It all depends on how you are using this code...if it's runs fast enough, and it's a one off script, then it really doesn't matter. If it's running for millions of records at a time, then yeah you should play around with other methods.

How to use regexp_matches() in an UPDATE statement?

I am trying to clean up a table that has a very messy varchar column, with entries of the sorts:
<u><font color="#0000FF">VA Lidar</font></u> OR <u><font color="#0000FF">InPort Metadata</font></u>
I would like to update the column by keeping only the html links, and separating them with a coma if there are more than one. Ideally I would do something like this:
UPDATE mytable
SET column = array_to_string(regexp_matches(column,'(?<=href=").+?(?=\")','g') , ',');
But unfortunately this returns an error in Postgres 10:
ERROR: set-returning functions are not allowed in UPDATE
I assume regexp_matches() is the said set-returning function. Any ideas on how I can achieve this?

Notes
1.
You don't need to base the correlated subquery on a separate instance of the base table (like other answers suggested). That would be doing more work for nothing.
2.
For simple cases an ARRAY constructor is cheaper than array_agg(). See:
Why is array_agg() slower than the non-aggregate ARRAY() constructor?
3.
I use a regular expression without lookahead and lookbehind constraints and parentheses instead: href="([^"]+)
See query 1.
This works because parenthesized subexpressions are captured by regexp_matches() (and several other Postgres regexp functions). So we can replace the more sophisticated constraints with plain parentheses. The manual on regexp_match():
If a match is found, and the pattern contains no parenthesized
subexpressions, then the result is a single-element text array
containing the substring matching the whole pattern. If a match is
found, and the *pattern* contains parenthesized subexpressions, then the
result is a text array whose n'th element is the substring matching
the n'th parenthesized subexpression of the pattern
And for regexp_matches():
This function returns no rows if there is no match, one row if there
is a match and the g flag is not given, or N rows if there are N
matches and the g flag is given. Each returned row is a text array
containing the whole matched substring or the substrings matching
parenthesized subexpressions of the pattern, just as described above
for regexp_match.
4.
regexp_matches() returns a set of arrays (setof text[]) for a reason: not only can a regular expression match several times in a single string (hence the set), it can also produce multiple strings for each single match with multiple capturing parentheses (hence the array). Does not occur with this regexp, every array in the result holds a single element. But future readers shall not be lead into a trap:
When feeding the resulting 1-D arrays to array_agg() (or an ARRAY constructor) that produces a 2-D array - which is only even possible since Postgres 9.5 added a variant of array_agg() accepting array input. See:
Is there something like a zip() function in PostgreSQL that combines two arrays?
However, quoting the manual:
inputs must all have same dimensionality, and cannot be empty or NULL
I think this can never fail as the same regexp always produces the same number of array elements. Ours always produces one element. But that may be different with other regexp. If so, there are various options:
Only take the first element with (regexp_matches(...))[1]. See query 2.
Unnest arrays and use string_agg() on base elements. See query 3.
Each approach works here, too.
Query 1
UPDATE tbl t
SET col = (
SELECT array_to_string(ARRAY(SELECT regexp_matches(col, 'href="([^"]+)', 'g')), ',')
);
Columns with no match are set to '' (empty string).
Query 2
UPDATE tbl
SET col = (
SELECT string_agg(t.arr[1], ',')
FROM regexp_matches(col, 'href="([^"]+)', 'g') t(arr)
);
Columns with no match are set to NULL.
Query 3
UPDATE tbl
SET col = (
SELECT string_agg(elem, ',')
FROM regexp_matches(col, 'href="([^"]+)', 'g') t(arr)
, unnest(t.arr) elem
);
Columns with no match are set to NULL.
db<>fiddle here (with extended test case)

You could use a correlated subquery to deal with the offending set-returning function (which is regexp_matches). Something like this:
update mytable
set column = (
select array_to_string(array_agg(x), ',')
from (
select regexp_matches(t2.c, '(?<=href=").+?(?=\")', 'g')
from t t2
where t2.id = t.id
) dt(x)
)
You're still stuck with the "CSV in a column" nastiness but that's a separate issue and presumably not a problem for you.

Building on the approach of mu is too short with slightly different regex and a COALESCE function to retain values that do not contain href-links:
UPDATE a
SET bad_data = COALESCE(
(SELECT Array_to_string(Array_agg(x), ',')
FROM (SELECT Regexp_matches(a.bad_data,
'(?<=href=")[^"]+', 'g'
) AS x
FROM a a2
WHERE a2.id = a.id) AS sub), bad_data
);
SQL Fiddle

Absolute maxvalue comparison of columns in Firebird SQL

I want to perform comparison for the specified columns in database, the comparison logic should compare the numbers regardless of their signs and will retrieve the result original with its sign.
For example, below code works well but as can be seen in the select block it returns the absolute value of columns. Is there any trick, cheat in Firebird 2.1 to overcome that?
SELECT a.ELM_NUM,a.COMBO, maxvalue(abs(a.N_1),abs(a.N_2)) as maxN from ntm a order by a.ELM_NUM

You can use a CASE condition:
SELECT a.ELM_NUM,a.COMBO,
CASE WHEN abs(a.N_1) > abs(a.N_2) THEN a.N_1 ELSE a.N_2 END as maxN
from ntm a
order by a.ELM_NUM

Problem with MySQL Select query with "IN" condition

I found a weird problem with MySQL select statement having "IN" in where clause:
I am trying this query:
SELECT ads.*
FROM advertisement_urls ads
WHERE ad_pool_id = 5
AND status = 1
AND ads.id = 23
AND 3 NOT IN (hide_from_publishers)
ORDER BY rank desc
In above SQL hide_from_publishers is a column of advertisement_urls table, with values as comma separated integers, e.g. 4,2 or 2,7,3 etc.
As a result, if hide_from_publishers contains same above two values, it should return only record for "4,2" but it returns both records
Now, if I change the value of hide_for_columns for second set to 3,2,7 and run the query again, it will return single record which is correct output.
Instead of hide_from_publishers if I use direct values there, i.e. (2,7,3) it does recognize and returns single record.
Any thoughts about this strange problem or am I doing something wrong?

There is a difference between the tuple (1, 2, 3) and the string "1, 2, 3". The former is three values, the latter is a single string value that just happens to look like three values to human eyes. As far as the DBMS is concerned, it's still a single value.
If you want more than one value associated with a record, you shouldn't be storing it as a comma-separated value within a single field, you should store it in another table and join it. That way the data remains structured and you can use it as part of a query.

You need to treat the comma-delimited hide_from_publishers column as a string. You can use the LOCATE function to determine if your value exists in the string.
Note that I've added leading and trailing commas to both strings so that a search for "3" doesn't accidentally match "13".
select ads.*
from advertisement_urls ads
where ad_pool_id = 5
and status = 1
and ads.id = 23
and locate(',3,', ','+hide_from_publishers+',') = 0
order by rank desc

You need to split the string of values into separate values. See this SO question...
Can Mysql Split a column?
As well as the supplied example...
http://blog.fedecarg.com/2009/02/22/mysql-split-string-function/

Here is another SO question:
MySQL query finding values in a comma separated string
And the suggested solution:
http://dev.mysql.com/doc/refman/5.0/en/string-functions.html#function_find-in-set

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

count number of times a regex pattern occurs in hive - sql

Related

Count all elements in an array

Need to divide a date part in SQL Server

How to use regexp_matches() in an UPDATE statement?

Absolute maxvalue comparison of columns in Firebird SQL

Problem with MySQL Select query with "IN" condition

Categories

Resources