Count the number of times the portion of a link beginning with certain string appears in the text of a column - sql

I need to count for each row the number of times the portion of a link beginning with 'https://t.co/' appears in the text of a column named "Tweet_text".
I've done:
SELECT COUNT(REGEXP_CONTAINS('https://t.co/', Tweet_text)) As Cnt
FROM `MyTable`
But this returns the overall count over the whole table, not the count row by row.

You can try this query:
SELECT ARRAY_LENGTH(REGEXP_EXTRACT_ALL(Tweet_text, 'https://t.co/'))
FROM MyTable
The function REGEXP_CONTAINS only returns the state whether your regular expression was found:
Returns TRUE if value is a partial match for the regular expression, regexp.
If you want to get the count of found substring in your column you have to use REGEXP_EXTRACT_ALL with ARRAY_LENGTH.
You get the count of each row (not a sum) because you don't use a aggregate function (like COUNT) anymore.

Related

How can I use an SQL aggregate function on data I directly input at the command line (e.g. AVG(1, 2, 3))?

How can I enter multiple values into an aggregate function using just data I enter at the command line? Say, in Postgres, I run the following.
SELECT AVG(2);
I'll get the correct answer, but I can't find a way to enter multiple values, such as below, without getting an error.
SELECT AVG(1,NULL,2,3);
I've tried wrapping the numbers in various brackets but to no effect. What's the syntax I'm missing?
EDIT: Additionally, is there a way to include NULLs in the input?
AVG() is an aggregate that operates over multiple rows. So you need to convert your comma separated list to one row per value to be able to use an aggregate like avg(). This could be done using e.g. string_to_table
select avg(num::numeric)
from string_to_table('1,2,3', ',') as x(num)
If you want to include a NULL value, you could add it to the list and convert it to null before casting it to a numeric value:
select avg(nullif(num, 'null')::numeric)
from string_to_table('1,2,3,4,null', ',') as x(num)

Why SUBSTRING_INDEX in Hive is not working for negative count?

When I execute the following two queries, the SUBSTRING_INDEX function with positive count gives me correct result but the SUBSTRING_INDEX function with negative count gives me wrong output result.
SELECT SUBSTRING_INDEX('wwwbig.data.nsqlcom',"data", 1)
Output: wwwbig.
SELECT SUBSTRING_INDEX("wwwbig.data.nsqlcom",'data', -1)
Output: ata.nsqlcom
As per the function definition, the second query should return ".nsqlcom" value. Note: This issue is only seen in the case of Hive and not any other tool.
its working perfectly.
definition is -
substring_index(STRING a, STRING delim, INT count)
Returns the substring from string A before count occurrences of the delimiter delim (as of Hive 1.3.0).
If count is positive, everything to the left of the final delimiter (counting from the left) is returned.
If count is negative, everything to the right of the final delimiter (counting from the right) is returned.
In your second case, when count is -1 and delimiter is 'data', function should return everything to the right of delimiter based on final occurrence position of delimiter.
As per your data "wwwbig.data.nsqlcom", final occurrence of data is at 7. So, everything from 8th position will be returned which is ata.nsqlcom.

ORACLE sql Substr / Instr

I have a column within a table that has PO-RAILCAR. I need to split this column into two. I write the following query and it does exactly what I want. However, the results come back with the dash. How do I write it to return the values as they are without the dashes?
SELECT INVT_LEV3, SUBSTR(INVT_LEV3,1,INSTR(INVT_LEV3,'-')) AS PO,
SUBSTR(INVT_LEV3,INSTR(INVT_LEV3,'-')) AS Railcar
FROM C_MVT_H
WHERE INVT_LEV4 = 'G07K02129/G07K02133'
This is what I get: First column is the column I need to split. The second and third look perfect but I need the dash removed
Column 1: 110799P-FBOX50553 Column2: 110799P- Column3:-FBOX505536
The problem is occurring because INSTR is giving you the position of the '-' within the text. To fix this you can just add or subtract 1 from the position returned.
Your current query:
SELECT INVT_LEV3, SUBSTR(INVT_LEV3,1,INSTR(INVT_LEV3,'-')-1) AS PO, SUBSTR(INVT_LEV3,INSTR(INVT_LEV3,'-')+1) AS Railcar FROM C_MVT_H WHERE INVT_LEV4 = 'G07K02129/G07K02133'
Proposed new query
SELECT INVT_LEV3, SUBSTR(INVT_LEV3,1,INSTR(INVT_LEV3,'-')) AS PO, SUBSTR(INVT_LEV3,INSTR(INVT_LEV3,'-')) AS Railcar FROM C_MVT_H WHERE INVT_LEV4 = 'G07K02129/G07K02133'

count number of times a regex pattern occurs in hive

I have a string variable stored in hive as follows
stringvar
AA1,BB3,CD4
AA12,XJ5
I would like to count (and filter on) how many times the regex pattern \w\w\d occurs. In the example, in the first row there are obviously three such examples. How can I do that without resorting to lateral views and explosions of stringvar (too expensive)?
Thanks!
You can split string by pattern and calculate size of result array - 1.
Demo:
select size(split('AA1,BB3,CD4','\\w\\w\\d'))-1 --returns 3
select size(split('AA12,XJ5','\\w\\w\\d'))-1 --returns 2
select size(split('AAxx,XJx','\\w\\w\\d'))-1 --returns 0
select size(split('','\\w\\w\\d'))-1 --returns 0
If column is null-able than special care should be taken. For example like this (depends on what you need to be returned in case of NULL):
select case when col is null then 0
else size(split(col,'\\w\\w\\d'))-1
end
Or simply convert NULL to empty string using NVL function:
select size(split(NVL(col,''),'\\w\\w\\d'))-1
The solution above is the most flexible one, you can count the number of occurrences and use it for complex filtering/join/etc.
In case you just need to filter records with fixed number of pattern occurrences or at least fixed number and do not need to know exact count then simple RLIKE without splitting is the cheapest method.
For example check for at least 2 repeats:
select 'AA1,BB3,CD4' rlike('\\w\\w\\d+,\\w\\w\\d+') --returns true, can be used in WHERE

Problem with MySQL Select query with "IN" condition

I found a weird problem with MySQL select statement having "IN" in where clause:
I am trying this query:
SELECT ads.*
FROM advertisement_urls ads
WHERE ad_pool_id = 5
AND status = 1
AND ads.id = 23
AND 3 NOT IN (hide_from_publishers)
ORDER BY rank desc
In above SQL hide_from_publishers is a column of advertisement_urls table, with values as comma separated integers, e.g. 4,2 or 2,7,3 etc.
As a result, if hide_from_publishers contains same above two values, it should return only record for "4,2" but it returns both records
Now, if I change the value of hide_for_columns for second set to 3,2,7 and run the query again, it will return single record which is correct output.
Instead of hide_from_publishers if I use direct values there, i.e. (2,7,3) it does recognize and returns single record.
Any thoughts about this strange problem or am I doing something wrong?
There is a difference between the tuple (1, 2, 3) and the string "1, 2, 3". The former is three values, the latter is a single string value that just happens to look like three values to human eyes. As far as the DBMS is concerned, it's still a single value.
If you want more than one value associated with a record, you shouldn't be storing it as a comma-separated value within a single field, you should store it in another table and join it. That way the data remains structured and you can use it as part of a query.
You need to treat the comma-delimited hide_from_publishers column as a string. You can use the LOCATE function to determine if your value exists in the string.
Note that I've added leading and trailing commas to both strings so that a search for "3" doesn't accidentally match "13".
select ads.*
from advertisement_urls ads
where ad_pool_id = 5
and status = 1
and ads.id = 23
and locate(',3,', ','+hide_from_publishers+',') = 0
order by rank desc
You need to split the string of values into separate values. See this SO question...
Can Mysql Split a column?
As well as the supplied example...
http://blog.fedecarg.com/2009/02/22/mysql-split-string-function/
Here is another SO question:
MySQL query finding values in a comma separated string
And the suggested solution:
http://dev.mysql.com/doc/refman/5.0/en/string-functions.html#function_find-in-set