How to count collocations in a repeating field - google-bigquery

I have a repeating field A which contains a list of strings. what would be a good way to find TOP strings which coincide with a given string. So, if A holds hashtags, for a given hashtag #T1, find the tags that coincide with #T1 in highest number of records.

You can use WITHIN and SUM(IF(...)) to find the matches. For example:
SELECT hashtag, COUNT(*) AS cnt
(SELECT tweet.hashtag as hashtag,
SUM(IF(tweet.hashtag == '#T1', 1, 0)) WITHIN RECORD as tagz
FROM [tweets])
WHERE tagz > 0
GROUP by hashtag,
ORDER BY cnt DESC

Related

Count the most popular occurrences of a hashtag in a string column postgreSQL

I have a column in my dataset with the following format:
hashtags
1 [#newyears, #christmas, #christmas]
2 [#easter, #newyears, #fourthofjuly]
3 [#valentines, #christmas, #easter]
I have managed to count the hashtags like so:
SELECT hashtags, (LENGTH(hashtags) - LENGTH(REPLACE(hashtags, ',', '')) + 1) AS hashtag_count
FROM full_data
ORDER BY hashtag_count DESC NULLS LAST
But I'm not sure if it's possible to count the occurrences of each hashtag. Is it possible to return the count of the most popular hashtags in the following format:
hashtags count
christmas 3
newyears 2
The datatype is just varchar, but I'm a bit confused on how I should approach this. Any help would be appreciated!
That's a bad idea to store this data. It's risky because we don't know whether the text will always be stored in exactly this form. Better save the different strings in separate columns.
Anyway, if you can't improve that and must deal with this structure, we could basically use a combination of UNNEST, STRING_TO_ARRAY and GROUP BY to split the hashtags and count them.
So the general idea is something like this:
WITH unnested AS
(SELECT
UNNEST(STRING_TO_ARRAY(hashtags, ',')) AS hashtag
FROM full_data)
SELECT hashtag, COUNT(hashtag)
FROM unnested
GROUP BY hashtag
ORDER BY COUNT(hashtag) DESC;
Due to the braces and spaces within your column, this will not produce the correct result.
So we could additionaly use TRIM and TRANSLATE to get rid of all other things except the hashtags.
With your sample data, following construct will produce the intended outcome:
WITH unnested AS
(SELECT
TRIM(TRANSLATE(UNNEST(STRING_TO_ARRAY(hashtags, ',')),'#,[,]','')) AS hashtag
FROM full_data)
SELECT hashtag, COUNT(hashtag)
FROM unnested
GROUP BY hashtag
ORDER BY COUNT(hashtag) DESC;
See here
But as already said, this is unpleasant and risky.
So if possible, find out which hashtags are possible (it seems these are all special days) and then create columns or a mapping table for them.
This said, store 0 or 1 in the column to indicate whether the hashtag appears or not and then sum the values per column.
I think you should split all the data in Array to record and then count it with Group by. Something like this query
SELECT hashtag, count(*) as hashtag_count
FROM full_data, unnest(hashtags) s(hashtag)
GROUP BY hashtag
ORDER BY hashtag_count DESC
Hopefully, it will match your request!
You can do it as follows :
select unnest(string_to_array(REGEXP_REPLACE(hashtags,'[^\w,]+','','g'), ',')) as tags, count(1)
from full_data
group by tags
order by count(1) desc
Result :
tags count
christmas 3
newyears 2
easter 2
fourthofjuly 1
valentines 1
REGEXP_REPLACE to remove any special characters.
string_to_array to generate an array
unnest to expand an array to a set of rows
Demo here

How to get first row of 3 specific values of a column using Oracle SQL?

I have a table which has ID, FAMILY, ENV_XML_PATH and CREATED_DATE columns.
ID
FAMILY
ENV_XML_PATH
CREATED_DATE
15826841
CRM
path1.xml
03-09-22 6:50:34AM
15826856
SCM
path3.xml
03-10-22 7:12:20AM
15826786
IC
path4.xml
02-10-22 12:50:52AM
15825965
CRM
path5.xml
02-10-22 1:50:52AM
15653951
null
path6.xml
04-10-22 12:50:52AM
15826840
FIN
path7.xml
03-10-22 2:34:09AM
15826841
SCM
path8.xml
02-10-22 8:40:52AM
15223450
IC
path9.xml
03-09-22 5:34:09AM
15026853
SCM
path10.xml
05-10-22 4:40:59AM
Now there are 18 DISTINCT values in FAMILY column and each value has multiple rows associated (as you can see from the above image).
What I want is to get the first row of 3 specific values (CRM, SCM and IC) in FAMILY column.
Something like this:
ID
FAMILY
ENV_XML_PATH
CREATED_DATE
15826841
CRM
path1.xml
date1
15826856
SCM
path3.xml
date2
15826786
IC
path4.xml
date3
I am new to this, though I understand the logic but I am not sure how to implement it. Kindly help. Thanks.
You can use RANK for that. Something like this:
WITH groupedData AS
(SELECT id, family, env_xml_path, created_date,
RANK () OVER (PARTITION BY family ORDER BY id) AS r_num
FROM yourtable
GROUP BY id, family, env_xml_path, created_date)
SELECT id, family, env_xml_path, created_date
FROM groupedData
WHERE r_num = 1
ORDER BY id;
Thus, within the first query, your data will be grouped by family and sorted by the column you want (in my example, it will be sorted by id).
After that, you will use the second query to only take the first row of each family.
Add a WHERE clause to the first query if you need to apply further restrictions on the result set.
See here a working example: db<>fiddle
You could use a window function to get to know the row number of each partition in family ordered by the created_date, and then filter by the the three families you are interested in:
with row_window as (
select
id,
family,
env_xml_path,
created_date,
row_number() over (partition by family order by created_date asc) as rn
from <your_table>
where family in ('CRM', 'SCM', 'IC')
)
select
id,
family,
env_xml_path,
created_date
from row_window
where rn = 1
Output:
ID
FAMILY
ENV_XML_PATH
CREATED_DATE
15826841
CRM
path1.xml
03-09-22 6:50:34
15826856
SCM
path3.xml
03-10-22 7:12:20
15826786
IC
path4.xml
02-10-22 12:50:52
The question doesn't really specify what 'first' means, but I assume it means the first to be added in the table, aka the person whose date is the oldest. Try this code:
SELECT DISTINCT * FROM (yourTable) WHERE Family = 'CRM' OR
Family = 'SCM' OR Family = 'IC' ORDER BY Created_Date ASC FETCH FIRST (number) ROWS ONLY;
What it does:
Distinct - It selects different rows, which means you won't get same type of rows at the top.
Where - checks if certain condition is true
OR - it means that the select should choose rows that match those requirements. In the current situation the distinct clause means that same rows won't repeat, so you won't be getting 2 different 'CRM' family names, so it will find the first 'CRM' then the first 'SCM' and so on.
ORDER BY - orders the column in specified order. In the current one, if first rows mean the oldest, then by ordering them by date and using ASC the oldest(aka smallest date) will be at the top.
FETCH FIRST (number) ROWS ONLY - It selects only the very first couple of rows you want. For example if you need 3 different 'first' rows you need to get FETCH FIRST 3 ROWS ONLY. Combined with the distinct word it will only show 3 different rows.

Data Studio obtain a transposed table from BigQuery and hidden conditional formatting in text/strings

Several shops should be monitored about their status in a Data Studio dashboard.
There are less than 20 shops and I show here only two in the example. In the BigQuery table there is shop column and following columns: status, info, sold_today and update_time. The shop and update_time columns are always filled, but the other ones are filled only if there is a change.
Task: For each shop the last entries of all columns should be shown.
Here is the BigQuery code for the sample table:
create or replace table dsadds.eu.dummy as(
Select "shop A" as shop, 1000 as sold_today, "sale ABC" as info, 0 as status,timestamp( "2022-09-05") as update_time
union all select "shop A",null,null,1,"2022-09-06"
union all select "shop A" as shop, 500 as sold_today, "open" as status,3,"2022-09-01"
union all Select "shop B" as shop, 700 as sold_today, "open" as status,3,current_timestamp() as update_time
)
This table looks in Data Studio with conditional formatting, Status=1 marked red, like this:
As you can see the "Shop A" is shown several times and with null values.
With following custom BigQuery in Data Studio I can obtain the last entry of each shop:
with tbl as
(select shop,
array_agg(sold_today ignore nulls order by update_time desc limit 1)[safe_offset(0)] sold_today,
array_agg(info ignore nulls order by update_time desc limit 1)[safe_offset(0)] info,
array_agg(status ignore nulls order by update_time desc limit 1)[safe_offset(0)] status,
from dsadds.eu.dummy
group by 1
)
select * from tbl
resulting in following table, showing all needed information:
However, the users would like to have this table to be transposed and look like this:
-->
On the right hand side it is shown with the final textbox for the labeling of the rows.
Of course, it is possible to build for each entry a Scorecard, but with 10 shops and three field for each, the limit of charts per page was reached.
Question
Is there a way to transpose a table and also do the conditional formatting?
The task is to return one column for each shop and a column id to sort the results. A column has to have one data type and for different rows, we cannot return one time a string and the other time a integer. Thus all integer values have to be formatted in BigQuery as strings.
For transpose, we build an tlb_array. The grouping by the shop generates an array for each one. The array has as first entry the shop name shop and as 2nd entry the column info and as 3rd entry we cast the sold_today column, which is an integer value to a string. We also include an id as entry number. By unnesting this array we unflatten the data and group it again by id in the next select statement. Here, we create a column for each shop and the if condition only considers data for this shop. Thus we end up with a table with three rows, with the row number in id. The needed data is in the shop columns.
with tbl as
(select shop,
array_agg(sold_today ignore nulls order by update_time desc limit 1)[safe_offset(0)] sold_today,
array_agg(info ignore nulls order by update_time desc limit 1)[safe_offset(0)] info,
array_agg(status ignore nulls order by update_time desc limit 1)[safe_offset(0)] status,
from dsadds.eu.dummy
group by 1
),
tlb_array as (
Select shop,X.* from tbl,
unnest([
struct(1 as id,shop as value),
struct(2,info),
struct(3,cast(sold_today as string))]) X
)
select id,
any_value(if(shop="shop A",value,null)) as ShopA,
any_value(if(shop="shop B",value,null)) as ShopB,
from tlb_array
group by 1
Thus we return only text via strings. But we want to apply conditional formatting without adding further columns. The trick is to include special characters in the returned string. ยด or ' is possible, but this would disturb the user. Therefore, the use of space characters is a good way. There are several unicode characters for different space distances. Thus a number can be encoded to space characters. Following UDF has to be hosted by you. It encodes each decimal digit of a number in a different unicode space character.
CREATE OR REPLACE FUNCTION `dsadds.us.number_to_space`(x INT64) AS (
(
SELECT
CONCAT(" ",
string_agg(SUBSTRING(
CODE_POINTS_TO_STRING([0x2007, 0x2002, 0x2004, 0x2005, 0x2006, 0x2008, 0x2009, 0x200A, 0x202F, 0x205F]),
y-47,1),"")
,"- ")
FROM
UNNEST(TO_CODE_POINTS(CAST(x AS string))) y )
);
Then you can use this function in your Custom BigQuery in Data Studio:
with tbl as
(select shop,
array_agg(sold_today ignore nulls order by update_time desc limit 1)[safe_offset(0)] sold_today,
array_agg(info ignore nulls order by update_time desc limit 1)[safe_offset(0)] info,
array_agg(status ignore nulls order by update_time desc limit 1)[safe_offset(0)] status,
from dsadds.us.dummy
group by 1
),
tlb_array as (
Select shop,X.* from tbl,
unnest([
struct(1 as id,concat(shop,dsadds.us.number_to_space(status)) as value),
struct(2,concat(info,dsadds.us.number_to_space(status))),
struct(3,cast(sold_today as string))]) X
)
select id,
any_value(if(shop="shop A",value,null)) as ShopA,
any_value(if(shop="shop B",value,null)) as ShopB,
from tlb_array
group by 1
This will result in following needed table. The (hidden) space characters have to be copied from the table (only in view not in edit mode of Data Studio possible) and condition formatting rules added (text contrains: ). Please also adding a textbox over the first column to hide it and enter the labels for each row.

Counting distinct values output from a grouped SQL Count function

I've got a database that holds information about volunteers and their participation in a range of events.
The following query gives me a list of their names and total attendances
SELECT
volunteers.last_name,
volunteers.first_name,
count (bookings.id)
FROM
volunteers,
bookings
WHERE
volunteers.id = bookings.volunteer_id
GROUP BY
volunteers.last_name,
volunteers.first_name
I want the result table to show the distinct number of attendances and how many there are of each; So if five people did one event it'd display 1 in the first column and 5 in the second and so on.
Thanks
If I understand correctly, you want what I call a "histogram of histograms" query:
select numvolunteers, count(*) as numevents, min(eventid), max(eventid)
from (select b.eventid, count(*) as numvolunteers
from bookings b
group by b.eventid
) b
group by numvolunteers
order by numvolunteers;
The first column is the number of volunteers booked for an "event". The second is the number of events where this occurs. The last two columns are just examples of events that have the given number of volunteers.

How to Order only first 20 records in a resultset using SQL?

My requirement is to get the List of Diagnosis based on the most used Diagnosis. So, to achieve that I have added one Column named DiagnosisCounter in the tblDiagnosisMst Table of the database which increases by 1 for each Diagnosis the each time user selects it. So, my query is like below:
select DiagnosisID,DiagnosisCode,Name from tblDiagnosisMst
where GroupName = 'Common' and RecStatus = 'A' order by DiagnosisCounter desc,
Name asc
So, this query is helping me to get the list of Diagnosis but in descending order for Diagnosis and then alphabetically for Diagnosis Name. But now my client wants to show only 20 most used Diagnosis name at the top and then all the names should appear in alphabetical order. But unfortunately I am stuck in this point. It would be so appreciative if I get your helpful advice for this problem.
This should do the trick:
;With Ordered as (
select DiagnosisID,DiagnosisCode,Name,
ROW_NUMBER() OVER (ORDER BY DiagnosisCounter desc) as rn
from tblDiagnosisMst
where GroupName = 'Common' and RecStatus = 'A'
)
select * from Ordered
order by CASE WHEN rn <= 20 THEN rn ELSE 21 END,
Name asc
We use ROW_NUMBER to assign the numbers 1-x to each of the rows, based on the diagnosiscounter. We then use that value for the first ORDER BY condition if it's in 1-20, and all other rows sort equally in position 21. The second condition is then used as a tie-breaker to sort those remaining row by name.
Try this
SELECT TOP 20
* FROM tblDiagnosisMst ORDER BY DiagnosisCounter;