Count values if text contains in BigQuery

Count values if text contains in BigQuery - sql

I have a datasource that looks like this:
combinations
A
A
A,B
B
A,C
A,B,C
what I want to do is to create a table that count every single time one combination occurs OR is contained in another combination. For that there are two steps:
output all the unique combinations.
Count every single time that combination occurs or is contained in another combination.
In this example, the desired output is this one:
combinations
frequency
A
5
B
3
A,B
2
A,C
2
A,B,C
1
Any ideas on how I can achieve this with BigQuery or SQL? I have tried with Count(), but the results are not correct.

You can extract the unique combinations then use join to count:
select c.combination, count(*)
from (select distinct combination from t) c join
t
on concat(',', t.combination, ',') like concat('%,', c, ',%')
group by c.combination;
EDIT:
The above treats combinations as strings. If you want to treat the combinations as individual values, then storing them in strings is the wrong data structure. However, you can still do what you want by using arrays:
select t.combination, countif(num_combos = num_matches)
from (select t.combination, t.num_combos, t2.seqnum, count(*) as num_matches 
from (select distinct t.combination,
split(t.combination, ',') as ar_combos,
array_length(split(t.combination, ',')) as num_combos
from t
) t cross join
unnest(t.ar_combos) ar join
(select t2.*, row_number() over (order by combination) as seqnum,
ar2
from t t2 cross join
unnest(split(t2.combination, ',')) ar2
) t2
on t2.ar2 = ar
group by t.combination, t.num_combos, t2.seqnum
) t
group by combination;

Consider below approach
select a.combination, countif((
select count(1) = array_length(split(a.combination))
from unnest(split(a.combination)) item
join unnest(split(d.combination)) item
using(item)
)) frequency
from (select distinct combination from data) a, data d
group by a.combination
# order by array_length(split(a.combination)), a.combination
if applied to sample data in your question
output is

Related

Is it possible to apply "Select Distinct" to any column of the query that isn’t in the first place?

For example, like the query below:
WITH T1 AS
(
SELECT DISTINCT
song_name,
year_rank AS rank,
group_name
FROM
billboard_top_100_year_end
WHERE
year = 2010
ORDER BY
rank
LIMIT 10
)
SELECT
rank,
group_name,
song_name
FROM
T1
LIMIT 10
I need to put the column song_name on the top because I didn’t know how to use DISTINCT if the column song_name was in third place.
So, after I needed to query again just to obtain the exactly same result but by another order of visualization.

DISTINCT does not apply to a certain column of the result set, but to all. It just eliminates duplicate result rows.
SELECT DISTINCT a, b, c FROM tab;
is the same as
SELECT a, b, c, FROM tab GROUP BY a, b, c;
Perhaps you are looking for the (non-standard!) PostgreSQL extension DISTINCT ON:
SELECT DISTINCT ON (song_name)
song_name, col2, col2, ...
FROM tab
ORDER BY song_name, col2;
With the ORDER BY, this will give you for each song_name the result with the smallest col2. If you omit the ORDER BY, you will get a random result row for each song_name.

SQL Server, include columns that are not in group by statement

I have a permanent problem,
lets assume that I have a following columns:
T:A(PK), B, C, D, E
Now,
select A, MAX(B) from T group BY A
No, I cant do:
select A, C, MAX(B) from T group BY A
I don't understand why - when in comes to AVG or SUM I get it. However, MAX or MIN is getting from exactly one row.
How to deal with it?

You can use ROW_NUMBER() for that like this:
select A, C, B
from (
select *
, row_number() over (partition by A order by B desc) seq
-- group by ^ max(^)
from yourTable ) t
where seq = 1;

That's cause columns included in the select list should also be part of group by clause. You may have column which re part of group by but not present in select list but vice-versa not possible.
You generally, put only those columns in select clause on which you want the grouping to happen.

try this. it can help you find the MAX by just 1 column (f1), and also adding the column you wanted(f3) but not affecting your MAX operation
SELECT m.f1,s.f2,m.maxf3 FROM
(SELECT f1,max(f3) maxf3 FROM t1 GROUP BY f1) m
CROSS APPLY (SELECT TOP(1) f2,f1 FROM t1 WHERE m.f1 = f1) s

Your question isn't very clear in that we aren't sure what you are trying to do.
Assuming you don't actually want to do a group by in your main query but want to return the max of B based on column A you can do it like so.
select A, C,(Select Max(B) from T as T2 WHERE T.A = T2.A) as MaxB from T

Select random rows from multiple tables in one query

I'm trying to insert some dummy data into a table (A), for which I need the IDs from two other tables (B and C). How can I get n rows with a random B.Id and a random C.Id.
I've got:
select
(Select top 1 ID from B order by newid()) as 'B.Id',
(select top 1 ID from C order by newid()) as 'C.Id'
which gives me random Ids from each table, but what's the best way to get n of these? I've tried joining on a large table and doing top n, but the IDs from B and C are the same random Ids repeated for each row.
So looking to end up with something like this, but able to specify N rows.
INSERT INTO A (B-Id,C-Id,Note)
select
(Select top 1 ID from B order by newid()) as 'B.Id',
(select top 1 ID from C order by newid()) as 'C.Id',
'Rar'
So if B had Ids 1,2,3,4 and C had Ids 11,12,13,14, i'm after the equivalent of:
INSERT INTO A (B-Id,C-Id,Note)
Values
(3,11,'rar'), (1,14,'rar'),(4,11,'rar')
Where the Ids from each table are combined at random

If you want to avoid duplicates, you can use row_number() to enumerate the values in each table (randomly) and then join them:
select b.id as b_id, c.id as c_id
from (select b.*, row_number() over (order by newid()) as seqnum
from b
) b join
(select c.*, row_number() over (order by newid()) as seqnum
from c
) c
on b.seqnum = c.seqnum;
You can just add top N or where seqnum <= N to limit the number.

If I'm reading your question correctly, I think you want N random rows from the union of the two tables - so on any given execution you will get X rows from table B and N-X rows from table C. To accomplish this, you first UNION tables B and C together, then ORDER BY the random value generated by NEWID() while pulling your overall TOP N.
SELECT TOP 50 --or however many you like
DerivedUnionOfTwoTables.[ID],
DerivedUnionOfTwoTables.[Source]
FROM
(
(SELECT NEWID() AS [Random ID], [ID], 'Table B' AS [Source] FROM B)
UNION ALL
(SELECT NEWID() AS [Random ID], [ID], 'Table C' AS [Source] FROM C)
) DerivedUnionOfTwoTables
ORDER BY
[Random ID] DESC
I included a column showing which source table any given record comes from so you could see the distribution of the two table sources changing each time it is executed. If you don't need it and/or don't care to verify, simply comment it out from the topmost select.

You shouldn't need to join to a large table - Select top N ID from B order by newid() should work as newid() works per-row (unlike RAND()). Your join is probably doing a cross-join which will give you multiple results for each newid value.

Get row count including column values in sql server

I need to get the row count of a query, and also get the query's columns in one single query. The count should be a part of the result's columns (It should be the same for all rows, since it's the total).
for example, if I do this:
select count(1) from table
I can have the total number of rows.
If I do this:
select a,b,c from table
I'll get the column's values for the query.
What I need is to get the count and the columns values in one query, with a very effective way.
For example:
select Count(1), a,b,c from table
with no group by, since I want the total.
The only way I've found is to do a temp table (using variables), insert the query's result, then count, then returning the join of both. But if the result gets thousands of records, that wouldn't be very efficient.
Any ideas?

#Jim H is almost right, but chooses the wrong ranking function:
create table #T (ID int)
insert into #T (ID)
select 1 union all
select 2 union all
select 3
select ID,COUNT(*) OVER (PARTITION BY 1) as RowCnt from #T
drop table #T
Results:
ID RowCnt
1 3
2 3
3 3
Partitioning by a constant makes it count over the whole resultset.

Using CROSS JOIN:
SELECT a.*, b.numRows
FROM YOUR_TABLE a
CROSS JOIN (SELECT COUNT(*) AS numRows
FROM YOUR_TABLE) b

Look at the Ranking functions of SQL Server.
SELECT ROW_NUMBER() OVER (ORDER BY a) AS 'RowNumber', a, b, c
FROM table;

You could do it like this:
SELECT x.total, a, b, c
FROM
table
JOIN (SELECT total = COUNT(*) FROM table) AS x ON 1=1
which will return the total number of records in the first column, followed by fields a,b & c

ORDER BY in GROUP BY clause

I have a query
Select
(SELECT id FROM xyz M WHEREM.ID=G.ID AND ROWNUM=1 ) TOTAL_X,
count(*) from mno G where col1='M' group by col2
Now from subquery i have to fetch ramdom id for this I am doing
Select
(SELECT id FROM xyz M WHEREM.ID=G.ID AND ROWNUM=1 order by dbms_random.value ) TOTAL_X,
count(*) from mno G where col1='M' group by col2
But , oracle is showing an error
"Missing right parenthesis".
what is wrong with the query and how can i wrtie this query to get random Id.
Please help.

Even if what you did was legal, it would not give you the result you want. The ROWNUM filter would be applied before the ORDER BY, so you would just be sorting one row.
You need something like this. I am not sure if this exact code will work given the correlated subquery, but the basic point is that you need to have a subquery that contains the ORDER BY without the ROWNUM filter, then apply the ROWNUM filter one level up.
WITH subq AS (
SELECT id FROM xyz M WHERE M.ID=G.ID order by dbms_random.value
)
SELECT (SELECT id FROM subq WHERE rownum = 1) total_x,
count(*)
from mno g where col1='M' group by col2

You can't use order by in a subselect. It wouldn't matter too, because the row numbering is applied first, so you cannot influence it by using order by,
[edit]
Tried a solution. Don't got Oracle here, so you'll have to read between the typos.
In this case, I generate a single random value, get the count of records in xyz per mno.id, and generate a sequence for those records per mno.id.
Then, a level higher, I filter only those records whose index match with the random value.
This should give you a random id from xyz that matches the id in mno.
select
x.mnoId,
x.TOTAL_X
from
(SELECT
g.id as mnoId,
m.id as TOTAL_X,
count(*) over (partition by g.id) as MCOUNT,
dense_rank() over (partition by g.id) as MINDEX,
r.RandomValue
from
mno g
inner join xyz m on m.id = g.id
cross join (select dbms_random.value as RandomValue from dual) r
where
g.col1 = 'M'
) x
where
x.MINDEX = 1 + trunc(x.MCOUNT * x.RandomValue)

The only difference between your two lines are that you order_by in the one that fails, right?
It so happens that order_by doesn't fit inside a nested select.
You could do an order_by inside a where clause that contains a select, though.
Edit: #StevenV is right.

If you're trying to do what I suspect, this should work
Select A.Id, Count(*)
From MNO A
Join (Select ID From XYZ M Where M.ID=G.ID And Rownum=1 Order By Dbms_Random.Value ) B On (B.ID = A.ID)
GROUP BY A.ID

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Count values if text contains in BigQuery - sql

Related

Is it possible to apply "Select Distinct" to any column of the query that isn’t in the first place?

SQL Server, include columns that are not in group by statement

Select random rows from multiple tables in one query

Get row count including column values in sql server

ORDER BY in GROUP BY clause

Categories

Resources