Use google bigquery to build histogram graph - sql

How can write a query that makes histogram graph rendering easier?
For example, we have 100 million people with ages, we want to draw the histogram/buckets for age 0-10, 11-20, 21-30 etc... What does the query look like?
Has anyone done it? Did you try to connect the query result to google spreadsheet to draw the histogram?

You could also use the quantiles aggregation operator to get a quick look at the distribution of ages.
SELECT
quantiles(age, 10)
FROM mytable
Each row of this query would correspond to the age at that point in the list of ages. The first result is the age 1/10ths of the way through the sorted list of ages, the second is the age 2/10ths through, 3/10ths, etc.

See the 2019 update, with #standardSQL --Fh
The subquery idea works, as does "CASE WHEN" and then doing a group by:
SELECT COUNT(field1), bucket
FROM (
SELECT field1, CASE WHEN age >= 0 AND age < 10 THEN 1
WHEN age >= 10 AND age < 20 THEN 2
WHEN age >= 20 AND age < 30 THEN 3
...
ELSE -1 END as bucket
FROM table1)
GROUP BY bucket
Alternately, if the buckets are regular -- you could just divide and cast to an integer:
SELECT COUNT(field1), bucket
FROM (
SELECT field1, INTEGER(age / 10) as bucket FROM table1)
GROUP BY bucket

With #standardSQL and an auxiliary stats query, we can define the range the histogram should look into.
Here for the time to fly between SFO and JFK - with 10 buckets:
WITH data AS (
SELECT *, ActualElapsedTime datapoint
FROM `fh-bigquery.flights.ontime_201903`
WHERE FlightDate_year = "2018-01-01"
AND Origin = 'SFO' AND Dest = 'JFK'
)
, stats AS (
SELECT min+step*i min, min+step*(i+1)max
FROM (
SELECT max-min diff, min, max, (max-min)/10 step, GENERATE_ARRAY(0, 10, 1) i
FROM (
SELECT MIN(datapoint) min, MAX(datapoint) max
FROM data
)
), UNNEST(i) i
)
SELECT COUNT(*) count, (min+max)/2 avg
FROM data
JOIN stats
ON data.datapoint >= stats.min AND data.datapoint<stats.max
GROUP BY avg
ORDER BY avg
If you need round numbers, see: https://stackoverflow.com/a/60159876/132438

Using a cross join to get your min and max values (not that expensive on a single tuple) you can get a normalized bucket list of any given bucket count:
select
min(data.VAL) as min,
max(data.VAL) as max,
count(data.VAL) as num,
integer((data.VAL-value.min)/(value.max-value.min)*8) as group
from [table] data
CROSS JOIN (SELECT MAX(VAL) as max, MIN(VAL) as min, from [table]) value
GROUP BY group
ORDER BY group
in this example we're getting 8 buckets (pretty self explanatory) plus one for null VAL

Write a subquery like this:
(SELECT '1' AS agegroup, count(*) FROM people WHERE AGE <= 10 AND AGE >= 0)
Then you can do something like this:
SELECT * FROM
(SELECT '1' AS agegroup, count(*) FROM people WHERE AGE <= 10 AND AGE >= 0),
(SELECT '2' AS agegroup, count(*) FROM people WHERE AGE <= 20 AND AGE >= 10),
(SELECT '3' AS agegroup, count(*) FROM people WHERE AGE <= 120 AND AGE >= 20)
Result will be like:
Row agegroup count
1 1 somenumber
2 2 somenumber
3 3 another number
I hope this helps you. Of course in the age group you can write anything like: '0 to 10'

There is now the APPROX_QUANTILES aggregation function in standard SQL.
SELECT
APPROX_QUANTILES(column, number_of_bins)
...

I found gamars approach quite intriguing and expanded a little bit on it using scripting instead of the cross join. Notably, this approach also allows to consistently change group sizes, like here with group sizes that increase exponentially.
declare stats default
(select as struct min(new_confirmed) as min, max(new_confirmed) as max
from `bigquery-public-data.covid19_open_data.covid19_open_data`
where new_confirmed >0 and date = date "2022-03-07"
);
declare group_amount default 10; -- change group amount here
SELECT
CAST(floor(
(ln(new_confirmed-stats.min+1)/ln(stats.max-stats.min+1)) * (group_amount-1))
AS INT64) group_flag,
concat('[',min(new_confirmed),',',max(new_confirmed),']') as group_value_range,
count(1) as quantity
FROM `bigquery-public-data.covid19_open_data.covid19_open_data`
where new_confirmed >0 and date = date "2022-03-07"
GROUP BY group_flag
ORDER BY group_flag ASC
The basic approach is to label each value with its group_flag and then group by it. The flag is calculated by scaling the value down to a value between 0 and 1 and then scale it up again to 0 - group_amount.
I just took the log of the corrected value and range before their division to get the desired bias in group sizes. I also add 1 to make sure it doesn't try to take the log of 0.

You're looking for a single vector of information. I would normally query it like this:
select
count(*) as num,
integer( age / 10 ) as age_group
from mytable
group by age_group
A big case statement will be needed for arbitrary groups. It would be simple but much longer. My example should be fine if every bucket contains N years.

Take a look at the custom SQL functions. It works as
to_bin(10, [0, 100, 500]) => '... - 100'
to_bin(1000, [0, 100, 500, 0]) => '500 - ...'
to_bin(1000, [0, 100, 500]) => NULL
Read more here
https://github.com/AdamovichAleksey/BigQueryTips/blob/main/sql/functions/to_bins.sql
Any ideas and commits are welcomed

Related

Check if 20% players of a tournament has played 8 or more rounds in previous year - Query Optimization - BigQuery

So I have a table "tournaments" with the following attributes/columns:
player_name
tournament_name
round_1, round_2, ......, round_10
round_1_date, round_2_date, ......, round_10_date
Other columns
I'm willing to check if 20% players of a particular tournament has played 8 or more rounds in previous year. If a person has played a round then there will be some score in that round else "null".
e.g: If a player has played round 1 and 2 then columns of round_1 and round_2 will be populated with some score along with round_1_date and round_2_date. All the other columns of rounds will be having null.
(Starting date can be round_1_date.)
I have made the following query which is giving accurate results but I beleive there can be a more better/optimized approach which will take less time than this one as this query will run multiple times in a SQL loop for large dataset. Query returns true or false.
SELECT
(CAST((total_players_meeting_threshold_criteria/COUNT(t4.player_name)*100>=20) AS string)) AS result
FROM (
SELECT
COUNT(*) total_players_meeting_threshold_criteria,
FROM (
SELECT
(player_name),
COUNT(round_1)+COUNT(round_2)+COUNT(round_3)+COUNT(round_4)+COUNT(round_5)+COUNT(round_6)+COUNT(round_7)+COUNT(round_8)+COUNT(round_9)+COUNT(round_10) AS total_rounds_played,
FROM
`Golf_DB_Women_Dataset.tournaments` t2
WHERE
player_name IN (
SELECT
DISTINCT player_name,
FROM
`tournaments` t1
WHERE
tournament_name = 'Tournament of Champions')
AND t2.round_1_date BETWEEN DATE_SUB((
SELECT
round_1_date
FROM
`tournaments`
WHERE
tournament_name = 'Tournament of Champions'
LIMIT
1), INTERVAL 1 YEAR)
AND DATE_SUB((
SELECT
round_1_date
FROM
`tournaments`
WHERE
tournament_name = 'Tournament of Champions'
LIMIT
1), INTERVAL 1 DAY)
GROUP BY
player_name
HAVING
total_rounds_played >= 8
ORDER BY
player_name) ) AS threshold_result,
`tournaments` t4
WHERE
t4.tournament_name = 'Tournament of Champions'
GROUP BY
total_players_meeting_threshold_criteria
Any help will be highly appreciated.
Thank you

any way to use IN and COUNT in rewriting this query?

select count(patientNUM) as totalpatients
from [dbo] (nolock)
where patientId in (
'97210219',
'97210221',
'97210222'
)
50
100
20
So each patientsID contain numbers of patients, 100, 20 or 50. And I want to go through each patient rows and to lost them as partial or full. For example if there are 40 of 50 patientID rows, it will list as partial. If 50, it will list as full. Is there way to use count or in at the same time?
So basically I want to create two columns, patientID, and fullorpartial in the second column.
Is there way to go through each row and count each rows and then return and compare the result in a second column?
You need to know the "capacity" as well as the patientId. I would suggest a derived table:
select t.patientId,
(case when count(*) < v.capacity then 'partial'
when count(*) = v.capacity then 'full'
end) as full_or_partial
from t join
(values ('97210219', 50),
('97210221', 100),
('97210222', 20)
) v(patientId, capacity)
on v.patientId = t.patientId
group by t.patientId;
Since I don't know what exactly your data looks like and what you want. But try this:
use over is a good choice
select patientId,count(patientNUM) over(partition by patientId) as totalpatients
from [dbo] (nolock)
where patientId in (
'97210219',
'97210221',
'97210222'
)
this will count patientNUM exist times for each patientId.
And about the 'partialorfull' col I think can achieve by using case
case
when patientId = '97210219' and totalpatients < 50 then 'partial'
when ...... --condition keep going on
else 'full'
end as partialorfull

Adding missing date rows to a BigQuery Table

I have a table where 1 of the rows is an integer that represents the rows time. Problem is the table isn't full, there are missing timestamps.
I would like to fill missing values such that every 10 seconds there is a row. I want the rest of the columns to be nuns (later I'll forward fill these nuns).
10 secs is basically 10,000.
If this was python the range would be:
range(
min(table[column]),
max(table[column]),
10000
)
If your values are strictly parted by 10 seconds, and there are just some multiples of 10 seconds intervals which are missing, you can go by this approach to fill your data holes:
WITH minsmax AS (
SELECT
MIN(time) AS minval,
MAX(time) AS maxval
FROM `dataset.table`
)
SELECT
IF (d.time <= i.time, d.time, i.time) as time,
MAX(IF(d.time <= i.time, d.value, NULL)) as value
FROM (
SELECT time FROM minsmax m, UNNEST(GENERATE_ARRAY(m.minval, m.maxval+100, 100)) AS time
) AS i
LEFT JOIN `dataset.table` d ON 1=1
WHERE ABS(d.time - i.time) >= 100
GROUP BY 1
ORDER BY 1
Hope this helps.
You can use arrays. For numbers, you can do:
select n
from unnest(generate_array(1, 1000, 1)) n;
There are similar functions for generate_timestamp_array() and generate_date_array() if you really need those types.
I ended up using the following query through python API:
"""
SELECT
i.time,
Sensor_Reading,
Sensor_Name
FROM (
SELECT time FROM UNNEST(GENERATE_ARRAY({min_time}, {max_time}+{sampling_period}+1, {sampling_period})) AS time
) AS i
LEFT JOIN
`{input_table}` AS input
ON
i.time =input.Time
ORDER BY i.time
""".format(sampling_period=sampling_period, min_time=min_time,
max_time=max_time,
input_table=input_table)
Thanks to both answers

PostgreSQL and matching row on multiple

I'm making a car statistic solution where I need to charge per kilometer driven.
I have the following table:
table: cars
columns: car_id, km_driven
table: pricing
columns: from, to, price
Content in my cars table can be:
car_id, km_driven
2, 430
3, 112
4, 90
Content on my pricing table can be:
from, to, price
0, 100, 2
101, 200, 1
201, null, 0.5
Meaning that the first 100 km cost 2USD per km, the next 100 km cost 1USD per km and everything above costs 0.5USD per km.
Is there a logic and simple way to calculate cost for my cars via PostgreSQL?
So if a car has driven ex. 201, then the price would be 100x2 + 100x1 + 0.5, not simply 201x0.5.
I would write the query as:
select c.car_id, c.km_driven,
sum(( least(p.to_km, c.km_driven) - p.from_km + 1) * p.price) as dist_price
from cars c join
pricing p
on c.km_driven >= p.from_km
group by c.car_id, c.km_driven;
Here is a db<>fiddle.
Modified from #sean-johnston's answer:
select
car_id, km_driven,
sum(case
when km_driven>=start then (least(finish,km_driven)-start+1)*price
else 0
end) as dist_price
from cars,pricing
group by car_id,km_driven
Original ranges kept
where km_driven >= start omitted (its optional but might improve performance)
fiddling a bit more, case can be omitted when where is in place
select
car_id, km_driven,
sum((least(finish,km_driven)-start+1)*price) as dist_price
from cars,pricing
where km_driven >= start
group by car_id,km_driven
dbfiddle
Judicious use of case/sum combinations. However, firstly need to make your ranges consistent. I'll choose to change the first range to 1,100. Given that then the following should give you want you're after. (I've also used start/finish as from/to are reserved words).
select
car_id, km_driven,
sum (case
when finish is null and km_driven >= start
then (km_driven-start+1) * price
when km_driven >= start
then (case
when (km_driven - start + 1) > finish
then (finish - start + 1)
else (km_driven - start + 1)
end) * price
else 0
end) as dist_price
from cars, pricing
where km_driven >= start
group by 1, 2;
Explanation:
We join against any range where the journey is at least as far as the start of the range.
The open ended range is handled in the first case clause and is fairly simple.
We need an inner case clause for the closed ranges as we only want the part of the journey in that range.
Then sum the results of that for the total journey price.
If you don't want to (or can't) make your ranges consistent then you'd need to add a third outer case for the start range.
I would definitely do this using a procedure, as it can be implemented in a very straightforward manner using loops. However, you should be able to do something similar to this:
select car_id, sum(segment_price)
from (
select
car_id,
km_driven,
f,
t,
price,
driven_in_segment,
segment_price
from (
select
car_id,
km_driven,
f,
t,
price,
(coalesce(least(t, km_driven), km_driven) - f) driven_in_segment,
price * (coalesce(least(t, km_driven), km_driven) - f) segment_price
from
-- NOTE: cartesian product here
cars,
pricing
where f < km_driven
)
) data
group by car_id
order by car_id
I find that pretty less readable, though.
UPDATE:
That query is a bit more complex than necessary, I was trying out some things with window functions that were not needed in the end. A simplified version here that should be equivalent:
select car_id, sum(segment_price)
from (
select
car_id,
km_driven,
f,
t,
price,
(coalesce(least(t, km_driven), km_driven) - f) driven_in_segment,
price * (coalesce(least(t, km_driven), km_driven) - f) segment_price
from
-- NOTE: cartesian product here
cars,
pricing
where f < km_driven
) data
group by car_id
order by car_id
you can use join and calculate your cost by using case when
select c.car_id, case when p.price=.5
then 100*2+100*1+(c.km_driven-200)*0.5
when p.price=1 then 100*2+(c.km_driven-100)*1
else c.km_driven*p.price as cost
from cars c join pricing p
on c.km_driven>=p.from and c.km_driven<=p.to

writing a query using advanced group by

I have a single table database consists of the following fields:
ID, Seniority (years), outcome and some other less important fields.
Table row example:
ID:36 Seniority(years):1.79 outcome:9627
I need to write a query (sql server) in relatively simple code that returns the average outcome, grouped by the Seniority field, with leaps of five years (0-5 years, 6-10 etc...) with the condition that the average will be shown only if the group has more than 3 rows.
Result row example:
range:0-5 average:xxxx
Thank you very much
Use CASE statement to create different age groups. Try this
select case when Seniority between 0 and 5 then '0-5'
when Seniority between 6 and 10 then '6-10'
..
End,
Avg(outcome)
From yourtable
Group by case when Seniority between 0 and 5 then '0-5'
when Seniority between 6 and 10 then '6-10'
..
End
Having count(1)>=3
Since you have decimal places, If you want to count 5.4 to 0-5 group and 5.6 to 6-10 then use Round(Seniority,0) instead of Seniority in CASE statement
P.s.
0-5 contains 6 values while 6-10 contains 5.
select 'range:'
+ cast (isnull(nullif(floor((abs(seniority-1))/5)*5+1,1),0) as varchar)
+ '-'
+ cast ((floor((abs(seniority-1))/5)+1)*5 as varchar) as seniority_group
,avg(outcome)
from t
group by floor((abs(seniority-1))/5)
having count(*) >= 3
;
This would be something like:
select floor(seniority / 5), avg(outcome)
from t
group by floor(seniority / 5)
having count(*) >= 3;
Note: This breaks the seniority into equal sized groups which is 0-4, 5-9, and so on. This seems more reasonable than having unequal groups.
You can follow Gordon's answer(but you should to edit it a little), but I would do this with additional table with all possible intervals. You then can add appropriate index to boost it.
create table intervals
(
id int identity(1, 1),
start int,
end int
)
insert into intervals values
(0, 5),
(6, 10)
...
select i.id, avg(t.outcome) as outcome
from intervals i
join tablename t on t.seniority between i.start and i.end
group by i.id
having count(*) >=3
If creating new tables is not an option you can always use a CTE:
;with intervals as(
select * from
(values
(0, 5),
(6, 10)
--...
) t(start, [end])
)
select i.id, avg(t.outcome) as outcome
from intervals i
join tablename t on t.seniority between i.start and i.[end]
group by i.id
having count(*) >=3