I'm making a car statistic solution where I need to charge per kilometer driven.
I have the following table:
table: cars
columns: car_id, km_driven
table: pricing
columns: from, to, price
Content in my cars table can be:
car_id, km_driven
2, 430
3, 112
4, 90
Content on my pricing table can be:
from, to, price
0, 100, 2
101, 200, 1
201, null, 0.5
Meaning that the first 100 km cost 2USD per km, the next 100 km cost 1USD per km and everything above costs 0.5USD per km.
Is there a logic and simple way to calculate cost for my cars via PostgreSQL?
So if a car has driven ex. 201, then the price would be 100x2 + 100x1 + 0.5, not simply 201x0.5.
I would write the query as:
select c.car_id, c.km_driven,
sum(( least(p.to_km, c.km_driven) - p.from_km + 1) * p.price) as dist_price
from cars c join
pricing p
on c.km_driven >= p.from_km
group by c.car_id, c.km_driven;
Here is a db<>fiddle.
Modified from #sean-johnston's answer:
select
car_id, km_driven,
sum(case
when km_driven>=start then (least(finish,km_driven)-start+1)*price
else 0
end) as dist_price
from cars,pricing
group by car_id,km_driven
Original ranges kept
where km_driven >= start omitted (its optional but might improve performance)
fiddling a bit more, case can be omitted when where is in place
select
car_id, km_driven,
sum((least(finish,km_driven)-start+1)*price) as dist_price
from cars,pricing
where km_driven >= start
group by car_id,km_driven
dbfiddle
Judicious use of case/sum combinations. However, firstly need to make your ranges consistent. I'll choose to change the first range to 1,100. Given that then the following should give you want you're after. (I've also used start/finish as from/to are reserved words).
select
car_id, km_driven,
sum (case
when finish is null and km_driven >= start
then (km_driven-start+1) * price
when km_driven >= start
then (case
when (km_driven - start + 1) > finish
then (finish - start + 1)
else (km_driven - start + 1)
end) * price
else 0
end) as dist_price
from cars, pricing
where km_driven >= start
group by 1, 2;
Explanation:
We join against any range where the journey is at least as far as the start of the range.
The open ended range is handled in the first case clause and is fairly simple.
We need an inner case clause for the closed ranges as we only want the part of the journey in that range.
Then sum the results of that for the total journey price.
If you don't want to (or can't) make your ranges consistent then you'd need to add a third outer case for the start range.
I would definitely do this using a procedure, as it can be implemented in a very straightforward manner using loops. However, you should be able to do something similar to this:
select car_id, sum(segment_price)
from (
select
car_id,
km_driven,
f,
t,
price,
driven_in_segment,
segment_price
from (
select
car_id,
km_driven,
f,
t,
price,
(coalesce(least(t, km_driven), km_driven) - f) driven_in_segment,
price * (coalesce(least(t, km_driven), km_driven) - f) segment_price
from
-- NOTE: cartesian product here
cars,
pricing
where f < km_driven
)
) data
group by car_id
order by car_id
I find that pretty less readable, though.
UPDATE:
That query is a bit more complex than necessary, I was trying out some things with window functions that were not needed in the end. A simplified version here that should be equivalent:
select car_id, sum(segment_price)
from (
select
car_id,
km_driven,
f,
t,
price,
(coalesce(least(t, km_driven), km_driven) - f) driven_in_segment,
price * (coalesce(least(t, km_driven), km_driven) - f) segment_price
from
-- NOTE: cartesian product here
cars,
pricing
where f < km_driven
) data
group by car_id
order by car_id
you can use join and calculate your cost by using case when
select c.car_id, case when p.price=.5
then 100*2+100*1+(c.km_driven-200)*0.5
when p.price=1 then 100*2+(c.km_driven-100)*1
else c.km_driven*p.price as cost
from cars c join pricing p
on c.km_driven>=p.from and c.km_driven<=p.to
Related
I am working on some car accident data and am stuck on how to get the data in the form I want.
select
sex_of_driver,
accident_severity,
count(accident_severity) over (partition by sex_of_driver, accident_severity)
from
SQL.dbo.accident as accident
inner join SQL.dbo.vehicle as vehicle on
accident.accident_index = vehicle.accident_index
This is my code, which counts the accidents had per each sex for each severity. I know I can do this with group by but I wanted to use a partition by in order to work out % too.
However I get a very large table (I assume for each row that is each sex/severity. When I do the following:
select
sex_of_driver,
accident_severity,
count(accident_severity) over (partition by sex_of_driver, accident_severity)
from
SQL.dbo.accident as accident
inner join SQL.dbo.vehicle as vehicle on
accident.accident_index = vehicle.accident_index
group by
sex_of_driver,
accident_severity
I get this:
sex_of_driver
accident_severity
(No column name)
1
1
1
1
2
1
-1
2
1
-1
1
1
1
3
1
I won't give you the whole table, but basically, the group by has caused the count to just be 1.
I can't figure out why group by isn't working. Is this an MS SQL-Server thing?
I want to get the same result as below (obv without the CASE etc)
select
accident.accident_severity,
count(accident.accident_severity) as num_accidents,
vehicle.sex_of_driver,
CASE vehicle.sex_of_driver WHEN '1' THEN 'Male' WHEN '2' THEN 'Female' end as sex_col,
CASE accident.accident_severity WHEN '1' THEN 'Fatal' WHEN '2' THEN 'Serious' WHEN '3' THEN 'Slight' end as serious_col
from
SQL.dbo.accident as accident
inner join SQL.dbo.vehicle as vehicle on
accident.accident_index = vehicle.accident_index
where
sex_of_driver != 3
and
sex_of_driver != -1
group by
accident.accident_severity,
vehicle.sex_of_driver
order by
accident.accident_severity
You seem to have a misunderstanding here.
GROUP BY will reduce your rows to a single row per grouping (ie per pair of sex_of_driver, accident_severity values. Any normal aggregates you use with this, such as COUNT(*), will return the aggregate value within that group.
Whereas OVER gives you a windowed aggregated, and means you are calculating it after reducing your rows. Therefore when you write count(accident_severity) over (partition by sex_of_driver, accident_severity) the aggregate only receives a single row in each partition, because the rows have already been reduced.
You say "I know I can do this with group by but I wanted to use a partition by in order to work out % too." but you are misunderstanding how to do that. You don't need PARTITION BY to work out percentage. All you need to calculate a percentage over the whole resultset is COUNT(*) * 1.0 / SUM(COUNT(*)) OVER (), in other words a windowed aggregate over a normal aggregate.
Note also that count(accident_severity) does not give you the number of distinct accident_severity values, it gives you the number of non-null values, which is probably not what you intend. You also have a very strange join predicate, you probably want something like a.vehicle_id = v.vehicle_id
So you want something like this:
select
sex_of_driver,
accident_severity,
count(*) as Count,
count(*) * 1.0 /
sum(count(*)) over (partition by sex_of_driver) as PercentOfSex
count(*) * 1.0 /
sum(count(*)) over () as PercentOfTotal
from
dbo.accident as accident a
inner join dbo.vehicle as v on
a.vehicle_id = v.vehicle_id
group by
sex_of_driver,
accident_severity;
Say we have a dataset of 500 000 flights from Los Angeles to 80 cities in Europe and back and from Saint Petersburg to same 80 cities in Europe and back. We want to find such 4 flights:
from LA to city X, from city X back to LA, from St P to city X and from city X back to St P
all 4 flights have to be in a time window of 4 days
we are looking for the cheapest combined price of 4 flights
city X can be any of 80 cities, we want to find such cheapest combination for all of them and get the list of these 80 combinations
The data is stored in BigQuery and I've created an SQL query, but it has 3 joins and I assume that under the hood it can have complexity of O(n^4), because the query didn't finish in 30 minutes and I had to abort it.
Here's the schema for the table:
See the query below:
select * from (
select in_led.`from` as city,
in_led.price + out_led.price + in_lax.price + out_lax.price as total_price,
out_led.carrier as out_led_carrier,
out_led.departure as out_led_departure,
in_led.departure as in_led_date,
in_led.carrier as in_led_carrier,
out_lax.carrier as out_lax_carrier,
out_lax.departure as out_lax_departure,
in_lax.departure as in_lax_date,
in_lax.carrier as in_lax_carrier,
row_number() over(partition by in_led.`from` order by in_led.price + out_led.price + in_lax.price + out_lax.price) as rn
from skyscanner.quotes as in_led
join skyscanner.quotes as out_led on out_led.`to` = in_led.`from`
join skyscanner.quotes as out_lax on out_lax.`to` = in_led.`from`
join skyscanner.quotes as in_lax on in_lax.`from` = in_led.`from`
where in_led.`to` = "LED"
and out_led.`from` = "LED"
and in_lax.`to` in ("LAX", "LAXA")
and out_lax.`from` in ("LAX", "LAXA")
and DATE_DIFF(DATE(in_led.departure), DATE(out_led.departure), DAY) < 4
and DATE_DIFF(DATE(in_led.departure), DATE(out_led.departure), DAY) > 0
and DATE_DIFF(DATE(in_lax.departure), DATE(out_lax.departure), DAY) < 4
and DATE_DIFF(DATE(in_lax.departure), DATE(out_lax.departure), DAY) > 0
order by total_price
)
where rn=1
Additional details:
all flights' departure dates fall in a 120 days window
Questions:
Is there a way to optimize this query for better performance?
How to properly classify this problem? The brute force solution is way too slow, but I'm failing to see what type of problem this is. Certainly doesn't look like something for graphs, kinda feels like sorting the table a couple of times by different fields with a stable sort might help, but still seems sub-optimal.
Below is for BigQuery Standard SQL
The brute force solution is way too slow, but I'm failing to see what type of problem this is.
so I would like to see solutions other than brute force if anyone here has ideas
#standardSQL
WITH temp AS (
SELECT DISTINCT *, UNIX_DATE(DATE(departure)) AS dep FROM `skyscanner.quotes`
), round_trips AS (
SELECT t1.from, t1.to, t2.to AS back, t1.price, t1.departure, t1.dep first_day, t1.carrier, t2.departure AS departure2, t2.dep AS last_day, t2.price AS price2, t2.carrier AS carrier2,
FROM temp t1
JOIN temp t2
ON t1.to = t2.from
AND t1.from = t2.to
AND t2.dep BETWEEN t1.dep + 1 AND t1.dep + 3
WHERE t1.from IN ('LAX', 'LED')
)
SELECT cityX, total_price,
( SELECT COUNT(1)
FROM UNNEST(GENERATE_ARRAY(t1.first_day, t1.last_day)) day
JOIN UNNEST(GENERATE_ARRAY(t2.first_day, t2.last_day)) day
USING(day)
) overlap_days_in_cityX,
(SELECT AS STRUCT departure, price, carrier, departure2, price2, carrier2
FROM UNNEST([t1])) AS LAX_CityX_LAX,
(SELECT AS STRUCT departure, price, carrier, departure2, price2, carrier2
FROM UNNEST([t2])) AS LED_CityX_LED
FROM (
SELECT AS VALUE ARRAY_AGG(t ORDER BY total_price LIMIT 1)[OFFSET(0)]
FROM (
SELECT t1.to cityX, t1.price + t1.price2 + t2.price + t2.price2 AS total_price, t1, t2
FROM round_trips t1
JOIN round_trips t2
ON t1.to = t2.to
AND t1.from < t2.from
AND t1.departure2 > t2.departure
AND t1.departure < t2.departure2
) t
GROUP BY cityX
)
ORDER BY overlap_days_in_cityX DESC, total_price
with output (just top 10 out of total 60 rows)
Brief explanation:
temp CTE: Dedup data and introduce dep field - number of days since epoch to eliminate costly TIMESTAMP functions
round_trips CTE: identify all round trip candidates with at most 4 days apart
identify those LAX and LED round trips which have overlaps
for each cityX take the cheapest combination
final output does extra calculation on overlapping days in cityX and lean a little output to have info about all involve flights
Note: in your data - duration field are all zeros - so it is not involved - but if you would have it - it is easy to add it to logic
the query didn't finish in 30 minutes and I had to abort it.
Is there a way to optimize this query for better performance?
My "generic recommendation" is to always learn the data, profile it, clean it - before actual coding! In your example - the data you shared has 469352 rows full of duplicates. After you remove duplicates - you got ONLY 14867 rows. So then I run your original query against that cleaned data and it took ONLY 97 sec to get result. Obviously, it does not mean we cannot optimize code itself - but at least this addresses your issue with "query didn't finish in 30 minutes and I had to abort it"
I'm trying to get a cumulative running total by using a LAG function and SUM. The column I'm wanting to sum is adding row 1 + 2 together but it doesn't continue on by adding row 1, 2, 3, 4 etc. Once a "reset" amount is hit, the running total needs to go back to the reset amount times the Coinin amount on the same row.
Ultimately, I want to know at any given point in history how much a slot machines progression is for say level's 1 & 2 before and after a jackpot payout. (This query is just looking at level 1)
Select Distinct A.AID
,A.BID
,B.Level
,A.Date
,B.Reset
,B.Cap
,Description
,B.RateofProg
,A.Coinin
,LAG(B.RateofProg/100.00 * A.Coinin/100.00) OVER (order by AID, BID, Level) + SUM(B.RateofProg/100 * A.Coinin/100) as RunningTotal
,CASE When C.Eventcode = 10004500 THEN ProgressivePdAmt/100.00 Else 0 end as ProgressivePdAmt
From Payout A
Join Slot_Progression B
on A.Mnum = B.Mnum
Join Events C
on A.Date = C.Date
Where A.Mnum = '102026'
and level = '1'
and A.Coinin > '0'
Group by A.AID, A.BID, B.Level, A.Date, B.Reset, B.Cap, Description, C.ProgressivePdAmt, B.RateofProg, A.Coinin, C.Eventcode
Order by AID, BID, Level
The cumulative sum is calculated using sum() not lag(). Presumably you want something like this:
sum(B.RateofProg/100.00 * A.Coinin/100.00) OVER (order by AID, BID, Level) as RunningTotal
How can write a query that makes histogram graph rendering easier?
For example, we have 100 million people with ages, we want to draw the histogram/buckets for age 0-10, 11-20, 21-30 etc... What does the query look like?
Has anyone done it? Did you try to connect the query result to google spreadsheet to draw the histogram?
You could also use the quantiles aggregation operator to get a quick look at the distribution of ages.
SELECT
quantiles(age, 10)
FROM mytable
Each row of this query would correspond to the age at that point in the list of ages. The first result is the age 1/10ths of the way through the sorted list of ages, the second is the age 2/10ths through, 3/10ths, etc.
See the 2019 update, with #standardSQL --Fh
The subquery idea works, as does "CASE WHEN" and then doing a group by:
SELECT COUNT(field1), bucket
FROM (
SELECT field1, CASE WHEN age >= 0 AND age < 10 THEN 1
WHEN age >= 10 AND age < 20 THEN 2
WHEN age >= 20 AND age < 30 THEN 3
...
ELSE -1 END as bucket
FROM table1)
GROUP BY bucket
Alternately, if the buckets are regular -- you could just divide and cast to an integer:
SELECT COUNT(field1), bucket
FROM (
SELECT field1, INTEGER(age / 10) as bucket FROM table1)
GROUP BY bucket
With #standardSQL and an auxiliary stats query, we can define the range the histogram should look into.
Here for the time to fly between SFO and JFK - with 10 buckets:
WITH data AS (
SELECT *, ActualElapsedTime datapoint
FROM `fh-bigquery.flights.ontime_201903`
WHERE FlightDate_year = "2018-01-01"
AND Origin = 'SFO' AND Dest = 'JFK'
)
, stats AS (
SELECT min+step*i min, min+step*(i+1)max
FROM (
SELECT max-min diff, min, max, (max-min)/10 step, GENERATE_ARRAY(0, 10, 1) i
FROM (
SELECT MIN(datapoint) min, MAX(datapoint) max
FROM data
)
), UNNEST(i) i
)
SELECT COUNT(*) count, (min+max)/2 avg
FROM data
JOIN stats
ON data.datapoint >= stats.min AND data.datapoint<stats.max
GROUP BY avg
ORDER BY avg
If you need round numbers, see: https://stackoverflow.com/a/60159876/132438
Using a cross join to get your min and max values (not that expensive on a single tuple) you can get a normalized bucket list of any given bucket count:
select
min(data.VAL) as min,
max(data.VAL) as max,
count(data.VAL) as num,
integer((data.VAL-value.min)/(value.max-value.min)*8) as group
from [table] data
CROSS JOIN (SELECT MAX(VAL) as max, MIN(VAL) as min, from [table]) value
GROUP BY group
ORDER BY group
in this example we're getting 8 buckets (pretty self explanatory) plus one for null VAL
Write a subquery like this:
(SELECT '1' AS agegroup, count(*) FROM people WHERE AGE <= 10 AND AGE >= 0)
Then you can do something like this:
SELECT * FROM
(SELECT '1' AS agegroup, count(*) FROM people WHERE AGE <= 10 AND AGE >= 0),
(SELECT '2' AS agegroup, count(*) FROM people WHERE AGE <= 20 AND AGE >= 10),
(SELECT '3' AS agegroup, count(*) FROM people WHERE AGE <= 120 AND AGE >= 20)
Result will be like:
Row agegroup count
1 1 somenumber
2 2 somenumber
3 3 another number
I hope this helps you. Of course in the age group you can write anything like: '0 to 10'
There is now the APPROX_QUANTILES aggregation function in standard SQL.
SELECT
APPROX_QUANTILES(column, number_of_bins)
...
I found gamars approach quite intriguing and expanded a little bit on it using scripting instead of the cross join. Notably, this approach also allows to consistently change group sizes, like here with group sizes that increase exponentially.
declare stats default
(select as struct min(new_confirmed) as min, max(new_confirmed) as max
from `bigquery-public-data.covid19_open_data.covid19_open_data`
where new_confirmed >0 and date = date "2022-03-07"
);
declare group_amount default 10; -- change group amount here
SELECT
CAST(floor(
(ln(new_confirmed-stats.min+1)/ln(stats.max-stats.min+1)) * (group_amount-1))
AS INT64) group_flag,
concat('[',min(new_confirmed),',',max(new_confirmed),']') as group_value_range,
count(1) as quantity
FROM `bigquery-public-data.covid19_open_data.covid19_open_data`
where new_confirmed >0 and date = date "2022-03-07"
GROUP BY group_flag
ORDER BY group_flag ASC
The basic approach is to label each value with its group_flag and then group by it. The flag is calculated by scaling the value down to a value between 0 and 1 and then scale it up again to 0 - group_amount.
I just took the log of the corrected value and range before their division to get the desired bias in group sizes. I also add 1 to make sure it doesn't try to take the log of 0.
You're looking for a single vector of information. I would normally query it like this:
select
count(*) as num,
integer( age / 10 ) as age_group
from mytable
group by age_group
A big case statement will be needed for arbitrary groups. It would be simple but much longer. My example should be fine if every bucket contains N years.
Take a look at the custom SQL functions. It works as
to_bin(10, [0, 100, 500]) => '... - 100'
to_bin(1000, [0, 100, 500, 0]) => '500 - ...'
to_bin(1000, [0, 100, 500]) => NULL
Read more here
https://github.com/AdamovichAleksey/BigQueryTips/blob/main/sql/functions/to_bins.sql
Any ideas and commits are welcomed
I am looking for a way to derive a weighted average from two rows of data with the same number of columns, where the average is as follows (borrowing Excel notation):
(A1*B1)+(A2*B2)+...+(An*Bn)/SUM(A1:An)
The first part reflects the same functionality as Excel's SUMPRODUCT() function.
My catch is that I need to dynamically specify which row gets averaged with weights, and which row the weights come from, and a date range.
EDIT: This is easier than I thought, because Excel was making me think I required some kind of pivot. My solution so far is thus:
select sum(baseSeries.Actual * weightSeries.Actual) / sum(weightSeries.Actual)
from (
select RecordDate , Actual
from CalcProductionRecords
where KPI = 'Weighty'
) baseSeries inner join (
select RecordDate , Actual
from CalcProductionRecords
where KPI = 'Tons Milled'
) weightSeries on baseSeries.RecordDate = weightSeries.RecordDate
Quassnoi's answer shows how to do the SumProduct, and using a WHERE clause would allow you to restrict by a Date field...
SELECT
SUM([tbl].data * [tbl].weight) / SUM([tbl].weight)
FROM
[tbl]
WHERE
[tbl].date >= '2009 Jan 01'
AND [tbl].date < '2010 Jan 01'
The more complex part is where you want to "dynamically specify" the what field is [data] and what field is [weight]. The short answer is that realistically you'd have to make use of Dynamic SQL. Something along the lines of:
- Create a string template
- Replace all instances of [tbl].data with the appropriate data field
- Replace all instances of [tbl].weight with the appropriate weight field
- Execute the string
Dynamic SQL, however, carries it's own overhead. Is the queries are relatively infrequent , or the execution time of the query itself is relatively long, this may not matter. If they are common and short, however, you may notice that using dynamic sql introduces a noticable overhead. (Not to mention being careful of SQL injection attacks, etc.)
EDIT:
In your lastest example you highlight three fields:
RecordDate
KPI
Actual
When the [KPI] is "Weight Y", then [Actual] the Weighting Factor to use.
When the [KPI] is "Tons Milled", then [Actual] is the Data you want to aggregate.
Some questions I have are:
Are there any other fields?
Is there only ever ONE actual per date per KPI?
The reason I ask being that you want to ensure the JOIN you do is only ever 1:1. (You don't want 5 Actuals joining with 5 Weights, giving 25 resultsing records)
Regardless, a slight simplification of your query is certainly possible...
SELECT
SUM([baseSeries].Actual * [weightSeries].Actual) / SUM([weightSeries].Actual)
FROM
CalcProductionRecords AS [baseSeries]
INNER JOIN
CalcProductionRecords AS [weightSeries]
ON [weightSeries].RecordDate = [baseSeries].RecordDate
-- AND [weightSeries].someOtherID = [baseSeries].someOtherID
WHERE
[baseSeries].KPI = 'Tons Milled'
AND [weightSeries].KPI = 'Weighty'
The commented out line only needed if you need additional predicates to ensure a 1:1 relationship between your data and the weights.
If you can't guarnatee just One value per date, and don't have any other fields to join on, you can modify your sub_query based version slightly...
SELECT
SUM([baseSeries].Actual * [weightSeries].Actual) / SUM([weightSeries].Actual)
FROM
(
SELECT
RecordDate,
SUM(Actual)
FROM
CalcProductionRecords
WHERE
KPI = 'Tons Milled'
GROUP BY
RecordDate
)
AS [baseSeries]
INNER JOIN
(
SELECT
RecordDate,
AVG(Actual)
FROM
CalcProductionRecords
WHERE
KPI = 'Weighty'
GROUP BY
RecordDate
)
AS [weightSeries]
ON [weightSeries].RecordDate = [baseSeries].RecordDate
This assumes the AVG of the weight is valid if there are multiple weights for the same day.
EDIT : Someone just voted for this so I thought I'd improve the final answer :)
SELECT
SUM(Actual * Weight) / SUM(Weight)
FROM
(
SELECT
RecordDate,
SUM(CASE WHEN KPI = 'Tons Milled' THEN Actual ELSE NULL END) AS Actual,
AVG(CASE WHEN KPI = 'Weighty' THEN Actual ELSE NULL END) AS Weight
FROM
CalcProductionRecords
WHERE
KPI IN ('Tons Milled', 'Weighty')
GROUP BY
RecordDate
)
AS pivotAggregate
This avoids the JOIN and also only scans the table once.
It relies on the fact that NULL values are ignored when calculating the AVG().
SELECT SUM(A * B) / SUM(A)
FROM mytable
If I have understand the problem then try this
SET DATEFORMAT dmy
declare #tbl table(A int, B int,recorddate datetime,KPI varchar(50))
insert into #tbl
select 1,10 ,'21/01/2009', 'Weighty'union all
select 2,20,'10/01/2009', 'Tons Milled' union all
select 3,30 ,'03/02/2009', 'xyz'union all
select 4,40 ,'10/01/2009', 'Weighty'union all
select 5,50 ,'05/01/2009', 'Tons Milled'union all
select 6,60,'04/01/2009', 'abc' union all
select 7,70 ,'05/01/2009', 'Weighty'union all
select 8,80,'09/01/2009', 'xyz' union all
select 9,90 ,'05/01/2009', 'kws' union all
select 10,100,'05/01/2009', 'Tons Milled'
select SUM(t1.A*t2.A)/SUM(t2.A)Result from
(select RecordDate,A,B,KPI from #tbl)t1
inner join(select RecordDate,A,B,KPI from #tbl t)t2
on t1.RecordDate = t2.RecordDate
and t1.KPI = t2.KPI