Max and Min group by 2 columns - sql

I have the following table that shows me every time a car has his tank filled. It returns the date, the car id, the mileage it had at that time and the liters filled:
| Date | Vehicle_ID | Mileage | Liters |
| 2016-10-20 | 234 | 123456 | 100 |
| 2016-10-20 | 345 | 458456 | 215 |
| 2016-10-20 | 323 | 756456 | 265 |
| 2016-10-25 | 234 | 123800 | 32 |
| 2016-10-26 | 345 | 459000 | 15 |
| 2016-10-26 | 323 | 756796 | 46 |
The idea is to calculate the average comsumption by month (I can't do it by day because not every car fills the tank every day).
To get that, i tried to get max(mileage)-min(mileage)/sum(liters) group by month. But this will only work for 1 specific car and 1 specific month.
If I try for 1 specific car and several months, the max and min will not return properly. If I add all the cars, even worse, as it will assume the max and min as if every car was the same.
select convert(char(7), Date, 127) as year_month,
sum("Liters tanked")/(max("Mileage")-min("Mileage"))*100 as Litres_per_100KM
from Tanking
where convert(varchar(10),"Date",23) >= DATEADD(mm, -5, GETDATE())
group by convert(char(7), Date, 127)
This will not work as it will assume the max and min from all the cars.
The "workflow" shoud be this:
- For each month, get the max and min mileage for each car. Calculate max-min to get the mileage it rode that month. Sum the mileage for each car to get a total mileage driven by all the cars. Sum the liters tanked. Divide the total liters by the total mileage.
How can I get the result:
| YearMonth | Average |
| 2016-06 | 30 |
| 2016-07 | 32 |
| 2016-08 | 46 |
| 2016-09 | 34 |

This is a more complicated problem than it seems. The problem is that you don't want to lose miles between months. It is tempting to do something like this:
select year(date), month(date),
sum(liters) / (max(mileage) - min(mileage))
from Tanking
where Date >= dateadd(month, -5, getdate())
group by year(date), month(date);
However, this misses miles and liters that span month boundaries. In addition, the liters on the first record of the month are for the previous milage difference. Oops! That is not correct.
One way to fix this is to look up the next values. The query looks something like this:
select year(date), month(date),
sum(next_liters) / (max(next_mileage) - min(mileage))
from (select t.*,
lead(date) over (partition by vehicle_id order by date) as next_date,
lead(mileage) over (partition by vehicle_id order by date) as next_mileage,
lead(liters) over (partition by vehicle_id order by date) as next_liters
from Tanking t
) t
where Date >= dateadd(month, -5, getdate())
group by year(date), month(date);
These queries use simplified column names, so escape characters don't interfere with the logic.
EDIT:
Oh, you have multiple cars (probably what vehicle_Id is there for). You want two levels of aggregation. The first query would look like:
select yyyy, mm, sum(liters) as liters, sum(mileage_diff) as mileage_diff,
sum(mileage_diff) / sum(liters) as mileage_per_liter
from (select vehicle_id, year(date) as yyyy, month(date) as mm,
sum(liters) as liters,
(max(mileage) - min(mileage)) as mileage_diff
from Tanking
where Date >= dateadd(month, -5, getdate())
group by vehicle_year(date), month(date)
) t
group by yyyy, mm;
Similar changes to the second query (with vehicle_id in the partition by clauses) would work for the second version.

Try to get the sums per car per month in a subquery. Then calculate the average per month in an outer query using the values of the subquery:
select year_month,
(1.0*sum(liters_per_car)/sum(mileage_per_car))*100.0 as Litres_per_100KM
from (
select convert(char(7), [Date], 127) as year_month,
sum(Liters) as liters_per_car,
max(Mileage)-min(Mileage) as mileage_per_car
from Tanking
group by convert(char(7), [Date], 127), Vehicle_ID) as t
group by year_month

You can use a CTE to get dif(mileage) and then calculate consumption:
Can check it here: http://rextester.com/OKZO55169
with cte (car, datec, difm, liters)
as
(
select
car,
datec,
mileage - lag(mileage,1,mileage) over(partition by car order by car, mileage) as difm,
liters
from #consum
)
select
car,
year(datec) as [year],
month(datec) as [month],
((cast(sum(liters) as float)/cast(sum(difm) as float)) * 100.0) as [l_100km]
from
cte
group by
car, year(datec), month(datec)

Related

BigQuery for running count of distinct values with a dynamic date-range

We are trying to make a query where we get the sum of unique customers on a specific year-month + the sum of unique customers on the 364 days before the specific date.
For example:
Our customer-table looks like this:
| order_date | customer_unique_id |
| -------- | -------------- |
| 2020-01-01 | tom#email.com |
| 2020-01-01 | daisy#email.com |
| 2019-05-02 | tom#email.com |
In this example we have two customers who ordered on 2020-01-01 and one of them already ordered within the 364-days timeframe.
The desired table should look like this:
| year_month | unique_customers |
| -------- | -------------- |
| 2020-01 | 2 |
We tried multiple solutions, such as partitioning and windows, but nothing seem to work correctly. The tricky part is the uniqueness. We want the look 364 days back but want to do a count distinct on the customers based on that whole period and not based on date/year/month because then we would get duplicates. For example, if you partition by date, year or month tom#email.com would be counted twice instead of once.
The goal of this query is to get insight into the order-frequency (orders divided by customers) over a time period from 12 months.
We work with Google BigQuery.
Hope someone can help us out! :)
Here is a way to achieve your desired results. Note that this query does year-month join in a separate query, and joins it with the rolling 364-day-interval query.
with year_month_distincts as (
select
concat(
cast(extract(year from order_date) as string),
'-',
cast(extract(month from order_date) as string)
) as year_month,
count(distinct customer_id) as ym_distincts
from customer_table
group by 1
)
select x.order_date, x.ytd_distincts, y.ym_distincts from (
select
a. order_date,
(select
count(distinct customer_id)
from customer_table b
where b.order_date between date_sub(a.order_date, interval 364 day) and a.order_date
) as ytd_distincts
from orders a
group by 1
) x
join year_month_distincts y on concat(
cast(extract(year from x.order_date) as string),
'-',
cast(extract(month from x.order_date) as string)
) = y.year_month
Two options using arrays that may help.
Look back 364 days as requested
In case you wish to look back 11 months (given reporting is monthly)
month_array AS (
SELECT
DATE_TRUNC(order_date,month) AS order_month,
STRING_AGG(DISTINCT customer_unique_id) AS cust_mth
FROM customer_table
GROUP BY 1
),
year_array AS (
SELECT
order_month,
STRING_AGG(cust_mth) OVER(ORDER by UNIX_DATE(order_month) RANGE BETWEEN 364 PRECEDING AND CURRENT ROW) cust_12m
-- (option 2) STRING_AGG(cust_mth) OVER (ORDER by cast(format_date('%Y%m', order_month) as int64) RANGE BETWEEN 99 PRECEDING AND CURRENT ROW) AS cust_12m
FROM month_array
)
SELECT format_date('%Y-%m',order_month) year_month,
(SELECT COUNT(DISTINCT cust_unique_id) FROM UNNEST(SPLIT(cust_12m)) AS cust_unique_id) as unique_12m
FROM year_array

Counting number of orders per customer

I have a table with the following columns: date, customers_id, and orders_id (unique).
I want to addd a column in which, for each order_id, I can see how many times the given customer has already placed an order during the previous year.
e.g. this is what it would look like:
customers_id | orders_id | date | order_rank
2083 | 4725 | 2018-08-314 | 1
2573 | 4773 | 2018-09-035 | 1
3393 | 3776 | 2017-09-11 | 1
3393 | 4172 | 2018-01-09 | 2
3393 | 4655 | 2018-08-17 | 3
I'm doing this in BigQuery, thank you!
Use count(*) with a window frame. Ideally, you would use an interval. But BigQuery doesn't (yet) support that syntax. So convert to a number:
select t.*,
count(*) over (partition by customer_id
order by unix_date(date)
range between 364 preceding and current row
) as order_rank
from t;
This treats a year as 365 days, which seems suitable for most purposes.
I suggest that you use the over clause and restrict the data in your where clause. You don't really need a window for your case. If you consider one your a period from 365 days in the past until now, this is gonna work:
select t.*,
count(*) over (partition by customer_id
order by date
) as c
from `your-table` t
where date > DATE_SUB(CURRENT_DATE(), INTERVAL 365 DAY)
order by customer_id, c
If you need some specific year, for example 2019, you can do something like:
select t.*,
count(*) over (partition by customer_id
order by date
) as c
from `your-table` t
where date between cast("2019-01-01" as date) and cast("2019-12-31" as date)
order by customer_id, c

Calculating cumulative sum over dates but excluding data that is removed on later dates

Let me see if I can explain it well.
I have data that looks like this for like fixed deposits
Placement Date | Maturity Date | Amount
2020-01-30 | 2020-03-30 | 50000
2020-02-05 | 2020-05-28 | 20000
2020-03-31 | 2020-05-30 | 7000
2020-04-13 | 2020-07-30 | 60000
My desired output would be on a monthly basis get the cumulative amount but to exclude those that have already matured.
Month | Amount
2020-01-31 | 50000
2020-02-29 | 70000
2020-03-31 | 27000 (due to the 50000 maturing on 03-30)
2020-04-30 | 87000
2020-05-31 | 60000 (due to the 20000 and 7000 maturing on 05-28 and 05-30)
So far I have used the over clause to get the cumulative sum but I haven't a clue how to remove those that have matured in the following months.
Thanks in advance
The easiest solution is to use a calendar table (you can easily find examples on how to build one online, it can be reused for other reporting purposes).
Then you can do something like this.
SELECT d.theDate, SUM(dp.Amount)
FROM dbo.AllDates d
JOIN dbo.Deposits dp
ON d.theDate BETWEEN dp.placementDate AND dp.MaturityDate
WHERE d.LastDayOfTheMonth = 1
GROUP BY d.theDate
I am thinking of a recursive cte to generate the dates, then a join on the original table, and aggregation:
with cte as (
select
datefromparts(year(min(placement_date)), month(min(placement_date)), 1) dt,
max(maturity_date) max_dt
from mytable
union all
select
dateadd(month, 1, dt),
max_dt
from cte
where dateadd(month, 1, dt) < max_dt
)
select eomonth(c.dt) month, coalesce(sum(t.amount), 0) amount
from cte c
left join mytable t
on eomonth(c.dt) between t.placement_date and t.maturity_date
group by c.dt
order by c.dt
Demo on DB Fiddle:
month | amount
:--------- | -----:
2020-01-31 | 50000
2020-02-29 | 70000
2020-03-31 | 27000
2020-04-30 | 87000
2020-05-31 | 60000
2020-06-30 | 60000
2020-07-31 | 0
You can use a recursive CTE directly on each row. That is, expand each row to determine what the value is at the end of the month. And then aggregate:
with cte as (
select eomonth(placement_date) as eom, maturity_date, amount
from t
where eomonth(placement_date) < maturity_date
union all
select eomonth(dateadd(month, 1, eom)), maturity_date,
(case when eomonth(dateadd(month, 1, eom)) < maturity_date then amount else 0 end)
from cte
where eomonth(dateadd(month, 1, eom)) <= eomonth(maturity_date)
)
select eom, sum(amount)
from cte
group by eom
order by eom;
There are two subtleties in this query:
One month later than the end-of-the-month may not be the last day of the last month. So, 2020-02-29 + 1 month is 2020-03-29, rather than 2020-03-31. Hence the use of eomonth() in the recursive part.
You want the final month for the maturity date to be in the result set. Hence the <= in the recursive part.
Here is a db<>fiddle.

BigQuery: Repeat the same calculated value in multiple rows

I'm trying to get several simple queries into one new table using Googe Big Query. In the final table is existing revenue data per day (that I can simply draw from another table). I then want to calculate the average revenue per day of the current month and continue this value until the end of the month. So the final table is updated every day and includes actual data and forecasted data.
So far, I came up with the following, which generates an error message in combination: Scalar subquery produced more than one element
#This gives me the date, the revenue per day and the info that it's actual data
SELECT
date, sum(revenue), 'ACTUAL' as type from `project.dataset.table` where date >"2020-01-01" and date < current_date() group by date
union distinct
# This shall provide the remaining dates of the current month
SELECT
(select calendar_date FROM `project.dataset.calendar_table` where calendar_date >= current_date() and calendar_date <=DATE_SUB(DATE_TRUNC(DATE_ADD(CURRENT_DATE(), INTERVAL 1 MONTH), MONTH), INTERVAL 1 DAY)),
#This shall provide the average revenue per day so far and write this value for each day of the remaining month
(SELECT avg(revenue_daily) FROM
(select sum(revenue) as revenue_daily from `project.dataset.table` WHERE date > "2020-01-01" and extract(month from date) = extract (month from current_date()) group by date) as average_daily_revenue where calendar >= current_date()),
'FORECAST'
How I wish the final data looks like:
+------------+------------+----------+
| date | revenue | type |
+------------+------------+----------+
| 01.04.2020 | 100 € | ACTUAL |
| … | 5.000 € | ACTUAL |
| 23.04.2020 | 200 € | ACTUAL |
| 24.04.2020 | 230,43 € | FORECAST |
| 25.04.2020 | 230,43 € | FORECAST |
| 26.04.2020 | 230,43 € | FORECAST |
| 27.04.2020 | 230,43 € | FORECAST |
| 28.04.2020 | 230,43 € | FORECAST |
| 29.04.2020 | 230,43 € | FORECAST |
| 30.04.2020 | 230,43 € | FORECAST |
+------------+------------+----------+
The forecast value is simply the sum of the actual revenue of the month divided by the number of days the month had so far.
Thanks for any hint on how to approach this.
I just figured something out, which creates the data I need. I'll still work on updating this every day automatically. But this is what I got so far:
select
date, 'actual' as type, sum(revenue) as revenue from `project.dataset.revenue` where date >="2020-01-01" and date < current_date() group by date
union distinct
select calendar_date, 'forecast',(SELECT avg(revenue_daily) FROM
(select sum(revenue) as revenue_daily from `project.dataset.revenue` WHERE extract(year from date) = extract (year from current_date()) and extract(month from date) = extract (month from current_date()) group by date order by date) as average_daily_revenue), FROM `project.dataset.calendar` where calendar_date >= current_date() and calendar_date <=DATE_SUB(DATE_TRUNC(DATE_ADD(CURRENT_DATE(), INTERVAL 1 MONTH), MONTH), INTERVAL 1 DAY) order by date

Aggregating over multiple groups in Postgres

I have a table with the following data:
+------------+-------------+---------------+
| shop_id | visit_date | visit_reason |
+------------+-------------+---------------+
| A | 2010-06-14 | shopping |
| A | 2010-06-15 | browsing |
| B | 2010-06-16 | shopping |
| B | 2010-06-14 | stealing |
+------------+-------------+---------------|
I need to build up an aggregate table that is grouped by shop, year, month, activity as well as total values for year and month. For example, if Shop A has 10 sales a month and 2 thefts a month and no other types of visit then the return would look like:
shop_id, year, month, reason, reason_count, month_count, year_count
A, 2010, 06, shopping, 10, 12, 144
A, 2010, 06, stealing, 2, 12, 144
Where month_count is the total number of visits, of any type, to the shop for 2010-06. Year-count is the same except for 2010.
I can get everything except the month and year counts with:
SELECT
shop_id,
extract(year from visit_date) as year,
extract(month from visit_date) as month,
visit_reason as reason,
count(visit_reason) as reason_count,
FROM shop_visits
GROUP BY shop_id, year, month
Should I be using some kind of CTE to double group by?
You can use window functions to add up the counts. The following is phrased using date_trunc(), which I find more convenient for aggregating by month:
select shop_id, date_trunc('month', visit_date) as yyyymm, reason,
count(*) as month_count,
sum(count(*)) over (partition by shop_id, date_trunc('year', min(visit_date))) as year_count
from t
group by shop_id, date_trunc('month', visit_date), reason;