Optimize Average of Averages SQL Query - sql

I have a table where each row is a vendor with a sale made on some date.
I'm trying to compute average daily sales per vendor for the year 2019, and get a single number. Which I think means I want to compute an average of averages.
This is the query I'm considering, but it takes a very long time on this large table. Is there a smarter way to compute this average without this much nesting? I have a feeling I'm scanning rows more times than I need to.
-- Average of all vendor's average daily sale counts
SELECT AVG(vendor_avgs.avg_daily_sales) avg_of_avgs
FROM (
-- Get average number of daily sales for each vendor
SELECT vendor_daily_totals.memberdeviceid, AVG(vendor_daily_totals.cnt)
avg_daily_sales
FROM (
-- Get total number of sales for each vendor
SELECT vendorid, COUNT(*) cnt
FROM vendor_sales
WHERE year = 2019
GROUP BY vendorid, month, day
) vendor_daily_totals
GROUP BY vendor_daily_totals.vendorid
) vendor_avgs;
I'm curious if there is in general a way to compute an average of averages more efficiently.
This is running in Impala, by the way.

I think you can just do the calculation in one shot:
SELECT AVG(t.avgs)
FROM (
SELECT vendorid,
COUNT(*) * 1.0 / COUNT(DISTINCT month, day) as avgs
FROM vendor_sales
WHERE year = 2019
GROUP BY vendorid
) t
This gets the total and divides by the number of days. However, COUNT(DISTINCT) might be even slower than nested GROUP BYs in Impala, so you need to test this.

Related

Finding the initial sampled time window after using SAMPLE BY again

I can't seem to find a perhaps easy solution to what I'm trying to accomplish here, using SQL and, more importantly, QuestDB. I also find it hard to put my exact question into words so bear with me.
Input
My real input is different of course but a similar dataset or case is the gas_prices table on the demo page of QuestDB. On https://demo.questdb.io, you can directly write and run queries against some sample database, so it should be easy enough to follow.
The main task I want to accomplish is to find out which month was responsible for the year's highest galon price.
Output
Using the following query, I can get the average galon price per month just fine.
SELECT timestamp, avg(galon_price) as avg_per_month FROM 'gas_prices' SAMPLE BY 1M
timestamp
avg_per_month
2000-06-05T00:00:00.000000Z
1.6724
2000-07-05T00:00:00.000000Z
1.69275
2000-08-05T00:00:00.000000Z
1.635
...
...
Then, I get all these monthly averages, group them by year and return the maximum galon price per year by wrapping the above query in a subquery, like so:
SELECT timestamp, max(avg_per_month) as max_per_year FROM (
SELECT timestamp, avg(galon_price) as avg_per_month FROM 'gas_prices' SAMPLE BY 1M
) SAMPLE BY 12M
timestamp
max_per_year
2000-01-05T00:00:00.000000Z
1.69275
2001-01-05T00:00:00.000000Z
1.767399999999
2002-01-05T00:00:00.000000Z
1.52075
...
...
Wanted output
I want to know which month was responsible for the maximum price of a year.
Looking at the output of the above query, we see that the maximum galon price for the year 2000 was 1.69275. Which month of the year 2000 had this amount as average price? I'd like to display this month in an additional column.
For the first row, July 2000 is shown in the additional column for year 2000 because it is responsible for the highest average price in 2000. For the second row, it was May 2001 as that month had the highest average price of 2001.
timestamp
max_per_year
which_month_is_responsible
2000-01-05T00:00:00.000000Z
1.69275
2000-07-05T00:00:00.000000Z
2001-01-05T00:00:00.000000Z
1.767399999999
2001-05-05T00:00:00.000000Z
...
...
What did I try?
I tried by adding a subquery to the SELECT to have a "duplicate" of some sort for the timestamp column but that's apparently never valid in QuestDB (?), so probably the solution is by adding even more subqueries in the FROM? Or a UNION?
Who can help me out with this? The data is there in the database and it can be calculated. It's just a matter of getting it out.
I think 'wanted output' can be achieved with window functions.
Please have a look at:
CREATE TABLE electricity (ts TIMESTAMP, consumption DOUBLE) TIMESTAMP(ts);
INSERT INTO electricity
SELECT (x*1000000)::timestamp, rnd_double()
FROM long_sequence(10000000);
SELECT day, ts, max_per_day
FROM
(
SELECT timestamp_floor('d', ts) as day,
ts,
avg_in_15_min as max_per_day,
row_number() OVER (PARTITION BY timestamp_floor('d', ts) ORDER BY avg_in_15_min desc) as rn_per_day
FROM
(
SELECT ts, avg(consumption) as avg_in_15_min
FROM electricity
SAMPLE BY 15m
)
) WHERE rn_per_day = 1

SQL - calculate average frequency along with other aggregations

I'm trying to make a query for a table where each row is an order. This query should get the following numbers for every day of the week:
[x] Total count of orders
[x] Sum of total_amount
[x] Total count of unique customers
[ ] Average order frequency of customers
I can't find a way to get the last one (average order frequency). I have tried this below yet I get a ERROR: aggregate function calls cannot be nested:
SELECT weekday,
COUNT(*) AS orders,
SUM(total_amount) AS revenue,
COUNT(DISTINCT customer_id) AS customers,
AVG(COUNT(DISTINCT customer_id)) AS avg_order_freq -- Average order frequency
FROM orders
GROUP BY weekday
I was hoping there was a single aggregate function for this. It would've been easier if I only had to get only the average order frequency, but I also had to aggregate other columns.
I'm using PostgreSQL.

How to apply aggregate functions to results of other aggregate functions in single query?

I have a table BIKE_TABLE containing columns Rented_Bike_Count, Hour, and Season. My goal is to determine average hourly rental count per season, as well as the MIN, MAX, and STDDEV of the average hourly rental count per season. I need to do this in a single query.
I used:
SELECT
SEASONS,
HOUR,
ROUND(AVG(RENTED_BIKE_COUNT),2) AS AVG_RENTALS_PER_HR
FROM TABLE
GROUP BY HOUR, SEASONS
ORDER BY SEASONS
and this gets me close, returning 96 rows (4 seasons x 24 hours per) like:
SEASON
HOUR
AVG_RENTALS_PER_HR
Autumn
0
709.44
Autumn
1
552.5
Autumn
2
377.48
Autumn
3
256.55
But I cannot figure out how to return the following results that use ROUND(AVG(RENTED_BIKE_COUNT) as their basis:
What is the average hourly rental count per season? The answer should be four lines, like: Autumn, [avg. number of bikes rented per hour]
What is the MIN of the average hourly rental count per season?
Same for MAX
Same for STDDEV.
I tried running
MIN(AVG(RENTED_BIKE_COUNT)) AS MIN_AVG_HRLY_RENTALS_BY_SEASON,
MAX(AVG(RENTED_BIKE_COUNT)) AS MAX_AVG_HRLY_RENTALS_BY_SEASON,
STDDEV(AVG(RENTED_BIKE_COUNT)) AS STNDRD_DEV_AVG_HRLY_RENTALS_BY_SEASON
as nested SELECT and then as nested FROM clauses, but I cannot seem to get it right. Am I close? Any assistance greatly appreciated.
I think that you are over complicating the task. Does this give you your answers? If not please tell me the difference between it's output and your desired output.
Of course you can add ROUND() to reach column etc as you see fit.
SELECT
SEASONS,
MIN(RENTED_BIKE_COUNT) minimum,
MAX(RENTED_BIKE_COUNT) maximum,
STDDEV(RENTED_BIKE_COUNT) sDev,
AVG(RENTED_BIKE_COUNT) average
FROM TABLE
GROUP BY SEASONS
ORDER BY SEASONS;
According to your comment It seems that you may want the following query.
WITH seasons AS(
SELECT
Season,
AVG(RENTED_BIKE_COUNT) seasonAverage
FROM TABLE
GROUP BY season)
SELECT
AVG(seasonAverage) average,
MIN(seasonAverage) minimum,
MAX(seasonAverage) maximum,
STDDEV(seasonAverage) sDev
FROM
seasons;

BigQuery: Calculating averages in daily partitioned tables

I have a problem with getting averages out of several partitioned daily tables. We have partitioned tables for every day. I want to have an SQL query that calculates requests average for N days grouped by country.
So this is the schema:
date (string)
country (string)
req (integer)
What I have until now:
SELECT country, avg(req) as AvgReq
FROM TABLE_DATE_RANGE([thePartitionedTable_],
DATE_ADD(CURRENT_TIMESTAMP(), -2, 'DAY'), CURRENT_TIMESTAMP())
GROUP BY country
This works for 1 day of course, but the data is skewed when i try it for 2 or more days. What is the problem in my logic? How does the AVG() function work in this case? Do i need to group by date as well?
So i want the daily average of thePartitionedTable_today and daily average thePartitionedTable_yesterday then i want the average of their averages if that makes sense. So if thePartitionedTable_today has a daily average of 2 for Nigeria and thePartitionedTable_yesterday had a daily average of 3 for Nigeria, then the average for Nigeria of those two days should be 2.5. I really appriciate your time!
Using standard SQL:
with avg_byday AS (
SELECT
country,
AVG(req) AS req_avg
FROM
`thePartitionedTable_*`
GROUP BY
_TABLE_SUFFIX,
country)
SELECT
country,
AVG(req_avg)
FROM
avg_byday
GROUP BY
country
The subquery will also give you average requests per country for each day.

SQL daily sum report

I have two tables, one is income which consists of ID, income_amount and date
the other is expenses which is ID, amount_spent and date.
I'm trying to display a table with three columns the daily total of income, the daily total of amount and the date for that day, it is possible for that day to have no income or amount but not necessarily both.
I was able to display the table in visual c# by gathering them in individual tables and deriving the results programmatically but is there a way to achieve that table with just a single sql query?
trunc_to_day here is an hypothetical function which truncates a date to its day (you didn't specify what RDMBS you are using):
select sum(incomes), sun(spent), day from (
(select income_amount incomes, 0 spent, trunc_to_day(datecol) day from income_table)
union all
(select 0 incomes, amount_spent spent, trunc_to_day(datecol) day from spent_table)
) group by day;
Finally, if you want to limit to some days, use a where statement on it.