SQL Aggregate Function over partitions - sql

I'm relatively new to SQL but have learned some cool stuff. I'm getting results that don't make sense. I've got a query with several subqueries and what-not but I have a windowed function that isn't working like I'm expecting.
The part that isn't working is this (simplified from the 300 line query):
SELECT AVG(table.sales_amount)
OVER (PARTITION BY table.month, table.sales_rep, table.department)
FROM table
The problem is that when I pull the data non aggregated I get a value different (107) than the above returns (95).
I've used windowed functions for COUNT and SUM and they work fine, but AVG is acting strangely. Am I missing something about how this works with AVG?
The subquery that table is a standin for looks like:
sales_rep, month, department, sales_amount
1, 2017-1, abc, 125.20
1, 2017-2, abc, 120.00
2, 2017-1, def, 100.00
...etc
Working out of Sql Server Management studio
SOLVED: I did finally figure it out, the results i was joining this subquery to had the sales rep multiple times in a month selling objects A&B which caused whoever sold both to be counted twice. whoops, my bad.

The results that you get should be the same values as in:
SELECT AVG(table.sales_amount)
FROM table
GROUP BY table.month, table.sales_rep, table.department;
Of course, the rows will be different. You need to match up the three key columns.
Based on your sample data, it looks like the partitioning keys uniquely define each row. Perhaps you really intend:
SELECT AVG(table.sales_amount) OVER () as overall_average
FROM table;
EDIT:
For the departmental average:
SELECT AVG(table.sales_amount) OVER (partition by table.department) as department_average
FROM table;

After some bruteforcing of potential errors I finally figured out the issue. I was joining that subquery to the another which had multiple instances of a sales_rep in a given month (selling objects a & b) which caused the average of those with sales of both objects to be counted twice instead of once.
so sales rep 1 sold objects a & b which made his avg count as 66% of the dept avg instead of 50%, and sales rep 2 count only 33%.

Related

SQL finding Average for transaction

so i have a question regarding Average problem. suppose i have 5 transactions, with each transaction having multiple items and each item has their own Quantity Value. I want to search Average Quantity per transaction. Note that in my ERD Design, there are 2 separate tables which are HeaderTransaction and TransactionDetail.
If i use AVG() function, then it will be very weird as i.e.
first transaction:
5 eggs
2 sausages
Second transaction :
3 eggs
10 sausages.
AVG will work like (5+2+3+10)/4 what i want is ((5+2)+(3+10))/2
my current solution is
SELECT SUM(ItemQuantity)/COUNT (DISTINCT (SalesTransactionId)) as[aveg]
i find it a bit rough
If i use AVG() function, then it will be very weird
Not if you AVG what you say you want to average, which is the number of items per transaction
SELECT AVG(num_of_items_in_transaction)
FROM
(SELECT SUM(amount_of_item) as num_of_items_in_transaction FROM detail GROUP BY tran_id)
The inner query groups per transaction, and counts the total number of items. The outer query then averages these totals. The point is that because you need to first do an operation group by transaction, then another operation group by something else (the whole dataset) you can't combine the grouping operations into a single step - it has to be multi stage, because you're feeding the output from one stage into the input of another. SELECT AVG(SUM(amount)) .. GROUP BY ??? - what would you put into the ??? to let MySQL know you wanted the SUM grouping by one thing but the AVG grouping by another? (You can't)
You generally need to do it as a two step reduction if you're not windowing, but that's conceivably a implicit multi-step operation anyway
I don't think there's any need to change what you have (sum of amounts divided by count of transactions), I just wanted to point out why it probably wasn't working as you expected

Grouping a percentage calculation in postgres/redshift

I keep running in to the same problem over and over again, hoping someone can help...
I have a large table with a category column that has 28 entries for donkey breed, then I'm counting two specific values grouped by each of those categories in subqueries like this:
WITH totaldonkeys AS (
SELECT donkeybreed,
COUNT(*) AS total
FROM donkeytable1
GROUP BY donkeybreed
)
,
sickdonkeys AS (
SELECT donkeybreed,
COUNT(*) AS totalsick
FROM donkeytable1
JOIN donkeyhealth on donkeytable1.donkeyid = donkeyhealth.donkeyid
WHERE donkeyhealth.sick IS TRUE
GROUP BY donkeybreed
)
,
It's my goal to end up with a table that has primarily the percentage of sick donkeys for each breed but I always end up struggling like hell with the problem of not being able to group by without using an aggregate function which I cannot do here:
SELECT (CAST(sickdonkeys.totalsick AS float) / totaldonkeys.total) * 100 AS percentsick,
totaldonkeys.donkeybreed
FROM totaldonkeys, sickdonkeys
GROUP BY totaldonkeys.donkeybreed
When I run this I end up with 28 results for each breed of donkey, one correct I believe but obviously hundreds of useless datapoints.
I know I'm probably being really dumb here but I keep hitting in to this same problem again and again with new donkeydata, I should obviously be structuring the whole thing a new way because you just can't do this final query without an aggregate function, I think I must be missing something significant.
You can easily count the proportion that are sick in the donkeyhealth table
SELECT d.donkeybreed,
AVG( (dh.sick)::int ) AS proportion_sick
FROM donkeytable1 d JOIN
donkeyhealth dh
ON d.donkeyid = dh.donkeyid
GROUP BY d.donkeybreed

How to fix this column doesn't exist error in SQL?

In the sales table, three columns are btl_price, bottle_qty, and total. The total for a transaction should be the product of btl_price and bottle_qty. How many transactions have a value of total that is not equal to btl_price times bottle_qty?
Here is the table:
Here are my codes:
sql = """
Select (btl_price*bottle_qty) As total_sale, CAST(total AS money)
From sales
Where total != total_sale
"""
It keeps telling me "column "total_sale" does not exist".
Please help me to identify my mistakes.
PS: I code this in Jupyter Notebook. This is a practice of mine not in any DBMS.
You cannot use columns computed in the SELECT clause in the WHERE clause (in SQL, the matter is evaluated before the former).
Also, you need proper type casting to compare money and numbers.
Finally, you need to turn on aggregation to compute the number of sales that satisfy the condition.
Assuming that you are using Postgres, that would be:
select count(*)
from sales
where total::numeric <> btl_price::numeric * btl_quantity
Try this:
SELECT *
FROM sales
WHERE total !=(btl_price * bottle_qty)
Good luck

PowerPivot Ranking Groups using DAX's Rankx - Ranking Using Sum of a Field

Am trying to rank groups by summing a field (not a calculated column) for each group so I get a static answer for each row in my table.
For example, I may have a table with state, agent, and sales. Sales is a field, not a measure. There can be many agents within a state, so there are many rows for each individual state. I am trying to rank the states by total sales within each state.
I have tried many things, but the ones that make the most sense to me are:
rankx(CALCULATETABLE(Table,allexcept(Table,Table[AGENT]),sum([Sales]),,DESC)
and
=rankx(SUMMARIZE(State,Table[State],"Sales",sum(Table[Sales])),[Sales])
The first one is creating a table where it sums sales without grouping by Agent. and then tries to rank based on that. I get #ERROR on this one.
The second one creates a table using SUMMARIZE with only sum of Sales grouped by state, then tries to take that table and rank the states based on Sales. For this one I get a rank of 1 for every row.
I think, but am not sure, that my problem is coming from the sales being a static field and not a calculated measure. I can't figure out where to go from here. Any help?
Assuming your data looks something like this...
...have you tried this:
Ranking Measure = RANKX(ALL('Table'[STATE]),CALCULATE(SUM('Table'[Sales])))
The ALL('Table'[STATE]) says to rank all states. The CALCULATE(SUM('Table'[Sales])) says to rank by the sum of their sales. The CALCULATE wrapper is important; a plain SUM('Table'[Sales]) will be filtered to the current row context, resulting in every state being ranked #1. (Alternatively, you can spin off SUM('Table'[Sales]) into a separate Sales measure - which I'd recommend.)
Note: the ranks will change based on slicers/filters (e.g. a filter by agent will re-rank the states by that agent). If you're looking for a static rank of states by their total sales (i.e. not affected by filters on agent and always looking at the entire table), then try this:
Static Ranking Measure = CALCULATE([Ranking Measure], ALLEXCEPT('Table', 'Table'[State]))
This takes the same ranking measure, but removes all filters except the state filter (which you need to leave, as that's the column you're ranking by).
I did figure out a solution that's pretty simple, but it's messier than I'd like. If it's the only thing that works though, that's okay.
I created a new table with each distinct state along with a sum of sales then just do a basic RANKX on that table.

Access Query MAX() Slows Query

I have the below Access Query and it works fine. However, it takes about 8-10 seconds to finish on a table that is about 700 records right now. The FROM is another query that has very little query time. I have narrowed it down to the MAX() function, because when I remove that function it runs with very little query time. What can I do to speed this up? I am going to assume as more data comes into the database the longer it will take to query.
SELECT FirstName, LastName, TeamID, MAX(total) AS totalMax
FROM attendanceViewAll
WHERE TeamID IN(5,9,13)
GROUP BY FirstName, LastName, TeamID
Here is the Sub Query, basically it selects a bunch of data from a table. This happens in less than a second. The result of this query is everything ordered by date and agentID. I then use the above query to find the MAX(total) so I can group the agents for a summary. I use the below query for other reports as well.
SELECT
a1.TeamID,
a1.FirstName,
a1.LastName,
a1.incurredDate,
a1.points,
a1.OneFallOff,
a1.TwoFallOff,
(select sum(a2.actualPoints)
from attendanceView as a2 where a2.agentID = a1.agentID and a2.incurredDate <= a1.incurredDate) as total,
a1.comment, a1.linked, a1.FallOffDate
FROM attendanceView as a1;
Your [attendanceViewAll] query is using a correlated subquery to produce a running total (ref: your previous question here). Now you are asking for the MAX() of that running total, which is the same thing as the SUM() of the [TwoFallOff] values. That is, for
incurredDate TwoFallOff total
------------ ---------- -----
2014-01-10 2 2
2014-01-11 3 5
2014-01-12 1 6
MAX(total) is the same value as SUM(TwoFallOff). The big difference is that to get each value for [total] you need to run the correlated subquery, whereas to get each value for [TwoFallOff] you don't.
In other words, I suspect that your current query is slow because the MAX() is forcing the correlated subquery in [attendanceViewAll] to be executed many times. You may get faster response if you have your current query refer directly back to [attendanceView] and SUM() the [TwoFallOff] values from there.
What you need is a multiple-column index and it should be almost instantaneous.
Use the interface as this link describes if you need help on that. However, your index should be first on the criteria, secondary on the fields used in group by, so I would have an index on
TeamID, FirstName, LastName