Double "group by" without join? - sql

I have user data:
user store item cost
1 10 100 5
1 10 101 3
1 11 102 7
2 10 101 3
2 12 103 4
2 12 104 5
I want a table which will tell me for each user how much he bought from each store and how much he bought in total:
user store cost_this_store cost_total
1 10 8 15
1 11 7 15
2 10 3 12
2 12 9 12
I can do this with two group by and a join:
select s.user, s.store, s.cost_this_store, u.cost_total
from (select user, store, sum(cost) as cost_this_store
from my_data
group by user, store) s
join (select user, sum(cost) as cost_total
from my_data
group by user) u
on s.user = u.user
However, this is definitely not how I would do this if I were writing this in any other language (join is clearly avoidable, and the two group by are not independent).
Is it possible to avoid the join in sql?
PS. I need the solution to work in hive.

You can do this with a windowing function... which Hive added support for last year:
select distinct
user,
store,
sum(cost) over (partition by user, store) as cost_this_store,
sum(cost) over (partition by user) as cost_total
from my_data
However, I'd argue that there wasn't anything glaringly wrong with your original implementation. You've essentially got two different sets of data, which you're combining through a JOIN.
The duplication might look like a code smell in a different language, but this isn't necessarily the wrong approach in SQL, and often you'll have to take approaches such as this that duplicate a portion of a query between two intermediate result sets for performance reasons.
SQL Fiddle (SQL Server)

Related

SQL MAX() Function Not Working As Expected

I am attempting to select the max requester_req for each customer group, but after trying numerous different approaches, my result set continues to display every row instead of the max for the customer group.
The query:
SELECT
x2.customer,
x.customer_req,
x2.requester_name,
MAX(x2.requester_req) AS requester_req
FROM x, x2
WHERE x.customer = x2.customer
GROUP BY x2.customer, x2.requester_name, x.customer_req
ORDER BY x2.customer
A sample result set:
customer customer_req requester_name requester_req
Bob's Burgers 7 Bob 9
Bob's Burgers 7 Jon 12
Hello Kitty 9 Jane 3
Hello Kitty 9 Luke 7
Expected result set:
customer customer_req requester_name requester_req
Bob's Burgers 7 Jon 12
Hello Kitty 9 Luke 7
Have I screwed up something in my group by clause? I can't count how many times I've switched things up and get the same result set.
Thank you very much for your help!
select the max requester_req for each customer group
Don't aggregate. Instead, you can filter with a correlated subquery:
select
x2.customer,
x.customer_req,
x2.requester_name,
x2.requester_req
from x
inner join x2 on x.customer = x2.customer
where x2.requester_req = (
select max(x20.requester_req) from x2 x20 where x20.customer = x2.customer
)
order by x2.customer
Side note: always use explicit, standard joins (with the on keywords) instead of old-school implicit joins (with commas in the from clause): this syntax is not recommended anymore since more than 20 years, mostly because it is harder to follow.

Total Sum SQL Server

I have a query that collects many different columns, and I want to include a column that sums the price of every component in an order. Right now, I already have a column that simply shows the price of every component of an order, but I am not sure how to create this new column.
I would think that the code would go something like this, but I am not really clear on what an aggregate function is or why I get an error regarding the aggregate function when I try to run this code.
SELECT ID, Location, Price, (SUM(PriceDescription) FROM table GROUP BY ID WHERE PriceDescription LIKE 'Cost.%' AS Summary)
FROM table
When I say each component, I mean that every ID I have has many different items that make up the general price. I only want to find out how much money I spend on my supplies that I need for my pressure washers which is why I said `Where PriceDescription LIKE 'Cost.%'
To further explain, I have receipts of every customer I've worked with and in these receipts I write down my cost for the soap that I use and the tools for the pressure washer that I rent. I label all of these with 'Cost.' so it looks like (Cost.Water), (Cost.Soap), (Cost.Gas), (Cost.Tools) and I would like it so for Order 1 it there's a column that sums all the Cost._ prices for the order and for Order 2 it sums all the Cost._ prices for that order. I should also mention that each Order does not have the same number of Costs (sometimes when I use my power washer I might not have to buy gas and occasionally soap).
I hope this makes sense, if not please let me know how I can explain further.
`ID Location Price PriceDescription
1 Park 10 Cost.Water
1 Park 8 Cost.Gas
1 Park 11 Cost.Soap
2 Tom 20 Cost.Water
2 Tom 6 Cost.Soap
3 Matt 15 Cost.Tools
3 Matt 15 Cost.Gas
3 Matt 21 Cost.Tools
4 College 32 Cost.Gas
4 College 22 Cost.Water
4 College 11 Cost.Tools`
I would like for my query to create a column like such
`ID Location Price Summary
1 Park 10 29
1 Park 8
1 Park 11
2 Tom 20 26
2 Tom 6
3 Matt 15 51
3 Matt 15
3 Matt 21
4 College 32 65
4 College 22
4 College 11 `
But if the 'Summary' was printed on every line instead of just at the top one, that would be okay too.
You just require sum(Price) over(Partition by Location) will give total sum as below:
SELECT ID, Location, Price, SUM(Price) over(Partition by Location) AS Summed_Price
FROM yourtable
WHERE PriceDescription LIKE 'Cost.%'
First, if your Price column really contains values that match 'Cost.%', then you can not apply SUM() over it. SUM() expects a number (e.g. INT, FLOAT, REAL or DECIMAL). If it is text then you need to explicitly convert it to a number by adding a CAST or CONVERT clause inside the SUM() call.
Second, your query syntax is wrong: you need GROUP BY, and the SELECT fields are not specified correctly. And you want to SUM() the Price field, not the PriceDescription field (which you can't even sum as I explained)
Assuming that Price is numeric (see my first remark), then this is how it can be done:
SELECT ID
, Location
, Price
, (SELECT SUM(Price)
FROM table
WHERE ID = T1.ID AND Location = T1.Location
) AS Summed_Price
FROM table AS T1
to get exact result like posted in question
Select
T.ID,
T.Location,
T.Price,
CASE WHEN (R) = 1 then RN ELSE NULL END Summary
from (
select
ID,
Location,
Price ,
SUM(Price)OVER(PARTITION BY Location)RN,
ROW_number()OVER(PARTITION BY Location ORDER BY ID )R
from Table
)T
order by T.ID

Restricting a SQL query so that any particular value in a certain column can only appear 3 times in the results, with respect to a given ordering

Suppose that I have a table in a SQL database with columns like the ones shown below. The table records various performance metrics of the employees in my company each month.
I can easily query the table so that I can see the best monthly sales figures that my employees have ever obtained, along with which employee was responsible and which month the figure was obtained in:
SELECT * FROM EmployeePerformance ORDER BY Sales DESC;
NAME MONTH SALES COMMENDATIONS ABSENCES
Karen Jul 16 36,319.13 2 0
David Feb 16 35,398.03 2 1
Martin Nov 16 33,774.38 1 1
Sandra Nov 15 33,012.55 4 0
Sandra Mar 16 31,404.45 1 0
Karen Sep 16 30,645.78 2 2
David Feb 16 29,584.81 1 1
Karen Jun 16 29,030.00 3 0
Stuart Mar 16 28,877.34 0 1
Karen Nov 15 28,214.42 1 2
Martin May 16 28,091.99 3 0
This query is very simple, but it's not quite what I want. How would I need to change it if I wanted to see only the top 3 monthly figures achieved by each employee in the result set?
To put it another way, I want to write a query that is the same as the one above, but if any employee would appear in the result set more than 3 times, then only their top 3 results should be included, and any further results of theirs should be ignored. In my sample query, Karen's figure from Nov 15 would no longer be included, because she already has three other figures higher than that according to the ordering "ORDER BY Sales DESC".
The specific SQL database I am using is either SQLite or, if what I need is not possible with SQLite, then MySQL.
In MySQL you can use windows function:
SELECT *
FROM EmployeePerformance
WHERE row_number() OVER (ORDER BY Sales DESC)<=3
ORDER BY Sales DESC
In SQLite window functions aren't available, but you still can count the preceding rows:
SELECT *
FROM EmployeePerformance e
WHERE
(SELECT COUNT(*)
FROM EmployeePerformance ee
WHERE ee.Name=e.Name and ee.Sales>e.Sales)<3
ORDER BY e.Sales DESC
I have managed to find an answer myself. It seems to work by pairing each record up with all of the records from the same person that were equal or greater, and then choosing only the (left) records that had no more than 3 greater-or-equal pairings.
SELECT P.Name, P.Month, P.Sales, P.Commendations, P.Absences
FROM Performance P
LEFT JOIN Performance P2 ON (P.Name = P2.Name AND P.Sales <= P2.Sales)
GROUP BY P.Name, P.Month, P.Sales, P.Commendations, P.Absences
HAVING COUNT(*) <= 3
ORDER BY P.Sales DESC;
I will give the credit to a_horse_with_no_name for adding the tag "greatest-n-per-group", as I would have had no idea what to search for otherwise, and by looking through other questions with this tag I managed to find what I wanted.
I found this question that was similar to mine... Using LIMIT within GROUP BY to get N results per group?
And I followed this link that somebody had included in a comment... https://www.xaprb.com/blog/2006/12/07/how-to-select-the-firstleastmax-row-per-group-in-sql/
...and the answer I wanted was in the first comment on that article. It's perfect as it uses only a LEFT JOIN, so it will work in SQLite.
Here is my SQL Fiddle: http://sqlfiddle.com/#!7/580f0/5/0

pls tell me how to make this query into a single query

select ac1.ACCT_CODE,
ac1.PERIOD,
ac1.MONTH,
ac1.YEAR,
ac1.PRD_BDGT,
ac2.ACCT_CODE,
ac2.PERIOD,
ac2.MONTH,
ac2.YEAR,
ac2.PRD_BDGT
from account ac1, account ac2
where ac1.acct_code='075200'
and ac1.year=1994
and ac1.period between 1 and 6
and ac2.acct_code=ac1.acct_code
and ac2.year=1995
and ac2.period =ac1.period
union
select ac3.ACCT_CODE,
ac3.PERIOD,
ac3.MONTH,
ac3.YEAR,
ac3.PRD_BDGT,
ac4.ACCT_CODE,
ac4.PERIOD,
ac4.MONTH,
ac4.YEAR,
ac4.PRD_BDGT
from account ac3, account ac4
where ac3.acct_code='075200'
and ac3.year=1995
and ac3.period between 7 and 12
and ac4.acct_code=ac3.acct_code
and ac4.year=1996
and ac4.period=ac3.period
Use an OR:
select ac1.ACCT_CODE,
ac1.PERIOD,
ac1.MONTH,
ac1.YEAR,
ac1.PRD_BDGT,
ac2.ACCT_CODE,
ac2.PERIOD,
ac2.MONTH,
ac2.YEAR,
ac2.PRD_BDGT
from account ac1, account ac2
where ac1.acct_code='075200'
and ac2.acct_code=ac1.acct_code
and ac2.period =ac1.period
and ((ac1.year=1994
and ac1.period between 1 and 6
and ac2.year=1995
) OR
(ac1.year=1995
and ac1.period between 7 and 12
and ac2.year=1996))
Your query is taking the union of two very similar queries, where the only difference is certain conditions in the where clause. You can combine them pretty easily by using or in the where clause.
The following query also fixes the join syntax:
select ac1.ACCT_CODE, ac1.PERIOD, ac1.MONTH, ac1.YEAR, ac1.PRD_BDGT, ac2.ACCT_CODE,
ac2.PERIOD, ac2.MONTH, ac2.YEAR, ac2.PRD_BDGT
from account ac1 join
account ac2
on ac2.period = ac1.period and
ac2.acct_code = ac1.acct_code
where ac1.acct_code='075200' and
((ac1.year = 1994 and
ac1.period between 1 and 6
ac2.year=1995
) or
(ac1.year=1995 and
ac1.period between 7 and 12 and
ac2.year=1996
)
);
I would be surprised if this query actually solves your business problem. Doing a self-join on the accounts table is suspicious. Often, an aggregation is what one needs, but I cannot tell the purpose of the query.

Progressive count using a query?

I use this query to
SELECT userId, submDate, COUNT(submId) AS nSubms
FROM submissions
GROUP BY userId, submDate
ORDER BY userId, submDate
obtain the total number of submissions per user per date.
However I need to have the progressive count for every user so I can see how their submissions accumulate over time.
Is this possible to implement in a query ?
EDIT: The obtained table looks like this :
userId submDate nSubms
1 2-Feb 1
1 4-Feb 7
2 1-Jan 4
2 2-Jan 2
2 18-Jan 1
I want to produce this :
userId submDate nSubms progressive
1 2-Feb 1 1
1 4-Feb 7 8
2 1-Jan 4 4
2 2-Jan 2 6
2 18-Jan 1 7
EDIT 2 : Sorry for not mentioning it earlier, I am not allowed to use :
Stored procedure calls
Update/Delete/Insert/Create queries
Unions
DISTINCT keyword
as I am using a tool that doesn't allow those.
You can use a self-join to grab all the rows of the same table with a date before the current row:
SELECT s0.userId, s0.submDate, COUNT(s0.submId) AS nSubms, COUNT (s1.submId) AS progressive
FROM submissions AS s0
JOIN submissions AS s1 ON s1.userId=s0.userId AND s1.submDate<=s0.submDate
GROUP BY s0.userId, s0.submDate
ORDER BY s0.userId, s0.submDate
This is going to force the database to do a load of pointless work counting all the same rows again and again though. It would be better to just add up the nSubms as you go down in whatever script is calling the query, or in an SQL variable, if that's available in your environment.
The Best solution for this is to do it at the client.
It's the right tool for the job. Databases are not suited for this kind of task
Select S.userId, S.submDate, Count(*) As nSubms
, (Select Count(*)
From submissions As S1
Where S1.userid = S.userId
And S1.submDate <= S.submDate) As TotalSubms
From submissions As S
Group By S.userid, S.submDate
Order By S.userid, S.submDate