SQL query to get average sum of other rows and store in current rows

SQL query to get average sum of other rows and store in current rows - sql

I have table like this, and I want to query to store the average of others row points.
USER_ID POINTS
------------- --------
a14e43e4f851 134
1e86e5adedbf 40
3c66730edf69 149
32e24082f97b 67
b33e3100a7be 124
274ee414ad8f 85
bdeef25fc797 172
For example - for user_id = a14e43e4f851, the average sum of points should be
avg(40+149+67+124+85+172) .
PS - not taken the points (134) in calculation for user a14e43e4f851.
Output should look like this --
USER_ID POINTS AVG
------------- ------- ------
a14e43e4f851 134 106 which is avg(40+149+67+124+85+172)
1e86e5adedbf 40 avg(134+149+67+124+85+172)
3c66730edf69 149 avg(134+40+67+124+85+172)
32e24082f97b 67 avg(134+40+149+124+85+172)
b33e3100a7be 124 ...
274ee414ad8f 85 ...
bdeef25fc797 172 ...

You could use a correlated subquery:
select t.*,
(select avg(t1.points) from mytable t1 where t1.user_id <> t.user_id) as average
from mytable t
An alternative uses window functions:
select t.*,
(sum(points) over() - points) / nullif(count(*) - 1, 0) as average
from mytable t
Note: avg obviously conflicts with a language keyword, I use average instead.
If you wanted an update statement:
update mytable t
set t.average = (
select avg(t1.points) from mytable t1 where t1.user_id <> t.user_id
)
However, I would not recommend actually storing this value; this is derived information, that can easily be computed on the fly whenever needed, using the first statement. If you are going to run the query often, you could create a view:
create view myview as
select t.*,
(sum(points) over() - points) / nullif(count(*) - 1, 0) as average
from mytable t

I'm assuming user_id is the PK.
WITH q AS (SELECT sum(points) AS s, count(*) AS n FROM mytable)
UPDATE table SET average = (q.s-points)/(q.n-1);
The idea is that
average of all other user's score is sum(score)/count(*)
sum of all user scores except this one is equal to sum of all scores minus this user's score
average of all other users' score except this one is (sum(score)-score_for_this_user)/(count(*)-1)
Nice thing is it only has to calculate the sum() and count() once.
To handle the case where there is only one row in the table:
WITH q AS (SELECT sum(points) AS s, NULLIF(count(*),0) AS n FROM mytable)
UPDATE table SET average = (q.s-points)/(q.n-1);
This makes the count NULL instead of 0, so the average updated should be NULL too.

Related

GROUP BY one column, then GROUP BY another column

I have a database table t with a sales table:
ID
TYPE
AGE
1
B
20
1
BP
20
1
BP
20
1
P
20
2
B
30
2
BP
30
2
BP
30
3
P
40
If a person buys a bundle it appears the bundle sale (TYPE B) and the different bundle products (TYPE BP), all with the same ID. So a bundle with 2 products appears 3 times (1x TYPE B and 2x TYPE BP) and has the same ID.
A person can also buy any other product in that single sale (TYPE P), which has also the same ID.
I need to calculate the average/min/max age of the customers but the multiple entries per sale tamper with the correct calculation.
The real average age is
(20 + 30 + 40) / 3 = 30
and not
(20+20+20+20 + 30+30+30 + 40) / 8 = 26,25
But I don't know how I can reduce the sales to a single row entry AND get the 4 needed values?
Do I need to GROUP BY twice (first by ID, then by AGE?) and if yes, how can I do it?
My code so far:
SELECT
AVERAGE(AGE)
, MIN(AGE)
, MAX(AGE)
, MEDIAN(AGE)
FROM t
but that does count every row.

Assuming the age is the same for all rows with the same ID (which in itself indicates a normalisation problem), you can use nest aggregation:
select avg(min(age)) from sales
group by id
AVG(MIN(AGE))
-------------
30
SQL Fiddle
The example in the documentation is very similar; and is explained as:
This calculation evaluates the inner aggregate (MAX(salary)) for each group defined by the GROUP BY clause (department_id), and aggregates the results again.
So for your version:
This calculation evaluates the inner aggregate (MIN(age)) for each group defined by the GROUP BY clause (id), and aggregates the results again.
It doesn't really matter whether the inner aggregate is min or max - again, assuming they are all the same - it's just to get a single value per ID, which can then be averaged.
You can do the same for the other values in your original query:
select
avg(min(age)) as avg_age,
min(min(age)) as min_age,
max(min(age)) as max_age,
median(min(age)) as med_age
from sales
group by id;
AVG_AGE MIN_AGE MAX_AGE MED_AGE
------- ------- ------- -------
30 20 40 30
Or if you prefer you could get the one-age-per-ID values once ina CTE or subquery and apply the second layer of aggregation to that:
select
avg(age) as avg_age,
min(age) as min_age,
max(age) as max_age,
median(age) as med_age
from (
select min(age) as age
from sales
group by id
);
which gets the same result.
SQL Fiddle

Finding average of data within a certain range

How do I find the average of data set within a certain range? Specifically I am looking to find the average for a data set for all data points that are within one standard deviations of the original average. Here is an example:
Student_ID Test_Scores
1 3
1 20
1 30
1 40
1 50
1 60
1 95
Average = 42.571
Standard Deviation = 29.854
I want to find all data points that are within one standard deviation of this original average, so within the range (42.571-29.854)<=Data<=(42.571+29.854). And from here I want to recalculate a new average.
So my desired data set is:
Student_ID Test_Scores
1 20
1 30
1 40
1 50
1 60
My desired new average is: 40
Here is my following SQL code and it didn't yield my desired result:
SELECT
Student_ID,
AVG(Test_Scores)
FROM
Student_Data
WHERE
Test_Scores BETWEEN (AVG(Test_Scores)-STDEV(Test_Scores)) AND (AVG(Test_Scores)+STDEV(Test_Scores))
ORDER BY
Student_ID
Anyone know how I could fix this?

Use either window functions or do the calculation in a subquery:
SELECT sd.Student_ID, sd.Test_Scores
FROM Student_Data sd CROSS JOIN
(SELECT AVG(Test_Scores) as avgts, STDEV(Test_Scores) as stdts
FROM Student_Data
) x
WHERE sd.Test_Scores BETWEEN avgts - stdts AND avgts + stdts
ORDER BY sd.Student_ID;

select avg(
select test_scores from table where
test_scores between
(
(select avg(test_scores) from table)-(select stddev(test_scores) from table))
and
(
(select avg(test_scores) from table)+(select stddev(test_scores) from table))
);

MS Access equivalent for using dense_rank in select

In MS Access, I have a table with 2 million account records/rows with various columns of data. I wish to apply a sequence number to every account record. (i.e.- 1 for the first account record ABC111, 2 for the second account record DEF222..., etc.)
Then, I would like to assign a batch number sequence for every 5 distinct account number. (i.e - record 1 with account number ABC111 being associated with batch number 101, record 2 with account number DEF222 being associated with batch number of 101)
This is how I would do it with a sql server query:
select distinct(p.accountnumber),FLOOR(((50 + dense_rank() over(order by
p.accountnumber)) - 1)/5) + 100 As BATCH from
db2inst1.account_table p
Raw Data:
AccountNumber
ABC111
DEF222
GHI333
JKL444
MNO555
PQR666
STU777
Resulting Data:
RecordNumber AccountNumber BatchNumber
1 ABC111 101
2 DEF222 101
3 GHI333 101
4 JKL444 101
5 MNO555 101
6 PQR666 102
7 STU777 102
I tried to make a query that uses SELECT as well as DENSE_RANK but I couldn't figure out how to make it work.
Thanks for reading my question

Something like this would probably work.
I'd first create a temporary table to hold the distinct account numbers, then I'd do an update query to assign the ranking.
CREATE TABLE tmpAccountRank
(AccountNumber TEXT(10)
CONSTRAINT PrimaryKey PRIMARY KEY,
AccountRank INTEGER NULL);
Then I'd use this table to generate the account ranking.
DELETE FROM tmpAccountRank;
INSERT INTO tmpAccountRank(AccountNumber)
SELECT DISTINCT AccountNumber FROM db2inst1.account_table;
UPDATE tmpAccountRank
SET AccountRank =
DCOUNT('AccountNumber', 'tmpAccountRank',
'AccountNumber < ''' + AccountNumber + '''') \ 5 + 101
I use DCOUNT and integer division (\ 5) to generate the ranking. This probably will have terrible performance but I think it's the way you would do it in MS Access.
If you want to skip the temp table, you can do it all in a nested subquery, but I don't think it's a great practice to do too much in a single query, especially in MS Access.
SELECT AccountNumber,
(SELECT COUNT(*) FROM
(SELECT DISTINCT AccountNumber
FROM db2inst1.account_table
WHERE AccountNumber < t.AccountNumber) q)) \ 5 + 101
FROM db2inst1.account_table t
Actually, this won't work in MS Access; apparently you can't reference tables outside of multiple levels of nesting in a subquery.

You can do dense_rank() with a correlated subquery. The logic is:
select a.*,
(select count(distinct a2.accountnumber)
from db2inst1.account_table as a2
where a2.accountnumber <= a.accountnumber
) as dense_rank
from db2inst1.account_table as a;
Then, you can use this for getting the batch number. Unfortunately, I don't follow the logic in your question (dense_rank() produces a number but your batch number is not numeric). However, this should answer your question.
EDIT:
Oh, that's right. In MS Access you need nested subqueries:
select a.*,
(select count(*)
from (select distinct a2.accountnumber
from db2inst1.account_table as a2
) as a2
where a2.accountnumber <= a.accountnumber
) as dense_rank
from db2inst1.account_table as a;

In SQL, I need to generate a ranking (1st, 2nd, 3rd) column, getting stuck on "ties"

I have a query that calculates points based on multiple criteria, and then orders the result set based on those points.
SELECT * FROM (
SELECT
dbo.afunctionthatcalculates(Something, Something) AS Points1
,dbo.anotherone(Something, Something) AS Points2
,dbo.anotherone(Something, Something) AS Points3
,[TotalPoints] = dbo.function(something) + dbo.function(something)
) AS MyData
ORDER BY MyData.TotalPoints
So my first stab at adding placement, rankings.. was this:
SELECT ROW_NUMBER() OVER(MyData.TotalPoints) AS Ranking, * FROM (
SELECT same as above
) AS MyData
ORDER BY MyData.TotalPoints
This adds the Rankings column, but doesn't work when the points are tied.
Rank | TotalPoints
--------------------
1 100
2 90
3 90
4 80
Should be:
Rank | TotalPoints
--------------------
1 100
2 90
2 90
3 80
Not really sure about how to resolve this.
Thank you for your help.

You should use the DENSE_RANK() function which takes the ties into account, as described here: http://msdn.microsoft.com/en-us/library/ms173825.aspx

DENSE_RANK() instead of ROW_NUMBER()

Joining onto a table that doesn't have ranges, but requires ranges

Trying to find the best way to write this SQL statement.
I have a customer table that has the internal credit score of that customer. Then i have another table with definitions of that credit score. I would like to join these tables together, but the second table doesn't have any way to link it easily.
The score of the customer is an integer between 1-999, and the definition table has these columns:
Score
Description
And these rows:
60 LOW
99 MED
999 HIGH
So basically if a customer has a score between 1 and 60 they are low, 61-99 they are med, and 100-999 they are high.
I can't really INNER JOIN these, because it would only join them IF the score was 60, 99, or 999, and that would exclude anyone else with those scores.
I don't want to do a case statement with the static numbers, because our scores may change in the future and I don't want to have to update my initial query when/if they do. I also cannot create any tables or functions to do this- I need to create a SQL statement to do it for me.
EDIT:
A coworker said this would work, but its a little crazy. I'm thinking there has to be a better way:
SELECT
internal_credit_score
(
SELECT
credit_score_short_desc
FROM
cf_internal_credit_score
WHERE
internal_credit_score = (
SELECT
max(credit.internal_credit_score)
FROM
cf_internal_credit_score credit
WHERE
cs.internal_credit_score <= credit.internal_credit_score
AND credit.internal_credit_score <= (
SELECT
min(credit2.internal_credit_score)
FROM
cf_internal_credit_score credit2
WHERE
cs.internal_credit_score <= credit2.internal_credit_score
)
)
)
FROM
customer_statements cs

try this, change your table to contain the range of the scores:
ScoreTable
-------------
LowScore int
HighScore int
ScoreDescription string
data values
LowScore HighScore ScoreDescription
-------- --------- ----------------
1 60 Low
61 99 Med
100 999 High
query:
Select
.... , Score.ScoreDescription
FROM YourTable
INNER JOIN Score ON YourTable.Score>=Score.LowScore
AND YourTable.Score<=Score.HighScore
WHERE ...

Assuming you table is named CreditTable, this is what you want:
select * from
(
select Description, Score
from CreditTable
where Score > 80 /*client's credit*/
order by Score
)
where rownum = 1
Also, make sure your high score reference value is 1000, even though client's highest score possible is 999.
Update
The above SQL gives you the credit record for a given value. If you want to join with, say, Clients table, you'd do something like this:
select
c.Name,
c.Score,
(select Description from
(select Description from CreditTable where Score > c.Score order by Score)
where rownum = 1)
from clients c
I know this is a sub-select that executed for each returning row, but then again, CreditTable is ridiculously small and there will be no significant performance loss because of the the sub-select usage.

You can use analytic functions to convert the data in your score description table to ranges (I assume that you meant that 100-999 should map to 'HIGH', not 99-999).
SQL> ed
Wrote file afiedt.buf
1 with x as (
2 select 60 score, 'Low' description from dual union all
3 select 99, 'Med' from dual union all
4 select 999, 'High' from dual
5 )
6 select description,
7 nvl(lag(score) over (order by score),0) + 1 low_range,
8 score high_range
9* from x
SQL> /
DESC LOW_RANGE HIGH_RANGE
---- ---------- ----------
Low 1 60
Med 61 99
High 100 999
You can then join this to your CUSTOMER table with something like
SELECT c.*,
sd.*
FROM customer c,
(select description,
nvl(lag(score) over (order by score),0) + 1 low_range,
score high_range
from score_description) sd
WHERE c.credit_score BETWEEN sd.low_range AND sd.high_range

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas