Top 10% of sum() Postgres - sql

I'm looking to pull the top 10% of a summed value on a Postgres sever.
So i'm summing a value with sum(transaction.value) and i'd like the top 10% of the value

From what I gather in your comments, I assume you want to:
Sum transactions per customer to get a total per customer.
List the top 10 % of customers who actually have transactions and spent the most.
WITH cte AS (
SELECT t.customer_id, sum(t.value) AS sum_value
FROM transaction t
GROUP BY 1
)
SELECT *, rank() OVER (ORDER BY sum_value DESC) AS sails_rank
FROM cte
ORDER BY sum_value DESC
LIMIT (SELECT count(*)/10 FROM cte)
Major points
Best to use a CTE here, makes the count cheaper.
The JOIN between customer and transaction automatically excludes customers without transaction. I am assuming relational integrity here (fk constraint on customer_id).
Dividing bigint / int effectively truncates the result (round down to the nearest integer). You may be interested in this related question:
PostgreSQL equivalent for TOP n WITH TIES: LIMIT "with ties"?
I added a sails_rank column which you didn't ask for, but seems to fit your requirement.
As you can see, I didn't even include the table customer in the query. Assuming you have a foreign key constraint on customer_id, that would be redundant (and slower). If you want additional columns from customer in the result, join customer to the result of above query:
WITH cte AS (
SELECT t.customer_id, sum(t.value) AS sum_value
FROM transaction t
GROUP BY 1
)
SELECT c.customer_id, c.name, sub.sum_value, sub.sails_rank
FROM (
SELECT *, rank() OVER (ORDER BY sum_value DESC) AS sails_rank
FROM cte
ORDER BY sum_value DESC
LIMIT (SELECT count(*)/10 FROM cte)
) sub
JOIN customer c USING (customer_id);

Related

SQL How to select customers with highest transaction amount by state

I am trying to write a SQL query that returns the name and purchase amount of the five customers in each state who have spent the most money.
Table schemas
customers
|_state
|_customer_id
|_customer_name
transactions
|_customer_id
|_transact_amt
Attempts look something like this
SELECT state, Sum(transact_amt) AS HighestSum
FROM (
SELECT name, transactions.transact_amt, SUM(transactions.transact_amt) AS HighestSum
FROM customers
INNER JOIN customers ON transactions.customer_id = customers.customer_id
GROUP BY state
) Q
GROUP BY transact_amt
ORDER BY HighestSum
I'm lost. Thank you.
Expected results are the names of customers with the top 5 highest transactions in each state.
ERROR: table name "customers" specified more than once
SQL state: 42712
First, you need for your JOIN to be correct. Second, you want to use window functions:
SELECT ct.*
FROM (SELECT c.customer_id, c.name, c.state, SUM(t.transact_amt) AS total,
ROW_NUMBER() OVER (PARTITION BY c.state ORDER BY SUM(t.transact_amt) DESC) as seqnum
FROM customers c JOIN
transaactions t
ON t.customer_id = c.customer_id
GROUP BY c.customer_id, c.name, c.state
) ct
WHERE seqnum <= 5;
You seem to have several issues with SQL. I would start with understanding aggregation functions. You have a SUM() with the alias HighestSum. It is simply the total per customer.
You can get them using aggregation and then by using the RANK() window function. For example:
select
state,
rk,
customer_name
from (
select
*,
rank() over(partition by state order by total desc) as rk
from (
select
c.customer_id,
c.customer_name,
c.state,
sum(t.transact_amt) as total
from customers c
join transactions t on t.customer_id = c.customer_id
group by c.customer_id
) x
) y
where rk <= 5
order by state, rk
There are two valid answers already. Here's a third:
SELECT *
FROM (
SELECT c.state, c.customer_name, t.*
, row_number() OVER (PARTITION BY c.state ORDER BY t.transact_sum DESC NULLS LAST, customer_id) AS rn
FROM (
SELECT customer_id, sum(transact_amt) AS transact_sum
FROM transactions
GROUP BY customer_id
) t
JOIN customers c USING (customer_id)
) sub
WHERE rn < 6
ORDER BY state, rn;
Major points
When aggregating all or most rows of a big table, it's typically substantially faster to aggregate before the join. Assuming referential integrity (FK constraints), we won't be aggregating rows that would be filtered otherwise. This might change from nice-to-have to a pure necessity when joining to more aggregated tables. Related:
Why does the following join increase the query time significantly?
Two SQL LEFT JOINS produce incorrect result
Add additional ORDER BY item(s) in the window function to define which rows to pick from ties. In my example, it's simply customer_id. If you have no tiebreaker, results are arbitrary in case of a tie, which may be OK. But every other execution might return different results, which typically is a problem. Or you include all ties in the result. Then we are back to rank() instead of row_number(). See:
PostgreSQL equivalent for TOP n WITH TIES: LIMIT "with ties"?
While transact_amt can be NULL (has not been ruled out) any sum may end up to be NULL as well. With an an unsuspecting ORDER BY t.transact_sum DESC those customers come out on top as NULL comes first in descending order. Use DESC NULLS LAST to avoid this pitfall. (Or define the column transact_amt as NOT NULL.)
PostgreSQL sort by datetime asc, null first?

SQL generate ranks of groups and subgroups based on third column

I want to write a SQL query to generate ranks of groups and subgroups based on third column (Price in this case). While i know we can use dense_rank() to generate ranks based on one column. I have no idea how to generate the two columns of ranks as shown below in a single query.
Both the rankings are based on price. So J3 comes first because J3 sum(price) is 1600. J1 comes second because J1 sum(price) is 1500 and so on.
Any inputs are appreciated.
I have provided the sample input and output. The name of the input table is "RENTAL"
First roll up jet_type prices to the jet_type level, then create a ranking of all jet_types ordered by rolled up price, and finally use your window function in the outer query partitioned by jet_price and ordered by highest rolled up price to create rank_service_wthin_jet:
select a.jet_type, b.rownum rank_jet, a.service_type, a.price,
row_number() over(partition by a.jet_type order by a.price desc) rank_service_wthin_jet
from yourtable a join (
select jet_type, row_number() over(order by price desc) rownum from (
select jet_type, sum(price) price from yourtable
group by jet_type)a)b on a.jet_type=b.jet_type
You can generate two columns as:
select t.*,
dense_rank() over (order by jet_type) as rank_jet,
row_number() over (partition by jet_type order by price desc) as rank_service_within_jet
. . .
This does not exactly return what is in your table. But the results are quite similar and -- even more important -- make sense.

Grouping and updating in SQL Server

What's the most efficient way to group records by a certain criteria in SQL, assign a batch number to each group and then assign a sequential number (transaction number) to each record within the batch/group?
We have tried using temp tables where the transaction number column is an identity column but inserting to the temp table and then updating the records in the main table is not as efficient.
We could have multiple groups and each group could have up to 5000 records. Assigning the batch number to each group is not problematic but assigning the auto increment number within each group is taking long.
let's say there are 7000 customers in 5 different regions
we have to group the customers by region (5 batches) and assign a batch number to each region
within each region we have to assign a sequential transaction number
the combination of the batch number and transaction number is used for identifying a record within a region (batch 5, transaction 1)
The grouping criteria is not known at insertion time and therefore we cannot have an identity column in the main table
You can create these values using ROW_NUMBER() and DENSE_RANK():
;WITH cte AS (SELECT *
,ROW_NUMBER() OVER(PARTITION BY region ORDER BY region) AS UPD_Transaction
,DENSE_RANK() OVER(ORDER BY region) AS UPD_Batch
FROM yourtable)
SELECT *
FROM cte
And you can update a cte to apply them without temp-tables:
;WITH cte AS (SELECT *
,ROW_NUMBER() OVER(PARTITION BY region ORDER BY region) AS UPD_Transaction
,DENSE_RANK() OVER(ORDER BY region) AS UPD_Batch
FROM yourtable)
UPDATE cte
SET Transaction = UPD_Transaction
,Batch = UPD_Batch
Not sure what you'd want to ORDER BY for your Transaction number, so just left region in there.
If am not wrong this is what you need. Use Window Function and Stacked CTE to do this.
;WITH cte
AS (SELECT *,
Dense_Rank()OVER(ORDER BY regions) batch_no
FROM yourtable),
cte1
AS (SELECT *,
Row_number()OVER(partition BY regions ORDER BY customers) seq_trans_no
FROM cte1)
SELECT batch_no,
regions,
seq_trans_no,
customers
FROM cte1

Join to replace sub-query

I am almost a novie in database queries.
However,I do understand why and how correlated subqueries are expensive and best avoided.
Given the following simple example - could someone help replacing with a join to help understand how it scores better:
SQL> select
2 book_key,
3 store_key,
4 quantity
5 from
6 sales s
7 where
8 quantity < (select max(quantity)
9 from sales
10 where book_key = s.book_key);
Apart from join,what other option do we have to avoid the subquery.
In this case, it ought to be better to use a windowed-function on a single access to the table - like so:
with s as
(select book_key,
store_key,
quantity,
max(quantity) over (partition by book_key) mq
from sales)
select book_key, store_key, quantity
from s
where quantity < s.mq
Using Common Table Expressions (CTE) will allow you to execute a single primary SELECT statement and store the result in a temporary result set. The data can then be self-referenced and accessed multiple times without requiring the initial SELECT statement to be executed again and won't require possibly expensive JOINs. This solution also uses ROW_NUMBER() and the OVER clause to number the matching BOOK_KEYs in descending order based off of the quantity. You will then only include the records that have a quantity that is less than the max quantity for each BOOK_KEY.
with CTE as
(
select
book_key,
store_key,
quantity,
row_number() over(partition by book_key order by quantity desc) rn
from sales
)
select
book_key,
store_key,
quantity
from CTE where rn > 1;
Working Demo: http://sqlfiddle.com/#!3/f0051/1
Apart from join,what other option do we have to avoid the subquery.
You use something like this:
SELECT select max(quantity)
INTO #myvar
from sales
where book_key = s.book_key
select book_key,store_key,quantity
from sales s
where quantity < #myvar

How do I use ROW_NUMBER()?

I want to use the ROW_NUMBER() to get...
To get the max(ROW_NUMBER()) --> Or i guess this would also be the count of all rows
I tried doing:
SELECT max(ROW_NUMBER() OVER(ORDER BY UserId)) FROM Users
but it didn't seem to work...
To get ROW_NUMBER() using a given piece of information, ie. if I have a name and I want to know what row the name came from.
I assume it would be something similar to what I tried for #1
SELECT ROW_NUMBER() OVER(ORDER BY UserId) From Users WHERE UserName='Joe'
but this didn't work either...
Any Ideas?
For the first question, why not just use?
SELECT COUNT(*) FROM myTable
to get the count.
And for the second question, the primary key of the row is what should be used to identify a particular row. Don't try and use the row number for that.
If you returned Row_Number() in your main query,
SELECT ROW_NUMBER() OVER (Order by Id) AS RowNumber, Field1, Field2, Field3
FROM User
Then when you want to go 5 rows back then you can take the current row number and use the following query to determine the row with currentrow -5
SELECT us.Id
FROM (SELECT ROW_NUMBER() OVER (ORDER BY id) AS Row, Id
FROM User ) us
WHERE Row = CurrentRow - 5
Though I agree with others that you could use count() to get the total number of rows, here is how you can use the row_count():
To get the total no of rows:
with temp as (
select row_number() over (order by id) as rownum
from table_name
)
select max(rownum) from temp
To get the row numbers where name is Matt:
with temp as (
select name, row_number() over (order by id) as rownum
from table_name
)
select rownum from temp where name like 'Matt'
You can further use min(rownum) or max(rownum) to get the first or last row for Matt respectively.
These were very simple implementations of row_number(). You can use it for more complex grouping. Check out my response on Advanced grouping without using a sub query
If you need to return the table's total row count, you can use an alternative way to the SELECT COUNT(*) statement.
Because SELECT COUNT(*) makes a full table scan to return the row count, it can take very long time for a large table. You can use the sysindexes system table instead in this case. There is a ROWS column that contains the total row count for each table in your database. You can use the following select statement:
SELECT rows FROM sysindexes WHERE id = OBJECT_ID('table_name') AND indid < 2
This will drastically reduce the time your query takes.
You can use this for get first record where has clause
SELECT TOP(1) * , ROW_NUMBER() OVER(ORDER BY UserId) AS rownum
FROM Users
WHERE UserName = 'Joe'
ORDER BY rownum ASC
ROW_NUMBER() returns a unique number for each row starting with 1. You can easily use this by simply writing:
ROW_NUMBER() OVER (ORDER BY 'Column_Name' DESC) as ROW_NUMBER
May not be related to the question here. But I found it could be useful when using ROW_NUMBER -
SELECT *,
ROW_NUMBER() OVER (ORDER BY (SELECT 100)) AS Any_ID
FROM #Any_Table
select
Ml.Hid,
ml.blockid,
row_number() over (partition by ml.blockid order by Ml.Hid desc) as rownumber,
H.HNAME
from MIT_LeadBechmarkHamletwise ML
join [MT.HAMLE] h on ML.Hid=h.HID
SELECT num, UserName FROM
(SELECT UserName, ROW_NUMBER() OVER(ORDER BY UserId) AS num
From Users) AS numbered
WHERE UserName='Joe'
You can use Row_Number for limit query result.
Example:
SELECT * FROM (
select row_number() OVER (order by createtime desc) AS ROWINDEX,*
from TABLENAME ) TB
WHERE TB.ROWINDEX between 0 and 10
--
With above query, I will get PAGE 1 of results from TABLENAME.
If you absolutely want to use ROW_NUMBER for this (instead of count(*)) you can always use:
SELECT TOP 1 ROW_NUMBER() OVER (ORDER BY Id)
FROM USERS
ORDER BY ROW_NUMBER() OVER (ORDER BY Id) DESC
Need to create virtual table by using WITH table AS, which is mention in given Query.
By using this virtual table, you can perform CRUD operation w.r.t row_number.
QUERY:
WITH table AS
-
(SELECT row_number() OVER(ORDER BY UserId) rn, * FROM Users)
-
SELECT * FROM table WHERE UserName='Joe'
-
You can use INSERT, UPDATE or DELETE in last sentence by in spite of SELECT.
SQL Row_Number() function is to sort and assign an order number to data rows in related record set. So it is used to number rows, for example to identify the top 10 rows which have the highest order amount or identify the order of each customer which is the highest amount, etc.
If you want to sort the dataset and number each row by seperating them into categories we use Row_Number() with Partition By clause. For example, sorting orders of each customer within itself where the dataset contains all orders, etc.
SELECT
SalesOrderNumber,
CustomerId,
SubTotal,
ROW_NUMBER() OVER (PARTITION BY CustomerId ORDER BY SubTotal DESC) rn
FROM Sales.SalesOrderHeader
But as I understand you want to calculate the number of rows of grouped by a column. To visualize the requirement, if you want to see the count of all orders of the related customer as a seperate column besides order info, you can use COUNT() aggregation function with Partition By clause
For example,
SELECT
SalesOrderNumber,
CustomerId,
COUNT(*) OVER (PARTITION BY CustomerId) CustomerOrderCount
FROM Sales.SalesOrderHeader
This query:
SELECT ROW_NUMBER() OVER(ORDER BY UserId) From Users WHERE UserName='Joe'
will return all rows where the UserName is 'Joe' UNLESS you have no UserName='Joe'
They will be listed in order of UserID and the row_number field will start with 1 and increment however many rows contain UserName='Joe'
If it does not work for you then your WHERE command has an issue OR there is no UserID in the table. Check spelling for both fields UserID and UserName.