How do I group a set of entities (people) into 10 equal portions but based on their usage for EX using Oracle's Toad 11g (SQL) - sql

Hi I have a list of 2+ mil people and their usage put in order from largest to smallest.
I tried ranking using row_number () over (partition by user column order by usage desc) as rnk
but that didnt work ..the results were crazy.
Simply put, I just want 10 equal groups of 10 with the first group consisting of the highest usage in the order of which i had first listed them.
HELP!

You can use ntile():
select t.*, ntile(10) over (order by usage desc) as usage_decile
from t;
The only caveat: This will divide the data into exactly 10 equal sized groups. If usage values have duplicates, then users with the same usage will be in different deciles.
If you don't want that behavior, use a more manual calculation:
select t.*,
ceil(rank() over (order by usage desc) * 10 /
count(*) over ()
) as usage_decile
from t;

Related

SQL- calculate ratio and get max ratio with corresponding user and date details

I have a table with user, date and a col each for messages sent and messages received:
I want to get the max of messages_sent/messages_recieved by date and user for that ratio. So this is the output I expect:
Andrew Lean 10/2/2020 10
Andrew Harp 10/1/2020 6
This is my query:
SELECT
ds.date, ds.user_name, max(ds.ratio) from
(select a.user_name, a.date, a.message_sent/ a.message_received as ratio
from messages a
group by a.user_name, a.date) ds
group by ds.date
But the output I get is:
Andrew Lean 10/2/2020 10
Jalinn Kim 10/1/2020 6
In the above output 6 is the correct max ratio for the date grouped but the user is wrong. What am I doing wrong?
With a recent version of most databases, you could do something like this.
This assumes, as in your data, there's one row per user per day. If you have more rows per user per day, you'll need to provide a little more detail about how to combine them or ignore some rows. You could want to SUM them. It's tough to know.
WITH cte AS (
select a.user_name, a.date
, a.message_sent / a.message_received AS ratio
, ROW_NUMBER() OVER (PARTITION BY a.date ORDER BY a.message_sent / a.message_received DESC) as rn
from messages a
)
SELECT t.user_name, t.date, t.ratio
FROM cte AS t
WHERE t.rn = 1
;
Note: There's no attempt to handle ties, where more than one user has the same ratio. We could use RANK (or other methods) for that, if your database supports it.
Here, I am just calculating the ratio for each column in the first CTE.
In the second part, I am getting the maximum results of the ratio calculated in the first part on date level. This means I am assuming each user will have one row for each date.
The max() function on date level will ensure that we always get the highest ratio on date level.
There could be ties, between the ratios for that we can use ROW_NUMBER' OR RANK()` to set a rank for each row based on the criteria that we would like to pass in case of ties and then filter on the rank generated.
with data as (
select
date,
user_id,
messages_sent / messages_recieved as ratio
from [table name]
)
select
date,
max(ratio) as higest_ratio_per_date
from data
group by 1,2

Replace first and last row having null values or missing values with previous/next available value in Postgresql12

I am a newbiew to postgresql.
I want to replace my first and last row of table,T which has null or missing values, with next/previous available values. Also, if there are missing values in the middle, it should be replaced with previous available value. For example:
id value EXPECTED
1 1
2 1 1
3 2 2
4 2
5 3 3
6 3
I am aware that there are many similar threads, but none seems to address this problem where the start and end also have missing values (including some missing in the middle rows). Also some of the concepts such as first_row ,partition by, top 1(which does not work for postgres) are very hard to grasp as a newbie.
So far i have referred to the following threads: value from previous row and Previous available value
Could someone kindly direct me in the right direction to address this problem?
Thank you
Unfortunately, Postgres doesn't have the ignore nulls option on lead() and lag(). In your example, you only need to borrow from the next row. So:
select t.*,
coalesce(value, lag(value) over (order by id), lead(value) over (order by id)) as expected
from t;
If you had multiple NULLs in a row, then this is trickier. One solution is to define "groups" based on when a value starts or stops. You can do this with a cumulative count of the values -- ascending and descending:
select t.*,
coalesce(value,
max(value) over (partition by grp_before),
max(value) over (partition by grp_after)
) as expected
from (select t.*,
count(value) over (order by id asc) as grp_before,
count(value) over (order by id desc) as grp_after
from t
) t;
Here is a db<>fiddle.

NTILE function performance in hive

Is there any way we can optimize the NTILE function run time. Currently, we have around 51M records with 17 variables.
We are performing below query to divide the datasets in to 100 buckets.
create table secondary_table
stored as orc
as
select a.*,NTILE(100) OVER(ORDER BY score) AS score_rank
from main_table a;
Here score variable represents 12 digit decimal values.
As of now all the load is getting dumped to one reducer which is taking a lot of time after completing 99%. Is there any approach we can optimize it as the current query is taking around 35 min. to execute.
Appreciate any response.
Thanks In advance.
This is not quite an answer, but it might provide some guidance.
The issue is the lack of partition by in the window function. Replacing it with equivalent constructs using, say, row_number() and count(*) won't help.
When I have encountered this, I have been able to work around it in one of two ways.
If there are lots of duplicates, then aggregate and use cumulative sums to define the tiles.
Otherwise, break the values into groups.
As an example of the second. Assuming the scores range from 0 to 1000, with pretty even distributions. Then:
select t.*,
1 + floor((t.seqnum_within + tt.running_cnt - tt.cnt - 1) * 100 / cnt)
from (select t.*,
row_number() over (partition by trunc(score) order by score) as seqnum_within
from t
) t join
(select trunc(score) as score_trunc, count(*) as cnt,
sum(count(*)) over (order by min(score)) as running_cnt,
sum(count(*)) over () as total_cnt
from t
group by trunc(score)
) tt
on trunc(t.score) = score_trunc;
The GROUP BY and JOIN should make better use of the parallel hardware.

COUNT(*) of a windowed SQL range

In Postgres 9.1 I'm using a windowing function like so:
SELECT a.category_id, (dense_rank() over w) - 1
FROM (
_inner select_
) a
WINDOW w AS (PARTITION BY category_id ORDER BY score)
What I can't figure out is how to also select the total number of elements in the windowed range. If I just use count(*) over w that tells me how many elements I've seen in the window so far instead of the total number in the window.
My core issue here is that cume_dist() counts from 1, not 0, for the number of rows before or equal to you. percentile_rank() counts from 0, like I need, but then it also subtracts 1 from the total number of rows when it does it's division.
SELECT
a.category_id,
(dense_rank() over w) - 1,
count(*) over (partition by category_id) --without order by
FROM (
_inner select_
) a
WINDOW w AS (PARTITION BY category_id ORDER BY score)
From the manual on Window Functions:
There is another important concept associated with window functions: for each row, there is a set of rows within its partition called its window frame. Many (but not all) window functions act only on the rows of the window frame, rather than of the whole partition. By default, if ORDER BY is supplied then the frame consists of all rows from the start of the partition up through the current row, plus any following rows that are equal to the current row according to the ORDER BY clause. When ORDER BY is omitted the default frame consists of all rows in the partition

Filtering by window function result in Postgresql

Ok, initially this was just a joke we had with a friend of mine, but it turned into interesting technical question :)
I have the following stuff table:
CREATE TABLE stuff
(
id serial PRIMARY KEY,
volume integer NOT NULL DEFAULT 0,
priority smallint NOT NULL DEFAULT 0,
);
The table contains the records for all of my stuff, with respective volume and priority (how much I need it).
I have a bag with specified volume, say 1000. I want to select from the table all stuff I can put into a bag, packing the most important stuff first.
This seems like the case for using window functions, so here is the query I came up with:
select s.*, sum(volume) OVER previous_rows as total
from stuff s
where total < 1000
WINDOW previous_rows as
(ORDER BY priority desc ROWS between UNBOUNDED PRECEDING and CURRENT ROW)
order by priority desc
The problem with it, however, is that Postgres complains:
ERROR: column "total" does not exist
LINE 3: where total < 1000
If I remove this filter, total column gets properly calculated, results properly sorted but all stuff gets selected, which is not what I want.
So, how do I do this? How do I select only items that can fit into the bag?
I don't know if this qualifies as "more elegant" but it is written in a different manner than Cybernate's solution (although it is essentially the same)
WITH window_table AS
(
SELECT s.*,
sum(volume) OVER previous_rows as total
FROM stuff s
WINDOW previous_rows as
(ORDER BY priority desc ROWS between UNBOUNDED PRECEDING and CURRENT ROW)
)
SELECT *
FROM window_table
WHERE total < 1000
ORDER BY priority DESC
If by "more elegant" you mean something that avoids the sub-select, then the answer is "no"
I haven't worked with PostgreSQL. However, my best guess would be using an inline view.
SELECT a.*
FROM (
SELECT s.*, sum(volume) OVER previous_rows AS total
FROM stuff AS s
WINDOW previous_rows AS (
ORDER BY priority desc
ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
)
ORDER BY priority DESC
) AS a
WHERE a.total < 1000;