How to I do multiple columns partitioning with the rows being duplicated? - sql

I have a set of SQL Stored procedure to use partitioning for my ranking to get percentile. by doing the below partitioning I am able to get my percentiles data right. However my problem is there are duplicates in each row. E.g for each DESC there are multiple duplicates when it is suppose to be only 1 row. Why is this so?
row_nums AS
(
SELECT DATE, DESC, NUM, ROW_NUMBER() OVER (PARTITION BY DATE, DESC ORDER BY NUM ASC) AS Row_Num
FROM ******
)
SELECT .................
This is the output I get currently: (Where there are duplicate rows being returned - Refer to Row 6 to 8)
http://i.stack.imgur.com/foe7g.png[^]
This is the output I want to achieve: http://i.stack.imgur.com/GkrHP.png[^]

You can remove duplicate by adding one more INNER query in FROM clause like below:
;WTIH row_nums AS
(
SELECT DATE, DESC, NUM, ROW_NUMBER() OVER (PARTITION BY DATE, DESC ORDER BY NUM ASC) AS Row_Num
FROM (
SELECT your required columns, COUNT(duplicated_rows_columnsname)
FROM ***
GROUP BY columnnames
HAVING COUNT(duplicated_rows_columnsname) = 1
)
)
SELECT .................
However, You can also remove duplicate row using DISTINCT clause in INNER. query.

Related

How do I select 1 [oldest] row per group of rows, given multiple groups?

Let's say we have the database table below, called USER_JOBS.
I'd like to write an SQL query that reflects this algorithm:
Divide the whole table in groups of rows defined by a common USER_ID (in the example table, the 2 resulting groups are colored yellow & green)
From each group, select the oldest row (according to SCHEDULE_TIME)
From this example table, the desired SQL query would return these 2 rows:
You can use ranking function (supported in most RDBS):
SELECT *
FROM
(
SELECT *
,ROW_NUMBER() OVER (PARTITION BY USER_ID ORDER BY SCHEDULE_TIME DESC) AS RowID
FROM [table]
)
WHERE RowID = 1
WITH Ranked AS (
SELECT
RANK() OVER (PARTITION BY User_ID ORDER BY ScheduleTime DESC) as Ranking,
*
FROM [table_name]
)
SELECT Status, Sob_Type, User_ID, TimeStamp FROM ranking WHERE Ranks = 1;

Select rows based on distinct values of nested field in BigQuery

I have a table in BigQuery which looks like this:
The sequence field is a repeated RECORD. I want to select one row per stepName but if there are multiple rows per step name, I want to choose the one where sequence.step.elapsedSeconds and sequence.step.elapsedMinutes are not null, otherwise select the rows where these columns are null.
As shown in the image above, I want to select row no. 2, 4 and 5. I have calculated ROW_NUMBER like this: ROW_NUMBER() OVER(PARTITION BY step.stepName) AS RowNum.
HereĀ“s my query so far in trying to filter out the unwanted rows:
WITH DistinctRows AS
(
select timestamp,
ARRAY (
SELECT
STRUCT(
STRUCT(
step.elapsedSeconds,
step.elapsedMinutes,
) as step
)
FROM
UNNEST(source_table.sequence) AS sequence
) AS sequence,
ROW_NUMBER() OVER(PARTITION BY step.stepName) AS RowNum
from source_table,
unnest(sequence) as previousCalls
order by timestamp asc
)
SELECT *
FROM DistinctRows,
unnest(sequence) as sequence
where (rowNum = 1 and (step.elapsedSeconds is null and step.elapsedMinutes is null)
or (RowNum > 1 and step.elapsedSeconds is not null and step.elapsedSeconds is not null)
order by timestamp asc
I need help in figuring out how to filter out the rows like no. 1 and 3 and would appreciate some help.
Thanks in advance.
Hmmm . . . Assuming that stepname is not part of the repeated column:
SELECT dr.* EXCEPT (sequence),
(SELECT seq
FROM unnest(dr.sequence) seq
ORDER BY seq.step.elapsedSeconds DESC NULLS LAST,
sequence.step.elapsedMinutes DESC NULLS LAST
) as sequence
FROM DistinctRows dr
ORDER BY timestamp asc;
If stepname is part of sequence, then the subquery would reaggregate:
SELECT dr.* EXCEPT (sequence),
(SELECT ARRAY_AGG(sequence ORDER BY stepName)
FROM (SELECT seq,
ROW_NUMBER() OVER (PARTITION BY seq.stepName
ORDER BY seq.step.elapsedSeconds DESC NULLS LAST, sequence.step.elapsedMinutes DESC NULLS
) as seqnum
FROM unnest(dr.sequence) seq
) s
WHERE seqnum = 1
) as sequence
FROM DistinctRows dr
ORDER BY timestamp asc

Distinct rows in a table in sql

I have a table with multiple rows of the same member id. I need only distinct rows based on 2 unique columns
Ex: there are 100 different customers, the table has 1000 rows because every customer has multiple cities and segments assigned to him.
I need 100 distinct rows for these customers depending on a unique segment and city combination. There is no specific requirement for this combination, just the first from the table is fine.
So, currently the table is somewhat like this,
Hope this helps.
use row_number()
select * from (select *,row_number() over(partition by memberid order by sales) rn
from table_name
) a where a.rn=1
Handy sql-server top(1) with ties syntax for that
select top(1) with ties t.*
from table_name t
order by row_number() over(partition by memberid order by sales)
As you have no paticular requirement for which exactly row to select, any column will do at order by, it can be null as well
select top(1) with ties t.*
from table_name t
order by row_number() over(partition by memberid order by (select null))
The simplest way to do this is to use the ROW_NUMBER() OVER(GROUP BY...) syntax. You have no need to use an order by, since you want an arbitrary row, but only one, for each member.
Since you need only the expected data, and not the Row_Number value, make sure that you detail the fields returned, like below:
SELECT
MemberId,
city,
segment,
sales
FROM (
SELECT *
ROW_NUMBER() OVER (GROUP BY MemberId) as Seq
FROM [Status]
) src
WHERE Seq = 1

Efficient way to combine 2 tables and get the row with max year with preference to one of the table

I am trying to combine 2 tables (key_ratios_cnd and key_ratios_snd) both tables are identical and primary key columns for both tables are symbol and fiscal_year.
In the final result set i want the row with maximum year in both the tables for each symbol. if the row with maximum year is present in both the tables then row from key_ratios_cnd should be selected.
I come up with below SQL query to give the result. I wanted to know if their are any other way to write the query that is more optimized.
select sq2.*
from
(select sq.*,
max(id) over(partition by sq.symbol) as max_id,
max(fiscal_year) over(partition by sq.symbol) as max_year
from
( select *,'2' as id
from test.key_ratios_cnd
union all
select *,'1' as id
from test.key_ratios_snd
) as sq
) as sq2
where id = max_id and fiscal_year = max_year
order by symbol asc
I would select a row from each table first and then combine. Postgres has distinct on which is perfect for this purpose.
select distinct on (symbol) sc.*
from ((select distinct on (cnd.symbol) cnd.*, 1 as ord
from test.key_ratios_cnd cnd
order by cnd.symbol, cnd.fiscal_year desc
) union all
(select distinct on (snd.symbol) cnd.*, 2 as ord
from test.key_ratios_cnd cnd
order by snd.symbol, snd.fiscal_year desc
)
) sc
order by symbol, fiscal_year desc, ord;
To speed this up, add an index on (symbol, fiscal_year desc) to each table.

row_number() over() combined with order by

How can I add a sequential row number to a query that is using order by?
Let say I have a request in this form :
SELECT row_number() over(), data
FROM myTable
ORDER BY data
This will produce the desired result as rows are ordered by "data", but the row numbers are also ordered by data. I understand this is normal as my row number is generated before the order by, but how can I generate this row number after the order by?
I did try to use a subquery like this :
SELECT row_number() over(ORDER BY data), *
FROM
(
SELECT data
FROM myTable
ORDER BY data
) As t1
As shown here, but DB2 doesn't seem to support this syntax SELECT ..., * FROM
Thanks !
You also need to use alaias name before '*'
SELECT row_number() over(ORDER BY data), t1.*
FROM
(
SELECT data
FROM myTable
ORDER BY data
) As t1
You don't need a subquery to do this,
SELECT data , row_number() over(ORDER BY data) as rn
FROM myTable
ORDER BY data