Remove duplicates from query, while repeating - sql

I have an SQL table with some data like this, it is sorted by date:
+----------+------+
| Date | Col2 |
+----------+------+
| 12:00:01 | a |
| 12:00:02 | a |
| 12:00:03 | b |
| 12:00:04 | b |
| 12:00:05 | c |
| 12:00:06 | c |
| 12:00:07 | a |
| 12:00:08 | a |
+----------+------+
So, I want my select result to be the following:
+----------+------+
| Date | Col2 |
+----------+------+
| 12:00:01 | a |
| 12:00:03 | b |
| 12:00:05 | c |
| 12:00:07 | a |
+----------+------+
I have used the distinct clause but it removes the last two rows with Col2 = 'a'

You can use lag (SQL Server 2012+) to get the value in the previous row and then compare it with the current row value. If they are equal assign them to one group (1 here) and a different group (0 here) otherwise. Finally select the required rows.
select dt,col2
from (
select dt,col2,
case when lag(col2,1,0) over(order by dt) = col2 then 1 else 0 end as somecol
from t) x
where somecol=0

If you are using Microsoft SQL Server 2012 or later, you can do this:
select date, col2
from (
select date, col2,
case when isnull(lag(col2) over (order by date, col2), '') = col2 then 1 else 0 end as ignore
from (yourtable)
) x
where ignore = 0
This should work as long as col2 cannot contain nulls and if the empty string ('') is not a valid value for col2. The query will need some work if either assumption is not valid.

same as accepted answer (+1) just moving the conditions
assumes col2 is not null
select dt, col2
from ( select dt, col2
lag(col2, 1) over(order by dt) as lagCol2
from t
) x
where x.lagCol2 is null or x.lagCol2 <> x.col2

Related

Calculating consecutive range of dates with a value in Hive

I want to know if it is possible to calculate the consecutive ranges of a specific value for a group of Id's and return the calculated value(s) of each one.
Given the following data:
+----+----------+--------+
| ID | DATE_KEY | CREDIT |
+----+----------+--------+
| 1 | 8091 | 0.9 |
| 1 | 8092 | 20 |
| 1 | 8095 | 0.22 |
| 1 | 8096 | 0.23 |
| 1 | 8098 | 0.23 |
| 2 | 8095 | 12 |
| 2 | 8096 | 18 |
| 2 | 8097 | 3 |
| 2 | 8098 | 0.25 |
+----+----------+--------+
I want the following output:
+----+-------------------------------+
| ID | RANGE_DAYS_CREDIT_LESS_THAN_1 |
+----+-------------------------------+
| 1 | 1 |
| 1 | 2 |
| 1 | 1 |
| 2 | 1 |
+----+-------------------------------+
In this case, the ranges are the consecutive days with credit less than 1. If there is a gap between date_key column, then the range won't have to take the next value, like in ID 1 between 8096 and 8098 date key.
Is it possible to do this with windowing functions in Hive?
Thanks in advance!
You can do this with a running sum classifying rows into groups, incrementing by 1 every time a credit<1 row is found(in the date_key order). Thereafter it is just a group by.
select id,count(*) as range_days_credit_lt_1
from (select t.*
,sum(case when credit<1 then 0 else 1 end) over(partition by id order by date_key) as grp
from tbl t
) t
where credit<1
group by id
The key is to collapse all the consecutive sequence and compute their length, I struggled to achieve this in a relatively clumsy way:
with t_test as
(
select num,row_number()over(order by num) as rn
from
(
select explode(array(1,3,4,5,6,9,10,15)) as num
)
)
select length(sign)+1 from
(
select explode(continue_sign) as sign
from
(
select split(concat_ws('',collect_list(if(d>1,'v',d))), 'v') as continue_sign
from
(
select t0.num-t1.num as d from t_test t0
join t_test t1 on t0.rn=t1.rn+1
)
)
)
Get the previous number b in the seq for each original a;
Check if a-b == 1, which shows if there is a "gap", marked as 'v';
Merge all a-b to a string, and then split using 'v', and compute length.
To get the ID column out, another string which encode id should be considered.

How to ignore nulls in PostgreSQL window functions? or return the next non-null value in a column

Lets say I have the following table:
| User_id | COL1 | COL2 |
+---------+----------+------+
| 1 | | 1 |
| 1 | | 2 |
| 1 | 2421 | |
| 1 | | 1 |
| 1 | 3542 | |
| 2 | | 1 |
I need another column indicating the next non-null COL1 value for each row, so the result would look like the below:
| User_id | COL1 | COL2 | COL3 |
+---------+----------+------+------
| 1 | | 1 | 2421 |
| 1 | | 2 | 2421 |
| 1 | 2421 | | |
| 1 | | 1 | 3542 |
| 1 | 3542 | | |
| 2 | | 1 | |
SELECT
first_value(COL1 ignore nulls) over (partition by user_id order by COL2 rows unbounded following)
FROM table;
would work but I'm using PostgreSQL which doesn't support the ignore nulls clause.
Any suggested workarounds?
You can still do it with windowing function if you add a case when criteria in the order by like this:
select
first_value(COL1)
over (
partition by user_id
order by case when COL1 is not null then 0 else 1 end ASC, COL2
rows unbounded following
)
from table
This will use non null values first.
However performance will probably not be great compared to skip nulls because the database will have to sort on the additional criteria.
I also had the same problem. The other solutions may work, but I have to build multiple windows for each row I need.
You can try this snippets : https://wiki.postgresql.org/wiki/First/last_(aggregate)
If you create the aggregates you can use them:
SELECT
first(COL1) over (partition by user_id order by COL2 rows unbounded following)
FROM table;
There is always the tried and true approach of using a correlated subquery:
select t.*,
(select t2.col1
from t t2
where t2.id >= t.id and t2.col1 is not null
order by t2.id desc
fetch first 1 row only
) as nextcol1
from t;
Hope this helps,
SELECT * FROM TABLE ORDER BY COALESCE(colA, colB);
which orders by colA and if colA has NULL value it orders by colB.
You can use COALESCE() function. For your query:
SELECT
first_value(COALESCE(COL1)) over (partition by user_id order by COL2 rows unbounded following)
FROM table;
but i don't understand what the reason to use sort by COL2, because this rows has null value for COL2:
| User_id | COL1 | COL2 |
+---------+----------+------+
| 1 | | 1 |
| 1 | | 2 |
| 1 | 2421 | | <<--- null?
| 1 | | 1 |
| 1 | 3542 | | <<--- null?
| 2 | | 1 |

SQL group by under some conditions

I have a big table with tons of duplicated rows (among those columns that I care about). Let me start with the following example:
|field1 | field2| field3| field4| field5|
| aa | 1 | NULL | 1 | 0 |
| aaa | 1 | NULL | 1 | 1 |
| aaa | 1 | NULL | 1 | 2 |
| a | 2 | 0 | 1 | 3 |
| a | 2 | 0 | NULL | 4 |
| a | 2 | NULL | 2 | 5 |
| b | 3 | NULL | 2 | 6 |
| b2 | 3 | NULL | NULL | 7 |
| c | 4 | NULL | NULL | 8 |
I am interested in an effiecient query to get the following table:
|field1 | field2| field3| field4|
| aaa | 1 | NULL | 1 |
| a | 2 | 0 | 1 |
| b | 3 | NULL | 2 |
| c | 4 | NULL | NULL |
Basically, it follows the following rules:
for each value of field2, there should be one and exactly one row present
among all the rows with the same value of field2 select the row that satisfy the following in order:
select the row that field4 is not Null (if possible)
among those that have a non Null value for the field4 select the row that has has a non Null value for field 3
among those that have a non Null value for the field4 and 3, select the row that has the longest string value for field 1
among those that satisfy all above, select only one row (does not matter what is the value of field5).
I could do it with bunch of joins, but it becomes very slow. Any better suggestions?
EDIT
The field2 values may not be in an specific order. I just put 1,2,3,4 in the example but this is not generally true in my case. I did not change it directly on the table since one of the suggested solutions are actually considering sequential value for field2, so I kept if for future readers that maybe interested in that.
This type of prioritization is challenging. I think the simplest method in MySQL uses variables:
select t.*
from (select t.*,
(#rn := if(#f2 = field2, #rn + 1,
if(#f2 := field2, 1, 1)
)
) as seqnum
from t cross join
(select #rn := 0, #field2 := '') params
order by field2,
(field4 is not null) desc,
(field3 is not null) desc,
length(field1) desc
) t
where seqnum = 1;
I'm not 100% sure I have the conditions right (the third seems to conflict with the first two). But whatever the prioritization, the idea is the same: use order by to get the rows in the right order and use variables to get the first one.
EDIT:
In SQL Server -- or any other reasonable database -- you do this with row_number():
select t.*
from (select t.*,
row_number() over (partition by field2
order by (case when field4 is not null then 0 else 1 end),
(case when field3 is not null then 0 else 1 end),
len(field1)
) as seqnum
from t
) t
where seqnum = 1;

Self-referential CASE WHEN clause in SQL

I'm trying to migrate some poorly formed data into a database. The data comes from a CSV, and is first loaded into a staging table of all varchar columns (as I cannot enforce type safety at this stage).
The data might look like
COL1 | COL2 | COL3
Name 1 | |
2/11/16 | $350 | $230
2/12/16 | $420 | $387
2/13/16 | $435 | $727
Name 2 | |
2/11/16 | $121 | $144
2/12/16 | $243 | $658
2/13/16 | $453 | $214
The first colum is a mixture of company names as pseudo-headers, and dates for which colum 2 and 3 data is relevant. I'd like to start transforming the data by creating a 'Brand' column - where 'StoreBrand' is the value of Col1 if Col2 is NULL, or the previous row's StoreBrand otherwise. Comething like:
COL1 | COL2 | COL3 | StoreBrand
Name 1 | | | Name 1
2/11/16 | $350 | $230 | Name 1
2/12/16 | $420 | $387 | Name 1
2/13/16 | $435 | $727 | Name 1
Name 2 | | | Name 2
2/11/16 | $121 | $144 | Name 2
2/12/16 | $243 | $658 | Name 2
2/13/16 | $453 | $214 | Name 2
I wrote this:
SELECT
t.*,
CASE
WHEN t.COL2 IS NULL THEN COL1
ELSE LAG(StoreBrand) OVER ()
END AS StoreBrand
FROM
(
SELECT
ROW_NUMBER() OVER () AS i,
*
FROM
Staging_Data
) t;
But the database (postgres in this case, but we're considering alternatives so the most diverse answer is preferred) chokes on LAG(StoreBrand) because that's the derived column I'm creating. Invoking LAG(Col1) only populates the first row's real data:
COL1 | COL2 | COL3 | StoreBrand
Name 1 | | | Name 1
2/11/16 | $350 | $230 | Name 1
2/12/16 | $420 | $387 | 2/11/16
2/13/16 | $435 | $727 | 2/12/16
Name 2 | | | Name 2
2/11/16 | $121 | $144 | Name 2
2/12/16 | $243 | $658 | 2/11/16
2/13/16 | $453 | $214 | 2/12/16
My goal would be a StoreBrand column which is the first value of COL1 for all date values before the next brand name:
COL1 | COL2 | COL3 | StoreBrand
Name 1 | | | Name 1
2/11/16 | $350 | $230 | Name 1
2/12/16 | $420 | $387 | Name 1
2/13/16 | $435 | $727 | Name 1
Name 2 | | | Name 2
2/11/16 | $121 | $144 | Name 2
2/12/16 | $243 | $658 | Name 2
2/13/16 | $453 | $214 | Name 2
The value of StoreBrand when Col2 and Col3 are null is inconsequential - that row will be dropped as part of the conversion process. The important thing is associating the data rows (i.e. those with dates) with their brand.
Is there a way to reference the previous value for the column that I'm missing?
Edit for people who find this question through a search engine:
The trick was to use WITH which allows to use a temporary result at several places (link).
I think this does what you want and discards the null rows at the same time (if you want to). We basically select all the brands before the row we are currently looking at and if no "brand row" exists between it and the current row then we take it.
WITH t AS
(SELECT
ROW_NUMBER() OVER () AS i,
*
FROM
Staging_Data
)
SELECT
a.COL1,
a.COL2,
a.COL3,
(SELECT b.COL1 FROM t b WHERE b.COL2 IS NULL AND b.i <= a.i AND NOT EXISTS(
SELECT * FROM t c WHERE c.COL2 IS NULL AND c.i <= a.i AND c.i > b.i)
) StoreBrand
FROM
t a
WHERE -- I don't think you need those rows? Otherwise remove it.
a.COL2 IS NOT NULL
It can be a bit confusing. t is a temporary table we defined with your query. And a, b and c are aliases for t. We could also write FROM t AS a to make it more obvious.
I think I understand what you want. Technically, you want the ignore nulls option on lag(), so it would look like this:
select lag(case when col1 not like '%/%/%' then col1 end ignore nulls) over (order by linenumber) as brandname
The only problem? Postgres doesn't support ignore nulls.
But, you can do pretty much the same thing with a subquery. The idea is to assign a grouping identifier to each group. And this is a cumulative count of the valid brand names. Then a simple max() aggregation works:
select t.*,
max(case when col1 not like '%/%/%' then col1 end) over (partition by grp) as brand
from (select t.*,
sum(case when col1 not like '%/%/%' then 1 end) over
(order by linenumber) as grp
from t
);

display records based on ranks and also delete duplicated data

i have a table like this
+------+------+------+------+
| col1 | col2 | col3 | rank |
+------+------+------+------+
| 1 | A | X | 4 |
| 2 | C | Y | 3 |
| 2 | C | Y | 3 |
| | A | X | 3 |
| 1 | B | Z | 2 |
+------+------+------+------+
(5 rows)
I need o/p like this
+------+------+------+------+
| col1 | col2 | col3 | rank |
+------+------+------+------+
| 1 | A | X | 4 |
| 2 | C | Y | 3 |
| 1 | B | Z | 2 |
+------+------+------+------+
so that I written query like below
select col1,col2,col3,rank,dense_rank() over(order by rank desc) from table1;
but its not giving proper o/p
try this !!
select a.col1,a.col2,a.col3,max(a.rank) as rank
from [dbo].[5] a join [dbo].[5] b
on a.col1=b.col1 group by a.col1,a.col2,a.col3
looks like you need aggregation with max():
select
col1,col2,col3,
max(rnk)
from table1
group by col1,col2,col3
If you could have different values of col1 for one combination of col2, col3, then distinct on is what you need:
select distinct on (col2, col3)
col1,col2,col3,
rnk
from table1
order by col2, col3, rnk desc
sql fiddle demo
The following should match what you are looking for:
select col1,col2,col3,rank,dense_rank() over(order by rank desc) from table1
WHERE col1 IS NOT NULL
GROUP BY 1, 2, 3, 4;
You can also use numeric aliases in your order by clause if you want one.