De-duplicating rows in a table with respect to certain columns and retaining the corresponding values in the other columns in HIVE - hive

I need to create a temporary table in HIVE using an existing table that has 7 columns. I just want to get rid of duplicates with respect to first three columns and also retain the corresponding values in the other 4 columns. I don't care which row is actually dropped while de-duplicating using first three rows alone.

You could use something as below if you are not considered about ordering
create table table2 as
select col1, col2, col3,
,split(agg_col,"|")[0] as col4
,split(agg_col,"|")[1] as col5
,split(agg_col,"|")[2] as col6
,split(agg_col,"|")[3] as col7
from (Select col1, col2, col3,
max(concat(cast(col4 as string),"|",
cast(col5 as string),"|",
cast(col6 as string),"|",
cast(col7 as string))) as agg_col
from table1
group by col1,col2,col3 ) A;
Below is another approach, which gives much control over ordering but slower than above approach
create table table2 as
select col1, col2, col3,max(col4), max(col5), max(col6), max(col7)
from (Select col1, col2, col3,col4, col5, col6, col7,
rank() over ( partition by col1, col2, col3
order by col4 desc, col5 desc, col6 desc, col7 desc ) as col_rank
from table1 ) A
where A.col_rank = 1
GROUP BY col1, col2, col3;
rank() over(..) function returns more than one column with rank as '1' if order by columns are all equal. In our case if there are 2 columns with exact same values for all seven columns then there will be duplicates when we use filter as col_rank =1. These duplicates can be eleminated using max and group by clauses as written in above query.

Related

Select multiple columns but distinct only one in SQL?

Lets say I have a table called TABLE with the columns col1, col2, col3 and col4
I want to select col1, col2 and col3 but distinct col2 values from the others, but I can't do it.
I tried something like this:
SELECT DISTINCT "col1", "col2", "col3" FROM [Table] WHERE col1 = Values
But the output brings me more than one record of col 2 with the same value.
I know that is because the distinct filtered all the columns that i specified, but i don't know how to get all the columns and filter only the values of col2.
Is it possible to SELECT more than 1 column but filter only one of them with SELECT DISTINCT ?
As you said, distinct just limits the full set of columns to eliminate duplicates. Instead, I'd just use an aggregate function with a GROUP BY statement.
SELECT MAX(col1) AS col1, col2,
MAX(col3) AS col3
FROM tbl
GROUP BY col2
That will take the top value alphanumerically from the supplied columns. Or, to list all values separated by commas:
SELECT STRING_AGG(col1,',') AS col1, col2,
STRING_AGG(col3,',') AS col3
FROM tbl
GROUP BY col2

Get one row per unique column value combination (`DISTINCT ON` operation without using it)

I have a table with 5 columns, For each unique combination of the first three columns, I want a single sample row. I don't care which values are considered for columns 4 and 5, as long as they come from the same row (they should be aligned).
I realise I can do a DISTINCT to fetch on the first three columns to fetch unique combinations of the first three columns. But the problems is then I cannot get 4th and 5th column values.
I considered GROUP BY, I can do a group by on the first three columns, then I can do a MIN(col4), MIN(col5). But there is no guarantee that values of col4 and col5 are aligned (from the same row).
The DB I am using does not support DISTINCT ON operation, which I realise is what I really need.
How do I perform what DISTINCT ON does without actually using that operation ?
I am guessing this is how I would write the SQL if DISTINCT ON was supported:
SELECT
DISTINCT ON (col1, col2, col3)
col1, col2, col3, col4, col5
FROM TABLE_NAME
select
col1, col2, col3, col4, col5
from (
select col1, col2, col3, col4, col5,
row_number() over (partition by col1, col2, col3) as n
from table_name
)
where n = 1

Multi-column Rank in SQL Server

There probably is, but I'm slightly new to SQL Server. I need to rank/denserank a dataset, but the ranking is based on 6 columns. What I have at the moment is:
SELECT col1, col2, col3, col4, col5, col6, col7,
RANK() OVER(ORDER BY col2 desc) as APPLICANT_RANK
FROM myTable
So that works fine, but if there is a tie in col2, then I get two records ranked the same. What I want is if there's a tie in col2, to see the higher number in col3, then col4, so down the line to col 6.
Thanks
You can include multiple columns in the order by clause in the rank function, just as you would when ordering the results of a whole query:
RANK() OVER(
ORDER BY col2 desc,col3 desc, col4 desc, col5 desc, col6 desc
) as APPLICANT_RANK

Oracle SQL - Join 2 table columns in 1 row

I have 2 SQL's and the result come fine. They are no relation between those 2 queries but I want to see all the rows in single column.
e.g.
Select col1,col2,sum(col3) as col3 from table a
select col4,col5 from table b
I would like the result to be
col1 col2 col3 col4 col5
If there is no equivalent row for either table a or table b replace with zeroes.
Could some one help me with this. thanks.
Since, you didn't provided any information like table structure or data inside each tables. You can cross join both tables.
select t.col1,t.col2,t.col3,t1.col1,t1.col2 from tab1 t,tab2 t1;
SQLFiddle
In both select statements add column based on rownum or row_number() and then full join results using this column:
select nvl(col1, 0) col1, nvl(col2, 0) col2, nvl(col3, 0) col3,
nvl(col4, 0) col4, nvl(col5, 0) col5
from
(select rownum rn, col1, col2, col3 from (
select col1, col2, sum(col3) col3 from tableA group by col1, col2)) a
full join (select rownum rn, col4, col5 from tableB) b using (rn)
SQLFiddle demo
I guess a UNION could be a pragmatic solution since the 2 queries are not related. They are just 2 data sets that should be retrieved in one statement:
Select col1,col2,sum(col3) as col3 from table a
UNION
select col4,col5, to_number(null) col6 from table b
Be aware of col6 in the example. SQL insists on retrieving an equal set of columns in a UNION statement. It is a good practice to retrieve columns with exactly the same datatype. Since the sum(col3) will yield a number datatype column, col6 should too.
The outcome of col4 and col5 will be shown in col1 and col2.

oracle sql query finding rows with multiple values in 3rd column matching columns 1 and 2

I have a dataset with about a million rows in and Oracle 11 db.
I'd like to find rows where col1 and col2 match but have different values in col3.
I'm not sure how to do this well though i can certainly write a query that never seems to finish:
select col1,col2,col3
from table tab1
where exists
(select 1
from table tab2
where tab1.col1 = tab2.col1
and tab1.col2 = tab2.col2
and tab1.col3 != tab2.col3);
I ran this and after an hour gave up waiting - I need to analyze the problems and present it to some people for figuring out how to move forward.
Thanks in any case,
Jeff
A query like this will indicate which rows having the same col1, col2 have differing values in col3:
SELECT col1, col2
FROM x
GROUP BY col1, col2
HAVING MIN(col3) <> MAX(col3)
To see how many of this col1, col2 pairs are affected:
SELECT COUNT(*)
FROM (SELECT col1, col2
FROM x
GROUP BY col1, col2
HAVING MIN(col3) <> MAX(col3)
)
You may also wish to know how many duplicates there are (ie having col1, col2, col3 the same:
SELECT col1, col2, col3
FROM x
GROUP BY col1, col2, col3
HAVING COUNT(*) > 1
Did you mean something like this?
select col1,col2,col3
from table tab1
where col1 = col2
and col1 <> col3