How do I select 3 columns and then drop duplicates from only two of the selected columns? - sql

I'm working with a table that contains 3 columns, all columns have integer datatypes.
I'm trying to replicate the following PySpark code into SQl
df = my_table.select('column_1', 'column_2', 'column_3')
df = df.drop_duplicates(['column_1', 'column_2'])
In the above code I'm trying to select three columns and then drop duplicates from only the first two.
I tried using
SELECT
MIN(column_1), MIN(column_2), column_3
FROM my_table
GROUP BY column_3
and it looks like it did get the job done but the output wasn't similar to the PySpark output.
Please Advise.
Note: I'm actually writing this query on dbt so I can't specify a SQL version

I think you can try with that.
SELECT mt1.MIN(col1) as min1, mt2.MIN(col2) as min2, mt1.col3
FROM my_table as mt1
JOIN my_table as mt2 on mt1.id = mt2.id
WHERE min1 != min2
GROUP BY col3;

I was able to drop duplicates from both col1 and col2 using ROW_NUMBER() in the following query:
SELECT col1, col2, col3
FROM
(
SELECT
col1, col2, col3,
ROW_NUMBER() OVER (PARTITION BY col1, col2 ORDER BY col1 DESC) AS row_num
FROM table_name
)
WHERE row_num = 1

Related

SQL with having statement now want complete rows

Here is a mock table
MYTABLE ROWS
PKEY 1,2,3,4,5,6
COL1 a,b,b,c,d,d
COL2 55,44,33,88,22,33
I want to know which rows have duplicated COL1 values:
select col1, count(*)
from MYTABLE
group by col1
having count(*) > 1
This returns :
b,2
d,2
I now want all the rows that contain b and d. Normally, I would use where in stmt, but with the count column, not certain what type of statement I should use?
maybe you need
select * from MYTABLE
where col1 in
(
select col1
from MYTABLE
group by col1
having count(*) > 1
)
Use a CTE and a windowed aggregate:
WITH CTE AS(
SELECT Pkey,
Col1,
Col2,
COUNT(1) OVER (PARTITION BY Col1) AS C
FROM dbo.YourTable)
SELECT PKey,
Col1,
Col2
FROM CTE
WHERE C > 1;
Lots of ways to solve this here's another
select * from MYTABLE
join
(
select col1 ,count(*)
from MYTABLE
group by col1
having count(*) > 1
) s on s.col1 = mytable.col1;

SQL query to remove duplicates from a table with 139 columns and load all columns to another table

I need to remove the duplicates from a table with 139 columns based on 2 columns and load the unique rows with 139 columns into another table.
eg :
col1 col2 col3 .....col139
a b .............
b c .............
a b .............
o/p:
col1 col2 col3 .....col139
a b .............
b c .............
need a SQL query for DB2?
If the "other table" does not exist yet you can create it like this
CREATE TABLE othertable LIKE originaltable
And the insert the requested row with this statement:
INSERT INTO othertable
SELECT col1,...,coln
FROM (SELECT
t.*,
ROW_NUMBER() OVER (PARTITION BY col1, col2 ORDER BY col1) AS num
FROM t) t
WHERE num = 1
There are numerous tools out there that generate queries and column lists - so if you do not want to write it by hand you could generate it with these tools or use another SQL statement to select it from the Db2 catalog table (syscat.columns).
You might be better just deleting the duplicates in place. This can be done without specifying a column list.
DELETE FROM
( SELECT
ROW_NUMBER() OVER (PARTITION BY col1, col2) AS DUP
FROM t
)
WHERE
DUP > 1
You can use row_number():
select t.*
from (select t.*,
row_number() over (partition by a, b order by a) as seqnum
from t
) t;
If you don't want seqnum in the result set, though, you need to list out all the columns.
To find duplicate values in col1 or any column, you can run the following query:
SELECT col1 FROM your_table GROUP BY col1 HAVING COUNT(*) > 1;
And if you want to delete those duplicate rows using the value of col1, you can run the following query:
DELETE FROM your_table WHERE col1 IN (SELECT col1 FROM your_table GROUP BY col1 HAVING COUNT(*) > 1);
You can use the same approach to delete duplicate rows from the table using col2 values.

Select group by with a max predicate

Quite often I have to do queries like below:
select col1, max(id)
from Table
where col2 = 'value'
and col3 = ( select max(col3)
from Table
where col2 = 'value'
)
group by col1
Are there any other ways to avoid subqueries and temp tables? Basically I need a group by on all the rows with a particular max value. Assuming all proper indices are used.
You can use an OLAP function to achieve this. I would say this solution is marginally better in that your predicates are not duplicated between the main query and subquery, so you don't violate DRY:
SELECT *
FROM (
select col1, max(id) as max_id,
RANK() OVER (PARTITION BY col1 ORDER BY col3 DESC) AS irow
from [Member]
where col2 = 'value'
group by col1
) subquery
WHERE subquery.irow = 1

SQL script for retrieving 5 unique values in a table ( google big query )

I am looking for a query where I can get unique values(5) in a table. For example.
The table consists of more 100+ columns. Is there any way I can get unique values.
I am using google big query and tried this option
select col1 col2 ... coln
from tablename
where col1 is not null and col2 is not null
group by col1,col2... coln
order by col1, col2... coln
limit 5
But problem is it gives zero records if all the column are null
Thanks
R
I think you might be able to do this in Google bigquery, assuming that the types for the columns are compatible:
select colname, colval
from (select 'col1' as colname, col1 as colvalue
from t
where col1 is not null
group by col1
limit 5
),
(select 'col2' as colname, col2 as colvalue
from t
where col2 is not null
group by col2
limit 5
),
. . .
For those not familiar with the syntax, a comas in the from clause means union all, not cross join in this dialect. Why did they have to change this?
Try This one, i hope it works
;With CTE as (
select * ,ROW_NUMBER () over (partition by isnull(col1,''),isnull(col2,'')... isnull(coln,'') order by isnull(col1,'')) row_id
from tablename
) select * from CTE where row_id =1

Add Identity column to a view in SQL Server 2008

This is my view:
Create View [MyView] as
(
Select col1, col2, col3 From Table1
UnionAll
Select col1, col2, col3 From Table2
)
I need to add a new column named Id and I need to this column be unique so I think to add new column as identity. I must mention this view returned a large of data so I need a way with good performance, And also I use two select query with union all I think this might be some complicated so what is your suggestion?
Use the ROW_NUMBER() function in SQL Server 2008.
Create View [MyView] as
SELECT ROW_NUMBER() OVER( ORDER BY col1 ) AS id, col1, col2, col3
FROM(
Select col1, col2, col3 From Table1
Union All
Select col1, col2, col3 From Table2 ) AS MyResults
GO
The view is just a stored query that does not contain the data itself so you can add a stable ID. If you need an id for other purposes like paging for example, you can do something like this:
create view MyView as
(
select row_number() over ( order by col1) as ID, col1 from (
Select col1 From Table1
Union All
Select col1 From Table2
) a
)
There is no guarantee that the rows returned by a query using ROW_NUMBER() will be ordered exactly the same with each execution unless the following conditions are true:
Values of the partitioned column are unique. [partitions are parent-child, like a boss has 3 employees][ignore]
Values of the ORDER BY columns are unique. [if column 1 is unique, row_number should be stable]
Combinations of values of the partition column and ORDER BY columns are unique. [if you need 10 columns in your order by to get unique... go for it to make row_number stable]"
There is a secondary issue here, with this being a view. Order By's don't always work in views (long-time sql bug). Ignoring the row_number() for a second:
create view MyView as
(
select top 10000000 [or top 99.9999999 Percent] col1
from (
Select col1 From Table1
Union All
Select col1 From Table2
) a order by col1
)
Using "row_number() over ( order by col1) as ID" is very expensive.
This way is much more efficient in cost:
Create View [MyView] as
(
Select ID = isnull(cast(newid() as varchar(40)), '')
, col1
, col2
, col3
From Table1
UnionAll
Select ID = isnull(cast(newid() as varchar(40)), '')
, col1
, col2
, col3
From Table2
)
use ROW_NUMBER() with "order by (select null)" this will be less expensive and will get your result.
Create View [MyView] as
SELECT ROW_NUMBER() over (order by (select null)) as id, *
FROM(
Select col1, col2, col3 From Table1
Union All
Select col1, col2, col3 From Table2 ) R
GO