How to select rows without duplicates when one column is different? - hive

This is my table with 4 columns:
a b e d
a f c d
I want to get all 1st and 4th columns, so that the first two rows will be merged into one row in the example, since they are the same:
a d
a d
When I use the command:
select column1, column4 from my_table;
Would this automatically remove duplicates? If not, how to get distinct rows with only the 1 and 4 columns?

little confusing question.
Do you want to delete duplicate data or you want to just select non-duplicate data?
If you want to delete duplicate data then it will be like this -
insert overwrite my_table
select * from my_table
join (
Select col1||col2||col3||col4 key, row_number() over (partition by col1,col4 order by col1 ) as rn
from my_table) rs on rs.key = col1||col2||col3||col4 and rs.rn=1
If you want to select the unique col1 and col4 and dont want to change underlying data, you can simply fire
select distinct column1, column4 from my_table;

Related

SQL query to remove duplicates from a table with 139 columns and load all columns to another table

I need to remove the duplicates from a table with 139 columns based on 2 columns and load the unique rows with 139 columns into another table.
eg :
col1 col2 col3 .....col139
a b .............
b c .............
a b .............
o/p:
col1 col2 col3 .....col139
a b .............
b c .............
need a SQL query for DB2?
If the "other table" does not exist yet you can create it like this
CREATE TABLE othertable LIKE originaltable
And the insert the requested row with this statement:
INSERT INTO othertable
SELECT col1,...,coln
FROM (SELECT
t.*,
ROW_NUMBER() OVER (PARTITION BY col1, col2 ORDER BY col1) AS num
FROM t) t
WHERE num = 1
There are numerous tools out there that generate queries and column lists - so if you do not want to write it by hand you could generate it with these tools or use another SQL statement to select it from the Db2 catalog table (syscat.columns).
You might be better just deleting the duplicates in place. This can be done without specifying a column list.
DELETE FROM
( SELECT
ROW_NUMBER() OVER (PARTITION BY col1, col2) AS DUP
FROM t
)
WHERE
DUP > 1
You can use row_number():
select t.*
from (select t.*,
row_number() over (partition by a, b order by a) as seqnum
from t
) t;
If you don't want seqnum in the result set, though, you need to list out all the columns.
To find duplicate values in col1 or any column, you can run the following query:
SELECT col1 FROM your_table GROUP BY col1 HAVING COUNT(*) > 1;
And if you want to delete those duplicate rows using the value of col1, you can run the following query:
DELETE FROM your_table WHERE col1 IN (SELECT col1 FROM your_table GROUP BY col1 HAVING COUNT(*) > 1);
You can use the same approach to delete duplicate rows from the table using col2 values.

What is the difference between Postgres DISTINCT vs DISTINCT ON?

I have a Postgres table created with the following statement. This table is filled by as dump of data from another service.
CREATE TABLE data_table (
date date DEFAULT NULL,
dimension1 varchar(64) DEFAULT NULL,
dimension2 varchar(128) DEFAULT NULL
) TABLESPACE pg_default;
One of the steps in a ETL I'm building is extracting the unique values of dimension1 and inserting them in another intermediary table.
However, during some tests I found out that the 2 commands below do not return the same results. I would expect for both to return the same sum.
The first command returns more results compared with the second (1466 rows vs. 1504.
-- command 1
SELECT DISTINCT count(dimension1)
FROM data_table;
-- command 2
SELECT count(*)
FROM (SELECT DISTINCT ON (dimension1) dimension1
FROM data_table
GROUP BY dimension1) AS tmp_table;
Any obvious explanations for this? Alternatively to an explanation, is there any suggestion of any check on the data I should do?
EDIT: The following queries both return 1504 (same as the "simple" DISTINCT)
SELECT count(*)
FROM data_table WHERE dimension1 IS NOT NULL;
SELECT count(dimension1)
FROM data_table;
Thank you!
DISTINCT and DISTINCT ON have completely different semantics.
First the theory
DISTINCT applies to an entire tuple. Once the result of the query is computed, DISTINCT removes any duplicate tuples from the result.
For example, assume a table R with the following contents:
#table r;
a | b
---+---
1 | a
2 | b
3 | c
3 | d
2 | e
1 | a
(6 rows)
SELECT distinct * from R will result:
# select distinct * from r;
a | b
---+---
1 | a
3 | d
2 | e
2 | b
3 | c
(5 rows)
Note that distinct applies to the entire list of projected attributes: thus
select distinct * from R
is semantically equivalent to
select distinct a,b from R
You cannot issue
select a, distinct b From R
DISTINCT must follow SELECT. It applies to the entire tuple, not to an attribute of the result.
DISTINCT ON is a postgresql addition to the language. It is similar, but not identical, to group by.
Its syntax is:
SELECT DISTINCT ON (attributeList) <rest as any query>
For example:
SELECT DISTINCT ON (a) * from R
It semantics can be described as follows. Compute the as usual--without the DISTINCT ON (a)---but before the projection of the result, sort the current result and group it according to the attribute list in DISTINCT ON (similar to group by). Now, do the projection using the first tuple in each group and ignore the other tuples.
Example:
select * from r order by a;
a | b
---+---
1 | a
2 | e
2 | b
3 | c
3 | d
(5 rows)
Then for every different value of a (in this case, 1, 2 and 3), take the first tuple. Which is the same as:
SELECT DISTINCT on (a) * from r;
a | b
---+---
1 | a
2 | b
3 | c
(3 rows)
Some DBMS (most notably sqlite) will allow you to run this query:
SELECT a,b from R group by a;
And this give you a similar result.
Postgresql will allow this query, if and only if there is a functional dependency from a to b. In other words, this query will be valid if for any instance of the relation R, there is only one unique tuple for every value or a (thus selecting the first tuple is deterministic: there is only one tuple).
For instance, if the primary key of R is a, then a->b and:
SELECT a,b FROM R group by a
is identical to:
SELECT DISTINCT on (a) a, b from r;
Now, back to your problem:
First query:
SELECT DISTINCT count(dimension1)
FROM data_table;
computes the count of dimension1 (number of tuples in data_table that where dimension1 is not null). This query
returns one tuple, which is always unique (hence DISTINCT
is redundant).
Query 2:
SELECT count(*)
FROM (SELECT DISTINCT ON (dimension1) dimension1
FROM data_table
GROUP BY dimension1) AS tmp_table;
This is query in a query. Let me rewrite it for clarity:
WITH tmp_table AS (
SELECT DISTINCT ON (dimension1)
dimension1 FROM data_table
GROUP by dimension1)
SELECT count(*) from tmp_table
Let us compute first tmp_table. As I mentioned above,
let us first ignore the DISTINCT ON and do the rest of the
query. This is a group by by dimension1. Hence this part of the query
will result in one tuple per different value of dimension1.
Now, the DISTINCT ON. It uses dimension1 again. But dimension1 is unique already (due to the group by). Hence
this makes the DISTINCT ON superflouos (it does nothing).
The final count is simply a count of all the tuples in the group by.
As you can see, there is an equivalence in the following query (it applies to any relation with an attribute a):
SELECT (DISTINCT ON a) a
FROM R
and
SELECT a FROM R group by a
and
SELECT DISTINCT a FROM R
Warning
Using DISTINCT ON results in a query might be non-deterministic for a given instance of the database.
In other words, the query might return different results for the same tables.
One interesting aspect
Distinct ON emulates a bad behaviour of sqlite in a much cleaner way. Assume that R has two attributes a and b:
SELECT a, b FROM R group by a
is an illegal statement in SQL. Yet, it runs on sqlite. It simply takes a random value of b from any of the tuples in the group of same values of a.
In Postgresql this statement is illegal. Instead, you must use DISTINCT ON and write:
SELECT DISTINCT ON (a) a,b from R
Corollary
DISTINCT ON is useful in a group by when you want to access a value that is functionally dependent on the group by attributes. In other words, if you know that for every group of attributes they always have the same value of the third attribute, then use DISTINCT ON that group of attributes. Otherwise you would have to make a JOIN to retrieve that third attribute.
The first query gives the number of not null values of dimension1, while the second one returns the number of distinct values of the column. These numbers obviously are not equal if the column contains duplicates or nulls.
The word DISTINCT in
SELECT DISTINCT count(dimension1)
FROM data_table;
makes no sense, as the query returns a single row. Maybe you wanted
SELECT count(DISTINCT dimension1)
FROM data_table;
which returns the number of distinct not null values of dimension1. Note, that it is not the same as
SELECT count(*)
FROM (
SELECT DISTINCT ON (dimension1) dimension1
FROM data_table
-- GROUP BY dimension1 -- redundant
) AS tmp_table;
The last query yields the number of all (null or not) distinct values of the column.
To learn and understand what happens by visual example.
Here's a bit of SQL to execute on a PostgreSQL:
DROP TABLE IF EXISTS test_table;
CREATE TABLE test_table (
id int NOT NULL primary key,
col1 varchar(64) DEFAULT NULL
);
INSERT INTO test_table (id, col1) VALUES
(1,'foo'), (2,'foo'), (3,'bar'), (4,null);
select count(*) as total1 from test_table;
-- returns: 4
-- Because the table has 4 records.
select distinct count(*) as total2 from test_table;
-- returns: 4
-- The count(*) is just one value. Making 1 total unique can only result in 1 total.
-- So the distinct is useless here.
select col1, count(*) as total3 from test_table group by col1 order by col1;
-- returns 3 rows: ('bar',1),('foo',2),(NULL,1)
-- Since there are 3 unique col1 values. NULL's are included.
select distinct col1, count(*) as total4 from test_table group by col1 order by col1;
-- returns 3 rows: ('bar',1),('foo',2),(NULL,1)
-- The result is already grouped, and therefor already unique.
-- So again, the distinct does nothing extra here.
select count(distinct col1) as total5 from test_table;
-- returns 2
-- NULL's aren't counted in a count by value. So only 'foo' & 'bar' are counted
select distinct on (col1) id, col1 from test_table order by col1 asc, id desc;
-- returns 3 rows: (2,'a'),(3,'b'),(4,NULL)
-- So it gets the records with the maximum id per unique col1
-- Note that the "order by" matters here. Changing that DESC to ASC would get the minumum id.
select count(*) as total6 from (select distinct on (col1) id, col1 from test_table order by col1 asc, id desc) as q;
-- returns 3.
-- After seeing the previous query, what else would one expect?
select distinct col1 from test_table order by col1;
-- returns 3 unique values : ('bar'),('foo'),(null)
select distinct id, col1 from test_table order by col1;
-- returns all records.
-- Because id is the primary key and therefore makes each returned row unique
Here's a more direct summary that might useful for Googlers, answering the title but not the intricacies of the full post:
SELECT DISTINCT
availability: ISO
behaviour:
SELECT DISTINCT col1, col2, col3 FROM mytable
returns col1, col2 and col3 and omits any rows in which all of the tuple (col1, col2, col3) are the same. E.g. you could get a result like:
1 2 3
1 2 4
because those two rows are not identical due to the 4. But you could never get:
1 2 3
1 2 4
1 2 3
because 1 2 3 appears twice, and both rows are exactly the same. That is what DISTINCT prevents.
vs GROUP BY: SELECT DISTINCT is basically a subset of GROUP BY where you can't use aggregate functions: Is there any difference between GROUP BY and DISTINCT
SELECT DISTINCT ON
availability: PostgreSQL extension, WONTFIXED by SQLite
behavior: unlike DISTINCT, DISTINCT ON allows you to separate
what you want to be unique
from what you want to return
E.g.:
SELECT DISTINCT ON(col1) col2, col3 FROM mytable
returns col2 and col3, and does not return any two rows with the same col1. E.g.:
1 2 3
1 4 5
could not happen, because we have 1 twice on col1.
And e.g.:
SELECT DISTINCT ON(col1, col2) col2, col3 FROM mytable
would prevent any duplicated (col1, col2) tuples, e.g. you could get:
1 2 3
1 4 3
as it has different (1, 2) and (1, 4) tuples, but not:
1 2 3
1 2 4
where (1, 2) happens twice, only one of those two could appear.
We can uniquely determine which one of the possible rows will be selected with ORDER BY which guarantees that the first match is taken, e.g.:
SELECT DISTINCT ON(col1, col2) col2, col3 FROM mytable
ORDER BY col1 DESC, col2 DESC, col3 DESC
would ensure that among:
1 2 3
1 2 4
only 1 2 4 would be picked as it happens first on our DESC sorting.
vs GROUP BY: DISTINCT ON is not a subset of GROUP BY because it allows you to access extra rows not present in the GROUP BY, which is generally not allowed in GROUP BY, unless:
you group by primary key in Postgres (unique not null is a TODO for them)
or if that is allows as an ISO extension as in SQLite/MySQL
This makes DISTINCT ON extremely useful to fulfill the common use case of "find the full row that reaches the maximum/minimum of some column": Is there any difference between GROUP BY and DISTINCT
E.g. to find the city of each country that has the most sales:
SELECT DISTINCT ON ("country") "country", "city", "amount"
FROM "Sales"
ORDER BY "country" ASC, "amount" DESC, "city" ASC
or equivalently with * if we want all columns:
SELECT DISTINCT ON ("country") *
FROM "Sales"
ORDER BY "country" ASC, "amount" DESC, "city" ASC
Here each country appears only once, within each country we then sort by amount DESC and take the first, and therefore highest, amount.
RANK and ROW_NUMBER window functions
These can be used basically as supersets of DISTINCT ON, and implemented tested as of both SQLite 3.34 and PostgreSQL 14.3. I highly recommend also looking into them, see e.g.: How to SELECT DISTINCT of one column and get the others?
This is how the above "city with the highest amount of each country" query would look like with ROW_NUMBER:
SELECT *
FROM (
SELECT
ROW_NUMBER() OVER (
PARTITION BY "country"
ORDER BY "amount" DESC, "city" ASC
) AS "rnk",
*
FROM "Sales"
) sub
WHERE
"sub"."rnk" = 1
ORDER BY
"sub"."country" ASC
Try
SELECT count(dimension1a)
FROM (SELECT DISTINCT ON (dimension1) dimension1a
FROM data_table
ORDER BY dimension1) AS tmp_table;
DISTINCT ON appears to be synonymous with GROUP BY.

Oracle: extract data from MDSYS.SDO_GEOMETRY column

I have a table form which I need to extract some information. This table has an oracle spatial (MDSYS.SDO_GEOMETRY) column, from which I also need some data.
I started out with a simple query like this:
select id, field1, field2
FROM my_table;
After that, I was able to loop over the result to extract the data that was in the spatial column:
SELECT *
FROM TABLE (SELECT a.POSITIONMAP.sdo_ordinates
FROM my_table
WHERE ID = 18742084);
The POSITIONMAP.sdo_ordinates seems to usually hold 4 values, like these:
100050,887
407294,948
0,577464740471056
-0,816415625470689
I need the last 2 values. I can achieve that by changing the query into this:
SELECT * FROM
(SELECT rownum AS num,
column_value AS orientatie
FROM TABLE (SELECT a.POSITIONMAP.sdo_ordinates
FROM my_table
WHERE ID = 18742084))
WHERE num IN (3,4)
Looping over every row from my first query to extract the data from the POSITIONMAP column is of course not very performance friendly, so my query becomes slow very quickly.
I would like to retrieve all information in one query, but there are a few things that prevent me from doing so.
Not every row in the table has data in POSITIONMAP
Some rows do have data in POSITIONMAP, but they only contain 2 values (so not the 3rd and 4th value that I am looking for.
I need the data in one row for every row in the table (using the previous query would result in duplicate rows
The closest I got is:
select
id,
field1,
field2
t.*
FROM my_table v,
table (v.POSITIONMAP.sdo_ordinates) t
This gives my 4 rows for every row in my_table.
As soon as I try to put the rownum condition into this query, I get an error: "invalid user.table.column, table.column, or column specification"
Is there any way to combine what I want to do into 1 query?
You can use sdo_util.getvertices as follows:
select t.x,t.y
from my_table mt
,table(sdo_util.getvertices(mt.positionmap)) t
where t.id = 2
I'm assuming that your geometries are lines (gtype=2002) and points (gtype= 2001). If you want X,Y values for lines and empty values for point you can filter on the sdo_gtype property of the geometry object.
select t.x,t.y
from my_table mt
,table(sdo_util.getvertices(mt.positionmap)) t
where t.id = 2
and mt.positionmap.sdo_gtype=2002
union all
select null as X,
null as Y
from my_table mt
where mt.positionmap.sdo_gtype=2001
One method is to use the ROW_NUMBER() analytic function:
SELECT *
FROM (
select id,
field1,
field2,
t.*,
ROW_NUMBER() OVER ( PARTITION BY v.id ORDER BY ROWNUM ) AS rn
FROM my_table v,
TABLE( v.POSITIONMAP.sdo_ordinates ) t
)
WHERE rn IN ( 3, 4 )

Sort column values to match order of values in another table column

Let's say I have table like this:
Column1 Column2
C 2
B 1
A 3
I need to exchange values in the second column to get this:
Column1 Column2
C 3
B 2
A 1
The goal is only for numeric column to have values sorted to follow alphabetical order on another column. The actual table has multiple columns and column 1 is people's name, while column 2 two is rank for rendering column 1 values in UI.
What is the most optimal way to do this?
I am doing this from C# code, on SQL server and have to use System.Data.SqlClient.SqlCommand because of transaction. But maybe it's not important if this can all be done from SQL.
Thank you!
So you need to update Column2 with the row-number according toColumn1?
You can use ROW_NUMBER and a CTE:
WITH CTE AS
(
SELECT Column1, Column2, RN = ROW_NUMBER() OVER (ORDER BY Column1)
FROM MyTable
)
UPDATE CTE SET Column2 = RN;
This updates the table MyTable and works because the CTE selects a single table. If it contains more than one table you have to JOIN the UPDATE with the CTE.
Demo

Update Table Beginning At Record One SQL Server

I am trying to update a table with records from another table. Whenever I use the insert into statement, I find that the records are simply appended. Instead, I want the records to be inserted from the top of the table. What is the easiest way to do this? I am thin king I could use a update statement, but that means I will have to join the tables. One of the tables(the one I am pulling records from) has only one column. As such, I would have to include another column to do the join.I am trying not to make it so complicated. If there is a simplier way, please let me know.
Sample:
Table One
Col1
1
2
3
4
Table 2
Col1 Col2
a
b
c
d
I want to move column 1 from table 1 to column 2 in table 2 such that table 2 will be:
Table 2
Col1 Col2
a 1
b 2
c 3
d 4
You can do the update using row_number(), but the rows will be assigned in an indeterminate order:
with toupdate as (
select t2.*, row_number() over (select NULL)) as seqnum
from table2 t2
),
t1 as (
select t1.*, row_numbrer() over (select NULL)) as seqnum
from table1 t1
)
update toupdate
set col2 = t1.col1
from toupdate join
t1
on toupdate.seqnum = t1.seqnum;
Note: if you have an ordering in mind, then use the appropriate order by in the partition clauses.
Unless you explicity define an ORDER BY clause in your SELECT statements, your result set will be completely arbitrary. This is in line with how any RDBMS should operate. You should consider including a timestamp at the time of insertion to identify the latest rows.