Find rows with all columns duplicated and no unique field in PostgreSQL

Find rows with all columns duplicated and no unique field in PostgreSQL - sql

Say I have a table like this, where no column or combination of columns is guaranteed to be unique:
GAME_EVENT
USERNAME
ITEM
QUANTITY
sell
poringLUVR
sword
1
sell
poringLUVR
sword
1
kill
daenerys
civilians
200000
kill
daenerys
civilians
200000
invoke
sylvanas
undead
1000000
And I want to retrieve the list of all rows that exist more than once (where the combination of ALL their columns appears more than once).
(In this case I would expect to get a list with the "sell/poringLUVR" and "kill/daenerys" rows)
What would be a good way of approaching this? Would a combined index be of any help? Suggestions for non-Postgres approaches are also welcome.

Assuming all columns NOT NULL, this will do:
SELECT *
FROM tbl t1
WHERE EXISTS (
SELECT FROM tbl t2
WHERE (t1.*) = (t2.*)
AND t1.ctid <> t2.ctid
);
ctid is a system column, the "tuple identifier" / "item pointer" that can serve as poor-man's PK in the absence of an actual PK (which you obviously don't have), and only within the scope of a single query. Related:
Delete duplicate rows from small table
How do I decompose ctid into page and row numbers?
If columns can be NULL, (more expensively) operate with IS NOT DISTINCT FROM instead of =. See:
How do I (or can I) SELECT DISTINCT on multiple columns?
(t1.*) = (t2.*) is comparing ROW values. This shorter syntax is equivalent: t1 = t2 unless a column of the same name exists in the underlying tables, in which case the second form fails while the first won't. See:
SQL syntax term for 'WHERE (col1, col2) < (val1, val2)'
Index?
If any of the columns has a particularly high cardinality (many distinct values, few duplicates), let's call it hi_cardi_column for the purpose of this answer, a plain btree index on just that column can be efficient for your task. A combination of a few, small columns with a multicolumn index can work, too. The point is to have a small, fast index or the overhead won't pay.
SELECT *
FROM tbl t1
WHERE EXISTS (
SELECT FROM tbl t2
WHERE t1.hi_cardi_column = t2.hi_cardi_column -- logically redundant
AND (t1.*) = (t2.*)
AND t1.ctid <> t2.ctid
);
The added condition t1.hi_cardi_column = t2.hi_cardi_column is logically redundant, but helps to utilize said index.
Other than that I don't see much potential for index support as all rows of the table have to be visited anyway, and all columns have to be checked.

Related

How to return first value meeting condition and stop (sub-)query

I am trying to find a translation for a word, using a subquery for another table, joining by the word that needs to be translated. It works, but it makes the query very slow, because the table containing the translation is quite huge.
select name,
(select name_en
from translations
where name = a.name
and name is not null
and name != ''
order by name_en desc
limit 1) as name_en
from original_table as a;
In my case it would be sufficient to just return the first value that meets the where clause of the subquery and then return it (without the ordering or limit 1). Is that even possible?

If you want any of the translations, do away with the ORDER BY clause and
CREATE INDEX ON translations (name);
Make sure that you put constraints on the translation table to improve performance and data quality:
define all columns NOT NULL: it is pointless to have NULL values there.
add a check constraint that forbids empty strings
make sure to have a unique constraint on name (it could be the primary key)
Then you can omit those extra checks, you don't nee the LIMIT, and you can convert the subquery to an outer join, which may boost performance for larger queries.

I would recommend a slightly different index. First, I strongly encourage you to write the query qualifying all column names:
select ot.name,
(select t.name_en
from translations t
where t.name = ot.name
t.name <> ''
limit 1
) as name_en
from original_table ot;
From your description, no order by is needed. The NULL comparison is redundant, so you can just remove it. I'm also not sure if t.name <> '' is needed. There are probably no matches on empty strings anyway.
The index is translates(name, name_en). Postgres now supports index-only scans (under some circumstances). If all the rows are "visible", then this can be resolved only using the index.

Why query optimizer selects completely different query plans?

Let us have the following table in SQL Server 2016
-- generating 1M test table with four attributes
WITH x AS
(
SELECT n FROM (VALUES (0),(1),(2),(3),(4),(5),(6),(7),(8),(9)) v(n)
), t1 AS
(
SELECT ones.n + 10 * tens.n + 100 * hundreds.n + 1000 * thousands.n + 10000 * tenthousands.n + 100000 * hundredthousands.n as id
FROM x ones, x tens, x hundreds, x thousands, x tenthousands, x hundredthousands
)
SELECT id,
id % 50 predicate_col,
row_number() over (partition by id % 50 order by id) join_col,
LEFT('Value ' + CAST(CHECKSUM(NEWID()) AS VARCHAR) + ' ' + REPLICATE('*', 1000), 1000) as padding
INTO TestTable
FROM t1
GO
-- setting the `id` as a primary key (therefore, creating a clustered index)
ALTER TABLE TestTable ALTER COLUMN id int not null
GO
ALTER TABLE TestTable ADD CONSTRAINT pk_TestTable_id PRIMARY KEY (id)
-- creating a non-clustered index
CREATE NONCLUSTERED INDEX ix_TestTable_predicate_col_join_col
ON TestTable (predicate_col, join_col)
GO
Ok, and now when I run the following queries having just slightly different predicates (b.predicate_col <= 0 vs. b.predicate_col = 0) I got completely different plans.
-- Q1
select b.id, b.predicate_col, b.join_col, b.padding
from TestTable b
join TestTable a on b.join_col = a.id
where a.predicate_col = 1 and b.predicate_col <= 0
option (maxdop 1)
-- Q2
select b.id, b.predicate_col, b.join_col, b.padding
from TestTable b
join TestTable a on b.join_col = a.id
where a.predicate_col = 1 and b.predicate_col = 0
option (maxdop 1)
If I look on query plans, then it is clear that he chooses to join the key lookup together with non-clustered index seek first and then he does the final join with non-clustered index in the case of Q1 (which is bad). A much better solution is in the case of Q2: he joins the non-clustered indexes first and then he does the final key lookup.
The question is: why is that and can I improve it somehow?
In my intuitive understanding of histograms, it should be easy to estimate the correct result for both variants of predicates (b.predicate_col <= 0 vs. b.predicate_col = 0), therefore, why different query plans?
EDIT:
Actually, I do not want to change the indexes or physical structure of the table. I would like to understand why he picks up such a bad query plan in the case of Q1. Therefore, my question is precisely like this:
Why he picks such a bad query plan in the case of Q1 and can I improve without altering the physical design?
I have checked the result estimations in the query plan and both query plans have exact row number estimations of every operator! I have checked the result memo structure (OPTION (QUERYTRACEON 3604, QUERYTRACEON 8615, QUERYTRACEON 8620)) and rules applied during the compilation (OPTION (QUERYTRACEON 3604, QUERYTRACEON 8619, QUERYTRACEON 8620)) and it seems that he finish the query plan search once he hit the first plan. Is this the reason for such behaviour?

This is caused by SQL Server's inability to use Index Columns to the Right of the Inequality search.
This code produces the same issue:
SELECT * FROM TestTable WHERE predicate_col <= 0 and join_col = 1
SELECT * FROM TestTable WHERE predicate_col = 0 and join_col <= 1
Inequality queries such as >= or <= put a limitation on SQL, the Optimiser can't use the rest of the columns in the index, so when you put an inequality on [predicate_col] you're rendering the rest of the index useless, SQL can't make full use of the index and produces an alternate (bad) plan. [join_col] is the last column in the Index so in the second query SQL can still make full use of the index.
The reason SQL opts for the Hash Match is because it can't guarantee the order of the data coming out of table B. The inequality renders [join_col] in the index useless so SQL has to prepare for unsorted data on the join, even though the row count is the same.
The only way to fix your problem (even though you don't like it) is to alter the Index so that Equality columns come before Inequality columns.

Ok answer can be from Statistics and histogram point of view also.
Answer can be from index structure arrangement point of view also.
Ok I am trying to answer this from index structure.
Although you get same result in both query because there is no predicate_col < 0 records
When there is Range predicate in composite index ,both the index are not utilise. There can also be so many other reason of index not being utilise.
-- Q1
select b.id, b.predicate_col, b.join_col, b.padding
from TestTable b
join TestTable a on b.join_col = a.id
where a.predicate_col = 1 and b.predicate_col <= 0
option (maxdop 1)
If we want plan like in Q2 then we can create another composite index.
-- creating a non-clustered index
CREATE NONCLUSTERED INDEX ix_TestTable_predicate_col_join_col_1
ON TestTable (join_col,predicate_col)
GO
We get query plan exactly like Q2.
Another way is to define CHECK constraint in predicate_col
Alter table TestTable ADD check (predicate_col>=0)
GO
This also give same query plan as Q2.
Though in real table and data, whether you can create CHECK Constraint or create another composite index or not is another discussion.

Compare one value of column A with all the values of column B in Hive HQL

I have two columns in one table say Column A and Column B. I need to search each value of Column A with All the values of column B each and every time and return true if the column A value is found in any of the rows of column B. How can i get this?
I have tried using the below command:
select column _A, column_B,(if (column_A =column_B), True, False) as test from sample;
If i use the above command, it is checking for that particular row alone. But I need true value, if a value of column A is found in any of the rows of column B.
How can i can check one value of column A with all the all the values of column B?
Or Is there any possibility to iterate and compare each value between two columns?

Solution
create temporary table t as select rand() as id, column_A, column_B from sample; --> Refer 1
select distinct t3.id,t3.column_A,t3.column_B,t3.match from ( --> Refer 3
select t1.id as id, t1.column_A as column_A, t1.column_B as column_B,--> Refer 2
if(t2.column_B is null, False, True) as match from t t1 LEFT OUTER JOIN
t t2 ON t1.column_A = t2.column_B
) t3;
Explanation
Create an identifier column to keep track of the rows in original table. I am using rand() here. We will take advantage of this to get the original rows in Step 3. Creating a temporary table t here for simplicity in next steps.
Use a LEFT OUTER JOIN with self to do your test that requires matching each column with another across all rows, yielding the match column. Note that here multiple duplicate rows may get created than in Sample table, but we have got a handle on the duplicates, since the id column for them will be same.
In this step, we apply distinct to get the original rows as in Sample table. You can then ditch the id column.
Notes
Self joins are costly in terms of performance, but this is unavoidable for solution to the question.
The distinct used in Step 3, is costly too. A more performant approach would be to use Window functions where we can partition by the id and pick the first row in the window. You can explore that.

You can do a left join to itself and check if the column key is null. If it is null, then that value is not found in the other table. Use if or "case when" function to check if it is null or not.
Select t1.column_A,
t1.column_B,
IF(t2.column_B is null, 'False', 'True') as test
from Sample t1
Left Join Sample t2
On t1.column_A = t2.column_B;

T-SQL Stored Procedure: Performance of select count(*) vs. select count([uniqueId])

So, I'm looking at a stored procedure here, which has more than one line like the following pseudocode:
if(select count(*) > 0)
...
on tables having a unique id (or identifier, for making it more general).
Now, in terms of performance, is it more performant to change this clause
to
if(select count([uniqueId]) > 0)
...
where uniqueId is, e.g., an Idx containing double values?
An example:
Consider a table like Idx (double) | Name (String) | Address (String)
Now the 'Idx' is a foreign key which I want to join in a stored procedure.
So, in terms of performance: what is better here?
if(select count(*) > 0)
...
or
if(select count(Idx) > 0)
...
? Or does the SQL Engine Change select count(*) to select count(Idx) internally, so we do not have to bother about this? Because at first sight, I'd say that select count(Idx) would be more performant.

The two are slightly different. count(*) counts rows. count([uniqueid]) counts the number of non-NULL values for uniqueid. Because a unique constraint allows a NULL value, SQL Server actually needs to read the column. This could add microseconds of time to a query, particularly if the page with the id is not already in memory. This also gives SQL Server more opportunities to optimize count(*).
As #lad2025 writes in a comment, the performant solution is to use if (exists . . ..

SELECT t1.*
FROM Table1 t1
JOIN Table2 t2 ON t2.idx = t1.idx
will give you only the rows in t1 that match an idx value in Table2. I'm not sure there is a good reason to do an if(select count...).
If you are really interested in the performance of something like this, just create a temp table with a million rows and give it a go:
CREATE TABLE #TempTable (id int identity, txt varchar(50))
GO
INSERT #TempTable (txt) VALUES (##IDENTITY)
GO 1000000

sql server-query optimization with many columns

we have "Profile" table with over 60 columns like (Id, fname, lname, gender, profilestate, city, state, degree, ...).
users search other peopel on website. query is like :
WITH TempResult as (
select ROW_NUMBER() OVER(ORDER BY #sortColumn DESC) as RowNum, profile.id from Profile
where
(#a is null or a = #a) and
(#b is null or b = #b) and
...(over 60 column)
)
SELECT profile.* FROM TempResult join profile on TempResult.id = profile.id
WHERE
(RowNum >= #FirstRow)
AND
(RowNum <= #LastRow)
sql server by default use clustered index for execution query. but total execution time is over 300. we test another solution such as multi column index in all columns in where clause but total execution time is over 400.
do you have any solution to make total execution time lower than 100.
we using sql server 2008.

Unfortunately I don't think there is a pure SQL solution to your issue. Here are a couple alternatives:
Dynamic SQL - build up a query that only includes WHERE clause statements for values that are actually provided. Assuming the average search actually only fills in 2-3 fields, indexes could be added and utilized.
Full Text Search - go to something more like a Google keyword search. No individual options.
Lucene (or something else) - Search outside of SQL; This is a fairly significant change though.
One other option that I just remembered implementing in a system once. Create a vertical table that includes all of the data you are searching on and build up a query for it. This is easiest to do with dynamic SQL, but could be done using Table Value Parameters or a temp table in a pinch.
The idea is to make a table that looks something like this:
Profile ID
Attribute Name
Attribute Value
The table should have a unique index on (Profile ID, Attribute Name) (unique to make the search work properly, index will make it perform well).
In this table you'd have rows of data like:
(1, 'city', 'grand rapids')
(1, 'state', 'MI')
(2, 'city', 'detroit')
(2, 'state', 'MI')
Then your SQL will be something like:
SELECT *
FROM Profile
JOIN (
SELECT ProfileID
FROM ProfileAttributes
WHERE (AttributeName = 'city' AND AttributeValue = 'grand rapids')
AND (AttributeName = 'state' AND AttributeValue = 'MI')
GROUP BY ProfileID
HAVING COUNT(*) = 2
) SelectedProfiles ON Profile.ProfileID = SelectedProfiles.ProfileID
... -- Add your paging here
Like I said, you could use a temp table that has attribute name/values:
SELECT *
FROM Profile
JOIN (
SELECT ProfileID
FROM ProfileAttributes
JOIN PassedInAttributeTable ON ProfileAttributes.AttributeName = PassedInAttributeTable.AttributeName
AND ProfileAttributes.AttributeValue = PassedInAttributeTable.AttributeValue
GROUP BY ProfileID
HAVING COUNT(*) = CountOfRowsInPassedInAttributeTable -- calculate or pass in
) SelectedProfiles ON Profile.ProfileID = SelectedProfiles.ProfileID
... -- Add your paging here
As I recall, this ended up performing very well, even on fairly complicated queries (though I think we only had 12 or so columns).

As a single query, I can't think of a clever way of optimising this.
Provided that each column's check is highly selective, however, the following (very long winded) code, might prove faster, assuming each individual column has it's own separate index...
WITH
filter AS (
SELECT
[a].*
FROM
(SELECT * FROM Profile WHERE #a IS NULL OR a = #a) AS [a]
INNER JOIN
(SELECT id FROM Profile WHERE b = #b UNION ALL SELECT NULL WHERE #b IS NULL) AS [b]
ON ([a].id = [b].id) OR ([b].id IS NULL)
INNER JOIN
(SELECT id FROM Profile WHERE c = #c UNION ALL SELECT NULL WHERE #c IS NULL) AS [c]
ON ([a].id = [c].id) OR ([c].id IS NULL)
.
.
.
INNER JOIN
(SELECT id FROM Profile WHERE zz = #zz UNION ALL SELECT NULL WHERE #zz IS NULL) AS [zz]
ON ([a].id = [zz].id) OR ([zz].id IS NULL)
)
, TempResult as (
SELECT
ROW_NUMBER() OVER(ORDER BY #sortColumn DESC) as RowNum,
[filter].*
FROM
[filter]
)
SELECT
*
FROM
TempResult
WHERE
(RowNum >= #FirstRow)
AND (RowNum <= #LastRow)
EDIT
Also, thinking about it, you may even get the same result just by having the 60 individual indexes. SQL Server can do INDEX MERGING...

You've several issues imho. One is that you're going to end up with a seq scan no matter what you do.
But I think your more crucial issue here is that you've an unnecessary join:
SELECT profile.* FROM TempResult
WHERE
(RowNum >= #FirstRow)
AND
(RowNum <= #LastRow)

This is a classic "SQL Filter" query problem. I've found that the typical approaches of "(#b is null or b = #b)" & it's common derivatives all yeild mediocre performance. The OR clause tends to be the cause.
Over the years I've done a lot of Perf/Tuning & Query Optimisation. The Approach I've found best is to generate Dynamic SQL inside a Stored Proc. Most times you also need to add "with Recompile" on the statement. The Stored Proc helps reduce potential for SQL injection attacks. The Recompile is needed to force the selection of indexes appropriate to the parameters you are searching on.
Generally it is at least an order of magnitude faster.
I agree you should also look at points mentioned above like :-
If you commonly only refer to a small subset of the columns you could create non-clustered "Covering" indexes.
Highly selective (ie:those with many unique values) columns will work best if they are the lead colum in the index.
If many colums have a very small number of values, consider using The BIT datatype. OR Create your own BITMASKED BIGINT to represent many colums ie: a form of "Enumerated datatyle". But be careful as any function in the WHERE clause (like MOD or bitwise AND/OR) will prevent the optimiser from choosing an index. It works best if you know the value for each & can combine them to use an equality or range query.
While often good to find RoWID's with a small query & then join to get all the other columns you want to retrieve. (As you are doing above) This approach can sometimes backfire. If the 1st part of the query does a Clustred Index Scan then often it is faster to get the otehr columns you need in the select list & savethe 2nd table scan.
So always good to try it both ways & see what works best.
Remember to run SET STATISTICS IO ON & SET STATISTICS TIME ON. Before running your tests. Then you can see where the IO is & it may help you with index selection, for the mose frequent combination of paramaters.
I hope this makes sense without long code samples. (it is on my other machine)

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas