Remove case insensitive duplicates in sql (postgres) - sql

I have a postgresql database, and I'm trying to delete (or even just get the ids) of the older of the duplicates I have in my table, but only those who are because of case sensitivity, for example helLo and hello.
The table is quite large and my nested query takes a really long time, I wonder if there is a better, more efficient way to do my query in one go, and not split it up to multiple queries, cause there's a lot of ids in question
SELECT * FROM some_table AS out
WHERE (SELECT count(*) FROM some_table AS in
WHERE out.text != in.text
AND LOWER(in.text) = LOWER(out.text)
AND in.created_at > out.created_at) > 1
Thanks!

Can you try
SELECT LOWER(text), ROW_NUMBER() OVER( PARTITION by LOWER(text) ORDER by created_at ) as rn
FROM some_table
You can then use the rn column as a filter

To help this query, create an expression index on LOWER(text). Include created_at in the index to help the date comparisons.
CREATE INDEX text_lower ON some_table(LOWER(text), created_at);
It's hard to test this without your data, though.

Related

Oracle query with order by perfomance issue

I have really complicated query:
select * from (
select * from tbl_user ...
where ...
and date_created between :date_from and :today
...
order by date_created desc
) where rownum <=50;
Currently query is fast enough because of where clause (only 3 month before today, date_from = today - 90 days).
I have to remove this clause, but it causes performance degradation.
What if first calculate date_from by `
SELECT MIN(date_created) where...
and then insert this value into main query? Set of data will be the same. Will it improve performance? Does it make sense?
Could anyone have any assumption about optimization?
Using an order by operation will of course cause the query to take a little longer to return. That being said, it is almost always faster to sort in the DB than it is to sort in your application logic.
It's hard to really optimize without the full query and schema information, but I'll take a stab at what seems like the most obvious to me.
Converting to Rank()
Your query could be a lot more efficient if you use a windowed rank() function. I've also converted it to use a common table expression (aka CTE). This doesn't improve performance, but does make it easier to read.
with cte as (
select
*
, rank() over (
partition by
-- insert what fields differentiate your rows here
-- unlike a group by clause, this doesn't need to be
-- every field
order by
date_created desc
)
from
tbl_user
...
where
...
and date_created between :date_from and :today
)
select
*
from
cte
where
rk <= 50
Indexing
If date_created is not indexed, it probably should be.
Take a look at your autotrace results. Figure out what filters have the highest cost. These are probably unindexed, and maybe should be.
If you post your schema, I'd be happy to make better suggestions.

How to retrieve the last 2 records from table?

I have a table with n number of records
How can i retrieve the nth record and (n-1)th record from my table in SQL without using derived table ?
I have tried using ROWID as
select * from table where rowid in (select max(rowid) from table);
It is giving the nth record but i want the (n-1)th record also .
And is there any other method other than using max,derived table and pseudo columns
Thanks
You cannot depend on rowid to get you to the last row in the table. You need an auto-incrementing id or creation time to have the proper ordering.
You can use, for instance:
select *
from (select t.*, row_number() over (order by <id> desc) as seqnum
from t
) t
where seqnum <= 2
Although allowed in the syntax, the order by clause in a subquery is ignored (for instance http://docs.oracle.com/javadb/10.8.2.2/ref/rrefsqlj13658.html).
Just to be clear, rowids have nothing to do with the ordering of rows in a table. The Oracle documentation is quite clear that they specify a physical access path for the data (http://docs.oracle.com/cd/B28359_01/server.111/b28318/datatype.htm#i6732). It is true that in an empty database, inserting records into a newtable will probably create a monotonically increasing sequence of row ids. But you cannot depend on this. The only guarantees with rowids are that they are unique within a table and are the fastest way to access a particular row.
I have to admit that I cannot find good documentation on Oracle handling or not handling order by's in subqueries in its most recent versions. ANSI SQL does not require compliant databases to support order by in subqueries. Oracle syntax allows it, and it seems to work in some cases, at least. My best guess is that it would probably work on a single processor, single threaded instance of Oracle, or if the data access is through an index. Once parallelism is introduced, the results would probably not be ordered. Since I started using Oracle (in the mid-1990s), I have been under the impression that order bys in subqueries are generally ignored. My advice would be to not depend on the functionality, until Oracle clearly states that it is supported.
select * from (select * from my_table order by rowid) where rownum <= 2
and for rows between N and M:
select * from (
select * from (
select * from my_table order by rowid
) where rownum <= M
) where rownum >= N
Try this
select top 2 * from table order by rowid desc
Assuming rowid as column in your table:
SELECT * FROM table ORDER BY rowid DESC LIMIT 2

How to speed up group-based duplication-count queries on unindexed tables

When I need to know the number of rows containing more than n duplicates for certain colulmn c, I can do it like this:
WITH duplicateRows AS (
SELECT COUNT(1)
FROM [table]
GROUP BY c
HAVING COUNT(1) > n
) SELECT COUNT(1) FROM duplicateRows
This leads to an unwanted behaviour: SQL Server counts all rows grouped by i, which (when no index is on this table) leads to horrible performance.
However, when altering the script such that SQL Server doesn't have to count all the rows doesn't solve the problem:
WITH duplicateRows AS (
SELECT 1
FROM [table]
GROUP BY c
HAVING COUNT(1) > n
) SELECT COUNT(1) FROM duplicateRows
Although SQL Server now in theory can stop counting after n + 1, it leads to the same query plan and query cost.
Of course, the reason is that the GROUP BY really introduces the cost, not the counting. But I'm not at all interested in the numbers. Is there another option to speed up the counting of duplicate rows, on a table without indexes?
The greatest two costs in your query are the re-ordering for the GROUP BY (due to lack of appropriate index) and the fact that you're scanning the whole table.
Unfortunately, to identify duplicates, re-ordering the whole table is the cheapest option.
You may get a benefit from the following change, but I highly doubt it would be significant, as I'd expect the execution plan to involve a sort again anyway.
WITH
sequenced_data AS
(
SELECT
ROW_NUMBER() OVER (PARTITION BY fieldC) AS sequence_id
FROM
yourTable
)
SELECT
COUNT(*)
FROM
sequenced_data
WHERE
sequence_id = (n+1)
Assumes SQLServer2005+
Without indexing the GROUP BY solution is the best, every PARTITION-based solution involving both table(clust. index) scan and sort, instead of simple scan-and-counting in GROUP BY case
If the only goal is to determine if there are ANY rows in ANY group (or, to rephrase that, "there is a duplicate inside the table, given the distinction of column c"), adding TOP(1) to the SELECT queries could perform some performance magic.
WITH duplicateRows AS (
SELECT TOP(1)
1
FROM [table]
GROUP BY c
HAVING COUNT(1) > n
) SELECT 1 FROM duplicateRows
Theoretically, SQL Server doesn't need to determine all groups, so as soon as the first group with a duplicate is found, the query is finished (but worst-case will take as long as the original approach). I have to say though that this is a somewhat imperative way of thinking - not sure if it's correct...
Speed and "without indexes" almost never go together.
Athough as others here have mentioned I seriously doubt that it will have performance benefits. Perhaps you could try restructuring your query with PARTITION BY.
For example:
WITH duplicateRows AS (
SELECT a.aFK,
ROW_NUMBER() OVER(PARTITION BY a.aFK ORDER BY a.aFK) AS DuplicateCount
FROM Address a
) SELECT COUNT(DuplicateCount) FROM duplicateRows
I haven't tested the performance of this against the actual group by clause query. It's just a suggestion of how you could restructure it in another way.

How to get all results, except one row based on a timestamp?

I have an simple question (?) about SQL. I have come across this problem a few times before and I have always solved it, but I'm looking for a more elegant solution and perhaps a faster solution.
The problem is that I would like to select all rows in a table except the one with the max value in a timestampvalue (in this case this is a summary row but it's not marked as this is any way, and it's not releveant to my result).
I could do something like this:
select * from [table] t
where loggedat < (select max(loggedat) from [table] and somecolumn='somevalue')
and somecolumn='somevalue'
But when working with large tables this seems kind of slow. Any suggestions?
If you don't want to change your DB structure, then your query (or one with a slight variation using <> instead of <) is the way to go.
You could add a column IsSummary bit to the table, and always mark the most recent row as true (and all others false). Then your query would change to:
Select * from [table] where IsSummary = 0 and somecolumn = 'somevalue'
This would sacrifice slower speed on inserts (since an insert would also trigger an update of the IsSummary value) in exchange for faster speed on the select query.
If only you don't mind one tiny (4 byte) extra column, then you might possibly go like this:
SELECT *
FROM (
SELECT *, ROW_NUMBER() OVER (ORDER BY loggedat DESC) AS rownum
FROM [table] t
WHERE somecolumn = 'somevalue'
/* and all the other filters you want */
) s
WHERE rownum > 1
In case you do mind the extra column, you'll just have to list the necessary columns explicitly in the outer SELECT.
It may not be the elegant SQL query you're looking for, but it would be trivial to do it in Java, PHP, etc, after fetching the results. To make it as simple as possible, use ORDER BY timestamp DESC and discard the first row.

SQL query to get single row value from an aggregate

I have an Oracle table with two columns ID and START_DATE, I want to run a query to get the ID of the record with the most recent date, initially i wrote this:
select id from (select * from mytable order by start_date desc) where rownum = 1
Is there a more cleaner and efficient way of doing this? I often run into this pattern in SQL and end up creating a nested query.
SELECT id FROM mytable WHERE start_date = (SELECT MAX(start_date) FROM mytable)
Still a nested query, but more straightforward and also, in my experience, more standard.
This looks to be a pretty clean and efficient solution to me - I don't think you can get any better than that, of course assuming that you've an index on start_date. If you want all ids for the latest start date then froadie's solution is better.