Identify duplicate fields in a table - sql

I'm trying to identify specific fields that are duplicated in a table in a mariadb-10.4.20 Joomla database. I would like to identify all rows that have a specific field duplicated, then ultimately be able to remove those duplicates, leaving just the one with the highest ID.
This table contains the IDs, titles and aliases for the articles in a joomla website. The script I'm building (in perl) will use this information to print the primary title alias and create redirects for any others.
I was previously using "group by" but it appears there's been a change recently in how it's used, and now it doesn't work properly. I don't understand the new format, and I'm not even sure it was previously working fully.
Here's a basic query that shows there are two of the same articles with different IDs:
MariaDB [mydb]> select id,alias,title from db1_content where title = "article title";
+--------+---------------+--------------+
| id | alias | title |
+--------+---------------+--------------+
| 299959 | unique-title | Unique Title |
| 300026 | unique-title | Unique Title |
+--------+------------------------------+
Here's an attempt at trying to use "group by" but it returns no results.
MariaDB [mydb]> select id,title,count(title) from db1_content group by id,title having count(title) > 1;
Empty set (0.230 sec)
If I run the same query without the id field, then it does return a list of all titles that are duplicated, along with the number of occurrences of each title.
That's not exactly what I want, though. I need it to print the id, alias and title fields so I can reference them in my perl script to subsequently perform another query to ultimately delete the duplicates and create links to be used in RewriteRules.
What am I doing wrong?

Since MariaDB cannot currently delete from a CTE, you could use a derived table to generate row numbers for each title ordered by id descending, JOIN that to your main table and then delete any row which has a row number greater than 1. For example:
DELETE db1 FROM db1_content db1
JOIN (
SELECT id,
ROW_NUMBER() OVER (PARTITION BY title ORDER BY id DESC) AS rn
FROM db1_content
) dbr ON db1.id = dbr.id
WHERE dbr.rn > 1
If you don't want to actually delete the records using SQL, you can just select the ones that need to be deleted by using a CTE:
WITH rns AS (
SELECT *,
ROW_NUMBER() OVER (PARTITION BY title ORDER BY id DESC) AS rn
FROM db1_content
)
SELECT id, alias, title
FROM rns
WHERE rn > 1
Demo on dbfiddle

Related

selecting duplicate columns in psql then sorting by row id

I have tried searching the web and whilst there are plenty of answers for finding duplicates, I am yet to stumble on one that allows me to find all the duplicates within a column (i.e where the same 'name' occurs more than once) and then only select the lowest row id (which would be the first duplicate name entered).
So the table's description (inserted from a file):
create table customer(id int, name varchar,)
id| name
1 | Darren
2 | Mark
3 | Julie
4 | Mark
5 | Julie
The query:
CREATE VIEW AS
SELECT COUNT(name), name
FROM customer
GROUP BY name
HAVING COUNT(name) > 1
Result (the order is never guaranteed, I want Mark to always come first as he has the lowest id):
Julie
Mark
Now the issue is, if i select id I have to include it in the group by. Doing that means no duplicate columns get selected as there wont be any since ever id is unique. And without selecting id I cant ORDER BY desc.
I hope I am clear, if not I can re-word or supply more information.
Please try this? Nested query. Basically the SELECT/GROUP is called. On the outside, we get the information selected and sort it.
CREATE VIEW AS
SELECT CNT_NAME, NAME
FROM
(
SELECT COUNT(name) CNT_NAME, name, min(id) min_id
FROM customer
GROUP BY name
HAVING COUNT(name) > 1
) AS alias
ORDER BY MIN_ID

How can i sort table records in SQL Server 2014 Management Studio by Alphabetical?

I have many record in one table :
1 dog
2 cat
3 lion
I want to recreate table or sort data with this Alphabetical order :
1 cat
2 dog
3 lion
Table 1
Id int Unchecked
name nvarchar(50) Checked
To create another table from your table :
CREATE TABLE T1
( ID INT IDENTITY PRIMARY KEY NOT NULL,
NAME NVARCHAR(50) NOT NULL
)
GO
INSERT INTO T1 VALUES ('Dog'),('Cat'),('Lion');
SELECT ROW_NUMBER ()OVER (ORDER BY NAME ASC) ID, NAME INTO T2 FROM T1 ORDER BY NAME ASC;
If you just want to sort the table data, use Order by
Select * from table_1 order by Name
If you want to change the Id's as well according to alphabetical order, create a new table and move the records to the new table by order.
SELECT RANK() OVER (ORDER BY name ) AS Id, name
INTO newTable
FROM table_1
In your database, the order of the records as they were inserted into the table does not necessarily dictate the order in which they're returned when queried. Nor does the ordering of a clustered key. There may be situations in which you appear to always get the same ordering of your results, but that is not guaranteed and may change at any time.
If the results of a query must be a specific order, then you must specify that ordering with an ORDER BY clause in your query (ORDER BY [Name] ASC in this particular case).
I understand, based upon your comments above, that you don't want this to be the answer. But this is how SQL Server (and any other relational database) works. If order matters, you specify that upon querying data from the system, not when inserting data into it.

How to get an incremental "RowId" column in SELECT using ROW_NUMBER()

I've been trying to update a query on the DataExplorer that we use on our Gaming SE Site for keeping track of tags without excerpts to include an incremental row number in the results to make reading the returned values easier. There are a number of questions on here that discuss how to do this, such as this one and this one which appear to have worked for those users, but I can't seem to get it work for my situation.
To be clear, I would like something like this:
RowId | TagName | Count | Easy List Formatting
----------------------------------------------
1 | Tag1 | 6 | 1. [tag:tag1] (6)
2 | Tag2 | 6 | 1. [tag:tag2] (6)
3 | Tag3 | 5 | 1. [tag:tag3] (5)
4 | Tag4 | 5 | 1. [tag:tag4] (5)
What I've come up with so far is this:
SELECT ROW_NUMBER() OVER(PARTITION BY TagInfo.[Count] ORDER BY TagInfo.TagName ASC) AS RowId, *
FROM
(
SELECT
TagName,
[Count],
concat('1. [tag:',concat(TagName,concat('] (', concat([Count],')')))) AS [Easy List Formatting]
FROM Tags
LEFT JOIN Posts pe on pe.Id = Tags.ExcerptPostId
LEFT JOIN TagSynonyms on SourceTagName = Tags.TagName
WHERE coalesce(len(pe.Body),0) = 0 and ApprovalDate is null
) AS TagInfo
ORDER BY TagInfo.[Count] DESC, TagInfo.TagName
This yields something close to what I want, but not quite. The RowId column increments, but once the Count column changes, it resets (presumably because of the PARTITION BY). But, if I remove the PARTITION BY, the RowId column becomes what appear to be random numbers.
Is what I want to do achievable given the way the tables are structured? If so, what should the SQL be?
To access the forked query, you can use this link. The original query (before my changes) can be found here if it helps in anyway.
Removing the PARTITION BY is exactly what is needed. The reason that your numbers look random is that the ORDER BY of the outer query is different from the ORDER BY of your ROW_NUMBER(). All you have to do is make those the same, and the output of the sequence project will have the monotonically increasing value you expect.
Specifically:
SELECT ROW_NUMBER() OVER (ORDER BY TagInfo.[Count] DESC, TagInfo.TagName) AS RowId, *
FROM
(
...
) AS TagInfo
ORDER BY TagInfo.[Count] DESC, TagInfo.TagName
Now you aren't partitioning, and the two ORDER BY clauses match, so you'll get your expected output.
For what it's worth, you technically don't really even care about having an ORDER BY in the ROW_NUMBER(), you just want the same order as the final result set. In that case, you can trick the query engine like so by providing a meaningless ORDER BY clause in the ROW_NUMBER():
SELECT ROW_NUMBER() OVER (ORDER BY (SELECT 1)) AS RowId
Boom, done!
A little way around i used its, i add a column to the original table that has a increment value by 1 for example
ALTER TABLE ur_table ADD id INT IDENTITY(1,1)
GO
after that you do the query with order by column id
select * from ur_table (query) order by id

Delete multiple occurrences of the same ID # and code in a junction table

enter code here
My problem is this: in this database the junction table contains some rows where the kha_id and the icd_fk are the same. While it's OK that kha_id appears in icd_junction more than once , it has to be with a separate icd_fk. I can run a query and get all of the ID#s and the codes which are listed more than once, but is there an industry-standard way of going about deleting all but one occurrence of each?
example: what i have is above
KHA_ID: 123456 V23
123456 V23
123456 V24
I need one of the rows kha_id=123456 and ICD_FK=V23 taken out.
This:
DELETE j1
FROM ICD_Junction AS j1
WHERE EXISTS
( SELECT 1
FROM ICD_Junction AS j2
WHERE j2.KHA_ID = j1.KHA_ID
AND j2.ICD_FK = j1.ICD_FK
AND j2.ID < j1.ID
)
;
will delete, for each KHA_ID and ICD_FK, all but one relevant row of ICD_Junction. (Specifically, it will keep the one with the least ID, and delete the rest.)
Once you've run the above, you should fix whatever code caused the duplication, and add a unique constraint to prevent this from happening again.
(Disclaimer: Not tested, and it's been a while since I last used SQL Server.)
Edited to add: If I'm understanding your comment correctly, you also need help with the query to find duplicates? For that, you can write:
SELECT KHA_ID,
ICD_FK,
COUNT(1) -- the number of duplicates
FROM ICD_Junction
GROUP
BY KHA_ID,
ICD_FK
HAVING COUNT(1) > 1
;
The original question was delete but the comment was find
Select jDup.*
FROM ICD_Junction AS j
JOIN ICD_Junction AS jDup
On j.KHA_ID = jDup.KHA_ID
AND j.ICD_FK = jDup.ICD_FK
AND j.ID < jDup.ID
Select max(jDup.ID), min(jDup.ID), count(*), jDup.KHA_ID, jDup.ICD_FK
FROM ICD_Junction AS jDup
Group By jDup.KHA_ID, jDup.ICD_FK
Having Count(*) > 1
You want something that uses ROW_NUMBER() and partition by. The reason is that it will let you pick one row to keep from a table that doesn't have a unique id. Like if this was a pure intersection table with no identity, you could use a variation on this to delete all rows where RowID > 1, leaving you just the unique rows. And it works just as well when you do have a unique id, where you can choose to preserve the earliest id.
select * from (select KHA_ID, ICD_FK, ROW_NUMBER()
OVER(PARTITION BY KHA_ID, ICD_FK
ORDER BY ID ASC) AS RowID
from ICD_Junction ) ordered where RowID > 1

Get last record of a table in Postgres

I'm using Postgres and cannot manage to get the last record of my table:
my_query = client.query("SELECT timestamp,value,card from my_table");
How can I do that knowning that timestamp is a unique identifier of the record ?
If under "last record" you mean the record which has the latest timestamp value, then try this:
my_query = client.query("
SELECT TIMESTAMP,
value,
card
FROM my_table
ORDER BY TIMESTAMP DESC
LIMIT 1
");
you can use
SELECT timestamp, value, card
FROM my_table
ORDER BY timestamp DESC
LIMIT 1
assuming you want also to sort by timestamp?
Easy way: ORDER BY in conjunction with LIMIT
SELECT timestamp, value, card
FROM my_table
ORDER BY timestamp DESC
LIMIT 1;
However, LIMIT is not standard and as stated by Wikipedia, The SQL standard's core functionality does not explicitly define a default sort order for Nulls.. Finally, only one row is returned when several records share the maximum timestamp.
Relational way:
The typical way of doing this is to check that no row has a higher timestamp than any row we retrieve.
SELECT timestamp, value, card
FROM my_table t1
WHERE NOT EXISTS (
SELECT *
FROM my_table t2
WHERE t2.timestamp > t1.timestamp
);
It is my favorite solution, and the one I tend to use. The drawback is that our intent is not immediately clear when having a glimpse on this query.
Instructive way: MAX
To circumvent this, one can use MAX in the subquery instead of the correlation.
SELECT timestamp, value, card
FROM my_table
WHERE timestamp = (
SELECT MAX(timestamp)
FROM my_table
);
But without an index, two passes on the data will be necessary whereas the previous query can find the solution with only one scan. That said, we should not take performances into consideration when designing queries unless necessary, as we can expect optimizers to improve over time. However this particular kind of query is quite used.
Show off way: Windowing functions
I don't recommend doing this, but maybe you can make a good impression on your boss or something ;-)
SELECT DISTINCT
first_value(timestamp) OVER w,
first_value(value) OVER w,
first_value(card) OVER w
FROM my_table
WINDOW w AS (ORDER BY timestamp DESC);
Actually this has the virtue of showing that a simple query can be expressed in a wide variety of ways (there are several others I can think of), and that picking one or the other form should be done according to several criteria such as:
portability (Relational/Instructive ways)
efficiency (Relational way)
expressiveness (Easy/Instructive way)
If your table has no Id such as integer auto-increment, and no timestamp, you can still get the last row of a table with the following query.
select * from <tablename> offset ((select count(*) from <tablename>)-1)
For example, that could allow you to search through an updated flat file, find/confirm where the previous version ended, and copy the remaining lines to your table.
The last inserted record can be queried using this assuming you have the "id" as the primary key:
SELECT timestamp,value,card FROM my_table WHERE id=(select max(id) from my_table)
Assuming every new row inserted will use the highest integer value for the table's id.
If you accept a tip, create an id in this table like serial. The default of this field will be:
nextval('table_name_field_seq'::regclass).
So, you use a query to call the last register. Using your example:
pg_query($connection, "SELECT currval('table_name_field_seq') AS id;
I hope this tip helps you.
To get the last row,
Get Last row in the sorted order: In case the table has a column specifying time/primary key,
Using LIMIT clause
SELECT * FROM USERS ORDER BY CREATED_TIME DESC LIMIT 1;
Using FETCH clause - Reference
SELECT * FROM USERS ORDER BY CREATED_TIME FETCH FIRST ROW ONLY;
Get Last row in the rows insertion order: In case the table has no columns specifying time/any unique identifiers
Using CTID system column, where ctid represents the physical location of the row in a table - Reference
SELECT * FROM USERS WHERE CTID = (SELECT MAX(CTID) FROM USERS);
Consider the following table,
userid |username | createdtime |
1 | A | 1535012279455 |
2 | B | 1535042279423 | //as per created time, this is the last row
3 | C | 1535012279443 |
4 | D | 1535012212311 |
5 | E | 1535012254634 | //as per insertion order, this is the last row
The query 1 and 2 returns,
userid |username | createdtime |
2 | B | 1535042279423 |
while 3 returns,
userid |username | createdtime |
5 | E | 1535012254634 |
Note : On updating an old row, it removes the old row and updates the data and inserts as a new row in the table. So using the following query returns the tuple on which the data modification is done at the latest.
Now updating a row, using
UPDATE USERS SET USERNAME = 'Z' WHERE USERID='3'
the table becomes as,
userid |username | createdtime |
1 | A | 1535012279455 |
2 | B | 1535042279423 |
4 | D | 1535012212311 |
5 | E | 1535012254634 |
3 | Z | 1535012279443 |
Now the query 3 returns,
userid |username | createdtime |
3 | Z | 1535012279443 |
Use the following
SELECT timestamp, value, card
FROM my_table
ORDER BY timestamp DESC
LIMIT 1
These are all good answers but if you want an aggregate function to do this to grab the last row in the result set generated by an arbitrary query, there's a standard way to do this (taken from the Postgres wiki, but should work in anything conforming reasonably to the SQL standard as of a decade or more ago):
-- Create a function that always returns the last non-NULL item
CREATE OR REPLACE FUNCTION public.last_agg ( anyelement, anyelement )
RETURNS anyelement LANGUAGE SQL IMMUTABLE STRICT AS $$
SELECT $2;
$$;
-- And then wrap an aggregate around it
CREATE AGGREGATE public.LAST (
sfunc = public.last_agg,
basetype = anyelement,
stype = anyelement
);
It's usually preferable to do select ... limit 1 if you have a reasonable ordering, but this is useful if you need to do this within an aggregate and would prefer to avoid a subquery.
See also this question for a case where this is the natural answer.
The column name plays an important role in the descending order:
select <COLUMN_NAME1, COLUMN_NAME2> from >TABLENAME> ORDER BY <COLUMN_NAME THAT MENTIONS TIME> DESC LIMIT 1;
For example: The below-mentioned table(user_details) consists of the column name 'created_at' that has timestamp for the table.
SELECT userid, username FROM user_details ORDER BY created_at DESC LIMIT 1;
In Oracle SQL,
select * from (select row_number() over (order by rowid desc) rn, emp.* from emp) where rn=1;
select * from table_name LIMIT 1;