How to delete duplicate rows from a table without unique key with only "plain" SQL and no temporary tables? - sql

Similar questions have been asked and answered here multiple times. From what I could find they were either specific to particular SQL implementation (Oracle, SQL Server, etc) or relied on a temporary table (where result would be initially copied).
I wonder if their is a platform-independent pure DML solution (just a single DELETE statement).
Sample data: Table A with a single field.
---------
|account|
|-------|
| A22 |
| A33 |
| A44 |
| A22 |
| A55 |
| A44 |
---------
The following SQL Fiddle shows Oracle-specific solution based on ROWID pseudo-column. It wouldn't work for any other database and is shown here just as an example.

The only platform-independent way I can think of is to store the data in a secondary table, truncate the first, and load it back in:
create table _tableA (
AccountId varchar(255)
);
insert into _TableA
select distinct AccountId from TableA;
truncate table TableA;
insert into TableA
select AccountId from _TableA;
drop table _TableA;
If you have a column that is unique for each account or relax the dialects of SQL, then you can possible find a single query solution.

Related

Understanding the precise difference in how SQL treats temp tables vs inline views

I know similar questions have been asked, but I will try to explain why they haven't answered my exact confusion.
To clarify, I am a complete beginner to SQL so bear with me if this is an obvious question.
Despite being a beginner I have been fortunate enough to be given a role doing some data science and I was recently doing some work where I wrote a query that self-joined a table, then used an inline view on the result, which I then selected from. I can include the code if necessary but I feel it is not for the question.
After running this, the admin emailed me and asked to please stop since it was creating very large temp tables. That was all sorted and he helped me write it more efficiently, but it made me very confused.
My understanding was that temp tables are specifically created by a statement like
SELECT INTO #temp1
I was simply using a nested select statement. Other questions on here seem to confirm that temp tables are different. For example the question here along with many others.
In fact I don't even have privileges to create new tables, so what am I misunderstanding? Was he using "temp tables" differently from the standard use, or do inline views create the same temp tables?
From what I can gather, the only explanation I can think of is that genuine temp tables are physical tables in the database, while inline views just store an array in RAM rather than in the actual database. Is my understanding correct?
There are two kind of temporary tables in MariaDB/MySQL:
Temporary tables created via SQL
CREATE TEMPORARY TABLE t1 (a int)
Creates a temporary table t1 that is only available for the current session and is automatically removed when the current session ends. A typical use case are tests in which you don't want to clean everything up in the end.
Temporary tables/files created by server
If the memory is too low (or the data size is too large), the correct indexes are not used, etc. the database server needs to create temporary files for sorting, collecting results from subqueries, etc. Temporary files are an indicator of your database design / and / or instructions should be optimized. Disk access is much slower than memory access and unnecessarily wastes resources.
A typical example for temporary files is a simple group by on a column which is not indexed (information displayed in "Extra" column):
MariaDB [test]> explain select first_name from test group by first_name;
+------+-------------+-------+------+---------------+------+---------+------+---------+---------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+------+-------------+-------+------+---------------+------+---------+------+---------+---------------------------------+
| 1 | SIMPLE | test | ALL | NULL | NULL | NULL | NULL | 4785970 | Using temporary; Using filesort |
+------+-------------+-------+------+---------------+------+---------+------+---------+---------------------------------+
1 row in set (0.000 sec)
The same statement with an index doesn't need to create temporary table:
MariaDB [test]> alter table test add index(first_name);
Query OK, 0 rows affected (7.571 sec)
Records: 0 Duplicates: 0 Warnings: 0
MariaDB [test]> explain select first_name from test group by first_name;
+------+-------------+-------+-------+---------------+------------+---------+------+------+--------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+------+-------------+-------+-------+---------------+------------+---------+------+------+--------------------------+
| 1 | SIMPLE | test | range | NULL | first_name | 58 | NULL | 2553 | Using index for group-by |
+------+-------------+-------+-------+---------------+------------+---------+------+------+--------------------------+

Normalization of multiple similar tables

i'm quite new to all this tech stuff so excuse me for making mistakes - beforehand.
My question is regarding data normalization. I'm using PGadmin4 for this task.
I have multiple tables one for each year containing multiple columns. I wish to normalize these data in order to make further inquiries. The data is in this form:
Table 1
| id | name1 | code1| code2 | year|
| 1 | Peter | 111 | 222 | 2007|
Table 2
| id | name1 | code1| code2 | year|
| 2 | Peter | 111 | 223 | 2008|
So my tables area similar but with some different data each year
I have broken it down so i have multiple tables containing only one column of information:
name1_table
| id | name1 |
And i have done this by every column. Now i need to link it all together - am heading in the right direction or have i gone of in a bad one?
What is the next step and if possible what is the code i need to use.
The easiest way to combine two tables with identical schemas is to create a new third table with the same schema and copy all the records into it.
Something like this:
INSERT INTO Table3 SELECT * FROM Table1;
INSERT INTO Table3 SELECT * FROM Table2;
Or if you simply need a combined query result you can use UNION:
SELECT * FROM Table1
UNION
SELECT * FROM Table2;
You are not headed in the right direction. The best approach is simply to store all the data in one table and to use indexes and/or partitions to access particular rows.
Sometimes this is not possible, notably because the tables have different formats. Possible solutions:
Break the existing tables into similarity sets based on columns, and create one table for each similarity set.
Create a table based on the most recent definition of the table, NULLing out columns that don't exist in historical tables.
Use a facility such as JSON for columns that have changed over time.
Use a facility such as inheritance for columns that have changed over time.

Update a MS Access field with the column count of another, not JOINed table

What I'm trying to do is create an update query in MS Access 2013 for a table separate from the actual data tables (meaning that there is no connection between the data table and the statistics table) to store some statistics (e.g. Count of records) that need to be stored for further calculations and later use.
I've looked up a bunch of tutorials in the past few days on this, with no luck of finding a solution to my problem, as all solutions included joining the tables, which - in my case - is irrelevant, as the table to-be-calculated-on is temporary with constantly changing data, thus I always want to count every record, find the max in the whole temp table, etc. on a given date (like logging).
The structure of statisticsTable:
| statDate (Date/time) | itemCount (integer) | ... |
----------------------------------------------------
| 01/01/2017 | 50 | ... |
| 02/01/2017 | 47 | ... |
| 03/01/2017 | 43 | ... |
| ... | ... | ... |
What I want to do, in semi-gibberish code:
UPDATE statisticsTable
SET itemCount = (SELECT Count(*) FROM tempTable)
WHERE statDate = 01/01/2017;
This should update the itemCount field of 01/01/2017 in the statisticsTable with the current row count of the temp table.
I know that this might not be the standard OR the correct use of MS Access or any DBMS in general, however, my assignment is rather limited, meaning I can't (shouldn't) modify any table structures, connections or the database structure in general, only create the update query that works as described above.
Is it possible to update a table's field value with the output of a query calculating on another table, WITHOUT joining the two tables in MS Access?
EDIT 1:
After further research, the function DCount() might be able to give the results I'm looking for, I will test it.
EDIT: I wrote a way more complicated answer that might not have even worked in Access (it would work in MS SQL-Server). Anyway.
What you need is a join criteria that is always true on which to base your update. You can just use is not null:
SELECT s.*, a.itemCount
FROM statisticsTable as s
INNER JOIN
(
SELECT count(*) as itemCount
from tempTable
) as a
on s.[some field that is always populated] is not null
and a.itemCount is not null

Most efficient way of getting the next unused id

(related to Finding the lowest unused unique id in a list and Getting unused unique values on a SQL table)
Suppose I have a table containing on id column and some others (they don't make any difference here):
+-----+-----+
| id |other|
+-----+-----+
The id has numerical increasing value. My goal is to get the lowest unused id and creating that row. So of course for the first time I run it will return 0 and the the row of this row would have been created. After a few executions it will look like this:
+-----+-----+
| id |other|
+-----+-----+
| 0 | ... |
| 1 | ... |
| 2 | ... |
| 3 | ... |
| 4 | ... |
+-----+-----+
Fairly often some of these rows might get deleted. Let's assume the rows with the id's of 1 and 3 are removed. No the table will look like this:
+-----+-----+
| id |other|
+-----+-----+
| 0 | ... |
| 2 | ... |
| 4 | ... |
+-----+-----+
If I now run again the query it would like to get back the id 1 and this row should be created:
| id |other|
+-----+-----+
| 0 | ... |
| 1 | ... |
| 2 | ... |
| 4 | ... |
+-----+-----+
The next times the query runs it should return the id's 3, 5, 6, etc.
What's the most effective way to run those kinds of query as I need to execute them fairly often in a second (it is fair to assume that the the id's are the only purpose of the table)? Is it possible to get the next unused row with one query? Or is it easier and faster by introducing another table which keeps track of the unused id's?
If it is significantly faster it is also possible to get a way to reuse any hole in the table provided that all numbers get reused at some time.
Bonus question: I plan to use SQLite for this kind of storing information as I don't need a database except for storing these id's. Is there any other free (as in speech) server which can do this job significantly faster?
I think I'd create a trigger on delete, and insert the old.id in a separate table.
Then you can select min(id) from that table to get the lowest id.
disclaimer: i don't know what database engine you use, so i don't know if triggers are available to you.
Like Dennis Haarbrink said; a trigger on delete and another on insert :
The trigger on delete would take the deleted id and insert it in a id pool table (only one column id)
The trigger on before insert would check if an id value is provided, otherwise it just query the id pool table (ex: SELECT MIN(id) FROM id_pool_table) and assign it (i.g. deletes it from the id_pool_table)
Normally you'd let the database handle assigning the ids. Is there a particular reason you need to have the id's sequential rather than unique? Can you, instead, timestamp them, and just number them when you display them? Or make a separate column for the sequential id, and renumber them?
Alternatively, you could not delete the rows themselves, but rather, mark them as deleted with a flag in a column, and then re-use the id's of the marked rows by finding the lowest numbered 'deleted' row, and reusing that id.
The database doesn't care if the values are sequential, only that they are unique. The desire to have your id values sequential is purely cosmetic, and if you are exposing this value to users -- it should not be your primary key, nor should there be any referential integrity based on the value because a client could change the format if desired.
The fastest and safest way to deal with the id value generation is to rely on native functionality that gives you a unique integer value (IE: SQLite's autoincrement). Using triggers only adds overhead, using MAX(id) +1 is extremely risky...
Summary
Ideally, use the native unique integer generator (SQLite/MySQL auto_increment, Oracle/PostgreSQL sequences, SQL Server IDENTITY) for the primary key. If you want a value that is always sequential, add an additional column to store that sequential value & maintain it as necessary. MySQL/SQLite/SQL Server unique integer generation only allows one per column - sequences are more flexible.

Best way to encapsulate complex Oracle PL/SQL cursor logic as a view?

I've written PL/SQL code to denormalize a table into a much-easer-to-query form. The code uses a temporary table to do some of its work, merging some rows from the original table together.
The logic is written as a pipelined table function, following the pattern from the linked article. The table function uses a PRAGMA AUTONOMOUS_TRANSACTION declaration to permit the temporary table manipulation, and also accepts a cursor input parameter to restrict the denormalization to certain ID values.
I then created a view to query the table function, passing in all possible ID values as a cursor (other uses of the function will be more restrictive).
My question: is this all really necessary? Have I completely missed a much more simple way of accomplishing the same thing?
Every time I touch PL/SQL I get the impression that I'm typing way too much.
Update: I'll add a sketch of the table I'm dealing with to give everyone an idea of the denormalization that I'm talking about. The table stores a history of employee jobs, each with an activation row, and (possibly) a termination row. It's possible for an employee to have multiple simultaneous jobs, as well as the same job over and over again in non-contiguous date ranges. For example:
| EMP_ID | JOB_ID | STATUS | EFF_DATE | other columns...
| 1 | 10 | A | 10-JAN-2008 |
| 2 | 11 | A | 13-JAN-2008 |
| 1 | 12 | A | 20-JAN-2008 |
| 2 | 11 | T | 01-FEB-2008 |
| 1 | 10 | T | 02-FEB-2008 |
| 2 | 11 | A | 20-FEB-2008 |
Querying that to figure out who is working when in what job is non-trivial. So, my denormalization function populates the temporary table with just the date ranges for each job, for any EMP_IDs passed in though the cursor. Passing in EMP_IDs 1 and 2 would produce the following:
| EMP_ID | JOB_ID | START_DATE | END_DATE |
| 1 | 10 | 10-JAN-2008 | 02-FEB-2008 |
| 2 | 11 | 13-JAN-2008 | 01-FEB-2008 |
| 1 | 12 | 20-JAN-2008 | |
| 2 | 11 | 20-FEB-2008 | |
(END_DATE allows NULLs for jobs that don't have a predetermined termination date.)
As you can imagine, this denormalized form is much, much easier to query, but creating it--so far as I can tell--requires a temporary table to store the intermediate results (e.g., job records for which the activation row has been found, but not the termination...yet). Using the pipelined table function to populate the temporary table and then return its rows is the only way I've figured out how to do it.
I think a way to approach this is to use analytic functions...
I set up your test case using:
create table employee_job (
emp_id integer,
job_id integer,
status varchar2(1 char),
eff_date date
);
insert into employee_job values (1,10,'A',to_date('10-JAN-2008','DD-MON-YYYY'));
insert into employee_job values (2,11,'A',to_date('13-JAN-2008','DD-MON-YYYY'));
insert into employee_job values (1,12,'A',to_date('20-JAN-2008','DD-MON-YYYY'));
insert into employee_job values (2,11,'T',to_date('01-FEB-2008','DD-MON-YYYY'));
insert into employee_job values (1,10,'T',to_date('02-FEB-2008','DD-MON-YYYY'));
insert into employee_job values (2,11,'A',to_date('20-FEB-2008','DD-MON-YYYY'));
commit;
I've used the lead function to get the next date and then wrapped it all as a sub-query just to get the "A" records and add the end date if there is one.
select
emp_id,
job_id,
eff_date start_date,
decode(next_status,'T',next_eff_date,null) end_date
from
(
select
emp_id,
job_id,
eff_date,
status,
lead(eff_date,1,null) over (partition by emp_id, job_id order by eff_date, status) next_eff_date,
lead(status,1,null) over (partition by emp_id, job_id order by eff_date, status) next_status
from
employee_job
)
where
status = 'A'
order by
start_date,
emp_id,
job_id
I'm sure there's some use cases I've missed but you get the idea. Analytic functions are your friend :)
EMP_ID JOB_ID START_DATE END_DATE
1 10 10-JAN-2008 02-FEB-2008
2 11 13-JAN-2008 01-FEB-2008
2 11 20-FEB-2008
1 12 20-JAN-2008
Rather than having the input parameter as a cursor, I would have a table variable (don't know if Oracle has such a thing I'm a TSQL guy) or populate another temp table with the ID values and join on it in the view/function or wherever you need to.
The only time for cursors in my honest opinion is when you have to loop. And when you have to loop I always recommend to do that outside of the database in the application logic.
It sounds like you are giving away some read consistency here ie: it will be possible for the contents of your temporary table to be out of sync with the source data, if you have concurrent modification data modification.
Without knowing the requirements, nor complexity of what you want to achieve. I would attempt
to define a view, containing (possibly complex) logic in SQL, else I'd add some PL/SQL to the mix with;
A pipelined table function, but using an SQL collection type (instead of the temporary table ). A simple example is here: http://asktom.oracle.com/pls/asktom/f?p=100:11:0::::P11_QUESTION_ID:4447489221109
Number 2 would give you less moving parts and solve your consistency issue.
Mathew Butler
The real problem here is the "write-only" table design - by which I mean, it's easy to insert data into it, but tricky and inefficient to get useful information out of it! Your "temporary" table has the structure the "permanent" table should have had in the first place.
Could you perhaps do this:
Create a permanent table with the better structure
Populate it to match the data in the first table
Define a database trigger on the original table to keep the new table in sync from now on
Then you can just select from the new table to perform your reporting.
I couldn't agree with you more, HollyStyles. I also used to be a TSQL guy, and find some of Oracle's idiosyncrasies more than a little perplexing. Unfortunately, temp tables aren't as convenient in Oracle, and in this case, other existing SQL logic is expecting to directly query a table, so I give it this view instead. There's really no application logic that exists outside of the database in this system.
Oracle developers do seem to use cursors much more eagerly than I would have thought. Given the bondage & discipline nature of PL/SQL, that's all the more surprising.
The simplest solution is:
Create a global temporary table containing just IDs you need:
CREATE GLOBAL TEMPORARY TABLE tab_ids (id INTEGER)
ON COMMIT DELETE ROWS;
Populate the temporary table with the IDs you need.
Use EXISTS operation in your procedure to select the rows that are only in the IDs table:
SELECT yt.col1, yt.col2 FROM your\_table yt
WHERE EXISTS (
SELECT 'X' FROM tab_ids ti
WHERE ti.id = yt.id
)
You can also pass a comma-separated string of IDs as a function parameter and parse it into a table. This is performed by a single SELECT. Want to know more - ask me how :-) But it's got to be a separate question.