CTAS is slower when where condition is included

CTAS is slower when where condition is included - sql

There is table with 412499154 records, I have to recreate based on where condition. The table has 6 columns and all of varchar2() datatype.
when I recreate with the entire content, it takes roughly around 10 minutes.
CREATE TABLE TEMP_NEW_01 NOLOGGING
AS
SELECT COL1,COL2,COL3,COL4,COL5,COL6
FROM TEMP_DATA ;
When a where clause is included, its running indefinitely,
CREATE TABLE TEMP_NEW_01 NOLOGGING
AS
select COL1,COL2,COL3,COL4,COL5,COL6
FROM TEMP_DATA
WHERE COL1 IN
(SELECT COL1 FROM temp_m2
where SHORT_CAPTION in ( select SHORT_CAPTION from t_category
where scat_caption in ('P','V'))
);
Any suggestions for improving? Thanks!

Instead of using (nested) subqueries, join tables.
Make sure columns you use in joins are indexed; if not, then:
CREATE INDEX i1_dat_col1 ON temp_data (col1);
CREATE INDEX i1_m2_col1 ON temp_m2 (col1);
CREATE INDEX i2_m2_capt ON temp_m2 (short_caption);
CREATE INDEX i1_cat_capt ON t_category (short_caption);
Gather statistics on all tables and indexes before running the CREATE TABLE statement!
Finally:
CREATE TABLE temp_new_01
AS
SELECT d.col1,
d.col2,
d.col3,
d.col4,
d.col5,
d.col6
FROM temp_data d
JOIN temp_m2 m ON m.col1 = d.col1
JOIN t_category c ON c.short_caption = m.short_caption
WHERE c.scat_caption IN ('P', 'V');

Related

In SQL Server, how do I make sql using "or" clauses over two connected tables use the correct indexes

I have two tables TableA and TableB which are related by AID. AID is the primary key in TableA and a foreign key in TableB.
The following SQL is used to select anything from TableB joined to TableA where ADesc or BDesc are like a certain value.
SQL1:
SELECT *
FROM TableB
INNER JOIN TableA ON TableA.AID = TableB.AID
WHERE ADesc LIKE 'A0A4D1%' OR BDesc = 'A0A4D1%'
Table A has a clustered index on AID (TableA_PK)
Table A has a non-clustered index on ADesc (TableA_I1)
Table B has a clustered index on BID (TableB_PK)
Table B has a non-clustered index on BDesc (TableB_I1)
If I separate up the above SQL into two statements (or one statement connected with a UNION as in SQL2), SQL Server will utilise the TableA_I1 and TableB_I1 and restrict on ADesc or BDesc really efficiently.
SQL2:
SELECT *
FROM TableB
INNER JOIN TableA ON TableA.AID = TableB.AID
WHERE ADesc LIKE 'A0A4D1%'
UNION
SELECT *
FROM TableB
INNER JOIN TableA ON TableA.AID = TableB.AID
WHERE BDesc = 'A0A4D1%'
However SQL1 does not use either TableA_I1 or TableB_I1 and returns much much slower.
My question is there a way I can get SQL Server to execute SQL1 but use the same indexes and similar execution plan as in SQL2?
I am using SQL Server 2019.
This may seem a strange question which begs the follow question - why don't you just change it to a UNION? 1st, my actual SQL is much more complex and the resulting UNION would be huge so I wanted to see if I could simplify. 2nd, I'm just curious as to why SQL server can't work it out and optimise it itself.
The SQL to create all the tables and indexes is below:
CREATE TABLE TableA
(
AID INT IDENTITY(1,1) NOT NULL,
ADesc VARCHAR(50) NOT NULL
)
GO
CREATE NONCLUSTERED INDEX TableA_I1 ON TableA (ADesc ASC)
GO
CREATE UNIQUE CLUSTERED INDEX TableA_PK ON TableA (AID ASC)
GO
CREATE TABLE TableB
(
BID INT IDENTITY(1,1) NOT NULL,
AID INT NOT NULL,
BDesc VARCHAR(50) NOT NULL
)
GO
CREATE NONCLUSTERED INDEX TableB_I1 ON TableB (BDesc ASC)
GO
CREATE UNIQUE CLUSTERED INDEX TableB_PK ON TableB (BID ASC)
GO
I then threw a few million rows into the tables to make sure there was plenty for it go at!

Perhaps try experimenting with the following pattern. Using a temporary table with a primary key means SQL Server will have accurate statistics to use.
drop table if exists #aid;
create table #aid(aid int primary key);
insert into #aid(aid)
select aid
from TableA
where ADesc like 'A0A4D1%'
union all
select aid
from TableB
where BDesc = 'A0A4D1';
select a.<columns>, b.<columns>
from #aid x
join TableA a on a.aid = x.aid
join TableB b on b.aid = x.aid;

How to Select all columns of a table except one

My code looks like:
CREATE TABLE tableC AS
(SELECT tableA.*,
ST_Intersection (B.geom, A.geom) as geom2 -- generate geom
FROM tableB, tableA
JOIN tableB
ON ST_Intersects (A.geom, b.geom)
WHERE test.id = 2);
Now It is working but I have two columns geom and geom2!
Inside geom column I will have the new geometry based on the intersection. So how can I select tableA except the geom column?

Create the table with all the columns and after that drop the geom column and rename the new one:
CREATE TABLE tableC AS
SELECT
tableA.*,
ST_Intersection (B.geom, A.geom) as geom2 -- generate geom
FROM
tableA inner JOIN tableB ON ST_Intersects (A.geom, b.geom)
WHERE test.id = 2
;
alter table tableC drop column geom;
alter table tableC rename column geom2 to geom;

The only way you would be able to do this would be to generate a dynamic SQL statement based on the columns within the table that excludes those you don't want. Obviously this will be a lot more effort than simply adding in all the column names.
There are also a lot of very good reasons to never include a select * in a production environment, given how picky SQL often is on the number and format of columns that are returned. By using select * you open yourself up to a changing query result in the future that could potentially break things.
If you have a LOT of columns and you simply don't want to manually type them all out, run the query below for your table and then format the result so you can copy/paste into your script:
SELECT *
FROM information_schema.columns
WHERE table_schema = 'your_schema'
AND table_name = 'your_table'

finding duplicate records in a table in oracle

Consider the scenario of loading of a table from a flat file. the table has no constraints or indexes defined.Somehow in between loading was interrupted and after some time the table was again loaded from the same file. So this time the records already inserted during first loading were duplicated. how to find the duplicate rows now ? assume there are 150 columns in the table so group by each and every column is tedious

A record is truly duplicate only if all the column values match. It becomes different or unique even if 1 column has a different value. If your table has no primary constraints, you must compare all columns.
An alternative way could be that you could do your 2nd load on a new temp table and populate your old table with records from this temp table where the records do not exist in the old table. In any case you have to compare all columns between the 2 tables to identify truly unique records.
You could also consider adding a primary key to your table and then running your delete query. Check the accepted answer on this link

You can use ROWID for deleting duplicate rows;
Select * FROM table_name A
WHERE
a.rowid > ANY (
SELECT
B.rowid
FROM
table_name B
WHERE
A.col1 = B.col1
AND
A.col2 = B.col2
);
here is a useful link:
[http://www.dba-oracle.com/t_delete_duplicate_table_rows.htm

Tested... Appears to work...
1st we get a list of the table columns in a comma separated list
SELECT wm_concat(column_Name)
FROM all_tab_cols
WHERE table_name = 'TABLENAME'Select and Column_ID is not null;
copy the results into query below where ResultList is defined.
adjust 'Tablename' to your table.
WITH CTE AS (SELECT TN.*, RowNum RN from 'TableName' TN order by ResultList),
SELECT * FROM CTE A
INNER JOIN CTE B using (ResultList)
WHERE A.RN <> B.RN
The above uses natrual joins to join all the tables columns to the same table columns and since duplicate rows will have different row numbers, the result set will list both offending records.

I got this snippet somewhere along the line for deleting dups:
DELETE FROM TABLE_NAME
WHERE ROWID IN
(SELECT ROWID FROM TABLE_NAME
MINUS
SELECT MIN(ROWID) FROM TABLE_NAME
GROUP BY <column list> );
Note the column_list lists the columns that are used to determine uniqueness.

Select * FROM table_name A
WHERE
a.rowid > (
SELECT
min (B.rowid)
FROM
table_name B
WHERE
A.row_id = B.row_id
);

Suppose you are having a test table(table in which you moved the record using flat file) dummd which is having multiple columns (like 150 and you are not sure which column is unique or primary )and duplicate rows so to find all the unique records you can use union and then create a view or new table like i did as test1 :-
create table test1
as
select * from dummd
union
select * from dummd

drop table and insert records from a new select statement (with the correct records)

I will like to drop an existing table and insert new records from a select statement. Keeping the coulmns the same. Old table (column a, column b) and select statment (select from a,b,c,d with inner joins)

If you need to drop completely you can do this:
Drop table yourtable
SELECT *
INTO yourtable
FROM
(SELECT a , b FROM blah, blah) x

Unless you want to change the schema of your old table, you might try a TRUNCATE TABLE:
TRUNCATE TABLE MyTable
Then you can insert into this table with a SELECT:
INSERT INTO MyTable
(
ColumnA,
ColumnB,
...
)
SELECT
ValueA,
ValueB,
...
FROM
Table1
INNER JOIN Table2
ON Table2.SomeColumn = Table1.SomeColumn
...
On the other hand, if you really want to recreate the table, then you can DROP TABLE and re-create it with a SELECT INTO as Jayvee showed.

Left Join Lateral and array aggregates

I'm using Postgres 9.3.
I have two tables T1 and T2 and a n:m relation T1_T2_rel between them. Now I'd like to create a view that in addition to the columns of T1 provides a column that, for each record in T1, contains an array with the primary key ids of all related records of T2. If there are no related entries in T2, corresponding fields of this column shall contain null-values.
An abstracted version of my schema would look like this:
CREATE TABLE T1 ( t1_id serial primary key, t1_data int );
CREATE TABLE T2 ( t2_id serial primary key );
CREATE TABLE T1_T2_rel (
t1_id int references T1( t1_id )
, t2_id int references T2( t2_id )
);
Corresponding sample data could be generated as follows:
INSERT INTO T1 (t1_data)
SELECT cast(random()*100 as int) FROM generate_series(0,9) c(i);
INSERT INTO T2 (t2_id) SELECT nextval('T2_t2_id_seq') FROM generate_series(0,99);
INSERT INTO T1_T2_rel
SELECT cast(random()*10 as int) % 10 + 1 as t1_id
, cast(random()*99+1 as int) as t2_id
FROM generate_series(0,99);
So far, I've come up with the following query:
SELECT T1.t1_id, T1.t1_data, agg
FROM T1
LEFT JOIN LATERAL (
SELECT t1_id, array_agg(t2_id) as agg
FROM T1_T2_rel
WHERE t1_id=T1.t1_id
GROUP BY t1_id
) as temp ON temp.t1_id=T1.t1_id;
This works. However, can it be simplified?
A corresponding fiddle can be found here: sql-fiddle. Unfortunately, sql-fiddle does not support Postgres 9.3 (yet) which is required for lateral joins.
[Update] As has been pointed out, a simple left join using a subquery in principle is enough. However, If I compare the query plans, Postgres resorts to sequential scans on the aggregated tables when using a left join whereas index scans are used in the case of the left join lateral.

As #Denis already commented: no need for LATERAL.
Also, your subquery selected the wrong column. This works:
SELECT t1.t1_id, t1.t1_data, t2_ids
FROM t1
LEFT JOIN (
SELECT t1_id, array_agg(t2_id) AS t2_ids
FROM t1_t2_rel
GROUP BY 1
) sub USING (t1_id);
-SQL fiddle.
Performance and testing
Concerning the ensuing sequential scan you mention: If you query the whole table, a sequential scan is often faster. Depends on the version you are running, your hardware, your settings and statistics of cardinalities and distribution of your data. Experiment with selective WHERE clauses like WHERE t1.t1_id < 1000 or WHERE t1.t1_id = 1000 and combine with planner settings to learn about choices:
SET enable_seqscan = off;
SET enable_indexscan = off;
To reset:
RESET enable_seqscan;
RESET enable_indexscan;
Only in your local session, mind you! This related answer on dba.SE has more instructions.
Of course, your setting may be off, too:
Keep PostgreSQL from sometimes choosing a bad query plan

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

CTAS is slower when where condition is included - sql

Related

In SQL Server, how do I make sql using "or" clauses over two connected tables use the correct indexes

How to Select all columns of a table except one

finding duplicate records in a table in oracle

drop table and insert records from a new select statement (with the correct records)

Left Join Lateral and array aggregates

Categories

Resources