hive "with tbl as" vs "create table tbl as" - hive

Does < with tbl as > much faster than < create table tbl as > ?
with tbl as
(
select
id,name
from
a
)
select id from tbl;
create table tbl
as
select
id,name
from
a;
select id from tbl;
If I want use tbl in many querys, how to use < with tbl as >?
with tbl as
(
select
id,name
from
a
)
select id from tbl;
select name from tbl;

There's no obvious performance gap.
with tbl as is a common table expression, aka CTE, which is only accessible within a single query. So we CAN NOT use CTE across multiple SQL queries separated by ;.
Favor create temporary table table over create table. The former is visible within a single session, and will be gone when the session ends.

Related

Should I Use a Temp Table To Optimize?

I have a somewhat complex query that I need to access for the IDs in order to delete from multiple tables, something along the lines of:
DELETE FROM Table1 WHERE ID IN ( -- Query here -- )
DELETE FROM Table2 WHERE ID IN ( -- Query here -- )
Would selecting the query into a temp table be more efficient than writing out the entire query twice, or is it just visually cleaner?
SELECT ( -- Query here -- ) INTO #Temp
DELETE FROM Table1 WHERE ID IN ( SELECT * FROM #Temp )
DELETE FROM Table2 WHERE ID IN ( SELECT * FROM #Temp )
Also, am open to other suggestions that I may have overlooked.
Thanks in advance
Completely agree with Jeroen Mostert. Try it and see. You could although try to use CTE for deletion or table variables and select distinct values of IDs there in case you have no distinct values of IDs in your query. And if you have a lot of data to delete, try to create additional indexes in temp table.

SQL: Insert into temp table on nested common table expression

I want to insert into temp table inside the nested CTE as on below code. I can select from first x inside second expression but cannot insert into temp table.
WITH x AS
(
SELECT * FROM MyTable
),
y AS
(
SELECT * INTO #temp FROM x
)
SELECT * FROM y
I have to do this using nested CTE as there is other logic to implement. I know i can insert into temp table outside of expression. Is there a way to achieve this? Please help.
SQL Server is quite explicit. A query cannot both return a result set and save the results to a query. You can easily do:
WITH x AS (
SELECT * FROM MyTable
)
SELECT x.*
INTO #temp
FROM x;
And then:
SELECT t.*
FROM #temp t;
The first query saves the result set into a temporary table. The second returns the values from the query.

Deleting duplicates rows from redshift

I am trying to delete some duplicate data in my redshift table.
Below is my query:-
With duplicates
As
(Select *, ROW_NUMBER() Over (PARTITION by record_indicator Order by record_indicator) as Duplicate From table_name)
delete from duplicates
Where Duplicate > 1 ;
This query is giving me an error.
Amazon Invalid operation: syntax error at or near "delete";
Not sure what the issue is as the syntax for with clause seems to be correct.
Has anybody faced this situation before?
Redshift being what it is (no enforced uniqueness for any column), Ziggy's 3rd option is probably best. Once we decide to go the temp table route it is more efficient to swap things out whole. Deletes and inserts are expensive in Redshift.
begin;
create table table_name_new as select distinct * from table_name;
alter table table_name rename to table_name_old;
alter table table_name_new rename to table_name;
drop table table_name_old;
commit;
If space isn't an issue you can keep the old table around for a while and use the other methods described here to validate that the row count in the original accounting for duplicates matches the row count in the new.
If you're doing constant loads to such a table you'll want to pause that process while this is going on.
If the number of duplicates is a small percentage of a large table, you might want to try copying distinct records of the duplicates to a temp table, then delete all records from the original that join with the temp. Then append the temp table back to the original. Make sure you vacuum the original table after (which you should be doing for large tables on a schedule anyway).
If you're dealing with a lot of data it's not always possible or smart to recreate the whole table. It may be easier to locate, delete those rows:
-- First identify all the rows that are duplicate
CREATE TEMP TABLE duplicate_saleids AS
SELECT saleid
FROM sales
WHERE saledateid BETWEEN 2224 AND 2231
GROUP BY saleid
HAVING COUNT(*) > 1;
-- Extract one copy of all the duplicate rows
CREATE TEMP TABLE new_sales(LIKE sales);
INSERT INTO new_sales
SELECT DISTINCT *
FROM sales
WHERE saledateid BETWEEN 2224 AND 2231
AND saleid IN(
SELECT saleid
FROM duplicate_saleids
);
-- Remove all rows that were duplicated (all copies).
DELETE FROM sales
WHERE saledateid BETWEEN 2224 AND 2231
AND saleid IN(
SELECT saleid
FROM duplicate_saleids
);
-- Insert back in the single copies
INSERT INTO sales
SELECT *
FROM new_sales;
-- Cleanup
DROP TABLE duplicate_saleids;
DROP TABLE new_sales;
COMMIT;
Full article: https://elliot.land/post/removing-duplicate-data-in-redshift
That should have worked. Alternative you can do:
With
duplicates As (
Select *, ROW_NUMBER() Over (PARTITION by record_indicator
Order by record_indicator) as Duplicate
From table_name)
delete from table_name
where id in (select id from duplicates Where Duplicate > 1);
or
delete from table_name
where id in (
select id
from (
Select id, ROW_NUMBER() Over (PARTITION by record_indicator
Order by record_indicator) as Duplicate
From table_name) x
Where Duplicate > 1);
If you have no primary key, you can do the following:
BEGIN;
CREATE TEMP TABLE mydups ON COMMIT DROP AS
SELECT DISTINCT ON (record_indicator) *
FROM table_name
ORDER BY record_indicator --, other_optional_priority_field DESC
;
DELETE FROM table_name
WHERE record_indicator IN (
SELECT record_indicator FROM mydups);
INSERT INTO table_name SELECT * FROM mydups;
COMMIT;
This method will preserve permissions and the table definition of the original_table.
The most upvoted answer does not preserve permissions on the table or the original definition of the table.
In real world production environment this method is how you should be doing as this is safest and easiest way to execute in production environment.
This will have a DOWN TIME in PROD.
Create Table with unique rows
CREATE TABLE unique_table as
(
SELECT DISTINCT * FROM original_table
)
;
Backup the original_table
CREATE TABLE backup_table as
(
SELECT * FROM original_table
)
;
Truncate the original_table
TRUNCATE original_table;
Insert records from unique_table into original_table
INSERT INTO original_table
(
SELECT * FROM unique_table
)
;
To avoid DOWN TIME run the below queries in a TRANSACTION and instead of TRUNCATE use DELETE
BEGIN transaction;
CREATE TABLE unique_table as
(
SELECT DISTINCT * FROM original_table
)
;
CREATE TABLE backup_table as
(
SELECT * FROM original_table
)
;
DELETE FROM original_table;
INSERT INTO original_table
(
SELECT * FROM unique_table
)
;
END transaction;
Simple answer to this question:
Firstly create a temporary table from the main table where value of row_number=1.
Secondly delete all the rows from the main table on which we had duplicates.
Then insert the values of temporary table into the main table.
Queries:
Temporary table
select id,date into #temp_a
from
(select *
from (select a.*,
row_number() over(partition by id order by etl_createdon desc) as rn
from table a
where a.id between 59 and 75 and a.date = '2018-05-24')
where rn =1)a
deleting all the rows from the main table.
delete from table a
where a.id between 59 and 75 and a.date = '2018-05-24'
inserting all values from temp table to main table
insert into table a select * from #temp_a.
The following deletes all records in 'tablename' that have a duplicate, it will not deduplicate the table:
DELETE FROM tablename
WHERE id IN (
SELECT id
FROM (
SELECT id,
ROW_NUMBER() OVER (partition BY column1, column2, column3 ORDER BY id) AS rnum
FROM tablename
) t
WHERE t.rnum > 1);
Postgres administrative snippets
Your query does not work because Redshift does not allow DELETE after the WITH clause. Only SELECT and UPDATE and a few others are allowed (see WITH clause)
Solution (in my situation):
I did have an id column on my table events that contained duplicate rows and uniquely identifies the record. This column id is the same as your record_indicator.
Unfortunately I was unable to create a temporary table because I ran into the following error using SELECT DISTINCT:
ERROR: Intermediate result row exceeds database block size
But this worked like a charm:
CREATE TABLE temp as (
SELECT *,ROW_NUMBER() OVER (PARTITION BY id ORDER BY id) AS rownumber
FROM events
);
resulting in the temp table:
id | rownumber | ...
----------------
1 | 1 | ...
1 | 2 | ...
2 | 1 | ...
2 | 2 | ...
Now the duplicates can be deleted by removing the rows having rownumber larger than 1:
DELETE FROM temp WHERE rownumber > 1
After that rename the tables and your done.
with duplicates as
(
select a.*, row_number (over (partition by first_name, last_name, email order by first_name, last_name, email) as rn from contacts a
)
delete from contacts
where contact_id in (
select contact_id from duplicates where rn >1
)

With structure - using multiple select queries without repeating the "with"

I would like to use a with structure with multiple sql select queries. Ex:
;with temptable as ( ... )
select id from temptable
select name from temptable
However after the first select query is done, SQL Server 2008 doesnt allow me to do it and it pushes me to write the same with structure again above the second query. Ex:
;with temptable as ( ... )
select id from temptable
;with temptable as ( ... )
select name from temptable
I tried using comma after the first select query but didnt work. How can I use multiple select queries under one with structure in SQL Server 2008.
Common table expression works only for one statement.
Specifies a temporary named result set, known as a common table
expression (CTE). This is derived from a simple query and defined
within the execution scope of a single SELECT, INSERT, UPDATE, or
DELETE statement.
select id from temptable;
select name from temptable;
are two statements, so you cannot use it in second query.
The alternative is to use temp table:
SELECT .... INTO #temptable FROM ...; -- your query from CTE
SELECT id FROM #temptable;
SELECT name FROM #temptable;
CTE only exists for the "duration of the query" you can't use the same CTE in two different SELECT statements as each is a separate query.
you need to store the result set of CTE into temp table to and fire multiple select statements on temp table.
;with temptable as ( ... )
select id
into #tempResultSet
from temptable
select name from tempResultSet

the row no in query output

I have a numeric field (say num) in table along with pkey.
select * from mytable order by num
now how I can get the row no in query output of a particular row for which I have pkey.
I'm using sql 2000.
Sounds like you want a row number for each record returned.
In SQL 2000, you can either do this:
SELECT (SELECT COUNT(*) FROM MyTable t2 WHERE t2.num <= t.num) AS RowNo, *
FROM MyTable t
ORDER BY num
which assumes num is unique. If it's not, then you'd have to use the PK field and order by that.
Or, use a temp table (or table var):
CREATE TABLE #Results
(
RowNo INTEGER IDENTITY(1,1),
MyField VARCHAR(10)
)
INSERT #Results
SELECT MyField
FROM MyTable
ORDER BY uum
SELECT * FROM #Results
DROP TABLE #Results
In SQL 2005, there is a ROW_NUMBER() function you could use which makes life a lot easier.
as i understand your question you want to get the number of all rows returned, right?
if so use ##rowcount
As Ada points out, this task became a lot easier in SQL Server 2005....
SELECT whatever, RowNumber from (
SELECT pk
, whatever
, ROW_NUMBER() OVER(ORDER BY num) AS 'RowNumber'
FROM mytable
)
WHERE pk = 23;