Query to delete rows from table [duplicate] - sql

This question already has answers here:
T-SQL: Selecting rows to delete via joins
(12 answers)
Closed 7 years ago.
I have two tables:
datess, deletess, sample
I wrote a query like this:
DELETE FROM sample
WHERE sample_date_key IN
(SELECT date_key FROM datess WHERE s_date BETWEEN '2015-02-18' and DATE'2015-02-25');
But I have 2 rows and two columns in deletess:
start_date | end_date
------------+----------
2015-02-18 | 2015-02-18
2015-01-18 | 2015-01-18
I want to delete all the rows in sample with dates between start_date and end_date in deletess.
I tried the below code but got error:
ERROR: more than one row returned by a subquery used as an expression
(SELECT date_key FROM datess WHERE s_date BETWEEN (SELECT start_date FROM deletess) AND (SELECT end_date FROM deletess);
I appreciate any help. Thanks!

You can do it using WHERE EXISTS to check whether any row match the condition:
DELETE FROM sample
WHERE sample_date_key IN
(
SELECT date_key
FROM datess
WHERE EXISTS (
SELECT 1
FROM deletess
WHERE s_date BETWEEN start_date
AND end_date
)
);

join is better than in
DELETE s
FROM sample s
INNER JOIN datess d
ON s.sample_date_key = d.date_key
INNER JOIN deletess sd
ON sd.start_date <= d.s_date
AND d.s_date <= sd.end_date;

Related

Extra column looking at where OpenDate > ClosedDate prevoius records

I have a hard struggle with this problem.
I have the following table:
TicketNumber
OpenTicketDate YYYY_MM
ClosedTicketDate YYYY_MM
1
2018-1
2020-1
2
2018-2
2021-2
3
2019-1
2020-6
4
2020-7
2021-1
I would like to create an extra column which would monitor the open tickets at the given OpenTicketDate.
So the new table would look like this:
TicketNumber
OpenTicketDate YYYY_MM
ClosedTicketDate YYYY_MM
OpenTicketsLookingBackwards
1
2018-1
2020-1
1
2
2018-2
2021-2
2
3
2019-1
2020-6
3
4
2020-7
2021-1
2
The logic behind the 4th (extra) column is that it looks at the previous records & current record where the ClosedTicketsDate > OpenTicketDate.
For example ticketNumber 4 has '2' open tickets because there are only 2 ClosedTicketDate records where ClosedTicketDate > OpenTicketDate.
The new column only fills data based on looking at prevoius records. It is backward looking not forward.
Is there anyone who can help me out?
You could perform a self join and aggregate as the following:
Select T.TicketNumber, T.OpenTicketDate, T.ClosedTicketDate,
Count(*) as OpenTicketsLookingBackwards
From table_name T Left Join table_name D
On Cast(concat(T.OpenTicketDate,'-1') as Date) < Cast(concat(D.ClosedTicketDate,'-1') as Date)
And T.ticketnumber >= D.ticketnumber
Group By T.TicketNumber, T.OpenTicketDate, T.ClosedTicketDate
Order By T.TicketNumber
You may also try with a scalar subquery as the following:
Select T.TicketNumber, T.OpenTicketDate, T.ClosedTicketDate,
(
Select Count(*) From table_name D
Where Cast(concat(T.OpenTicketDate,'-1') as Date) <
Cast(concat(D.ClosedTicketDate,'-1') as Date)
And T.ticketnumber >= D.ticketnumber
) As OpenTicketsLookingBackwards
From table_name T
Order By T.TicketNumber
Mostly, joins tend to outperform subqueries.
See a demo.

Count rows in a table based on each row value

I'm struggling with following problem. I want to create a query in spark that runs a query for every row on existing table based on current column value.
Table can be simplified like this:
job_id
start_date
end_date
1
1-1-2000
2-1-2000
2
1-1-2000
3-1-2000
3
2-1-2000
4-1-2000
4
5-1-2000
7-1-2000
I want to create query which adds another column that counts how many jobs have already been started at each rows start date.
Output for this table should look as following
job_id
start_date
end_date
jobs_active_at_start
1
1-1-2000
2-1-2000
2 (active jobs id - 1,2)
2
1-1-2000
3-1-2000
2 (active jobs id - 1,2)
3
2-1-2000
4-1-2000
3 (active jobs id - 1,2,3)
4
5-1-2000
7-1-2000
1 (only job 4 is active)
I've tried to do subquery
%sql
SELECT
t1.id,
(SELECT COUNT(*) FROM table t2 WHERE t2.start_date <= t1.start_date AND t2.end_date >= t1.start_date)
FROM table t1
But databricks returned an error
AnalysisException: Correlated column is not allowed in predicate
I guess this method doesn't have best efficiency either.
What is best approach to tackle such problem?
You can just join the table to itself on the dates.
select
t1.job_id,
t1.start_date,
t1.end_date,
count (t2.job_id)
from
Table1 t1
inner join Table1 t2
on t2.start_date <= t1.start_date AND t2.end_date >= t1.start_date
group by
t1.job_id,
t1.start_date,
t1.end_date;

SQL Server : getting sum of values in "calendar" table without joining

Is it possible to get a the sum of value from the calendar_table to the main_table without joining like below?
select
date, sum(value)
from
main_table
inner join
calendar_table on start_date <= date and end_date >= date
group by
date
I am trying to avoid a join like this because main_table is a very large table with rows that have very large start and end dates, and it is absolutely killing my performance. And I've already indexed both tables.
Sample desired results:
+-----------+-------+
| date | total |
+-----------+-------+
| 7-24-2010 | 11 |
+-----------+-------+
Sample tables
calendar_table:
+-----------+-------+
| date | value |
+-----------+-------+
| 7-24-2010 | 5 |
| 7-25-2010 | 6 |
| ... | ... |
| 7-23-2020 | 2 |
| 7-24-2020 | 10 |
+-----------+-------+
main_table:
+------------+-----------+
| start_date | end_date |
+------------+-----------+
| 7-24-2010 | 7-25-2010 |
| 8-1-2011 | 8-5-2011 |
+------------+-----------+
You want the sum in the calendar table. So, I would recommend an "incremental" approach. This starts by unpivoting the data and putting the value as an increment and decrement in the results:
select c.date, c.value as inc
from main_table m join
calendar_table t
on m.start_date = c.date
union all
select dateadd(day, 1, c.date), - c.value as inc
from main_table m join
calendar_table t
on m.end_date = c.date;
The final step is to aggregate and do a cumulative sum:
select date, sum(inc) as value_on_date,
sum(sum(inc)) over (order by date) as net_value
from ((select c.date, c.value as inc
from main_table m join
calendar_table t
on m.start_date = c.date
) union all
(select dateadd(day, 1, c.date), - c.value as inc
from main_table m join
calendar_table t
on m.end_date = c.date
)
) c
group by date
order by date;
This is processing two rows of data for each row in the master table. Assuming that your time spans are longer than two days typically for each master row, the resulting data processed should be much smaller. And smaller data implies a faster query.
Here's a cross-apply example to possibly work from.
select main_table.date
, CalendarTable.ValueSum
from main_table
CROSS APPLY(
SELECT SUM(value) as ValueSum
FROM calendar_table
WHERE start_date <= main_table.date and main_table.end_date >= date
) as CalendarTable
group by date
You could try something like this ... but be aware, it is still technically 'joined' to the main table. If you look at an execution plan, you will see that there is a join operation of some kind going on.
select
date,
(select sum(value) from calendar_table t where m.start_date <= t.date and m.end_date >= t.date)
from
main_table m
The thing about that query is that the 'main_table' is not grouped as part of the results. You could possibly do that outside the select, but I don't know what you are trying to achieve. If you are grouping just to get the SUM, then perhaps maintaining the 'main_table' in the group is superflous.
As already mentioned, you must perform a join of some sort in order to get data from more than one table in a query.
You did not provide details if the indexes which are important for performance. I suggest the following indexes to optimize query performance.
For calendar_table, make sure you have a unique clustered index (or primary key) on date. Alternatively, a unique nonclustered index on date with the value column included.
A composite index on the main_table start_date and end_date columns may also be beneficial.
Even with optimal indexes, the query will still take some time against a 500M row table (e.g. a couple of minutes) with no additional filter criteria. If you need results in milliseconds, create an indexed view to materialize the join and aggregation results. Be aware the indexed view will add overhead for inserts/deletes on both tables as well as for updates to the value column in order to keep the index consistent with the underlying data.
Below is an indexed view DDL example.
CREATE VIEW dbo.vw_example
WITH SCHEMABINDING
AS
SELECT
date, sum(value) AS value, COUNT_BIG(*) AS countbig
from
dbo.main_table
inner join
dbo.calendar_table on start_date <= date and end_date >= date
group by
date;
GO
CREATE UNIQUE CLUSTERED INDEX cdx ON dbo.vw_example(date);
GO
Depending on your SQL Server edition, the optimizer may be able to use the indexed view automatically so your original query can use the view index without changes. Otherwise, query the view directly and specify a NOEXPAND hint:
SELECT date, value AS total
FROM dbo.vw_example WITH (NOEXPAND);
EDIT:
With the query improvement #GordonLinoff suggested, a non-clustered index on the main_table end_date column will help optimize that query.

Adding in missing dates from results in SQL

I have a database that currently looks like this
Date | valid_entry | profile
1/6/2015 1 | 1
3/6/2015 2 | 1
3/6/2015 2 | 2
5/6/2015 4 | 4
I am trying to grab the dates but i need to make a query to display also for dates that does not exist in the list, such as 2/6/2015.
This is a sample of what i need it to be:
Date | valid_entry
1/6/2015 1
2/6/2015 0
3/6/2015 2
3/6/2015 2
4/6/2015 0
5/6/2015 4
My query:
select date, count(valid_entry)
from database
where profile = 1
group by 1;
This query will only display the dates that exist in there. Is there a way in query that I can populate the results with dates that does not exist in there?
You can generate a list of all dates that are between the start and end date from your source table using generate_series(). These dates can then be used in an outer join to sum the values for all dates.
with all_dates (date) as (
select dt::date
from generate_series( (select min(date) from some_table), (select max(date) from some_table), interval '1' day) as x(dt)
)
select ad.date, sum(coalesce(st.valid_entry,0))
from all_dates ad
left join some_table st on ad.date = st.date
group by ad.date, st.profile
order by ad.date;
some_table is your table with the sample data you have provided.
Based on your sample output, you also seem to want group by date and profile, otherwise there can't be two rows with 2015-06-03. You also don't seem to want where profile = 1 because that as well wouldn't generate two rows with 2015-06-03 as shown in your sample output.
SQLFiddle example: http://sqlfiddle.com/#!15/b0b2a/2
Unrelated, but: I hope that the column names are only made up. date is a horrible name for a column. For one because it is also a keyword, but more importantly it does not document what this date is for. A start date? An end date? A due date? A modification date?
You have to use a calendar table for this purpose. In this case you can create an in-line table with the tables required, then LEFT JOIN your table to it:
select "date", count(valid_entry)
from (
SELECT '2015-06-01' AS d UNION ALL '2015-06-02' UNION ALL '2015-06-03' UNION ALL
'2015-06-04' UNION ALL '2015-06-05' UNION ALL '2015-06-06') AS t
left join database AS db on t.d = db."date" and db.profile = 1
group by t.d;
Note: Predicate profile = 1 should be applied in the ON clause of the LEFT JOIN operation. If it is placed in the WHERE clause instead then LEFT JOIN essentially becomes an INNER JOIN.

Select data from a table where only the first two columns are distinct

Background
I have a table which has six columns. The first three columns create the pk. I'm tasked with removing one of the pk columns.
I selected (using distinct) the data into a temp table (excluding the third column), and tried inserting all of that data back into the original table with the third column being '11' for every row as this is what I was instructed to do. (this column is going to be removed by a DBA after I do this)
However, when I went to insert this data back into the original table I get a pk constraint error. (shocking, I know)
The other three columns are just date columns, so the distinct select didn't create a unique pk for each record. What I'm trying to achieve is just calling a distinct on the first two columns, and then just arbitrarily selecting the three other columns as it doesn't matter which dates I choose (at least not on dev).
What I've tried
I found the following post which seems to achieve what I want:
How do I (or can I) SELECT DISTINCT on multiple columns?
I tried the answers from both Joel,and Erwin.
Attempt 1:
However, with Joels answer the set returned is too large - the inner join isn't doing what I thought it would do. Selecting distinct col1 and col2 there are 400 columns returned, however when I use his solution 600 rows are returned. I checked the data and in fact there were duplicate pk's. Here is my attempt at duplicating Joels answer:
select a.emp_no,
a.eec_planning_unit_cde,
'11' as area, create_dte,
create_by_emp_no, modify_dte,
modify_by_emp_no
from tempdb.guest.temp_part_time_evaluator b
inner join
(
select emp_no, eec_planning_unit_cde
from tempdb.guest.temp_part_time_evaluator
group by emp_no, eec_planning_unit_cde
) a
ON b.emp_no = a.emp_no AND b.eec_planning_unit_cde = a.eec_planning_unit_cde
Now, if I execute just the inner select statement 400 rows are returned. If I select the whole query 600 rows are returned? Isn't inner join supposed to only show the intersection of the two sets?
Attempt 2:
I also tried the answer from Erwin. This one has a syntax error and I'm having trouble googling the spec on the where clause (specifically, the trick he is using with (emp_no, eec_planning_unit_cde))
Here is the attempt:
select emp_no,
eec_planning_unit_cde,
'11' as area, create_dte,
create_by_emp_no,
modify_dte,
modify_by_emp_no
where (emp_no, eec_planning_unit_cde) IN
(
select emp_no, eec_planning_unit_cde
from tempdb.guest.temp_part_time_evaluator
group by emp_no, eec_planning_unit_cde
)
Now, I realize that the post I referenced is for postgresql. Doesn't T-SQL have something similar? Trying to google parenthesis isn't working too well.
Overview of Questions:
Why doesn't inner join return an intersection of two sets? From googling this is what I thought it was supposed to do
Is there another way to achieve the same method that I was trying in attempt 2 in t-sql?
It doesn't matter to me which one of these I use, or if I use another solution... how should I go about this?
A select distinct will be based on all columns so it does not guarantee the first two to be distinct
select pk1, pk2, '11', max(c1), max(c2), max(c3)
from table
group by pk1, pk2
You could TRY this:
SELECT a.emp_no,
a.eec_planning_unit_cde,
b.'11' as area,
b.create_dte,
b.create_by_emp_no,
b.modify_dte,
b.modify_by_emp_no
FROM
(
SELECT emp_no, eec_planning_unit_cde
FROM tempdb.guest.temp_part_time_evaluator
GROUP BY emp_no, eec_planning_unit_cde
) a
JOIN tempdb.guest.temp_part_time_evaluator b
ON a.emp_no = b.emp_no AND a.eec_planning_unit_cde = b.eec_planning_unit_cde
That would give you a distinct on those fields but if there is differences in the data between columns you might have to try a more brute force approch.
SELECT a.emp_no,
a.eec_planning_unit_cde,
a.'11' as area,
a.create_dte,
a.create_by_emp_no,
a.modify_dte,
a.modify_by_emp_no
FROM
(
SELECT ROW_NUMBER() OVER(ORDER BY emp_no, eec_planning_unit_cde) rownumber,
a.emp_no,
a.eec_planning_unit_cde,
a.'11' as area,
a.create_dte,
a.create_by_emp_no,
a.modify_dte,
a.modify_by_emp_no
FROM tempdb.guest.temp_part_time_evaluator
) a
WHERE rownumber = 1
I'll reply one by one:
Why doesn't inner join return an intersection of two sets? From googling this is what I thought it was supposed to do
Inner join don't do an intersection. Le'ts supose this tables:
T1 T2
n s n s
1 A 2 X
2 B 2 Y
2 C
3 D
If you join both tables by numeric column you don't get the intersection (2 rows). You get:
select *
from t1 inner join t2
on t1.n = t2.n;
| N | S |
---------
| 2 | B |
| 2 | B |
| 2 | C |
| 2 | C |
And, your second query approach:
select *
from t1
where t1.n in (select n from t2);
| N | S |
---------
| 2 | B |
| 2 | C |
Is there another way to achieve the same method that I was trying in attempt 2 in t-sql?
Yes, this subquery:
select *
from t1
where not exists (
select 1
from t2
where t2.n = t1.n
);
It doesn't matter to me which one of these I use, or if I use another solution... how should I go about this?
yes, using #JTC second query.