I have a table with these columns:
id (int)
col1 (int)
col2 (varchar)
date1 (date)
col3 (int)
cumulative_col3 (int)
and about 750k rows.
I want to update the cumulative_col3 with the sum of col3 of same col1, col2 and previous to date of date1.
I have indexes on (date1), (date1, col1, col2) and (col1, col2).
I have tried the following query but it takes a long time to complete.
update table_name
set cumulative_col3 = (select sum(s.col3)
from table_name s
where s.date1 <= table_name.date1
and s.col1 = table_name.col1
and s.col2 = table_name.col2);
What can I do to improve the performance of this query?
You can try to calculate the running sum in a derived table instead:
update table_name
set cumulative_col3 = t.cum_sum
from (
select id,
sum(s.col3) over (partition by col1, col2 order by date1) as cum_sum
from table_name
) s
where s.id = table_name.id;
This assumes that id is the primary key of the table.
You might try adding the following index to your table:
CREATE INDEX idx ON table_name (date1, col1, col2, col3);
This index, if used, should allow the correlated sum subquery to be evaluated faster.
Related
I'm working with a table that contains 3 columns, all columns have integer datatypes.
I'm trying to replicate the following PySpark code into SQl
df = my_table.select('column_1', 'column_2', 'column_3')
df = df.drop_duplicates(['column_1', 'column_2'])
In the above code I'm trying to select three columns and then drop duplicates from only the first two.
I tried using
SELECT
MIN(column_1), MIN(column_2), column_3
FROM my_table
GROUP BY column_3
and it looks like it did get the job done but the output wasn't similar to the PySpark output.
Please Advise.
Note: I'm actually writing this query on dbt so I can't specify a SQL version
I think you can try with that.
SELECT mt1.MIN(col1) as min1, mt2.MIN(col2) as min2, mt1.col3
FROM my_table as mt1
JOIN my_table as mt2 on mt1.id = mt2.id
WHERE min1 != min2
GROUP BY col3;
I was able to drop duplicates from both col1 and col2 using ROW_NUMBER() in the following query:
SELECT col1, col2, col3
FROM
(
SELECT
col1, col2, col3,
ROW_NUMBER() OVER (PARTITION BY col1, col2 ORDER BY col1 DESC) AS row_num
FROM table_name
)
WHERE row_num = 1
Quite often I have to do queries like below:
select col1, max(id)
from Table
where col2 = 'value'
and col3 = ( select max(col3)
from Table
where col2 = 'value'
)
group by col1
Are there any other ways to avoid subqueries and temp tables? Basically I need a group by on all the rows with a particular max value. Assuming all proper indices are used.
You can use an OLAP function to achieve this. I would say this solution is marginally better in that your predicates are not duplicated between the main query and subquery, so you don't violate DRY:
SELECT *
FROM (
select col1, max(id) as max_id,
RANK() OVER (PARTITION BY col1 ORDER BY col3 DESC) AS irow
from [Member]
where col2 = 'value'
group by col1
) subquery
WHERE subquery.irow = 1
I have a SQLite table like this:
Col1 Col2 Col3
1 ABC Bill
2 CDE Fred
3 FGH Jack
4 CDE June
I would like to find the row containing a Col2 value of CDE which has the max Col1 value i.e. in this case June. Or, put another way, the most recently added row with a col2 value of CDE, as Col1 is an auto increment column. What is an SQL query string to achieve this? I need this to be efficient as the query will run many iterations in a loop.
Thanks.
SELECT * FROM table WHERE col2='CDE' ORDER BY col1 DESC LIMIT 1
in case if col1 wasn't an increment it would go somewhat like
SELECT *,MAX(col1) AS max_col1 FROM table WHERE col2='CDE' GROUP BY col2 LIMIT 1
Try this:
SELECT t1.*
FROM table1 t1
INNER JOIN
(
SELECT MAX(col1) MAXID, col2
FROM table1
GROUP BY col2
) t2 ON t1.col1 = t2.maxID AND t1.col2 = t2.col2
WHERE t1.col2 = 'CDE';
SQL Fiddle Demo1
1: This demo is mysql, but it should work fine with the same syntax in sqlite.
Use a subquery such as:
SELECT Col1, Col2, Col3
FROM table
WHERE Col1 = (SELECT MAX(Col1) FROM table WHERE Col2='CDE')
Add indexes as appropriate, e.g. clustered index on Col1 and another nonclustered index on Col2 to speed up the subquery.
In SQLite 3.7.11 and later, the simplest query would be:
SELECT *, max(Col1) FROM MyTable WHERE Col2 = 'CDE'
As shown by EXPLAIN QUERY PLAN, both this and passingby's query are most efficient, if there is an index on Col2.
If you'd want to see the correspondig values for all Col2 values, use a query like this instead:
SELECT *, max(Col1) FROM MyTable GROUP BY Col2
This is my view:
Create View [MyView] as
(
Select col1, col2, col3 From Table1
UnionAll
Select col1, col2, col3 From Table2
)
I need to add a new column named Id and I need to this column be unique so I think to add new column as identity. I must mention this view returned a large of data so I need a way with good performance, And also I use two select query with union all I think this might be some complicated so what is your suggestion?
Use the ROW_NUMBER() function in SQL Server 2008.
Create View [MyView] as
SELECT ROW_NUMBER() OVER( ORDER BY col1 ) AS id, col1, col2, col3
FROM(
Select col1, col2, col3 From Table1
Union All
Select col1, col2, col3 From Table2 ) AS MyResults
GO
The view is just a stored query that does not contain the data itself so you can add a stable ID. If you need an id for other purposes like paging for example, you can do something like this:
create view MyView as
(
select row_number() over ( order by col1) as ID, col1 from (
Select col1 From Table1
Union All
Select col1 From Table2
) a
)
There is no guarantee that the rows returned by a query using ROW_NUMBER() will be ordered exactly the same with each execution unless the following conditions are true:
Values of the partitioned column are unique. [partitions are parent-child, like a boss has 3 employees][ignore]
Values of the ORDER BY columns are unique. [if column 1 is unique, row_number should be stable]
Combinations of values of the partition column and ORDER BY columns are unique. [if you need 10 columns in your order by to get unique... go for it to make row_number stable]"
There is a secondary issue here, with this being a view. Order By's don't always work in views (long-time sql bug). Ignoring the row_number() for a second:
create view MyView as
(
select top 10000000 [or top 99.9999999 Percent] col1
from (
Select col1 From Table1
Union All
Select col1 From Table2
) a order by col1
)
Using "row_number() over ( order by col1) as ID" is very expensive.
This way is much more efficient in cost:
Create View [MyView] as
(
Select ID = isnull(cast(newid() as varchar(40)), '')
, col1
, col2
, col3
From Table1
UnionAll
Select ID = isnull(cast(newid() as varchar(40)), '')
, col1
, col2
, col3
From Table2
)
use ROW_NUMBER() with "order by (select null)" this will be less expensive and will get your result.
Create View [MyView] as
SELECT ROW_NUMBER() over (order by (select null)) as id, *
FROM(
Select col1, col2, col3 From Table1
Union All
Select col1, col2, col3 From Table2 ) R
GO
how to write a statement to accomplish the folowing?
lets say a table has 2 columns (both are nvarchar) with the following data
col1 10000_10000_10001_10002_10002_10002
col2 10____20____10____30____40_____50
I'd like to keep only the following data:
col1 10000_10001_10002
col2 10____10____30
thus removing the duplicates based on the second column values (neither of the columns are primary keys), keeping only those records with the minimal value in the second column.
how to accomplish this?
This should work for you:
;
WITH NotMin AS
(
SELECT Col1, Col2, MIN(Col2) OVER(Partition BY Col1) AS TheMin
FROM Table1
)
DELETE Table1
--SELECT *
FROM Table1
INNER JOIN NotMin
ON Table1.Col1 = NotMin.Col1 AND Table1.Col2 = NotMin.Col2
AND Table1.Col2 != TheMin
This uses a CTE (like a derived table, but cleaner) and the over clause as a shortcut for less code. I also added a commented select so you can see the matching rows (verify before deleting). This will work in SQL 2005/2008.
Thanks,
Eric
Ideally, you'd like to be able to say:
DELETE
FROM tbl
WHERE (col1, col2) NOT IN (SELECT col1, MIN(col2) AS col2 FROM tbl GROUP BY col1)
Unfortunately, that's not allowed in T-SQL, but there is a proprietary extension with a double FROM (using EXCEPT for clarity):
DELETE
FROM tbl
FROM tbl
EXCEPT
SELECT col1, MIN(col2) AS col2 FROM tbl GROUP BY col1
In general:
DELETE
FROM tbl
WHERE col1 + '|' + col2 NOT IN (SELECT col1 + '|' + MIN(col2) FROM tbl GROUP BY col1)
Or other workarounds.
Sorry, I misunderstood the question.
SELECT col1, MIN(col2) as col2
FROM table
GROUP BY col1
Of course returns the rows in question, but assuming you can't alter the table to add a unique identifier, you would need to do something like:
DELETE FROM test
WHERE col1 + '|' + col2 NOT IN
(SELECT col1 + '|' + MIN(col2)
FROM test
GROUP BY col1)
Which should work assuming that the pipe character never appears in your set.