Cumulative Summing Values in SQLite - sql

I am trying to perform a cumulative sum of values in SQLite. I initially only needed to sum a single column and had the code
SELECT
t.MyColumn,
(SELECT Sum(r.KeyColumn1) FROM MyTable as r WHERE r.Date < t.Date)
FROM MyTable as t
Group By t.Date;
which worked fine.
Now I wanted to extend this to more columns KeyColumn2 and KeyColumn3 say. Instead of adding more SELECT statements I thought it would be better to use a join and wrote the following
SELECT
t.MyColumn,
Sum(r.KeyColumn1),
Sum(r.KeyColumn2),
Sum(r.KeyColumn3)
FROM MyTable as t
Left Join MyTable as r On (r.Date < t.Date)
Group By t.Date;
However this does not give me the correct answer (instead it gives values that are much larger than expected). Why is this and how could I correct the JOIN to give me the correct answer?

You are likely getting what I would call mini-Cartesian products: your Date values are probably not unique and, as a result of the self-join, you are getting matches for each of the non-unique values. After grouping by Date the results are just multiplied accordingly.
To solve this, the left side of the join must be rid of duplicate dates. One way is to derive a table of unique dates from your table:
SELECT DISTINCT Date
FROM MyTable
and use it as the left side of the join:
SELECT
t.Date,
Sum(r.KeyColumn1),
Sum(r.KeyColumn2),
Sum(r.KeyColumn3)
FROM (SELECT DISTINCT Date FROM MyTable) as t
Left Join MyTable as r On (r.Date < t.Date)
Group By t.Date;
I noticed that you used t.MyColumn in the SELECT clause, while your grouping was by t.Date. If that was intentional, you may be relying on undefined behaviour there, because the t.MyColumn value would probably be chosen arbitrarily among the (potentially) many in the same t.Date group.
For the purpose of this example, I assumed that you actually meant t.Date, so, I replaced the column accordingly, as you can see above. If my assumption was incorrect, please clarify.

Your join is not working cause he will find way more possibilities to join then your subselect would do.
The join is exploding your table.
The sub select does a sum of all records where the date is lower then the one from the current record.
The join joins every row multiple times aslong as the date is lower then the current record. This mean a single record could do as manny joins as there are records with a date lower. This causes multiple records. And in the end a higher SUM.
If you want the sum from mulitple columns you will have to use 3 sub query or define a unique join.

Related

SQL subselect statement very slow on certain machines

I've got an sql statement where I get a list of all Ids from a table (Machines).
Then need the latest instance of another row in (Events) where the the id's match so have been doing a subselect.
I need to latest instance of quite a few fields that match the id so have these subselects after one another within this single statement so end up with results similar to this...
This works and the results are spot on, it's just becoming very slow as the Events Table has millions of records. The Machine table would have on average 100 records.
Is there a better solution that subselects? Maybe doing inner joins or a stored procedure?
Help appreciated :)
You can use apply. You don't specify how "latest instance" is defined. Let me assume it is based on the time column:
Select a.id, b.*
from TableA a outer apply
(select top(1) b.Name, b.time, b.weight
from b
where b.id = a.id
order by b.time desc
) b;
Both APPLY and the correlated subquery need an ORDER BY to do what you intend.
APPLY is a lot like a correlated query in the FROM clause -- with two convenient enhances. A lateral join -- technically what APPLY does -- can return multiple rows and multiple columns.

Working around unsupported Correlated Where Subqueries in Hive

I'm trying to figure out a work around for the fact HIVE doesn't support correlated subqueries. Ultimately, I've been counting how many items exist in the data each week over the last month, and now I want to know how many items dropped out this week, came back, or are totally new. Wouldn't be too hard if I could use a where subquery but I'm having a tough time thinking of a work around without it.
Select
count(distinct item)
From data
where item in (Select item from data where date <= ("2016-05-10"))
And date between "2016-05-01" and getdate()
Any help would be great. Thank you.
Work around is left join with two result set and where second result set column is null.
ie
Select count (a.item)
from
(select distinct item from data where date between "2016-05-01" and getdate()) a
left join (Select distinct item from data where date <= ("2016-05-10")) b
on a.item =b.item
and b.item is null

Optimize WHERE clause in query

I have the following query:
SELECT table2.serialcode,
p.name,
date,
power,
behind,
direction,
length,
centerlongitude,
centerlatitude,
currentlongitude,
currentlatitude
FROM table1 as table2
JOIN pivots p ON p.serialcode = table2.serial
WHERE table2.serialcode = '49257'
and date = (select max(a.date) from table1 a where a.serialcode ='49257');
It seems it is retrieving the select max subquery for each join. It takes a lot of time. Is there a way to optimize it? Any help will be appreciated.
Sub selects that end up being evaluated "per row of the main query" can cause tremendous performance problems once you try to scale to larger number of rows.
Sub selects can almost always be eliminated with a data model tweak.
Here's one approach: add a new is_latest to the table to track if it's the max value (and for ties, use other fields like created time stamp or the row ID). Set it to 1 if true, else 0.
Then you can add where is_latest = 1 to your query and this will radically improve performance.
You can schedule the update to happen or add a trigger etc. if you need an automated way of keeping is_latest up to date.
Other approaches involve 2 tables - one where you keep only the latest record and another table where you keep the history.
declare #maxDate datetime;
select #maxDate = max(a.date) from table1 a where a.serialcode ='49257';
SELECT table2.serialcode,
p.name,
date,
power,
behind,
direction,
length,
centerlongitude,
centerlatitude,
currentlongitude,
currentlatitude
FROM table1 as table2
JOIN pivots p ON p.serialcode = table2.serial
WHERE table2.serialcode = '49257'
and date =#maxDate;
You can optimize this query using indexes. Here are somethat should help: table1(serialcode, serial, date), table1(serialcode, date), and pivots(serialcode).
Note: I find it very strange that you have columns called serial and serialcode in the same table, and the join is on serial.
Since you haven't mentioned which DB you are using, I would answer if it was for Oracle.
You can use WITH clause to take out the subquery and make it perform just once.
WITH d AS (
SELECT max(a.date) max_date from TABLE1 a WHERE a.serialcode ='49257'
)
SELECT table2.serialcode,
p.name,
date,
power,
behind,
direction,
length,
centerlongitude,
centerlatitude,
currentlongitude,
currentlatitude
FROM table1 as table2
JOIN pivots p ON p.serialcode = table2.serial
JOIN d on (table2.date = d.max_date)
WHERE table2.serialcode = '49257'
Please note that you haven't qualified date column, so I just assumed it belonged to table1 and not pivots. You can change it. An advise on the same note - always qualify your columns by using table.column format.

Need to make SQL subquery more efficient

I have a table that contains all the pupils.
I need to look through my registered table and find all students and see what their current status is.
If it's reg = y then include this in the search, however student may change from y to n so I need it to be the most recent using start_date to determine the most recent reg status.
The next step is that if n, then don't pass it through. However if latest reg is = y then search the pupil table, using pupilnumber; if that pupil number is in the pupils table then add to count.
Select Count(*)
From Pupils Partition(Pupils_01)
Where Pupilnumber in (Select t1.pupilnumber
From registered t1
Where T1.Start_Date = (Select Max(T2.Start_Date)
From registered T2
Where T2.Pupilnumber = T1.Pupilnumber)
And T1.reg = 'N');
This query works, but it is very slow as there are several records in the pupils table.
Just wondering if there is any way of making it more efficient
Worrying about query performance but not indexing your tables is, well, looking for a kind word here... ummm... daft. That's the whole point of indexes. Any variation on the query is going to be much slower than it needs to be.
I'd guess that using analytic functions would be the most efficient approach since it avoids the need to hit the table twice.
SELECT COUNT(*)
FROM( SELECT pupilnumber,
startDate,
reg,
rank() over (partition by pupilnumber order by startDate desc) rnk
FROM registered )
WHERE rnk = 1
AND reg = 'Y'
You can look execution plan for this query. It will show you high cost operations. If you see table scan in execution plan you should index them. Also you can try "exists" instead of "in".
This query MIGHT be more efficient for you and hope at a minimum you have indexes per "pupilnumber" in the respective tables.
To clarify what I am doing, the first inner query is a join between the registered table and the pupil which pre-qualifies that they DO Exist in the pupil table... You can always re-add the "partition" reference if that helps. From that, it is grabbing both the pupil AND their max date so it is not doing a correlated subquery for every student... get all students and their max date first...
THEN, join that result to the registration table... again by the pupil AND the max date being the same and qualify the final registration status as YES. This should give you the count you need.
select
count(*) as RegisteredPupils
from
( select
t2.pupilnumber,
max( t2.Start_Date ) as MostRecentReg
from
registered t2
join Pupils p
on t2.pupilnumber = p.pupilnumber
group by
t2.pupilnumber ) as MaxPerPupil
JOIN registered t1
on MaxPerPupil.pupilNumber = t1.pupilNumber
AND MaxPerPupil.MostRecentRec = t1.Start_Date
AND t1.Reg = 'Y'
Note: If you have multiple records in the registration table, such as a person taking multiple classes registered on the same date, then you COULD get a false count. If that might be the case, you could change from
COUNT(*)
to
COUNT( DISTINCT T1.PupilNumber )

Does Count (*) skew desired results when performing additional join?

My query works fine; however, I need to join another dataset to my query, and I expect that the count(f.*) will break.
Here's the query I start with:
SELECT
MIN(received_date) AS FirstVisit
, patient_id AS PatientID
INTO #LookupTable
FROM F_ACCESSION_DAILY
SELECT
f.doctor AS Doctor
, COUNT(f.*) AS CountNewPatients
, MONTH(firstvisit) AS Month
, YEAR(firstvisit) AS Year
FROM F_ACCESSION_DAILY f
INNER JOIN #LookupTable l ON f.received_date = l.FirstVisit
AND f.patient_id = l.PatientID
GROUP BY f.doctor
, MONTH(firstvisit)
, YEAR(firstvisit)
DROP TABLE #LookupTable
I would like to Join the above query on another table.
The question is *will my count(f.*) stay the same or will it change because I've added a new dataset?*
**How do I make sure that the count(f.*) will remain the same?
Thank you so much for your guidance.
will my count(f.*) stay the same or will it change because I've added a new dataset?*
COUNT(*) counts rows. If you join another table and the number of rows increases, the result of COUNT(*) will increase.
How do I make sure that the count(f.*) will remain the same?
Use COUNT (DISTINCT f.Id).
If there's exactly a 1 row per patient_id in the new table (and you're doing an INNER JOIN) then the count won't change. Otherwise, it will.
You could use an OUTER APPLY (SELECT TOP 1 ....) instead of a JOIN to guarantee that the count won't change.
By the way, it looks like you're missing a GROUP BY patient_id in your first SELECT.
Joins do not "skew" the COUNT(*). The count does exactly what it is advertised to do. The problem is that you may be multiplying the number of rows, without really realizing it.
One way to solve the problem is to do the aggregations at the appropriate level. Sometimes, you have to do it this way -- for instance, when SUMs and AVGs are involved.
For the count, though, you can replace it with:
count(distinct AccessionDailyID)
Even if the rows gets multiplied, then this will work to get your count. By the way, this assumes thatyour table has a unique id for each row.
By the way, you may want to be sure thatyou use LEFT OUTER JOIN rather than INNER JOIN to be sure that you don't lose any rows in the joining process.