Getting peak value of a column in table till this date - sql

I have an oracle table having columns {date, id, profit, max_profit}.
I have data in date and profit, and I want highest value of profit till date in max_profit, I am using query below
UPDATE MY_TABLE a SET a.MAX_PROFIT = (SELECT MAX(b.PROFIT)
FROM MY_TABLE b WHERE b.DATE <= a.DATE
AND a.id = b.id)
This is giving me correct result, but I have millions of rows for which query is taking considerable time, any faster way of doing it ?

You can use a MERGE statement with an analytic function:
MERGE INTO my_table dst
USING (
SELECT ROWID rid,
MAX( profit ) OVER ( PARTITION BY id ORDER BY "DATE" ) AS max_profit
FROM my_table
) src
ON ( src.rid = dst.ROWID )
WHEN MATCHED THEN
UPDATE SET max_profit = src.max_profit;

When you do something like "SELECT MAX(...)" you're going to scan all the records implicated in the 'WHERE" part of the query, so you want to make getting all those records as easy on the database as possible.
Do you have an index on the table that includes the id and date columns?
Depending on the behavior of this application, if you're doing a lot fewer updates/inserts (as opposed to doing a ton of reads during reporting or some other process), a possible performance enhancement might be to keep the value you're storing in the max_profit column up to date somewhere while you're changing the data. Have you considered a separate table that just stores the profit calculation for each possible date?

Related

Creating a partitioned table from query in Big Query does not yield same as without partitioning

When creating a table let's say "orders" with partitioning in the following way my result gets truncated in comparison to if I create it without partitioning. (Commenting and uncommenting rows five and 6).
I suspect that it might have something to do with the BQ limits (found here) but I can't figure out what. The ts is a timestamp field and order_id is a UUID string.
i.e. The count distinct on the last row will yield very different results. When partitioned it will return far less order_ids than without partitioning.
DROP TABLE IF EXISTS
`project.dataset.orders`;
CREATE OR REPLACE TABLE
`project.dataset.orders`
-- PARTITION BY
-- DATE(ts)
AS
SELECT
ts,
order_id,
SUM(order_value) AS order_value
FROM
`project.dataset.raw_orders`
GROUP BY
1, 2;
SELECT COUNT(DISTINCT order_id) FROM `project.dataset.orders`;
(This is not a valid 'answer', I just need a better place to write SQL than the comment box, I don't mind if moderator convert this answer into a comment AFTER it serves its purpose)
What is the number you'd get if you do query below, and which one does it align with (partitioned or non-partitioned)?
SELECT COUNT(DISTINCT order_id) FROM (
SELECT
ts,
order_id,
SUM(order_value) AS order_value
FROM
`project.dataset.raw_orders`
GROUP BY
1, 2
) t;
It turns out that there's a 60 day partition expiration!
https://cloud.google.com/bigquery/docs/managing-partitioned-tables#partition-expiration
So by updating the partition expiration I could get the full range.

SQL Server : verify that two columns are in same sort order

I have a table with an ID and a date column. It's possible (likely) that when a new record is created, it gets the next larger ID and the current datetime. So if I were to sort by date or I were to sort by ID, the resulting data set would be in the same order.
How do I write a SQL query to verify this?
It's also possible that an older record is modified and the date is updated. In that case, the records would not be in the same sort order. I don't think this happens.
I'm trying to move the data to another location, and if I know that there are no modified records, that makes it a lot simpler.
I'm pretty sure I only need to query those two columns: ID, RecordDate. Other links indicate I should be able to use LAG, but I'm getting an error that it isn't a built-in function name.
In other words, both https://dba.stackexchange.com/questions/42985/running-total-to-the-previous-row and Is there a way to access the "previous row" value in a SELECT statement? should help, but I'm still not able to make that work for what I want.
If you cannot use window functions, you can use a correlated subquery and EXISTS.
SELECT *
FROM elbat t1
WHERE EXISTS (SELECT *
FROM elbat t2
WHERE t2.id < t1.id
AND t2.recorddate > t1.recorddate);
It'll select all records where another record with a lower ID and a greater timestamp exists. If the result is empty you know that no such record exists and the data is like you want it to be.
Maybe you want to restrict it a bit more by using t2.recorddate >= t1.recorddate instead of t2.recorddate > t1.recorddate. I'm not sure how you want it.
Use this:
SELECT ID, RecordDate FROM tablename t
WHERE
(SELECT COUNT(*) FROM tablename WHERE tablename.ID < t.ID)
<>
(SELECT COUNT(*) FROM tablename WHERE tablename.RecordDate < t.RecordDate);
It counts for each row, how many rows have id less than the row's id and
how many rows have RecordDate less than the row's RecordDate.
If these counters are not equal then it outputs this row.
The result is all the rows that would not be in the same position after sorting by ID and RecordDate
One method uses window functions:
select count(*)
from (select t.*,
row_number() over (order by id) as seqnum_id,
row_number() over (order by date, id) as seqnum_date
from t
) t
where seqnum_id <> seqnum_date;
When the count is zero, then the two columns have the same ordering. Note that the second order by includes id. Two rows could have the same date. This makes the sort stable, so the comparison is valid even when date has duplicates.
the above solutions are all good but if both dates and ids are in increment then this should also work
select modifiedid=t2.id from
yourtable t1 join yourtable t2
on t1.id=t2.id+1 and t1.recordDate<t2.recordDate

SQL loop on duplicate row to combine into one

I have something to fix in my database here it is:
I have a table with duplicate rows like that:
the duplicate columns are IDPatient and IDObjet and you should never have both duplicate and that's why i put Key on both column but it's a bit too late.. so I have to fix this by combining these duplicate row into one without losing data and to put it in order.
Example, as you can see in the picture the column texte_1 contains each one a date 2010-11-25 and 2011-11-04. The date 2010-11-25 come before 2011-11-04 So i have to put 2011-11-04 into the column texte_2 of the first row and looping like that for each data I have in my row and to verify if the date is older or not. If yes, I have to replace the data in the row one with the second row, taking the information we have replace in a temp var and then finding a new column("Texte_X") to insert into the same row my replace data and validating at the same time if it's not older.
I can have multiple duplicate row in my table and I know looping in SQL server is slow, but would really appreciate a good solution to solve this here.
Here's a example of multiple duplicate row
How about a MERGE:
merge mytable as t
using (
select idPatient, idObject, max(texte_1) dt
from mytable
group by idPatient, idObject
) s on t.idPatient = s.idPatient
and t.idObject = s.idObject
and t.texte_1 != s.dt
when matched then delete;
You could use the ROW_NUMBER() function and your ID field to order the duplicates, then PIVOT to de-normalize the records, or self-joins, like:
;with cte as (SELECT *,RN = ROW_NUMBER() OVER(PARTITION BY IDPatient,IDObjet ORDER BY ID)
FROM YourTable
)
SELECT a.IDPatient,a.IDObjet,a.Texte_1, b.Texte_1 as Texte_2, c.Texte_1 AS Texte_3
FROM cte a
LEFT JOIN cte b
ON a.IDPatient = b.IDPatient
AND a.IDObjet = b.IDObjet
AND b.RN = 2
LEFT JOIN cte c
ON a.IDPatient = c.IDPatient
AND a.IDObjet = c.IDObjet
AND c.RN = 3
WHERE a.RN = 1
This assumes the ID order is sufficient, you could change it to your date field if needed. Since you ultimately want to remove the duplicate lines, you could either run this query into a new table, or after you use this as the basis of your update you can then DELETE records from the cte above where RN > 1
Personally, I would avoid the de-normalized Texte_1-10 structure, and add a new field that's the equivalent of the RN field as part of the key.

Trying to use SQL to pull flat data into an array-like answer

I am trying to use SQL to pull flat data into an array-like answer, and need some help.
The flat data is formatted as:
timestamp, unique_id, value
... over and over again on each row down the table I call "temperature_values". When you look at the table, it has lots of rows with a unique_id of "temp_low" and lots of rows with a unique_id of "temp_high". For each timestamp, there is a single row with the "temp_low" unique_id and a single row with the "temp_high" unique_id values. Of course, the timestamp field is the same on each of these rows.
So if I want to query just the "temp_low" or the "temp_high", it's very easy.
But what I'd like to do is have a single SQL statement that returns:
timestamp, temp_low, temp_high
... having the timestamp as unique on each result row, so that it's easy to graph the high and slow temperatures for each timestamp. I've tried some INNER JOINs into the same table, but I'm not sure that's the correct way to solve this.
Any clues?
TIA - Dave
A self join is a good solution. If temp low and temp high are the only possible unique ids and if the low is truly always less than or equal to the high, you could also do:
SELECT timestamp, min(value) as temp_low, max(value) as temp_high
FROM table_name
GROUP BY timestamp
Edit: by joining the table to itself the following will work (assuming every timestamp has exactly one high row and one low row)
SELECT low.timestamp,
low.value temp_low,
high.value temp_high
FROM table_name low
JOIN table_name high
ON low.timestamp = high.timestamp
WHERE low.unique_id = 'temp_low'
AND high.unique_id = 'temp_high'
Or assuming every timestamp has at most one high row and one low row but not necessarily both:
SELECT coalesce(low.timestamp, high.timestamp) timestamp,
low.value temp_low,
high.value temp_high
FROM table_name low
FULL OUTER JOIN table_name high
ON low.timestamp = high.timestamp
WHERE (low.unique_id = 'temp_low' OR low.timestamp is null)
AND (high.unique_id = 'temp_high' OR high.timestamp is null)
This works from joining between two sub-queries (q1 and q2), and allows you to ignore any ID's that aren't high or low:
SELECT q1.timestamp AS Time, High, Low FROM
(SELECT timestamp, value AS High FROM temps
WHERE ID = 'temp_high') q1 INNER JOIN
(SELECT timestamp, value AS Low From temps
WHERE ID = 'temp_low') q2 ON q1.Date = q2.Date;
This may be a good reference for JOINs. Assuming the values are separated into separated tables and the time stamps match you could join them on the time-stamp.
SELECT temp_low_table.timestamp AS timestamp, temp_low_table.temp_low AS temp_low, temp_high_table.temp_high AS temp_high
FROM temp_low_table
INNER JOIN temp_high_table
ON temp_low_table.timestamp=temp_high_table.timestamp;
A very simple solution that doesn't involve joins is to use a group by + an aggregate function (like min or max) with a case statement:
select timestamp,
max(case when unique_id = 'temp_low' then value end) as temp_low,
max(case when unique_id = 'temp_high' then value end) as temp_high
from temperature_values
group by timestamp

Relational division with events in a certain timeframe

I have my table (cte) defintions and result set here
The CTE may look strange but it has been tested and returns the correct results in the most efficient manner that I've found yet. The below query will find the number of person IDs (patid) who are taking two or more drugs at the same time. Currently, the query works insofar as it returns the patIDs of the people taking both drugs, but not both drugs at the same time. Taking both drugs is indicated by one fillDate of one drug falling before a scriptEndDate of another drug. So
You can see in this partial result set that on line 18 the scriptFillDate is 2009-07-19 which is between the fillDate and scriptEndDate of the same patID from row 2. What constraint do I need to add so I can filter these unneeded results?
--PatientDrugList is a CTE because eventually parameters might be passed to it
--to alter the selection population
;with PatientDrugList(patid, filldate, scriptEndDate,drugName,strength)
as
(
select rx.patid,rx.fillDate,rx.scriptEndDate,rx.drugName,rx.strength
from rx
),
--the row constructor here will eventually be parameters for a stored procedure
DrugList (drugName)
as
(
select x.drugName
from (values ('concerta'),('fentanyl'))
as x(drugName)
where x.drugName is not null
)
--the row number here is so that I can find the largest date range
--(the largest datediff means the person was on a given drug for a larger
--amount of time. obviously not a optimal solution
--celko inspired relational division!
select distinct row_number() over(partition by pd.patid, drugname order by datediff(day,pd.fillDate,pd.scriptEndDate)desc) as rn
,pd.patid
,pd.drugname
,pd.fillDate
,pd.scriptEndDate
from PatientDrugList as pd
where not exists
(select * from DrugList
where not exists
(select * from PatientDrugList as pd2
where(pd.patid=pd2.patid)
and (pd2.drugName = DrugList.drugName)))
and exists
(select *
from DrugList
where DrugList.drugName=pd.drugName
)
group by pd.patid, pd.drugName,pd.filldate,pd.scriptEndDate
Wrap you original query into a CTE, or better yet, for performance, stability of query plan and result, store it into a temp table.
The query below (assuming CTE option) will give you the overlapping times when both drugs are being taken.
;with tmp as (
.. your query producing the columns shown ..
)
select *
from tmp a
join tmp b on a.patid = b.patid and a.drugname <> b.drugname
where a.filldate < b.scriptenddate
and b.filldate < a.scriptenddate;