How to set updating row's field with value of closest to it by date another field? - sql

I have a huge table with 2m+ rows.
The structure is like that:
ThingName (STRING),
Date (DATE),
Value (INT64)
Sometimes Value is null and I need to fix it by setting it with NOT NULL Value of closest to it by Date row corresponding to ThingName...
And I am totally not SQL guy.
I tried to describe my task with this query (and simplified it a lot by using only previous dates (but actually I need to check future dates too)):
update my_tbl as SDP
set SDP.Value = (select SDPI.Value
from my_tbl as SDPI
where SDPI.Date < SDP.Date
and SDP.ThingName = SDPI.ThingName
and SDPI.Value is not null
order by SDPI.Date desc limit 1)
where SDP.Value is null;
There I try to set updating row Value with one that I select from same table for same ThingName and with limit 1 I leave only single result.
But query editor tell me this:
Correlated subqueries that reference other tables are not supported unless they can be de-correlated, such as by transforming them into an efficient JOIN.
Actually, I am not sure at all that my task can be solved just with query.
So, can anyone help me? If this is impossible, then tell me this, if it possible, tell me what SQL constructions may help me.

Below is for BigQuery Standard SQL
In many (if not most) cases you don't want to update your table (as it incur extra cost and limitations associated with DML statements) but rather can adjust 'missing' values in-query - like in below example:
#standardSQL
SELECT
ThingName,
date,
IFNULL(value,
LAST_VALUE(value IGNORE NULLS)
OVER(PARTITION BY thingname ORDER BY date)
) AS value
FROM `project.dataset.my_tbl`
If for some reason you actually need to update the table - above statement will not help as DML's UPDATE does not allow use of analytic functions, so you need to use another approach. For example as below one
#standardSQL
SELECT
t1.ThingName, t1.date,
ARRAY_AGG(t2.Value IGNORE NULLS ORDER BY t2.date DESC LIMIT 1)[OFFSET(0)] AS value
FROM `project.dataset.my_tbl` AS t1
LEFT JOIN `project.dataset.my_tbl` AS t2
ON t2.ThingName = t1.ThingName
AND t2.date <= t1.date
GROUP BY t1.ThingName, t1.date, t1.value
and now you can use it to update your table as in example below
#standardSQL
UPDATE `project.dataset.my_tbl` t
SET value = new_value
FROM (
SELECT TO_JSON_STRING(t1) AS id,
ARRAY_AGG(t2.Value IGNORE NULLS ORDER BY t2.date DESC LIMIT 1)[OFFSET(0)] new_value
FROM `project.dataset.my_tbl` AS t1
LEFT JOIN `project.dataset.my_tbl` AS t2
ON t2.ThingName = t1.ThingName
AND t2.date <= t1.date
GROUP BY id
)
WHERE TO_JSON_STRING(t) = id

In BigQuery, updates are rather rare. The logic you seem to want is:
select t.*,
coalesce(value,
lag(value ignore nulls) over (partition by thingname order by date)
) as value
from my_tbl;
I don't really see a reason to save this back in the table.

Related

Trying to get the greatest value from a customer on a given day

What I need to do: if a customer makes more than one transaction in a day, I need to display the greatest value (and ignore any other values).
The query is pretty big, but the code I inserted below is the focus of the issue. I’m not getting the results I need. The subselect ideally should be reducing the number of rows the query generates since I don’t need all the transactions, just the greatest one, however my code isn’t cutting it. I’m getting the exact same number of rows with or without the subselect.
Note: I don’t actually have a t. in the actual query, there’s just a dozen or so other fields being pulled in. I added the t.* just to simplify the code example.*
SELECT
t.*,
(SELECT TOP (1)
t1.CustomerGUID
t1.Value
t1.Date
FROM #temp t1
WHERE t1.CustomerGUID = t.CustomerGUID
AND t1.Date = t.Date
ORDER BY t1.Value DESC) AS “Value”
FROM #temp t
Is there an obvious flaw in my code or is there a better way to achieve the result of getting the greatest value transaction per day per customer?
Thanks
you may want to do as follows:
SELECT
t1.CustomerGUID,
t1.Date,
MAX(t1.Value) AS Value
FROM #temp t1
GROUP BY
t1.CustomerGUID,
t1.Date
You can use row_number() as shown below.
SELECT
*
FROM
(
SELECT *, ROW_NUMBER() OVER (PARTITION BY CustomerGUID ORDER BY Date Desc) AS SrNo FROM <YourTable>
)
<YourTable>
WHERE
SrNo = 1
Sample data will be more helpful.
Try this window function:
MAX(value) OVER(PARTITION BY date,customer ORDER BY value DESC)
Its faster and more efficient.
Probably many other ways to do it, but this one is simple and works
select t.*
from (
select
convert(varchar(8), r.date,112) one_day
,max(r.Value) max_sale
from #temp r
group by convert(varchar(8), r.date,112)
) e
inner join #temp t on t.value = e.max_sale and convert(varchar(8), t.date,112) = e.one_day
if you have 2 people who spend the exact same amount that's also max, you'll get 2 records for that day.
the convert(varchar(8), r.date,112) will perform as desired on date, datetime and datetime2 data types. If you're date is a varchar,char,nchar or nvarchar you'll want to examine the data to find out if you left(t.date,10) or left(t.date,8) it.
If i've understood your requirement correctly you have stated"greatest value transaction per day per customer". That suggests to me you don't want 1 row per customer in the output but a row per day per customer.
To achieve this you can group on the day like this
Select t.customerid, datepart(day,t.date) as Daydate,
max(t.value) as value from #temp t group by
t.customerid, datepart(day,t.date);

Populate blank values in a Field with number from last populated value

There is no specific number of blank values. It can be none or many. Here is the current result.
Blank Cells to be Populated:
You can use analytic functions. I think this will work:
select t.*, coalesce(coil, lag(coil ignore nulls) over (order by datetime))
from t;
I know Oracle has supported ignore nulls for a long, long time. I don't quite remember off-hand if ancient versions supported it.
The below approach should work (or hopefully will give you enough to go on). The idea is that you update columns by joining a table on itself and joining on an earliest row which has been entered before the row you are wanting to update and also a row in which the column you are wanting to update is not NULL.
SELECT YT1.ID, YT2.COIL
FROM Your_Table YT1
INNER JOIN Your_Table YT2 ON YT2.ID =
(SELECT TOP 1 ID FROM Your_Table
WHERE [start_date] < YT1.[start_date]
AND COIL IS NOT NULL
ORDER BY [start_date] DESC)
WHERE YT1.COIL IS NULL OR LEN(YT1.COIL) = 0

Get average and standard deviation on difference between row values

Given the following table:
CREATE TABLE datapoints
(
id serial NOT NULL,
datasource text,
"timestamp" integer,
value text,
CONSTRAINT datapoints_pkey PRIMARY KEY (id)
)
How can I calculate the average and standard deviation of the difference in timestamp1 from one row to the next?
What I mean is, if the data looks like this:
timestamp
---------
1385565639
1385565641
1385565643
I would like to calculate the average and standard deviation on the following data:
timestamp difference
--------------------
0
2
2
Is this even possible in a single query?
First one returns the difference and second one ruturns the stddev and avg:
--difference
WITH rn as(
SELECT timestamp , row_number()over() rown
FROM datapoints order by timestamp
)
SELECT ta.rown, tb.rown,tb.timestamp - ta.timestamp
FROM rn as ta,rn as tb
WHERE ta.rown=tb.rown+1 ;
--avg, stddev
WITH rn as(
SELECT timestamp , row_number()over() rown
FROM datapoints
ORDER BY timestamp
)
SELECT stddev(tb.timestamp - ta.timestamp), avg(tb.timestamp - ta.timestamp)
FROM rn as ta,rn as tb
WHERE ta.rown=tb.rown+1 ;
Unless I misunderstood or oversimplified your question
something like this might be helpful.
select t2.timestamp - t1.timestamp
from
TableName t1
join TableName t2 on
(
t1.timestamp < t2.timestamp
and
(
not exists select null from TableName tMid
where
tMid.timestamp > t1.timestamp and tMid.timestamp < t2.timestamp
)
)
I doubt this is the most efficient thing to do but you mentioned you want it done with one single query.
Just giving you an idea.
If your IDs are consecutive, you could do the join much simpler
(on t1.ID = t2.ID-1 or something similar).
Then also you need to see how to also include the last/first difference
(maybe you try an outer join). I think my query misses that one.
Never mind, seems I probably misunderstood your question.
This seems useful for your case.
SQL: Show average and min/max within standard deviations

How to run a nested query depends on a condition

Is it possible to run queries depends on a condition?
I mean,
i have a table with id, score, amt, time.
I have to group by id and has to get max score record for every id,
if two records with same id and score then i has to go for amt, if amts also same then to time.
It is possible to do this in a single query !!
Thanks in advance.
It's possible if you do a self join. However, the fact that you have two records with the same id suggests that your db might not be normalized. If any event, the general idea is this:
select case
when t1.id = t2.id and t1.score = t2.score then t1.amt
else t1.time end fieldalias
from yourtable t1 join yourtable t2 on something
where whatever
However, this will only work if amt and time are the same datatype. Plus I have no idea what field to use to do your self join.

Order by not working in Oracle subquery

I'm trying to return 7 events from a table, from todays date, and have them in date order:
SELECT ID
FROM table
where ID in (select ID from table
where DATEFIELD >= trunc(sysdate)
order by DATEFIELD ASC)
and rownum <= 7
If I remove the 'order by' it returns the IDs just fine and the query works, but it's not in the right order. Would appreciate any help with this since I can't seem to figure out what I'm doing wrong!
(edit) for clarification, I was using this before, and the order returned was really out:
select ID
from TABLE
where DATEFIELD >= trunc(sysdate)
and rownum <= 7
order by DATEFIELD
Thanks
The values for the ROWNUM "function" are applied before the ORDER BY is processed. That why it doesn't work the way you used it (See the manual for a similar explanation)
When limiting a query using ROWNUM and an ORDER BY is involved, the ordering must be done in an inner select and the limit must be applied in the outer select:
select *
from (
select *
from table
where datefield >= trunc(sysdate)
order by datefield ASC
)
where rownum <= 7
You cannot use order by in where id in (select id from ...) kind of subquery. It wouldn't make sense anyway. This condition only checks if id is in subquery. If it affects the order of output, it's only incidental. With different data query execution plan might be different and output order would be different as well. Use explicit order by at the end of the main query.
It is well known 'feature' of Oracle that rownum doesn't play nice with order by. See http://www.adp-gmbh.ch/ora/sql/examples/first_rows.html for more information. In your case you should use something like:
SELECT ID
FROM (select ID, row_number() over (order by DATEFIELD ) r
from table
where DATEFIELD >= trunc(sysdate))
WHERE r <= 7
See also:
http://www.orafaq.com/faq/how_does_one_select_the_top_n_rows_from_a_table
http://www.oracle.com/technetwork/issue-archive/2006/06-sep/o56asktom-086197.html
http://asktom.oracle.com/pls/asktom/f?p=100:11:507524690399301::::P11_QUESTION_ID:127412348064
See also other similar questions on SO, eg.:
Oracle SELECT TOP 10 records
Oracle/SQL - Select specified range of sequential records
Your outer query cant "see" the ORDER in the inner query and in this case the order in the inner doesn't make sense because it (the inner) is only being used to create a subset of data that will be used on the WHERE of the outer one, so the order of this subset doesn't matter.
maybe if you explain better what you want to do, we can help you
ORDER BY CLAUSE IN Subqueries:
the order by clause is not allowed inside a subquery, with the exception of the inline views. If attempt to include an ORDER BY clause, you receive an error message
An inline View is a query at the from clause.
SELECT t.*
FROM (SELECT id, name FROM student) t