Kusto - All data per id for max date
Hi,
I am struggeling with a query and hope someone can help me with this topic. :)
I want to get all data per ID related to the latest timestamp.
My source looks something like this:
Timestamp
ID
Other columns
Date A
ID A
other data 1
Date A
ID A
other data 2
Date B
ID B
other data 1
Date B
ID B
other data 2
Date C
ID A
other data 1
Date C
ID A
other data 2
Date D
ID B
other data 1
Date D
ID B
other data 2
As result i want:
Timestamp
ID
Other columns
Date A
ID A
other data 1
Date A
ID A
other data 2
Date B
ID B
other data 1
Date B
ID B
other data 2
So for the ID A and B (and so on) all rows with the same (max)timestamp related to the ID.
I tried --> source | summarize arg_max(timestamp) by ID
That results in only in:
Timestamp
ID
Other columns
Date A
ID A
other data 1
Date B
ID B
other data 1
If I add further columns to the summarize i got a amount of rows depending of the different entries in the other column, but also with timestamps that are not the latest.
query:
source | summarize arg_max(timestamp) by ID, other column
result:
Timestamp
ID
Other columns
Date A
ID A
other data 1
Date A
ID A
other data 2
Date B
ID B
other data 1
Date B
ID B
other data 2
Date C
ID A
other data 1
Date C
ID A
other data 2
Date D
ID B
other data 1
Date D
ID B
other data 2
Hopes that is understandle. I am grateful for any input
Thanks in advance
Marcus
Option 1.
datatable(Timestamp:int, ID:string, OtherColumns:string)
[
4 ,"A" ,"other data 1"
,4 ,"A" ,"other data 2"
,3 ,"B" ,"other data 1"
,3 ,"B" ,"other data 2"
,2 ,"A" ,"other data 1"
,2 ,"A" ,"other data 2"
,1 ,"B" ,"other data 1"
,1 ,"B" ,"other data 2"
]
| partition hint.strategy=native by ID
(
order by Timestamp desc
| extend rr = row_rank(Timestamp)
| where rr == 1
| project-away rr
)
Timestamp
ID
OtherColumns
4
A
other data 1
4
A
other data 2
3
B
other data 1
3
B
other data 2
Fiddle
Option 2.
let t = datatable(Timestamp:int, ID:string, OtherColumns:string)
[
4 ,"A" ,"other data 1"
,4 ,"A" ,"other data 2"
,3 ,"B" ,"other data 1"
,3 ,"B" ,"other data 2"
,2 ,"A" ,"other data 1"
,2 ,"A" ,"other data 2"
,1 ,"B" ,"other data 1"
,1 ,"B" ,"other data 2"
];
t
| summarize Timestamp = max(Timestamp) by ID
| join kind=inner t on ID, Timestamp
| project-away *1
ID
Timestamp
OtherColumns
A
4
other data 1
A
4
other data 2
B
3
other data 1
B
3
other data 2
Fiddle
Related
I have a following data :
Row Column 1 Column 2 batch_date
1 Account 1 zipcode 1 11/28/2020
2 Account 1 zipcode 1 11/29/2020
3 Account 1 zipcode 1 11/30/2020
4 Account 1 zipcode 2 12/1/2020
5 Account 1 zipcode 2 12/2/2020
6 Account 1 zipcode 2 12/3/2020
7 Account 1 zipcode 2 12/4/2020
8 Account 1 zipcode 2 12/5/2020
9 Account 1 zipcode 2 12/6/2020
10 Account 1 zipcode 2 12/7/2020
11 Account 1 zipcode 2 12/8/2020
12 Account 1 zipcode 2 12/9/2020
13 Account 1 zipcode 2 12/10/2020
14 Account 1 zipcode 3 12/11/2020
15 Account 1 zipcode 3 12/12/2020
I would like to fetch data for this account for dates when the column2 (zipcode) has been changed.
Output should be like below:
Row Column 1 Column 2 batch_date
1 Account 1 zipcode 1 11/28/2020
2 Account 1 zipcode 2 12/1/2020
3 Account 1 zipcode 3 12/11/2020
How can we do it in bigquery ?
I have already tried FIRST_VALUE() function but the query is resulting into "resources issue".
I also tried self join but that is not giving the desired output.
Can anybody help on this ?
Below is for BigQuery Standard SQL
#standardsql
select * except(changed) from (
select *, column_2 != ifnull(lag(column_2) over win, '') changed
from `project.dataset.table`
window win as (partition by column_1 order by parse_date('%m/%d/%Y', batch_date) asc)
)
where changed
If to apply to sample data from your question - output is
Note: above code assumes your batch_date column is of STRING data type - thus use of parse_date function. If this column is of DATE data type - you don't need this function and can use just batch_date instead of parse_date('%m/%d/%Y', batch_date)
I managed to do it with the help of navigation functions in BigQuery:
SELECT DISTINCT Row, c1, c2, FIRST_VALUE(batch_date)
OVER (PARTITION BY c2 ORDER BY batch_date ASC) AS batch_date
FROM table;
I replaced "Column 1" with "c1", "Column 2" with "c2" for purpose of the example.
Say I have this table:
id
timeline
1
BASELINE
1
MIDTIME
1
ENDTIME
2
BASELINE
2
MIDTIME
3
BASELINE
4
BASELINE
5
BASELINE
5
MIDTIME
5
ENDTIME
6
MIDTIME
6
ENDTIME
7
RISK
7
RISK
So this is what the data looks like except the data has more observations (few thousands)
How do I get the output so that it will look like this:
id
timeline
1
BASELINE
1
MIDTIME
2
BASELINE
2
MIDTIME
5
BASELINE
5
MIDTIME
How do I select the first two terms of each ID which has 2 specific timeline values (BASELINE and MIDTIME)? Notice id 6 has MIDTIME and ENDTIME,and id 7 has two RISK I don't want these two ids.
I used
SELECT *
FROM df
WHERE id IN (SELECT id FROM df GROUP BY id HAVING COUNT(*)=2)
and got IDs with two timeline values (output below) but don't know how to get rows with only BASELINE and MIDTIME.
id timeline
---|--------|
1 | BASELINE |
1 | MIDTIME |
2 | BASELINE |
2 | MIDTIME |
5 | BASELINE |
5 | MIDTIME |
6 | MIDTIME | ---- dont want this
6 | ENDTIME | ---- dont want this
7 | RISK | ---- dont want this
7 | RISK | ---- dont want this
Many Thanks.
You can try using exists -
DEMO
select * from t t1 where timeline in ('BASELINE','MIDTIME') and
exists
(select 1 from t t2 where t1.id=t2.id and timeline in ('BASELINE','MIDTIME')
group by t2.id having count(distinct timeline)=2)
OUTPUT:
id timeline
1 BASELINE
1 MIDTIME
2 BASELINE
2 MIDTIME
5 BASELINE
5 MIDTIME
I think this query should give you the result you want.
NOTE: As i understand, you don't want the ID where exists a "ENDTIME", and in your sample data, there is an "ENDTIME" for ID 1. I assumed this was an error so i made a query that excludes all id containing "ENDTIME".
WITH CTE AS
(
SELECT
id
FROM
df
WHERE
timeline IN ('ENDTIME', 'RISK')
)
SELECT
id,
timeline
FROM
df
WHERE
id NOT IN (SELECT id FROM CTE);
There's probably a number of ways to do this, here's one way that will pick up BASELINE and MIDTIME rows where only they exist, ensuring there are only 2 rows per returned ID. Without knowing the ordering of timeline, it's not possible to go further I don't think:
SELECT
id
, timeline
FROM (
SELECT
*
, SUM(CASE WHEN timeline = 'BASELINE' THEN 1 ELSE 0 END) OVER (PARTITION BY id) AS BaselineCount
, SUM(CASE WHEN timeline = 'MIDTIME' THEN 1 ELSE 0 END) OVER (PARTITION BY id) AS MidtimeCount
FROM df
WHERE df.timeline IN ('BASELINE', 'MIDTIME')
) subquery
WHERE subquery.BaselineCount > 0
AND subquery.MidtimeCount > 0
GROUP BY
id
, timeline
;
I have a table that records the history of each ID per LOCATION. This table is updated each day to keep track of the history of any change in a certain row(ID). Note: The date field is not in chronological order.
ID Location Count Date (datetime type)
1 A 20 2020-01-15T12:00:00.000
1 A 10 2020-04-15T12:00:00.000
1 A 15 2020-03-15T12:00:00.000
1 B 10 2020-05-15T12:00:00.000
1 B 5 2020-06-15T12:00:00.000
1 B 0 2020-07-15T12:00:00.000
2 A 18 2020-01-15T12:00:00.000
2 A 0 2020-04-15T12:00:00.000
2 A 14 2020-03-15T12:00:00.000
2 B 10 2020-05-15T12:00:00.000
2 B 5 2020-06-15T12:00:00.000
2 B 1 2020-07-15T12:00:00.000
For each unique ID, I need to pull the first instance (oldest date) when the Count value is zero. If a unique ID does not have an instance where it Count value is zero, I need to pull the most current Count value.
Here's what my results should look like below:
ID Location Count Date (datetime type)
1 A 10 2020-04-15T12:00:00.000
1 B 0 2020-07-15T12:00:00.000
2 A 0 2020-04-15T12:00:00.000
2 B 1 2020-07-15T12:00:00.000
I can't seem to wrap my head around how to code this in Google BigQuery.
Below is for BigQuery Standard SQL
#standardSQL
SELECT AS VALUE
CASE COUNTIF(count = 0)
WHEN 0 THEN ARRAY_AGG(t ORDER BY date DESC LIMIT 1)
ELSE ARRAY_AGG(t ORDER BY count, date LIMIT 1)
END [OFFSET(0)]
FROM `project.dataset.table` t
GROUP BY id, location
if to apply to sample data from your question - output is
Row id location count date
1 1 A 10 2020-04-15 12:00:00 UTC
2 1 B 0 2020-07-15 12:00:00 UTC
3 2 A 0 2020-04-15 12:00:00 UTC
4 2 B 1 2020-07-15 12:00:00 UTC
Here is my table structure
Id INT
RecId INT
Dated DATETIME
Status INT
and here is my data.
Status table (contains different statuses)
Id Status
1 Created
2 Assigned
Log table (contains logs for the different statuses that a record went through (RecId))
Id RecId Dated Status
1 1 2013-12-09 14:16:31.930 1
2 7 2013-12-09 14:27:26.620 1
3 1 2013-12-09 14:27:26.620 2
3 8 2013-12-10 11:14:13.747 1
3 9 2013-12-10 11:14:13.747 1
3 8 2013-12-10 11:14:13.747 2
What I need to generate a report from this data in the following format.
Dated Created Assigned
2013-12-09 2 1
2013-12-10 3 1
Here the rows data is calculated date wise. The Created is calculated as (previous record (date) Created count - Previous date Assigned count) + Todays Created count.
For example if on date 2013-12-10 three entries were made to log table out of which two have the status Created while one has the status assigned. So in the desired view that I want to build for report, For date 2013-12-10, the view will return Created as 2 + 1 = 3 where 2 is newly inserted records in log table and 1 is the previous day remaining record count (Created - Assigned) 2 - 1.
I hope the scenario is clear. Please ask me if further information is required.
Please help me with the sql to construct the above view.
This matches the expected result for the provided sample, but may require more testing.
with CTE as (
select
*
, row_number() over(order by dt ASC) as rn
from (
select
cast(created.dated as date) as dt
, count(created.status) as Created
, count(Assigned.status) as Assigned
, count(created.status)
- count(Assigned.status) as Delta
from LogTable created
left join LogTable assigned
on created.RecId = assigned.RecId
and created.status = 1
and assigned.Status = 2
and created.Dated <= assigned.Dated
where created.status = 1
group by
cast(created.dated as date)
) x
)
select
dt.dt
, dt.created + coalesce(nxt.delta,0) as created
, dt.assigned
from CTE dt
left join CTE nxt on dt.rn = nxt.rn+1
;
Result:
| DT | CREATED | ASSIGNED |
|------------|---------|----------|
| 2013-12-09 | 2 | 1 |
| 2013-12-10 | 3 | 1 |
See this SQLFiddle demo
Suppose I have a table which has a "CDATE" representing the date when I retrieved the data, a "SECID" identifying the security I retrieved data for, a "SOURCE" designating where I got the data and the "VALUE" which I got from the source. My data might look as following:
CDATE | SECID | SOURCE | VALUE
--------------------------------
1/1/2012 1 1 23
1/1/2012 1 5 45
1/1/2012 1 3 33
1/4/2012 2 5 55
1/5/2012 1 5 54
1/5/2012 1 3 99
Suppose I have a HIERARCHY table like the following ("SOURCE" with greatest HIERARCHY number takes precedence):
SOURCE | NAME | HIERARCHY
---------------------------
1 ABC 10
3 DEF 5
5 GHI 2
Now let's suppose I want my results to be picked according to the hierarchy above. So applying the hierarch and selecting the source with the greatest HIERARCHY number I would like to end up with the following:
CDATE | SECID | SOURCE | VALUE
---------------------------------
1/1/2012 1 1 23
1/4/2012 2 5 55
1/5/2012 1 3 99
This joins on your hierarchy and selects the top-ranked source for each date and security.
SELECT CDATE, SECID, SOURCE, VALUE
FROM (
SELECT t.CDATE, t.SECID, t.SOURCE, t.VALUE,
ROW_NUMBER() OVER (PARTITION BY t.CDATE, t.SECID
ORDER BY h.HIERARCHY DESC) as nRow
FROM table1 t
INNER JOIN table2 h ON h.SOURCE = t.SOURCE
) A
WHERE nRow = 1
You can get the results you want with the below. It combines your data with your hierarchies and ranks them according to the highest hierarchy. This will only return one result arbitrarily though if you have a source repeated for the same date.
;with rankMyData as (
select
d.CDATE
, d.SECID
, d.SOURCE
, d.VALUE
, row_number() over(partition by d.CDate, d.SECID order by h.HIERARCHY desc) as ranking
from DATA d
inner join HIERARCHY h
on h.source = d.source
)
SELECT
CDATE
, SECID
, SOURCE
, VALUE
FROM rankMyData
where ranking = 1