How to not add sequence to the same group pandas based on date time - pandas

I have a dataframe as follows:
id date day
A 01/01/2023 00:00 01
A 01/01/2023 00:00 01
B 01/01/2023 01:00 01
B 01/01/2023 01:00 01
A 01/01/2023 02:00 01
A 02/01/2023 00:00 02
The output I expect is
id date day count
A 01/01/2023 00:00 01 1
A 01/01/2023 00:00 01 1 (2 rows are 1 because they fall under same group)
B 01/01/2023 01:00 01 1 (this is 1 because ID is different)
B 01/01/2023 01:00 01 1
A 01/01/2023 02:00 01 2 (this is incremented because it happened on same day)
A 02/01/2023 00:00 02 1 (this is 1 because the day has changed)
Grouping is done on ID, date, day and you can assume the dataframe is sorted by id and date.

IIUC use:
df['count'] = df.groupby(['id','day'])['date'].rank('dense').astype(int)

Related

How to aggregrate over time elasped in Netezza sql?

I use Netezza and
I have table like this.
cert_id date value
-------- ------------ ------
01 2018-01-01 2
01 2018-01-02 1
01 2018-01-03 3
02 2018-02-06 2
02 2018-02-07 1
02 2018-02-08 4
02 2018-02-09 6
And I want this aggregate(over time) table to be like this.
cert_id date value
-------- ------------ ------
01 2018-01-01 2
01 2018-01-02 3
01 2018-01-03 6
02 2018-02-06 2
02 2018-02-07 3
02 2018-02-08 7
02 2018-02-09 13
One approach uses a correlated subquery to find the rolling sums:
SELECT
cert_id,
date,
"value",
(SELECT SUM(t2."value") FROM yourTable t2
WHERE t1.cert_id = t2.cert_id AND t2.date <= t1.date) rolling_sum
FROM yourTable t1
ORDER BY
cert_id,
date;
Demo
If Netezza supports analytic functions, then here is an even simpler query:
SELECT
cert_id,
date,
"value",
SUM(value) OVER (PARTITION BY cert_id ORDER BY date) rolling_sum
FROM yourTable;

Link date dimension with an interval fields in fact table

My problem it's very complex, for me anyway! I'll try to explain :
I have the following fact table :
ID Date_From Date_To Time_From Time_To ID_ACTIV
1 01/02/2018 25/05/2018 08:00:00 10:00:00 41
2 01/06/2018 01/07/2018 10:00:00 13:00:00 41
3 01/02/2018 10/02/2018 10:00:00 11:00:00 42
And a normal date dimension I want to link this dimension with fact tab to get this result table in example:
Date hour ACTIV_COUNT
31/01/2018 10 0
01/02/2018 07 0
01/02/2018 08 1
01/02/2018 09 1
01/02/2018 10 2
01/02/2018 11 1
.
.
01/06/2018 10 1
.
.
The only solution that I found it's to create a named query with a field and populate it with all possible dates and times in each interval, then link it with the date and time dimension.
Have you a better solution ?
Thank you in advance.

Run a query that provides all data entries from the past week in BigQuery

I have a table with data and one of the columns is titled 'createdAt' and is a timestamp. Is there a query that I can run that selects all of the entries that would have been made in the previous week?
This is the code I have so far. I believe it would be implementing a WHERE clause of some kind but I am not sure just how to do it.
#standardSQL
SELECT
Serial,
SUM(ConnectionTime/3600) as Total_Hours,
COUNT(DISTINCT DeviceID) AS Devices_Connected
FROM `dataworks-356fa.FirebaseArchive.testf`
WHERE Model = "BlueBox-pH"
GROUP BY Serial
ORDER BY Serial
LIMIT 1000;
In Standard SQL you can try something like this to see if the WHERE clause gets you the correct date range:
SELECT
MIN(createdAt),
MAX(createdAt)
FROM
`dataworks-356fa.FirebaseArchive.testf`
WHERE
EXTRACT(WEEK FROM createdAt) = EXTRACT(WEEK FROM CURRENT_TIMESTAMP()) - 1
Please note that BigQuery uses the Sunday as the first day of the week. I don't know how to change that. Would be interesting if someone knows since in my country we consider the Monday to be the first day of the week.
You can use DATE_TRUNC with the WEEK part to find the start of the week for a given date. For example,
#standardSQL
WITH Input AS (
SELECT DATE '2017-06-25' AS date, 1 AS x UNION ALL
SELECT DATE '2017-06-20', 2 UNION ALL
SELECT DATE '2017-06-26', 3 UNION ALL
SELECT DATE '2017-07-11', 4 UNION ALL
SELECT DATE '2017-07-09', 5
)
SELECT
DATE_TRUNC(date, WEEK) AS week,
MAX(x) AS max_x
FROM Input
GROUP BY week;
In your particular case, it would be:
#standardSQL
SELECT
Serial,
SUM(ConnectionTime/3600) as Total_Hours,
COUNT(DISTINCT DeviceID) AS Devices_Connected
FROM `dataworks-356fa.FirebaseArchive.testf`
WHERE Model = "BlueBox-pH" AND
createdAt >= DATE_TRUNC(CURRENT_DATE(), WEEK)
GROUP BY Serial
ORDER BY Serial
LIMIT 1000;
Alternatively, if you are just looking for dates in the past seven days, you can use a query of this form:
#standardSQL
SELECT
Serial,
SUM(ConnectionTime/3600) as Total_Hours,
COUNT(DISTINCT DeviceID) AS Devices_Connected
FROM `dataworks-356fa.FirebaseArchive.testf`
WHERE Model = "BlueBox-pH" AND
createdAt >= DATE_SUB(CURRENT_DATE(), INTERVAL 1 WEEK)
GROUP BY Serial
ORDER BY Serial
LIMIT 1000;
selects all of the entries that would have been made in the previous week?
Below is for BigQuery Standard SQL and restricts data to only previous week and obviously not current week
#standardSQL
SELECT
Serial,
SUM(ConnectionTime/3600) AS Total_Hours,
COUNT(DISTINCT DeviceID) AS Devices_Connected
FROM `dataworks-356fa.FirebaseArchive.testf`,
UNNEST([DATE_SUB(CURRENT_DATE(), INTERVAL CAST(FORMAT_DATE('%w', CURRENT_DATE()) AS INT64) DAY)]) AS first_day_of_week
WHERE Model = 'BlueBox-pH'
AND createdAt
BETWEEN DATE_SUB(first_day_of_week, INTERVAL 7 DAY)
AND DATE_SUB(first_day_of_week, INTERVAL 1 DAY)
GROUP BY Serial
-- ORDER BY Serial
-- LIMIT 1000
to understand how past week stuff it works - run below
#standardSQL
WITH dates AS (
SELECT createdAt
FROM UNNEST(GENERATE_DATE_ARRAY('2017-01-01', '2017-01-13', INTERVAL 1 DAY)) AS createdAt
)
SELECT
createdAt,
FORMAT_DATE('%a', createdAt) AS weekday,
FORMAT_DATE('%U', createdAt) AS week_start_Sunday,
FORMAT_DATE('%W', createdAt) AS week_start_Monday,
FORMAT_DATE('%V', createdAt) AS week_start_Monday_prorated,
DATE_SUB(createdAt, INTERVAL weekday_num DAY) AS first_day_of_week_Sunday,
DATE_SUB(createdAt, INTERVAL weekday_num - 1 DAY) AS first_day_of_week_Monday,
DATE_SUB(DATE_SUB(createdAt, INTERVAL weekday_num DAY), INTERVAL 7 DAY) AS first_day_of_prev_week_Sunday,
DATE_SUB(DATE_SUB(createdAt, INTERVAL weekday_num - 1 DAY), INTERVAL 7 DAY) AS first_day_of_prev_week_Monday
FROM dates, UNNEST([CAST(FORMAT_DATE('%w', createdAt) AS INT64)]) AS weekday_num
ORDER BY createdAt
the output is -
createdAt weekday week_ week_ week_ first_day_ first_day_ first_day_ first_day_
start_ start_ start_ of_week_ of_week_ of_prev_week_ of_prev_week_
Sunday Monday Monday_ Sunday Monday Sunday Monday
prorated
---------------------------------------------------------------------------------------------------
2017-01-01 Sun 01 00 52 2017-01-01 2017-01-02 2016-12-25 2016-12-26
2017-01-02 Mon 01 01 01 2017-01-01 2017-01-02 2016-12-25 2016-12-26
2017-01-03 Tue 01 01 01 2017-01-01 2017-01-02 2016-12-25 2016-12-26
2017-01-04 Wed 01 01 01 2017-01-01 2017-01-02 2016-12-25 2016-12-26
2017-01-05 Thu 01 01 01 2017-01-01 2017-01-02 2016-12-25 2016-12-26
2017-01-06 Fri 01 01 01 2017-01-01 2017-01-02 2016-12-25 2016-12-26
2017-01-07 Sat 01 01 01 2017-01-01 2017-01-02 2016-12-25 2016-12-26
2017-01-08 Sun 02 01 01 2017-01-08 2017-01-09 2017-01-01 2017-01-02
2017-01-09 Mon 02 02 02 2017-01-08 2017-01-09 2017-01-01 2017-01-02
2017-01-10 Tue 02 02 02 2017-01-08 2017-01-09 2017-01-01 2017-01-02
2017-01-11 Wed 02 02 02 2017-01-08 2017-01-09 2017-01-01 2017-01-02
2017-01-12 Thu 02 02 02 2017-01-08 2017-01-09 2017-01-01 2017-01-02
2017-01-13 Fri 02 02 02 2017-01-08 2017-01-09 2017-01-01 2017-01-02
As you can see, in my answer I am using logic of first_day_of_week_Sunday to calculate first_day_of_week
If you would have same requirements as #Wouter - in my country we consider the Monday to be the first day of the week - you can use logic of first_day_of_week_Monday

how to find the date difference in hours between two records with nearest datetime value and it must be compared in same group

How to find the date difference in hours between two records with nearest datetime value and it must be compared in same group?
Sample Data as follows:
Select * from tblGroup
Group FinishedDatetime
1 03-01-2009 00:00
1 13-01-2009 22:00
1 08-01-2009 03:00
2 01-01-2009 10:00
2 13-01-2009 20:00
2 10:01-2009 10:00
3 27-10-2008 00:00
3 29-10-2008 00:00
Expected Output :
Group FinishedDatetime Hours
1 03-01-2009 00:00 123
1 13-01-2009 22:00 139
1 08-01-2009 03:00 117
2 01-01-2009 10:00 216
2 13-01-2009 20:00 82
2 10:01-2009 10:00 82
3 27-10-2008 00:00 48
3 29-10-2008 00:00 48
Try this:
Select t1.[Group], DATEDIFF(HOUR, z.FinishedDatetime, t1.FinishedDatetime)
FROM tblGroup t1
OUTER APPLY(SELECT TOP 1 *
FROM tblGroup t2
WHERE t2.[Group] = t1.[Group] AND t2.FinishedDatetime<t1.FinishedDatetime
ORDER BY FinishedDatetime DESC)z

SQL How Many Employees Are Working, Group By Hour

I have a table, a timetable, with check-in and check-out times of the employees:
ID Date Check-in Check out
1 1-1-2011 11:00 18:00
2 1-1-2011 11:00 19:00
3 1-1-2011 16:00 18:30
4 1-1-2011 17:00 20:00
Now I want to know how many employees are working, every (half) hour.
The result I want to see:
Hour Count
11 2
12 2
13 2
14 2
15 2
16 3
17 3
18 2,5
19 1
Every 'Hour' you must read as 'till the next full hour', ex. 11 -> 11:00 - 12:00
Any ideas?
Build an additional table, called Hours, containing the following data:
h
00:00
00:30
01:00
...
23:30
then, run
Select h as 'hour' ,count(ID) as 'count' from timetable,hours where [Check_in]<=h and h<=[Check_out] group by h