Count different actions within one hour in python - pandas

I am starting to work with time series. I have one of a user doing bank transfers to different countries, however the most frequent country to where he/she is doing the transfers is X, but there are transfers also to the countries Y and Z. Let's say:
date id country
2020-01-01T00:00:00.000Z id_01 X
2020-01-01T00:20:00.000Z id_02 X
2020-01-01T00:25:00.000Z id_03 Y
2020-01-01T00:35:00.000Z id_04 X
2020-01-01T00:45:00.000Z id_05 Z
2020-01-01T01:00:00.000Z id_06 X
2020-01-01T10:20:00.000Z id_07 X
2020-01-01T10:25:00.000Z id_08 X
2020-01-01T13:00:00.000Z id_09 X
2020-01-01T18:45:00.000Z id_10 Z
2020-01-01T18:55:00.000Z id_11 X
Since the most frequent country is X, I would like to count iteratively how many transactions have been done within one hour (in the whole list of events) to countries different than X.
The format of the expected output for this particular case would be:
date id country
2020-01-01T00:25:00.000Z id_03 Y
2020-01-01T00:45:00.000Z id_05 Z
Starting from 2020-01-01T00:00:00.000Z, within one hour there are two Y, Z transactions. Then starting from 2020-01-01T00:20:00.000Z, within one hour, there are the same transactions, and so on. Then, starting from 2020-01-01T10:20:00.000Z, within one hour, all are X. Starting from 2020-01-01T18:45:00.000Z, within one hour, there is only one Z.
I am trying with a double for loop and .value_counts(), but I'm not sure of what I am doing.

Have you considered using a time-series database for this? It could make your life easier if you are doing a lot of event-based aggregations with arbitrary time intervals. Time-series databases abstract this for you so all you need is to send a query and get the results into pandas. It's also going to run considerably faster.
For example hourly aggregations can be done using the following syntax in QuestDB.
select timestamp, country, count() from yourTable SAMPLE BY 1h
this will return results like this
| timestamp | country | count |
| 2020-06-22T00:00:00 | X | 234 |
| 2020-06-22T00:00:00 | Y | 493 |
| 2020-06-22T01:00:00 | X | 12 |
| 2020-06-22T01:00:00 | Y | 66 |
You can adjust this to monthly or weekly or 5-minute resolution results without having to re-write your logic, all you need to do is change the 1h to 1M,7d or 5m or pass this as an argument.
Now, to get results one hour before and after the timestamp of your target transaction, you could add a timestamp interval search to the above. For example assuming your target transaction happened on 2010-01-01T06:47:00.000000Z, the resulting search would be
select hour, country, count() from yourTable
where timestamp = '2010-01-01T05:47:00.000000Z;2h'
sample by 1h;
If this is something which would work for you, there is a tutorial on how to run this type of query in QuestDB and get the results into pandas here

IIUC, you can select only the rows not X, then use diff once forward and once backward (within 1 hour before and after) and you want where any of the two diff is below a Timedelta of 1h.
#convert to datetime
df['date'] = pd.to_datetime(df['date'])
#mask not X and select only these rows
mX = df['country'].ne('X')
df_ = df[mX].copy()
# mask within an hour before and after
m1H = (df_['date'].diff().le(pd.Timedelta(hours=1)) |
df_['date'].diff(-1).le(pd.Timedelta(hours=1)) )
# selet only the rows meeting criteria on X and 1H
df_ = df_[m1H]
print (df_)
date id country
2 2020-01-01 00:25:00+00:00 id_03 Y
4 2020-01-01 00:45:00+00:00 id_05 Z

You can try :
df['date'] = pd.to_datetime(df.date)
(df.country != 'X').groupby(by=df.date.dt.hour).sum()
First it turns your date columns into a datetime. Then, you test if country is 'X', and group by hour, and sum the number of countries that are different than 'X'. Groups are based on hours, and not on rolling elasped time. Hope it solves your problem!

Related

How to get correct min, max date for each customer's changing label in wide format in BigQuery?

I have a table that records customer purchases, for example:
customer_id
label
date
purchase_id
price
2
A
2022-01-01
asd
10
3
A
2022-01-01
asdf
5
4
B
2022-02-04
asdfg
200
2
A
2022-01-03
asdjg
4
3
B
2022-02-01
dfs
20
2
G
2022-04-05
fdg
40
2
G
2022-04-10
fdg
40
2
A
2022-06-06
fgd
20
I want to see how many days/money each customer has spent in each label, so far what I'm doing is:
SELECT
customer_id,
label,
COUNT(DISTINCT(purchase_id) as orders_count,
SUM(price) as total_spent,
min(date) as first_date,
max(date) as last_date,
DATE_DIFF(max(date), min(date), DAY) as days
FROM
TABLE
WHERE
date > '2022-01-01'
GROUP BY
customer_id,
label
which gives me a long table, like this:
customer_id
label
orders_count
total_spent
first_date
last_date
days
2
A
3
34
2022-01-01
2022-06-06
180
2
G
1
40
2022-04-05
2022-04-10
5
etc
Just for simplicity I show a few columns, but customers have orders all the time. The issue with the above is that, for example for customer 2, that he starts with label A, then changes to G, then he is back to A so this is not visible in the results table (min(date) is correct, but max(date) takes their 2nd A max(date)) and that I'd prefer to have it in wide format. For instance, ideally, columns called next_label_{i} that you get values for each changing label would be the best for me.
Could you advise me of a way of a) dealing with accomodating with this label change(future label change is the same as an earlier label) and b) a way to produce it into a wide format?
Thanks
edit:
example output (correct date, wide format) [columns would go as wide as the max number of unique labels for any customer]
customer_id
first_label
first_first_date
first_last_date
first_total_spent
first_days
next_label
next_first_date
next_last_date
next_days
next_label_2
next_first_date_2
next_last_date_2
next_days_2
2
A
2022-01-01
2022-01-03
2
14
G
2022-04-05
2022-04-05
0
A
2022-06-06
2022-06-06
0
etc
Sorry this is not exactly accurate (missing the orders_count, total_spent) but it's a pain in the ass for format it here, but hopefully you get the idea. In principle, it's something as if you used python's pivot_table on the previous dataset.
Alternatively, I'd be glad for just a solution in the long format that distinguishes between a customer's label and the same customer's repeated label ( as in customer 2 who starts with A and after changing to G, returns to A)
Could you advise me of ... b) a way to produce it into a wide format?
First, I want to say that I hope you have really good reason to get that output as usually it is not what is considered a best practices and rather is being left for presentation layer to handle.
With that in mind - consider below approach
select * from (
select customer_id, offset, purchase.*
from (
select customer_id,
array_agg((struct(label, date, purchase_id, price)) order by date) purchases
from your_table
group by customer_id
), unnest(purchases) purchase with offset
order by customer_id, offset
)
pivot (
any_value(label) label,
any_value(date) date,
any_value(purchase_id) purchase_id,
any_value(price) price
for offset in (0,1,2,3,4,5)
)
if applied to sample data in your question - output is
Note: Above has silly assumption that you know the max number of steps (in this case I used 6 - from 0 till 5). There are plenty of posts here on SO that shows how to use same technique to make it dynamic. I do not want to duplicate them as it is against SO policies. So, just do your extra homework on this :o)

Bring all itens from key using data filter

I have a list of documents that contains the same key but different dates,and I'm trying to bring all itens from key using data filter, im using Tableau Desktop.
For example, my table is:
Document
Date
Key
A
01/01/2021
X
B
01/02/2021
X
C
01/03/2021
X
D
01/04/2021
X
E
01/05/2021
X
F
01/06/2021
X
G
01/07/2021
Y
H
01/08/2021
Y
If I filter feb/2021, since my key X has the date 01/02/2021, results should be:
Document
Date
Key
A
01/01/2021
X
B
01/02/2021
X
C
01/03/2021
X
D
01/04/2021
X
E
01/05/2021
X
F
01/06/2021
X
Else if im filtering the date aug/2021, it should be:
Document
Date
Key
G
01/07/2021
Y
H
01/08/2021
Y
What I tried: I created a date parameter "Insert Date" to insert a single date and I created a calculated field "Select Date" using FIXED like the code bellow:
{ FIXED [Key] : MAX([Date] = [Insert Date])}
I got it done with a single day but i need using the entire month.
It sounds like you want to first identify the set of keys that show up at least once within your selected date range, and then include all records for the identified keys, regardless of the date on each record.
If so, you don't want to filter record based on the dates, but you do want the user to specify a date (or date range). So instead of using a filter control for the date, use a parameter - either to allow the user to select a day within the month you want, or have two parameters to select the start and end of a range.
Then for the filter define an aggregate calculation or a set that determines whether a key occurs in the proposed time frame. A set named, say KEY_OF_INTERST based on the Key field, defined by a condition similar to the following should work
MAX([Date] >= [START DATE PARAM] and [Date] <= [END DATE PARAM])
That expression is True if at least one record falls within the specified date range, so the set will include exactly the Keys that have at least one document record in the time range.
Then just use the set to filter to the interesting keys

Combining data from one column into one with multiplication

I am trying to find a way to add up amount with the same ID and different units but also perform multiplication or division on them before adding them together.
The column time describes the amount of time spent doing a certain task.
There are four different values the time column can have which are:
- Uur (which stands for Hours)
- Minuut (which stands for Minutes)
- Etmaal (which stands for 24 hours)
- Dagdeel (which stands for 4 hours)
What I'd like is to transform them all into hours which should eventually return the row:
ID | Amount | Time |
---------------------
82 | 1690634 | Uur |
So only one unit remains.
This means rows that contain minuut will have their amount divided by 60
Rows that contain etmaal will have their amount multiplied by 24
and rows that contain dagdeel will have their amount multiplied by 4
I think the logic is:
select id,
(case when time = 'minuut' then amount / 60
when time = 'etmaal' then amount * 24
when time = 'dagdeel' then amount / 4
when time = 'uur' then amount
end) as amount,
time
from table
Please use below query,
select id,
case when time = 'minuut' then amount/60
when time = 'etmaal' then amount*24
when time = 'dagdeel' then amount*4 end as amount,
time
from table;

Creating a Time Range in SQL

I want to make a table (Table A) in Hive that has three columns. This table has times starting from 5AM and ending at 2AM the next day. Each row is a 5 minute increment from the previous row.
The first two columns are this (and I don't know how to generate this).
start_time | end_time
5:00:00 | 5:05:00
5:05:01 | 5:10:00
...
23:55:01 | 00:00:00
...
1:55:01 | 02:00:00
Does anyone know how to do the above?
To shed some background:
Once I have Table A created, I want to use use another table (Table B) that I have with epoch times for each record that represents a visit of a customer, extract the necessary hour/minute/second information, and then provide a sum count of visitors for each time interval in a third column of Table A, say, "customer_count".
I think I know to do the calculation for "customer_count" column for Table A, however, what I need help with is making the first two columns in Table A.
You could do it the other way around:
Crop from table B the dates you are interested in
Group by 5 minute increments (calculated by (time-start_time) / 60 / 5 assuming the epoch is in seconds)
Then turn the increments back into dates and calculate the second end_time column
Something like this:
select from_unixtime(<start time> + period*60*5),
from_unixtime(<start time> + (period+1)*60*5),
count from
(select (time-<start time>)/(60*5) as period,count(*) as count from tableB
where time >= <start time> and time <= <end time>
group by (time-<start time>)/(60*5) ) inner
Note that you won't receive times with zero count (no visits during a period)

Performing math on SELECT result rows

I have a table that houses customer balances and I need to be able to see when accounts figures have dropped by a certain percentage over the previous month's balance per account.
My output consists of an account id, year_month combination code, and the month ending balance. So I want to see if February's balance dropped by X% from January's, and if January's dropped by the same % from December. If it did drop then I would like to be able to see what year_month code it dropped in, and yes I could have 1 account with multiple drops and I hope to see that.
Anyone have an ideas on how to perform this within SQL?
EDIT: Adding some sample data as requested. On the table I am looking at I have year_month as a column, but I do have access to get the last business day date per month as well
account_id | year_month | ending balance
1 | 2016-1 | 50000
1 | 2016-2 | 40000
1 | 2016-3 | 25
Output that I would like to see is the year_month code when the ending balance has at least a 50% decline from the previous month.
First I would recommend making Year_Month a yyyy-mm-dd format date for this calculation. Then take the current table and join it to itself, but the date that you join on will be the prior month. Then perform your calculation in the select. So you could do something like this below.
SELECT x.*,
x.EndingBalance - y.EndingBalance
FROM Balances x
INNER JOIN Balances y ON x.AccountID = y.AccountID
and x.YearMonth = DATEADD(month, DATEDIFF(month, 0, x.YearMonth) - 1, 0)