DOB field in Pyspark - apache-spark-sql

I want to calculate age and from DOB field. But in my code I am hard coding it. But need to do dynamically like today - DOB. Similarly I also want to calculate duration from start_date. My data frame looks like -
id dob start_date
77 30/09/1990 2019-04-13 15:27:22
65 15/12/1988 2018-12-26 23:28:12
3 08/12/2000 2018-12-26 23:28:17
I have so far - For age calculation
df= df.withColumn('dob',to_date(unix_timestamp(F.col('dob'),'dd/MM/yyyy').cast("timestamp")))
end_date = '3/09/2019'
end_date = pd.to_datetime(end_date, format="%d/%m/%Y")
df= df.withColumn('end_date',F.unix_timestamp(F.lit(end_date),'dd/mm/yyyy').cast("timestamp"))
df = df.withColumn('age', (F.datediff(F.col('end_date'), F.col('dob')))/365)
df= df.withColumn("age", func.round(df["age"], 0))
For duration calculation -
end_date_1 = '2019-09-30'
end_date_1 = pd.to_datetime(end_date_1, format="%Y-%m-%d")
df= df.withColumn('end_date_1',F.unix_timestamp(F.lit(end_date_1),'yyyy-MM-dd HH:mm:ss').cast("timestamp"))
df= df.withColumn('duration', (F.datediff(F.col('end_date_1'), F.col('created_at'))))
In the above two codes I have hard code two values. One is end_date = '2019-09-30' and other is end_date_1 = '2019-09-30'. But want to do this based on todays() date. How to do it in pyspark?

You can use date.today() to get today's date as you are using python with spark.
More information about the desired date format can be found here official python documentation

Related

Pandas get number of row for specific date with time as index

I have a dataframe with datetime as index (the values in columns don't matter). I would like to extract the row number when the index is equal to a specific date (2021/06/25 16:00:00). How can I do that ? The only way I found was to add a count column and use loc, but I wondered if there was a better way to do it.
import pandas as pd
from datetime import date, datetime, timedelta
sdate = datetime(2021,6,5,14) # Start date
edate = datetime(2021,6,30) # End Date
date_to_find = datetime(2021,6,25,16)
df = pd.DataFrame(index=pd.date_range(sdate, edate, freq='H')) # Create a dataframe with date as index, every hour
df.insert(0, 'row_num', range(0,len(df))) # here you insert the row count
df.loc[df.index == date_to_find]['row_num'][0] # Grab the row number for which the index equals the date to find
You can try with,
df.index.get_loc(date_to_find)
is faster and more readable.

Date column shifts and is no longer callable

I am using pandas groupby to group duplicate dates by their pm25 values to get one average. However when I use the groupby function, the structure of my dataframe changes, and I can no longer call the 'Date' Column.
Using groupby also changes the structure of my data: instead of being sorted by 1/1/19, 1/2/19, it is sorted by 1/1/19, 1/10/19, 1/11/19.
Here is my current code:
Before using df.groupby my df looks like:
df before groupy
I use groupby:
df.groupby('Date').mean('pm25')
print(df)
df after groupby
And after, I cannot call the 'Date' column anymore or sort the column
print(df['Date'])
Returns just
KeyError: 'Date'
Please help, or please let me know what else I can provide.
Using groupby also changes the structure of my data: instead of being sorted by 1/1/19, 1/2/19, it is sorted by 1/1/19, 1/10/19, 1/11/19.
This is because your Date column type is string not datetime. In string comparison, the third character 1 of 1/10/19 is smaller than the third character 2 of 1/2/19. If you want to keep the original sequence, you can to the following
df['Date'] = pd.to_datetime(df['Date']) # Convert Date column to datetime type
df['Date'] = df['Date'].dt.strftime('%m/%d/%y') # Convert datetime to other formats (but the dtype of column will be string)
And after, I cannot call the 'Date' column anymore or sort the column
This is because after groupby Date column, the returned dataframe will use Date column after groupby as index to represent each group.
pm25
Date
01/01/19 8.50
01/02/19 9.20
01/03/19 7.90
01/04/19 8.90
01/05/19 6.00
After doing df.groupby('Date').mean('pm25'), the returned dataframe above means the mean pm25 value of 01/01/19 group is 8.50, etc.
If you want to retrieve the Date column back from the index, you can reset_index() after groupby,
df.groupby('Date').mean('pm25').reset_index()
which gives
Date pm25
0 01/01/19 8.50
1 01/02/19 9.20
2 01/03/19 7.90
3 01/04/19 8.90
4 01/05/19 6.00
5 01/06/19 6.75
6 01/11/19 8.50
7 01/12/19 9.20
8 01/21/19 9.20
Or specify the as_index argument of pandas.DataFrame.groupby() to False
df.groupby('Date', as_index=False).mean('pm25')

Count different actions within one hour in python

I am starting to work with time series. I have one of a user doing bank transfers to different countries, however the most frequent country to where he/she is doing the transfers is X, but there are transfers also to the countries Y and Z. Let's say:
date id country
2020-01-01T00:00:00.000Z id_01 X
2020-01-01T00:20:00.000Z id_02 X
2020-01-01T00:25:00.000Z id_03 Y
2020-01-01T00:35:00.000Z id_04 X
2020-01-01T00:45:00.000Z id_05 Z
2020-01-01T01:00:00.000Z id_06 X
2020-01-01T10:20:00.000Z id_07 X
2020-01-01T10:25:00.000Z id_08 X
2020-01-01T13:00:00.000Z id_09 X
2020-01-01T18:45:00.000Z id_10 Z
2020-01-01T18:55:00.000Z id_11 X
Since the most frequent country is X, I would like to count iteratively how many transactions have been done within one hour (in the whole list of events) to countries different than X.
The format of the expected output for this particular case would be:
date id country
2020-01-01T00:25:00.000Z id_03 Y
2020-01-01T00:45:00.000Z id_05 Z
Starting from 2020-01-01T00:00:00.000Z, within one hour there are two Y, Z transactions. Then starting from 2020-01-01T00:20:00.000Z, within one hour, there are the same transactions, and so on. Then, starting from 2020-01-01T10:20:00.000Z, within one hour, all are X. Starting from 2020-01-01T18:45:00.000Z, within one hour, there is only one Z.
I am trying with a double for loop and .value_counts(), but I'm not sure of what I am doing.
Have you considered using a time-series database for this? It could make your life easier if you are doing a lot of event-based aggregations with arbitrary time intervals. Time-series databases abstract this for you so all you need is to send a query and get the results into pandas. It's also going to run considerably faster.
For example hourly aggregations can be done using the following syntax in QuestDB.
select timestamp, country, count() from yourTable SAMPLE BY 1h
this will return results like this
| timestamp | country | count |
| 2020-06-22T00:00:00 | X | 234 |
| 2020-06-22T00:00:00 | Y | 493 |
| 2020-06-22T01:00:00 | X | 12 |
| 2020-06-22T01:00:00 | Y | 66 |
You can adjust this to monthly or weekly or 5-minute resolution results without having to re-write your logic, all you need to do is change the 1h to 1M,7d or 5m or pass this as an argument.
Now, to get results one hour before and after the timestamp of your target transaction, you could add a timestamp interval search to the above. For example assuming your target transaction happened on 2010-01-01T06:47:00.000000Z, the resulting search would be
select hour, country, count() from yourTable
where timestamp = '2010-01-01T05:47:00.000000Z;2h'
sample by 1h;
If this is something which would work for you, there is a tutorial on how to run this type of query in QuestDB and get the results into pandas here
IIUC, you can select only the rows not X, then use diff once forward and once backward (within 1 hour before and after) and you want where any of the two diff is below a Timedelta of 1h.
#convert to datetime
df['date'] = pd.to_datetime(df['date'])
#mask not X and select only these rows
mX = df['country'].ne('X')
df_ = df[mX].copy()
# mask within an hour before and after
m1H = (df_['date'].diff().le(pd.Timedelta(hours=1)) |
df_['date'].diff(-1).le(pd.Timedelta(hours=1)) )
# selet only the rows meeting criteria on X and 1H
df_ = df_[m1H]
print (df_)
date id country
2 2020-01-01 00:25:00+00:00 id_03 Y
4 2020-01-01 00:45:00+00:00 id_05 Z
You can try :
df['date'] = pd.to_datetime(df.date)
(df.country != 'X').groupby(by=df.date.dt.hour).sum()
First it turns your date columns into a datetime. Then, you test if country is 'X', and group by hour, and sum the number of countries that are different than 'X'. Groups are based on hours, and not on rolling elasped time. Hope it solves your problem!

Visualizing headcount data over a particular time period

I have a data visualization question that I would like to get some input on. I'm currently using python pandas to clean up a data set then subsequently uploading it in SISENSE for use. What I am trying to do is visualize active jobs grouped by week/month based on the start and end dates of particular assignments. For example, I have a set of jobs with the following start dates, organized in rows within a dataframe:
Job ID Start Date End Date
Job 1 5/25/2020 6/7/2020
Job 2 5/25/2020 5/31/2020
For the week of 5/25/2020 I have two active jobs, and for the week of 6/1/2020 I have 1 active job. The visualization should look like a bar chart with the x axis being the week/time period and y axis being the count of active jobs.
How can I best organize this into a data frame and visualize it?
something like
df = pd.DataFrame({'Job ID': [1,2], 'Start Date': ['5/25/2020', '5/25/2020'], 'End Date': ['6/7/2020', '5/31/2020']})
You could then apply a function to generate a new column 'week beginning' - take a look here for a solution in python Get week start date (Monday) from a date column in Python (pandas)?
import datetime as dt
# Change 'myday' to contains dates as datetime objects
df['Start Date'] = pd.to_datetime(df['Start Date'])
# 'daysoffset' will container the weekday, as integers
df['daysoffset'] = df['Start Date'].apply(lambda x: x.weekday())
# We apply, row by row (axis=1) a timedelta operation
df['Week Beginning'] = df.apply(lambda x: x['Start Date'] - dt.TimeDelta(days=x['daysoffset']), axis=1)
and then groupby on this week beginning
counr = df.groupby(df['Week Beginning']).sum()
Following this, you could plot using
count_by_job_id = count['Job ID']
pd.DataFrame(count_by_job_id).plot.bar()
You will need a custom SQL in Sisense elasticube to make this work easily. You will then join your dataframe table with the dim_dates ( excel file from the below link)
This is similar to scenario described here : https://support.sisense.com/hc/en-us/articles/230644208-Date-Dimension-File
You custom SQL will be something like this :
Select JobID,
CAST(startdate as date) as Startdate,
CAST(enddate as date) as Enddate,
C.RECORD_DATE AS week_start
FROM JOB j
JOIN tbl_Calendar C ON c.RECORD_DATE BETWEEN j.StartDate and j.EndDate
WHERE DATENAME(DW,C.RECORD_DATE) = 'MONDAY'
Then you can just create a column chart and drop the fields week_start( you can format in a few different ways) under categories section and drop the field count(JOBID) under values section.

How to calculate the sum of rows with the maximum date country wise

I am trying to calculate the sum of the rows with the maximum date per country and if the country has more than one province then it should add the confirmed cases with the maximum date . For ex input
This is the input that I have and the output should be
Output
So the output for China is 90 which is the sum of Tianjin and Xinjiang for the maximum date which is 02-03-2020.
And since Argentina does not have any province it's output is 20 for the highest date which is again the same as above.
The strategy is to sort the values such that the last date is the first row of the Country/Region-Province/State pairs, then roll up the dataset twice, filtering the max date between roll ups.
First, sorting to put most recent dates at the top of each group:
(df
.sort_values(['Country/Region', 'Province/State', 'Date'], ascending=False))
Date Country/Region Province/State Confirmed
3 02-03-2020 China Xinjiang 70
2 01-03-2020 China Xinjiang 30
1 02-03-2020 China Tianjin 20
0 01-03-2020 China Tianjin 10
Then rolling up to Country/Region-Province/State and taking the most recent date:
(df
.sort_values(['Country/Region', 'Province/State', 'Date'], ascending=False)
.groupby(['Country/Region', 'Province/State'])
.first())
Date Confirmed
Country/Region Province/State
China Tianjin 02-03-2020 20
Xinjiang 02-03-2020 70
Finally, rolling up again to just Country/Region:
(df
.sort_values(['Country/Region', 'Province/State', 'Date'], ascending=False)
.groupby(['Country/Region', 'Province/State'])
.first()
.groupby('Country/Region').sum())
Confirmed
Country/Region
China 90
If you fill in your empty province values, you could use groupby to pull out the Date and then another groupby to get the sum of values.
input['Date'] = pd.to_datetime(input['Date'], format="%d-%m-%Y")
input = input.fillna("dummy")
input.loc[input.groupby(["Country/Region", "Province/State"]).Date.idxmax()].groupby(["Country/Region"])["Confirmed"].sum()
Basic Solution is to use the power of pandas and multiple methods that can jointly solve your problem
So we first convert the Date Column to a Datetime Column
Bascially , String to Datetime Type Conversion
Then we sort by the 'Date' Column
Then we .groupby() - Country and Date aggregating over 'Confirmed' Column (SUM) - Finally we drop_duplicates() keeping the last, which gives you the latest for any particular country
df['Date'] = pd.to_datetime(df['Date'])
df.sort_values(by = 'Date' , inplace = True)
df.groupby(['Country','Date']).agg({'Confirmed':'sum'}).reset_index().drop_duplicates(subset = 'Country' , keep = 'last')
df.groupby(['Date', 'Country/Region'],as_index = False).sum().groupby(['Country/Region']).agg('last').reset_index()
Country/Region Date Confirmed
0 Argentina 2020-03-02 20
1 China 2020-03-02 90
2 France 2020-03-02 70