I have a data visualization question that I would like to get some input on. I'm currently using python pandas to clean up a data set then subsequently uploading it in SISENSE for use. What I am trying to do is visualize active jobs grouped by week/month based on the start and end dates of particular assignments. For example, I have a set of jobs with the following start dates, organized in rows within a dataframe:
Job ID Start Date End Date
Job 1 5/25/2020 6/7/2020
Job 2 5/25/2020 5/31/2020
For the week of 5/25/2020 I have two active jobs, and for the week of 6/1/2020 I have 1 active job. The visualization should look like a bar chart with the x axis being the week/time period and y axis being the count of active jobs.
How can I best organize this into a data frame and visualize it?
something like
df = pd.DataFrame({'Job ID': [1,2], 'Start Date': ['5/25/2020', '5/25/2020'], 'End Date': ['6/7/2020', '5/31/2020']})
You could then apply a function to generate a new column 'week beginning' - take a look here for a solution in python Get week start date (Monday) from a date column in Python (pandas)?
import datetime as dt
# Change 'myday' to contains dates as datetime objects
df['Start Date'] = pd.to_datetime(df['Start Date'])
# 'daysoffset' will container the weekday, as integers
df['daysoffset'] = df['Start Date'].apply(lambda x: x.weekday())
# We apply, row by row (axis=1) a timedelta operation
df['Week Beginning'] = df.apply(lambda x: x['Start Date'] - dt.TimeDelta(days=x['daysoffset']), axis=1)
and then groupby on this week beginning
counr = df.groupby(df['Week Beginning']).sum()
Following this, you could plot using
count_by_job_id = count['Job ID']
pd.DataFrame(count_by_job_id).plot.bar()
You will need a custom SQL in Sisense elasticube to make this work easily. You will then join your dataframe table with the dim_dates ( excel file from the below link)
This is similar to scenario described here : https://support.sisense.com/hc/en-us/articles/230644208-Date-Dimension-File
You custom SQL will be something like this :
Select JobID,
CAST(startdate as date) as Startdate,
CAST(enddate as date) as Enddate,
C.RECORD_DATE AS week_start
FROM JOB j
JOIN tbl_Calendar C ON c.RECORD_DATE BETWEEN j.StartDate and j.EndDate
WHERE DATENAME(DW,C.RECORD_DATE) = 'MONDAY'
Then you can just create a column chart and drop the fields week_start( you can format in a few different ways) under categories section and drop the field count(JOBID) under values section.
Related
I have a dataframe with datetime as index (the values in columns don't matter). I would like to extract the row number when the index is equal to a specific date (2021/06/25 16:00:00). How can I do that ? The only way I found was to add a count column and use loc, but I wondered if there was a better way to do it.
import pandas as pd
from datetime import date, datetime, timedelta
sdate = datetime(2021,6,5,14) # Start date
edate = datetime(2021,6,30) # End Date
date_to_find = datetime(2021,6,25,16)
df = pd.DataFrame(index=pd.date_range(sdate, edate, freq='H')) # Create a dataframe with date as index, every hour
df.insert(0, 'row_num', range(0,len(df))) # here you insert the row count
df.loc[df.index == date_to_find]['row_num'][0] # Grab the row number for which the index equals the date to find
You can try with,
df.index.get_loc(date_to_find)
is faster and more readable.
I have imported the tips data set from seaborn and tried to find the maximum bill amount for lunch and dinner on Saturday and Sunday.
I tried below code but get an error:
pd.pivot_table(df, values=df['total_bill'], index=df['day'],
columns=df['time'], aggfunc='max')
In order to get data for index as 'Sat' and 'Sun' you can use loc along with the pivot you created.
pd.pivot_table(df, values='total_bill',index='day',
columns='time', aggfunc='max').loc[['Sat', 'Sun']]
I am starting to work with time series. I have one of a user doing bank transfers to different countries, however the most frequent country to where he/she is doing the transfers is X, but there are transfers also to the countries Y and Z. Let's say:
date id country
2020-01-01T00:00:00.000Z id_01 X
2020-01-01T00:20:00.000Z id_02 X
2020-01-01T00:25:00.000Z id_03 Y
2020-01-01T00:35:00.000Z id_04 X
2020-01-01T00:45:00.000Z id_05 Z
2020-01-01T01:00:00.000Z id_06 X
2020-01-01T10:20:00.000Z id_07 X
2020-01-01T10:25:00.000Z id_08 X
2020-01-01T13:00:00.000Z id_09 X
2020-01-01T18:45:00.000Z id_10 Z
2020-01-01T18:55:00.000Z id_11 X
Since the most frequent country is X, I would like to count iteratively how many transactions have been done within one hour (in the whole list of events) to countries different than X.
The format of the expected output for this particular case would be:
date id country
2020-01-01T00:25:00.000Z id_03 Y
2020-01-01T00:45:00.000Z id_05 Z
Starting from 2020-01-01T00:00:00.000Z, within one hour there are two Y, Z transactions. Then starting from 2020-01-01T00:20:00.000Z, within one hour, there are the same transactions, and so on. Then, starting from 2020-01-01T10:20:00.000Z, within one hour, all are X. Starting from 2020-01-01T18:45:00.000Z, within one hour, there is only one Z.
I am trying with a double for loop and .value_counts(), but I'm not sure of what I am doing.
Have you considered using a time-series database for this? It could make your life easier if you are doing a lot of event-based aggregations with arbitrary time intervals. Time-series databases abstract this for you so all you need is to send a query and get the results into pandas. It's also going to run considerably faster.
For example hourly aggregations can be done using the following syntax in QuestDB.
select timestamp, country, count() from yourTable SAMPLE BY 1h
this will return results like this
| timestamp | country | count |
| 2020-06-22T00:00:00 | X | 234 |
| 2020-06-22T00:00:00 | Y | 493 |
| 2020-06-22T01:00:00 | X | 12 |
| 2020-06-22T01:00:00 | Y | 66 |
You can adjust this to monthly or weekly or 5-minute resolution results without having to re-write your logic, all you need to do is change the 1h to 1M,7d or 5m or pass this as an argument.
Now, to get results one hour before and after the timestamp of your target transaction, you could add a timestamp interval search to the above. For example assuming your target transaction happened on 2010-01-01T06:47:00.000000Z, the resulting search would be
select hour, country, count() from yourTable
where timestamp = '2010-01-01T05:47:00.000000Z;2h'
sample by 1h;
If this is something which would work for you, there is a tutorial on how to run this type of query in QuestDB and get the results into pandas here
IIUC, you can select only the rows not X, then use diff once forward and once backward (within 1 hour before and after) and you want where any of the two diff is below a Timedelta of 1h.
#convert to datetime
df['date'] = pd.to_datetime(df['date'])
#mask not X and select only these rows
mX = df['country'].ne('X')
df_ = df[mX].copy()
# mask within an hour before and after
m1H = (df_['date'].diff().le(pd.Timedelta(hours=1)) |
df_['date'].diff(-1).le(pd.Timedelta(hours=1)) )
# selet only the rows meeting criteria on X and 1H
df_ = df_[m1H]
print (df_)
date id country
2 2020-01-01 00:25:00+00:00 id_03 Y
4 2020-01-01 00:45:00+00:00 id_05 Z
You can try :
df['date'] = pd.to_datetime(df.date)
(df.country != 'X').groupby(by=df.date.dt.hour).sum()
First it turns your date columns into a datetime. Then, you test if country is 'X', and group by hour, and sum the number of countries that are different than 'X'. Groups are based on hours, and not on rolling elasped time. Hope it solves your problem!
I want to calculate age and from DOB field. But in my code I am hard coding it. But need to do dynamically like today - DOB. Similarly I also want to calculate duration from start_date. My data frame looks like -
id dob start_date
77 30/09/1990 2019-04-13 15:27:22
65 15/12/1988 2018-12-26 23:28:12
3 08/12/2000 2018-12-26 23:28:17
I have so far - For age calculation
df= df.withColumn('dob',to_date(unix_timestamp(F.col('dob'),'dd/MM/yyyy').cast("timestamp")))
end_date = '3/09/2019'
end_date = pd.to_datetime(end_date, format="%d/%m/%Y")
df= df.withColumn('end_date',F.unix_timestamp(F.lit(end_date),'dd/mm/yyyy').cast("timestamp"))
df = df.withColumn('age', (F.datediff(F.col('end_date'), F.col('dob')))/365)
df= df.withColumn("age", func.round(df["age"], 0))
For duration calculation -
end_date_1 = '2019-09-30'
end_date_1 = pd.to_datetime(end_date_1, format="%Y-%m-%d")
df= df.withColumn('end_date_1',F.unix_timestamp(F.lit(end_date_1),'yyyy-MM-dd HH:mm:ss').cast("timestamp"))
df= df.withColumn('duration', (F.datediff(F.col('end_date_1'), F.col('created_at'))))
In the above two codes I have hard code two values. One is end_date = '2019-09-30' and other is end_date_1 = '2019-09-30'. But want to do this based on todays() date. How to do it in pyspark?
You can use date.today() to get today's date as you are using python with spark.
More information about the desired date format can be found here official python documentation
Table format is as follows:
Date ID subID value
-----------------------------
7/1/1996 100 1 .0543
7/1/1996 100 2 .0023
7/1/1996 200 1 -.0410
8/1/1996 100 1 -.0230
8/1/1996 200 1 .0121
I'd like to apply STDEV to the value column where date falls within a specified range, grouping on the ID column.
Desired output would like something like this:
DateRange, ID, std_v
1 100 .0232
2 100 .0323
1 200 .0423
One idea I've had that works but is clunky, involves creating an additional column (which I've called 'partition') to identify a 'group' of values over which STDEV is taken (by using the OVER function and PARTITION BY applied to 'partition' and 'ID' variables).
Creating the partition variable involves a CASE statement prior where a given record is assigned a partition based on its date falling within a given range (ie,
...
, partition = CASE
WHEN date BETWEEN '7/1/1996' AND '10/1/1996' THEN 1
WHEN date BETWEEN '10/1/1996' AND '1/1/1997' THEN 2
...
Ideally, I'd be able to apply STDEV and the OVER function partitioning on the variable ID and variable date ranges (eg, say, trailing 3 months for a given reference date). Once this works for the 3 month period described above, I'd like to be able to make the date range variable, creating an additional '#dateRange' variable at the start of the program to be able to run this for 2, 3, 6, etc month ranges.
I ended up coming upon a solution to my question.
You can join the original table to a second table, consisting of a unique list of the dates in the first table, applying a BETWEEN clause to specify desired range.
Sample query below.
Initial table, with columns (#excessRets):
Date, ID, subID, value
Second table, a unique list of dates in the previous, with columns (#dates):
Date
select d.date, er.id, STDEV(er.value)
from #dates d
inner join #excessRet er
on er.date between DATEADD(m, -36, d.date) and d.date
group by d.date, er.id
order by er.id, d.date
To achieve the desired next step referenced above (making range variable), simply create a variable at the outset and replace "36" with the variable.