pandas: get third business day of month - pandas

i Have a daterange
pd.bdate_range("2001-01-01", "2018-01-01")
and want to find the third business day of the month (ignore holidays for now). How do I do that?

As you are already in business dates, you could resample to the start of the business month ('BMS') and add an offset of 3 business days::
>>> pd.Series(index=pd.bdate_range("2001-01-01",
"2018-01-01")).resample('BMS').index + pd.datetools.BDay(3)
DatetimeIndex(['2001-01-04', '2001-02-06', '2001-03-06', '2001-04-05',
'2001-05-04', '2001-06-06', '2001-07-05', '2001-08-06',
'2001-09-06', '2001-10-04',
...
'2017-04-06', '2017-05-04', '2017-06-06', '2017-07-06',
'2017-08-04', '2017-09-06', '2017-10-05', '2017-11-06',
'2017-12-06', '2018-01-04'],
dtype='datetime64[ns]', length=205, freq=None)
You'll find further details on how to work with dates in pandas in the documentation.

Plan: groupby year, month. Choose third with nth().
This example will be easier with a series:
dates = pd.Series(pd.bdate_range("2001-01-01", "2018-01-01"))
dates.groupby((dates.dt.year, dates.dt.month)).nth(3)
Partial output:
2001 1 2001-01-04
2 2001-02-06
3 2001-03-06
4 2001-04-05
5 2001-05-04
6 2001-06-06
7 2001-07-05
8 2001-08-06
9 2001-09-06
10 2001-10-04
11 2001-11-06
12 2001-12-06
2002 1 2002-01-04
2 2002-02-06
3 2002-03-06
4 2002-04-04

Related

Ensuring years and months are running as part of data cleaning

I have 2 datasets:
rainfall per month (mm) from 1982-01 to 2022-08
no. of rainy days per month per year from 1982-01 to 2022-08.
month no_of_rainy_days
0 1982-01 10
1 1982-02 5
2 1982-03 11
3 1982-04 14
4 1982-05 10
month total_rainfall
0 1982-01 107.1
1 1982-02 27.8
2 1982-03 160.8
3 1982-04 157.0
4 1982-05 102.2
Qn 1: As part of ensuring data integrity, how do I ensure that the dates are running consecutively? i.e 1982-01 and next is 1982-02 and not a skip to 1982-03?
I am unsure how to perform the checking and have done a search online. Is it common practice to assume that the years and months are running?
First, separate the year from the month.
df.rename(columns={"month": "ym"}, inplace=True)
df[["year", "month"]] = df["ym"].astype(str).str.split("-", expand=True)
Then you can group the dataframe by year and count the number of observations per year (counts number of rows per year).
observations_per_year = df["year"]\
.groupby(df["year"])\
.agg("count")\
.reset_index(name="observations")
observations_per_year[observations_per_year["observations"] < 12]
Assuming you have any years with less than 12 observations, they will be displayed like so:
year observations
0 1982 11
4 1986 11
5 1987 11
6 1988 10
11 1993 11
Given the lack of detail and no sample data provided, I made some assumptions about your data:
Each data set will not have more than one row for any month of the year (i.e., a maximum of 12 rows/observations per year).
Each dataframe contains a single observation per row, as shown in your examples (so you would do this for each dataframe prior to merging them). As such, counting rows per year-month is an accurate means of counting the number of observations for any given month.
The sorted order of the data is irrelevant (you can later sort by year-month if needed).

How to join columns in Julia?

I have opened a dataframe in julia where i have 3 columns like this:
day month year
1 1 2011
2 4 2015
3 12 2018
how can I make a new column called date that goes:
day month year date
1 1 2011 1/1/2011
2 4 2015 2/4/2015
3 12 2018 3/12/2018
I was trying with this:
df[!,:date]= df.day.*"/".*df.month.*"/".*df.year
but it didn't work.
in R i would do:
df$date=paste(df$day, df$month, df$year, sep="/")
is there anything similar?
thanks in advance!
Julia has an inbuilt Date type in its standard library:
julia> using Dates
julia> df[!, :date] = Date.(df.year, df.month, df.day)
3-element Vector{Date}:
2011-01-01
2015-04-02
2018-12-03

Quickly fill cells with datetime based on column name in pandas?

I need to convert my cumbersome column headers into a datetime for every cell in that column. For example, I need the datetime "2001-10-06 6:00" from the column header 20011006_6_blah_blah_blah. I have a column of other datetimes that I will eventually be using to do some calculations.
Construction of an example df:
date_rng0=pd.date_range(start=datetime.date(2001,10,1),end=datetime.date(2001,10,7),freq='D')
date_rng1=pd.date_range(start=datetime.date(2001,10,5),end=datetime.date(2001,10,8),freq='D')
drstr0=[str(i.year)+str(i.month)+str(i.day)+'_blah' for i in date_rng0]
drstr1=[str(i.year)+str(i.month)+str(i.day)+'_blah' for i in date_rng1]
#make zero df
arr=np.zeros((len(date_rng0),len(date_rng1))) # all ones, mask out below
df=pd.DataFrame(arr,index=drstr0,columns=drstr1)
First I copy all the column names into the cells, column by column. This is very slow with my data:
for c in df.columns:
df[c]=c
Then I convert them to datetime using an atrocious looking lambda mess:
for c in df.columns:
df.loc[:,c]=df.loc[:,c].apply(lambda x: datetime.date(int(x.split('_')[0][:4]),int(x.split('_')[0][4:6]),int(x.split('_')[0][6:])))
Then I make a datetime column using a similar lambda function:
df['date_time']=df.index
df['date_time']=df.loc[:,'date_time'].apply(lambda x: datetime.date(int(x.split('_')[0][:4]),int(x.split('_')[0][4:6]),int(x.split('_')[0][6:])))
df.head()
gives:
2001105_blah 2001106_blah 2001107_blah 2001108_blah date_time
2001101_blah 2001-10-05 2001-10-06 2001-10-07 2001-10-08 2001-10-01
2001102_blah 2001-10-05 2001-10-06 2001-10-07 2001-10-08 2001-10-02
2001103_blah 2001-10-05 2001-10-06 2001-10-07 2001-10-08 2001-10-03
2001104_blah 2001-10-05 2001-10-06 2001-10-07 2001-10-08 2001-10-04
2001105_blah 2001-10-05 2001-10-06 2001-10-07 2001-10-08 2001-10-05
Then I can do a little math:
ndf=df.copy()
for c in df.columns:
ndf.loc[:,c]=df.loc[:,c]-df.loc[:,'date_time']
Which gives what I am ultimately after:
2001105_blah 2001106_blah 2001107_blah 2001108_blah date_time
2001101_blah 4 days 5 days 6 days 7 days 0 days
2001102_blah 3 days 4 days 5 days 6 days 0 days
2001103_blah 2 days 3 days 4 days 5 days 0 days
2001104_blah 1 days 2 days 3 days 4 days 0 days
2001105_blah 0 days 1 days 2 days 3 days 0 days
The problem is, this process has never completed using my 2,000 x 30,000 dataframe despite walking away for 30 min. I feel like I am doing something wrong. Any suggestions to improve the efficiency?
You can try with str.split, ' '.join, and pd.to_datetime
#add new column with values as the column names joined into a string
df['temp']=(' '.join(df.columns.astype(str)))
#expand the dataframe
temp=df['temp'].str.split(expand=True)
#rename the columns with original names
temp.columns=df.columns[:-1]
#parse the index to datetime
index=pd.to_datetime(df.index.str.split('_').str[0],format='%Y%m%d').to_numpy()
#substract the index to each column
newdf=temp.apply(lambda x: pd.to_datetime(x.str.split('_').str[0],format='%Y%m%d')-index)
#mask only the rows where all values are non-negative
newdf=newdf[newdf.apply(lambda x: x >= pd.Timedelta(0)).all(1)]
Output:
print(newdf)
2001105_blah 2001106_blah 2001107_blah 2001108_blah
2001101_blah 4 days 5 days 6 days 7 days
2001102_blah 3 days 4 days 5 days 6 days
2001103_blah 2 days 3 days 4 days 5 days
2001104_blah 1 days 2 days 3 days 4 days
2001105_blah 0 days 1 days 2 days 3 days

Tableau: How to get moving average with respect to day of week in last 4 weeks?

e.g: If I have the data as below:
Week 1 Week2 Week3
S M T W T F S S M T W T F S S M T W T F S
2 5 6 7 5 5 3 4 5 7 2 4 3 2 4 5 2 1 2 7 8
If today is Monday, my average will be (5+5+5)/3 which is 5. Tomorrow it will be (6+7+2)/3 which will be 5 again and day after it will be (7+2+1)/3 which will be 3.33
How to get this in Tableau?
First, you can use "Weekday" as a column or row (by rightclicking on the date).
Then you can simply add a Table Calculation "Moving Average" with a specific computing dimension "Week of [Date]"
=> Table Calculation Specifics <=
=> Result <=
Data source used-: Tableau Sample Superstore.
You can do the following-:
Columns-: Week(Order Date)
Rows-: Weekday(Order date)
Put Sales in text.
Right click sales>Quick Table Calculation>Moving Average
right click Sales>edit quick table calculation>
Set the following
Select Moving along-: "Table across"
Previous values-: 4

how to get a moving average in sql

If I have data from week 1 to week 52 data and I want 4 week Moving Average with 1 week how can I make a SQL query for this? For example, for week 5 I want week1-week4 average, week6 I want week5-week8 average and so on.
I have the columns week and target_value in table A.
Sample data is like this:
Week target_value
1 20
2 10
3 10
4 20
5 60
6 20
So the output I want will start from week 5 as only week 1-week4 is available not before that.
Output data will look like:
Week Output
5 15 (20+10+10+20)/4=15 Moving Average week1-week4
6 25 (10+10+20+60)/4=25 Moving Average week2-week5
The data is in hive but I can move it to oracle if it is simpler to do this there.
SELECT
Week,
(SELECT ISNULL(AVG(B.target_value), A.target_value)
FROM tblA B
WHERE (B.Week < A.Week)
AND B.Week >= (A.Week - 4)
) AS Moving_Average
FROM tblA A
The ISNULL keeps you from getting a null for your first week since there is no week 0. If you want it to be null, then just leave the ISNULL function out.
If you want it to start at week 5 only, then add the following line to the end of the SQL that I wrote:
WHERE A.Week > 4
Results:
Week Moving_Average
1 20
2 20
3 15
4 13
5 15
6 25