pandas resample: what is the 3M equivalent of Q - pandas

I have a time series, e.g.:
import pandas as pd
df = pd.DataFrame.from_dict({'count': {
pd.Timestamp('2016-02-29'): 1, pd.Timestamp('2016-03-31'): 2,
pd.Timestamp('2016-04-30'): 4, pd.Timestamp('2016-05-31'): 8,
pd.Timestamp('2016-06-30'): 16, pd.Timestamp('2016-07-31'): 32,
}})
df
And can resample it to get counts per Quarter with e.g:
df.resample('Q').agg('sum')
I am trying to do the same with '3M' but no matter what I try, I fail to get the same result, e.g.:
df.resample('3M', closed='right', origin='start',label='right').agg('sum')
gives:
How can I achieve the result of resample('Q') using resample('3M')?

Related

How to define NetworkX graph from pandas dataframe having multiple columns

I have a pandas dataframe that captures the information whether a invoice has been raised as a dispute or not based on some characteristics. I would like to run a community detection on top of this to search for patterns. But confused on how to create a graph from this. Tried like the below :
import pandas as pd
import networkx as nx
from itertools import combinations as comb
data = [[4321, 543, 765, 3, 2014, 54, 0, 1, 0, 1, 0], [2321, 657, 654, 7, 2017, 59, 1, 0, 1, 0, 1]]
df = pd.DataFrame(data, columns = ['NetValueInDocCurr', 'NetWeight', 'Volume', 'BillingItems', 'FISCALYEAR', 'TaxAmtInDocCurr', 'Description_Bulk', 'Description_Car_Care', 'Description_Packed', 'Description_Services', 'Final_Dispute'])
edges = set(comb(df.columns,2))
G = nx.Graph()
G.add_edges_from(edges)
My current assumption is to define column name as node, pairwise relationship between all columns as edge and column value as edge weight. Is this the right approach? If yes, any help on the code to define weights? My idea is to start with a complete graph and use divisive methods like Girvan-Newman.

How to convert Multi-Index into a Heatmap

New to Pandas/Python, I have managed to make an index like below;
MultiIndex([( 1, 1, 4324),
( 1, 2, 8000),
( 1, 3, 8545),
( 1, 4, 8544),
( 1, 5, 7542),
(12, 30, 7854),
(12, 31, 7511)],
names=['month', 'day', 'count'], length=366)
I'm struggling to find out how I can store the first number into a list (the 1-12 one) the second number into another list (1-31 values) and the third number into another seperate list (scores 0-9000)
I am trying to build a heatmap that is Month x Day on the axis' and using count as the values and failing horribly! I am assuming I have to seperate Month, Day and Count into seperate lists to make the heat map?
data1 = pd.read_csv("a2data/Data1.csv")
data2 = pd.read_csv("a2data/Data2.csv")
merged_df = pd.concat([data1, data2])
merged_df.set_index(['month', 'day'], inplace=True)
merged_df.sort_index(inplace=True)
merged_df2=merged_df.groupby(['month', 'day']).count.mean().reset_index()
merged_df2.set_index(['month', 'day', 'count'], inplace=True)
#struggling here to seperate out month, day and count in order to make a heatmap
Are you looking for:
# let start here
merged_df2=merged_df.groupby(['month', 'day']).count.mean()
# use sns
import seaborn as sns
sns.heatmap(merged_df2.unstack('day'))
Output:
Or you can use plt:
merged_df2=merged_df.groupby(['month', 'day']).count.mean().unstack('day')
plt.imshow(merged_df2)
plt.xticks(np.arange(merged_df2.shape[1]), merged_df2.columns)
plt.yticks(np.arange(merged_df2.shape[0]), merged_df2.index)
plt.show()
which gives:

Assert an integer is in list on pandas series

I have a DataFrame with two pandas Series as follow:
value accepted_values
0 1 [1, 2, 3, 4]
1 2 [5, 6, 7, 8]
I would like to efficiently check if the value is in accepted_values using pandas methods.
I already know I can do something like the following, but I'm interested in a faster approach if there is one (took around 27 seconds on 1 million rows DataFrame)
import pandas as pd
df = pd.DataFrame({"value":[1, 2], "accepted_values": [[1,2,3,4], [5, 6, 7, 8]]})
def check_first_in_second(values: pd.Series):
return values[0] in values[1]
are_in_accepted_values = df[["value", "accepted_values"]].apply(
check_first_in_second, axis=1
)
if not are_in_accepted_values.all():
raise AssertionError("Not all value in accepted_values")
I think if create DataFrame with list column you can compare by DataFrame.eq and test if match at least one value per row by DataFrame.any:
df1 = pd.DataFrame(df["accepted_values"].tolist(), index=df.index)
are_in_accepted_values = df1.eq(df["value"]).any(axis=1).all()
Another idea:
are_in_accepted_values = all(v in a for v, a in df[["value", "accepted_values"]].to_numpy())
I found a little optimisation to your second idea. Using a bit more numpy than pandas makes it faster (more than 3x, tested with time.perf_counter()).
values = df["value"].values
accepted_values = df["accepted_values"].values
are_in_accepted_values = all(s in e for s, e in np.column_stack([values, accepted_values]))

Rolling means in Pandas dataframe

I am trying to run some computations on DataFrames. I want to compute the average difference between two sets of rolling mean. To be more specific, the average of the difference between a long-term mean (lst) and a smaller-one (lst_2). I am trying to combine the calculation with a double for loop as follows:
import pandas as pd
import numpy as pd
def main(df):
df=df.pct_change()
lst=[100,150,200,250,300]
lst_2=[5,10,15,20]
result=pd.DataFrame(np.sum([calc(df,T,t) for T in lst for t in lst_2]))/(len(lst)+len(lst_2))
return result
def calc(df,T,t):
roll=pd.DataFrame(np.sign(df.rolling(t).mean()-df.rolling(T).mean()))
return roll
Overall I should have 20 differences (5 and 100, 10 and 100, 15 and 100 ... 20 and 300); I take the sign of the difference and I want the average of these differences at each point in time. Ideally the result would be a dataframe result.
I got the error: cannot copy sequence with size 3951 to array axis with dimension 1056 when it runs the double for loops. Obviously I understand that due to rolling of different T and t, the dimensions of the dataframes are not equal when it comes to the array conversion (with np.sum), but I thought it would put "NaN" to align the dimensions.
Hope I have been clear enough. Thank you.
As requested in the comments, here is an example. Let's suppose the following
dataframe:
df = pd.DataFrame({'A': [100,101.4636,104.9477,106.7089,109.2701,111.522,113.3832,113.8672,115.0718,114.6945,111.7446,108.8154]},index=[0, 1, 2, 3,4,5,6,7,8,9,10,11])
df=df.pct_change()
and I have the following 2 sets of mean I need to compute:
lst=[8,10]
lst_1=[3,4]
Then I follow these steps:
1/
I want to compute the rolling mean(3) - rolling mean(8), and get the sign of it:
roll=np.sign(df.rolling(3).mean()-df.rolling(8).mean())
This should return the following:
roll = pd.DataFrame({'A': ['NaN','NaN','NaN','NaN','NaN','NaN','NaN','NaN',-1,-1,-1,-1},index=[0, 1, 2, 3,4,5,6,7,8,9,10,11])
2/
I redo step 1 with the combination of differences 3-10 ; 4-8 ; 4-10. So I get overall 4 roll dataframes.
roll_3_8 = pd.DataFrame({'A': ['NaN','NaN','NaN','NaN','NaN','NaN','NaN','NaN',-1,-1,-1,-1},index=[0, 1, 2, 3,4,5,6,7,8,9,10,11])
roll_3_10 = pd.DataFrame({'A': ['NaN','NaN','NaN','NaN','NaN','NaN','NaN','NaN','NaN','NaN',-1,-1},index=[0, 1, 2, 3,4,5,6,7,8,9,10,11])
roll_4_8 = pd.DataFrame({'A': ['NaN','NaN','NaN','NaN','NaN','NaN','NaN','NaN',-1,-1,-1,-1},index=[0, 1, 2, 3,4,5,6,7,8,9,10,11])
roll_4_10 = pd.DataFrame({'A': ['NaN','NaN','NaN','NaN','NaN','NaN','NaN','NaN','NaN','NaN',-1,-1},index=[0, 1, 2, 3,4,5,6,7,8,9,10,11])
3/
Now that I have all the diffs, I simply want the average of them, so I sum all the 4 rolling dataframes, and I divide it by 4 (number of differences computed). The results should be (before dropping all N/A values):
result = pd.DataFrame({'A': ['NaN','NaN','NaN','NaN','NaN','NaN','NaN','NaN','NaN','NaN',-1,-1},index=[0, 1, 2, 3,4,5,6,7,8,9,10,11])

Pandas timestamp doesn't play nicely with numpy datetime?

I'm just beginning to use pandas 0.9 and am seeing unexpected behavior with pandas timestamps. After setting an index with a datetime, the timeindex doesn't seem to correctly convert to anything else. I'm probably not using it correctly, so please set me straight:
from pandas import *
import datetime
version.version
# 0.9.1
np.version.version
# 1.6.2
ndx = ['a','b','b']
date = [datetime.datetime(2013, 2, 16, 15, 0), datetime.datetime(2013, 2, 16, 11, 0),datetime.datetime(2013, 2, 16, 2, 0)]
vals = [1,2,3,]
df = DataFrame({'ndx':ndx,'date':date,'vals':vals})
df2=df.groupby(['ndx','date']).sum()
df2.index.get_level_values('date')
# array([1970-01-16 143:00:00, 1970-01-16 130:00:00, 1970-01-16 139:00:00], dtype=datetime64[ns])
df.set_index([ndx,date]).reset_index()['level_1'].unique() # fetch from index
# array([1970-01-16 143:00:00, 1970-01-16 139:00:00, 1970-01-16 130:00:00], dtype=datetime64[ns])
df.set_index([ndx,date]).reset_index()['date'].unique() # fetch from column
# array([2013-02-16 15:00:00, 2013-02-16 11:00:00, 2013-02-16 02:00:00], dtype=object)
I would wouldn't expect anything with 1970 as a result of these operations. Thoughts?
this is a numpy bug
see the following
http://pandas.pydata.org/pandas-docs/stable/whatsnew.html#numpy-datetime64-dtype-and-1-6-dependency
https://github.com/pydata/pandas/issues/2872