pn = Panel(randn(4,3,3), items=['a','a','b','b'], major_axis=np.arange(0,3), minor_axis=np.arange(0,3))
I'd like to average repeated items.
But...
pn.groupby(level=0)
TypeError: groupby() got an unexpected keyword argument 'level'
pn.groupby(axis='items')
TypeError: groupby() takes at least 2 arguments (2 given)
which i don't fully inderstand...
Is there another way out?
You need to pass a mapping function (which in this case in an identity function).
As of 0.14.1 and prior, the API for Panel.groupby is different from Series/DataFrame.groupby (as it has not been updated). 0.15.0 will address this.
In [6]: pn = Panel(randn(4,3,3), items=['a','a','b','b'], major_axis=np.arange(0,3), minor_axis=np.arange(0,3))
In [7]: pn
Out[7]:
<class 'pandas.core.panel.Panel'>
Dimensions: 4 (items) x 3 (major_axis) x 3 (minor_axis)
Items axis: a to b
Major_axis axis: 0 to 2
Minor_axis axis: 0 to 2
Showing converted to a frame just for clarity
In [10]: pn.to_frame()
Out[10]:
a a b b
major minor
0 0 0.149552 0.149552 -3.750202 -3.750202
1 -1.354459 -1.354459 0.744473 0.744473
2 -0.183734 -0.183734 0.081965 0.081965
1 0 -1.946947 -1.946947 0.039065 0.039065
1 -0.648491 -0.648491 -0.141705 -0.141705
2 -1.581965 -1.581965 -0.628115 -0.628115
2 0 -1.280040 -1.280040 -0.556467 -0.556467
1 -0.093943 -0.093943 0.722981 0.722981
2 -0.207690 -0.207690 0.914684 0.914684
In [11]: pn.groupby(lambda x: x, axis='items').mean().to_frame()
Out[11]:
a b
major minor
0 0 0.733896 -1.814611
1 -1.021487 0.182690
2 -0.791080 -0.040136
1 0 -1.141415 -0.445435
1 -0.678486 -0.395380
2 -1.504996 0.172791
2 0 -0.405256 -0.999300
1 0.001912 0.272143
2 -0.987270 0.154344
Related
Based on the simplifed sample dataframe
import pandas as pd
import numpy as np
timestamps = pd.date_range(start='2017-01-01', end='2017-01-5', inclusive='left')
values = np.arange(0,len(timestamps))
df = pd.DataFrame({'A': values ,'B' : values*2},
index = timestamps )
print(df)
A B
2017-01-01 0 0
2017-01-02 1 2
2017-01-03 2 4
2017-01-04 3 6
I want to use a roll-forward window of size 2 with a stride of 1 to create a resulting dataframe like
timestep_1 timestep_2 target
0 A 0 1 2
B 0 2 4
1 A 1 2 3
B 2 4 6
I.e., each window step should create a data item with the two values of A and B in this window and the A and B values immediately to the right of the window as target values.
My first idea was to use pandas
https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.rolling.html
But that seems to only work in combination with aggregate functions such as sum, which is a different use case.
Any ideas on how to implement this rolling-window-based sampling approach?
Here is one way to do it:
window_size = 3
new_df = pd.concat(
[
df.iloc[i : i + window_size, :]
.T.reset_index()
.assign(other_index=i)
.set_index(["other_index", "index"])
.set_axis([f"timestep_{j}" for j in range(1, window_size)] + ["target"], axis=1)
for i in range(df.shape[0] - window_size + 1)
]
)
new_df.index.names = ["", ""]
print(df)
# Output
timestep_1 timestep_2 target
0 A 0 1 2
B 0 2 4
1 A 1 2 3
B 2 4 6
df
A idx1 idx2
2022/1/1 0 2 4
2022/1/2 1 1 3
2022/1/3 2 0 3
2022/1/4 3 3 4
2022/1/5 4 0 4
expected df
A idx1 idx2 lowest
2022/1/1 0 2 4 2
2022/1/2 1 1 3 1
2022/1/3 2 0 3 0
2022/1/4 3 3 4 3
2022/1/5 4 0 4 0
Goal
Assign lowest column from A column using iloc method where start index is idx1.values and end index is idx2.values as below:
df['lowest']=df.A.iloc[df.idx1.values:df.idx2.values].min()
But get TypeError: cannot do positional indexing on DatetimeIndex with these indexers.
And I don't want to change original index.
And if there are million rows, the speed should be considered.So numpy method is welcomed.
The fastest way I can think of is doing it the following way. Did try to time it and it was 10 times faster than normal apply:
Change values of column A to numpy array first. Then still use apply along axis=1 but raw=True which passes each row as numpy array. See documentation for information:
the passed function will receive ndarray objects instead. If you are just applying a NumPy reduction function this will achieve much better performance.
arr = df['A'].to_numpy()
df['lowest'] = df.apply(lambda row: arr[row[1]:row[2]].min(), axis=1, raw=True)
IIUC, You can try it like this:
df['lowest'] = df.apply(lambda row: df.iloc[row['idx1']:row['idx2'], 0].min(), axis=1)
print(df)
Output:
A idx1 idx2 lowest
2022-01-01 0 2 4 2
2022-01-02 1 1 3 1
2022-01-03 2 0 3 0
2022-01-04 3 3 4 3
2022-01-05 4 0 4 0
calculate the mean of the values in one row according it's label:
A = [1,2,3,4,5,6,7,8,9,10]
B = [0,0,0,0,0,1,1,1,1, 1]
Result = pd.DataFrame(data=[A, B])
I want the output is: 0->3; 1-> 7.8
pandas has the groupby function, but I don't know how to implement this. Thanks
This is simple groupby problem ...
Result=Result.T
Result.groupby(Result[1])[0].mean()
Out[372]:
1
0 3
1 8
Name: 0, dtype: int64
Firstly, it sounds like you want to label the index:
In [11]: Result = pd.DataFrame(data=[A, B], index=['A', 'B'])
In [12]: Result
Out[12]:
0 1 2 3 4 5 6 7 8 9
A 1 2 3 4 5 6 7 8 9 10
B 0 0 0 0 0 1 1 1 1 1
If the index was unique you wouldn't have to do any groupby, just take the mean of each row (that's the axis=1):
In [13]: Result.mean(axis=1)
Out[13]:
A 5.5
B 0.5
dtype: float64
However, if you had multiple rows with the same label, then you'd need to groupby:
In [21]: Result2 = pd.DataFrame(data=[A, A, B], index=['A', 'A', 'B'])
In [22]: Result2.mean(axis=1)
Out[22]:
A 5.5
A 5.5
B 0.5
dtype: float64
Note: the duplicate rows (that happen to have the same mean as I lazily used the same row contents), in general we'd want to take the mean of those means:
In [23]: Result2.mean(axis=1).groupby(level=0).mean()
Out[23]:
A 5.5
B 0.5
dtype: float64
Note: .groupby(level=0) groups the rows which have the same index label.
You're making it difficult on yourself by constructing the dataframe in such a way as to put the things you want to take the mean of and the things you want to be your labels as different rows.
Option 1
groubpy
This deals with the data presented in the dataframe Result
Result.loc[0].groupby(Result.loc[1]).mean()
1
0 3
1 8
Name: 0, dtype: int64
Option 2
Overkill using np.bincount and because your grouping values are 0 and 1. I'd have a solution even if they weren't but it makes it simpler.
I wanted to use the raw lists A and B
pd.Series(np.bincount(B, A) / np.bincount(B))
0 3.0
1 8.0
dtype: float64
Option 3
Construct a series instead of a dataframe.
Again using raw lists A and B
pd.Series(A, B).mean(level=0)
0 3
1 8
dtype: int64
This is related to a previous question I've asked, here: Replace any string in columns with 1
However, since that question has been answered long ago, I've started a new question here. I am essentially trying to use convert_objects to replace string values with 1's in the following dataframe (abbreviated here):
uniq_epoch T_Opp T_Eval
1 0 0
1 0 vv.bo
2 bx 0
3 0 0
3 vo.bp 0
...
I am using the following code to do this. I've actually tried using this code on the entire dataframe, and have also applied it to a particular column. The result each time is that there is no error message, but also no change to the data (no values are converted to NaN, and the dtype is still 'O').
df = df.convert_objects(convert_numeric = True)
or
df.T_Eval = df.T_Eval.convert_objects(convert_numeric=True)
Desired final output is as follows:
uniq_epoch T_Opp T_Eval
1 0 0
1 0 1
2 1 0
3 0 0
3 1 0
...
Where there may also be a step prior to this, with the 1s as NaN, and fillna(1) is used to insert 1s where strings have been.
I've already searched posts on stackoverflow, and looked at the documentation for convert_objects, but it is unfortunately pretty sparse. I wouldn't have known to even attempt to apply it this way if not for the previous post (linked above).
I'll also mention that there are quite a few strings (codes) in these columns, and that the codes can recombine, so that to do this with a dict and replace(), would take about the same amount of time as if I did this by hand.
Based on the previous post and the various resources I've been able to find, I can't figure out why this isn't working - any help much appreciated, including pointing towards further documentation.
This is on 0.13.1
docs here
and here
Maybe you have an older version; IIRC convert_objects introduced in 0.11.
In [5]: df = read_csv(StringIO(data),sep='\s+',index_col=0)
In [6]: df
Out[6]:
T_Opp T_Eval
uniq_epoch
1 0 0
1 0 vv.bo
2 bx 0
3 0 0
3 vo.bp 0
[5 rows x 2 columns]
In [7]: df.convert_objects(convert_numeric=True)
Out[7]:
T_Opp T_Eval
uniq_epoch
1 0 0
1 0 NaN
2 NaN 0
3 0 0
3 NaN 0
[5 rows x 2 columns]
In [8]: df.convert_objects(convert_numeric=True).dtypes
Out[8]:
T_Opp float64
T_Eval float64
dtype: object
In [9]: df.convert_objects(convert_numeric=True).fillna(1)
Out[9]:
T_Opp T_Eval
uniq_epoch
1 0 0
1 0 1
2 1 0
3 0 0
3 1 0
[5 rows x 2 columns]
I want to use a MultiIndex with dates as one of the hierarchical index types. I also want to save the DataFrame as a frame_table, so that I can select subsets from disk without loading the whole thing. I currently get an error: TypeError: [date] is not implemented as a table column and I was wondering if I am using the multiindex incorrectly, or this is indeed a limitation of Pandas. Thanks!
import pandas as pd, numpy, datetime
print pd.__version__ #-> 0.13.0rc1
idx1 = pd.MultiIndex.from_tuples([(datetime.date(2013,12,d), s, t) for d in range(1,3) for s in range(2) for t in range(3)])
df1 = pd.DataFrame(data=numpy.zeros((len(idx1),2)), columns=['a','b'], index=idx1)
with pd.get_store('test1.h5') as f:
f.put('trials',df1) #-> OK
with pd.get_store('test2.h5') as f:
f.put('trials',df1,data_colums=True,format='t') #-> TypeError: [date] is not implemented as a table column
Use datetime.datetime as these types can be stored efficiently. Docs are here for an example of storing a multi-index frame in a HDFStore.
When storing a multi-index, you MUST specify names for the levels (HDFStore currently won't warn you if you try to store it ATM; this will be addressed in the next release).
In [20]: idx1 = pd.MultiIndex.from_tuples([(datetime.datetime(2013,12,d), s, t) for d in range(1,3) for s in range(2) for t in range(3)],names=['date','s','t'])
In [21]: df1 = pd.DataFrame(data=numpy.zeros((len(idx1),2)), columns=['a','b'], index=idx1)
You need to store as a table (put stores in Fixed format, unless append is specified).
In [22]: df1.to_hdf('test.h5','df',mode='w',format='table')
In [23]: pd.read_hdf('test.h5','df')
Out[23]:
a b
date s t
2013-12-01 0 0 0 0
1 0 0
2 0 0
1 0 0 0
1 0 0
2 0 0
2013-12-02 0 0 0 0
1 0 0
2 0 0
1 0 0 0
1 0 0
2 0 0
[12 rows x 2 columns]
Sample selection
In [8]: pd.read_hdf('test.h5','df',where='date=20131202')
Out[8]:
a b
date s t
2013-12-02 0 0 0 0
1 0 0
2 0 0
1 0 0 0
1 0 0
2 0 0
[6 rows x 2 columns]