Is there a simple way to convert an xarray DataArray to a pandas DataFrame, where I can prescribe which dimensions get turned into index/columns? For example, suppose I have a DataArray
import xarray as xr
weather = xr.DataArray(
name='weather',
data=[['Sunny', 'Windy'], ['Rainy', 'Foggy']],
dims=['date', 'time'],
coords={
'date': ['Thursday', 'Friday'],
'time': ['Morning', 'Afternoon'],
}
)
which results in:
<xarray.DataArray 'weather' (date: 2, time: 2)>
array([['Sunny', 'Windy'],
['Rainy', 'Foggy']], dtype='<U5')
Coordinates:
* date (date) <U8 'Thursday' 'Friday'
* time (time) <U9 'Morning' 'Afternoon'
Suppose I now want to move that to a pandas DataFrame indexed by date, with columns time. I can kind of do this by using .to_dataframe() and then .unstack() on the resulting dataframe:
>>> weather.to_dataframe().unstack()
weather
time Afternoon Morning
date
Friday Foggy Rainy
Thursday Windy Sunny
However, pandas will sort things so rather than Morning followed by Afternoon, I get Afternoon followed by Morning. I was rather hoping there would be an API like
weather.to_dataframe(index_dims=[...], column_dims=[...])
which could do this reshaping for me, without me having to re-sort my indices and columns afterwards.
In xarray 0.16.1, dim_order was added to .to_dataframe. Does this do what you're looking for?
xr.DataArray.to_dataframe(
self,
name: Hashable = None,
dim_order: List[Hashable] = None,
) -> pandas.core.frame.DataFrame
Docstring:
Convert this array and its coordinates into a tidy pandas.DataFrame.
The DataFrame is indexed by the Cartesian product of index coordinates
(in the form of a :py:class:`pandas.MultiIndex`).
Other coordinates are included as columns in the DataFrame.
Parameters
----------
name
Name to give to this array (required if unnamed).
dim_order
Hierarchical dimension order for the resulting dataframe.
Array content is transposed to this order and then written out as flat
vectors in contiguous order, so the last dimension in this list
will be contiguous in the resulting DataFrame. This has a major
influence on which operations are efficient on the resulting
dataframe.
If provided, must include all dimensions of this DataArray. By default,
dimensions are sorted according to the DataArray dimensions order.
If you want to make sure that coordinate labels stay in a particular (non-default) order, you can explicitly set their type to be a CategoricalIndex:
In [25]: import xarray as xr, pandas as pd
...: weather = xr.DataArray(
...: name='weather',
...: data=[['Sunny', 'Windy'], ['Rainy', 'Foggy']],
...: dims=['date', 'time'],
...: coords={
...: 'date': pd.CategoricalIndex(['Thursday', 'Friday'], ordered=True),
...: 'time': pd.CategoricalIndex(['Morning', 'Afternoon'], ordered=True),
...: }
...: )
This will be preserved when converting to pandas:
In [26]: weather.to_dataframe().unstack()
Out[26]:
weather
time Morning Afternoon
date
Thursday Sunny Windy
Friday Rainy Foggy
I know this question has been asked before. The answer is as follows:
df.resample('M').agg({'col1': np.sum, 'col2': np.mean})
But I have 27 columns and I want to sum the first 25, and average the remaining two. Should I write this('col1' - 'col25': np.sum) for 25 columns and this('col26': np.mean, 'col27': np.mean) for two columns?
Mt dataframe contains hourly data and I want to convert it to monthly data. I want to try something like that but it is nonsense:
for i in col_list:
df = df.resample('M').agg({i-2: np.sum, 'col26': np.mean, 'col27': np.mean})
Is there any shortcut for this situation?
You can try this, not for loop :
sum_col = ['col1','col2','col3','col4', ...]
sum_df = df.resample('M')[sum_col].sum()
mean_col = ['col26','col27']
mean_df = df.resample('M')[mean_col].mean()
df = sum_col.join(mean_df)
I am trying to run some computations on DataFrames. I want to compute the average difference between two sets of rolling mean. To be more specific, the average of the difference between a long-term mean (lst) and a smaller-one (lst_2). I am trying to combine the calculation with a double for loop as follows:
import pandas as pd
import numpy as pd
def main(df):
df=df.pct_change()
lst=[100,150,200,250,300]
lst_2=[5,10,15,20]
result=pd.DataFrame(np.sum([calc(df,T,t) for T in lst for t in lst_2]))/(len(lst)+len(lst_2))
return result
def calc(df,T,t):
roll=pd.DataFrame(np.sign(df.rolling(t).mean()-df.rolling(T).mean()))
return roll
Overall I should have 20 differences (5 and 100, 10 and 100, 15 and 100 ... 20 and 300); I take the sign of the difference and I want the average of these differences at each point in time. Ideally the result would be a dataframe result.
I got the error: cannot copy sequence with size 3951 to array axis with dimension 1056 when it runs the double for loops. Obviously I understand that due to rolling of different T and t, the dimensions of the dataframes are not equal when it comes to the array conversion (with np.sum), but I thought it would put "NaN" to align the dimensions.
Hope I have been clear enough. Thank you.
As requested in the comments, here is an example. Let's suppose the following
dataframe:
df = pd.DataFrame({'A': [100,101.4636,104.9477,106.7089,109.2701,111.522,113.3832,113.8672,115.0718,114.6945,111.7446,108.8154]},index=[0, 1, 2, 3,4,5,6,7,8,9,10,11])
df=df.pct_change()
and I have the following 2 sets of mean I need to compute:
lst=[8,10]
lst_1=[3,4]
Then I follow these steps:
1/
I want to compute the rolling mean(3) - rolling mean(8), and get the sign of it:
roll=np.sign(df.rolling(3).mean()-df.rolling(8).mean())
This should return the following:
roll = pd.DataFrame({'A': ['NaN','NaN','NaN','NaN','NaN','NaN','NaN','NaN',-1,-1,-1,-1},index=[0, 1, 2, 3,4,5,6,7,8,9,10,11])
2/
I redo step 1 with the combination of differences 3-10 ; 4-8 ; 4-10. So I get overall 4 roll dataframes.
roll_3_8 = pd.DataFrame({'A': ['NaN','NaN','NaN','NaN','NaN','NaN','NaN','NaN',-1,-1,-1,-1},index=[0, 1, 2, 3,4,5,6,7,8,9,10,11])
roll_3_10 = pd.DataFrame({'A': ['NaN','NaN','NaN','NaN','NaN','NaN','NaN','NaN','NaN','NaN',-1,-1},index=[0, 1, 2, 3,4,5,6,7,8,9,10,11])
roll_4_8 = pd.DataFrame({'A': ['NaN','NaN','NaN','NaN','NaN','NaN','NaN','NaN',-1,-1,-1,-1},index=[0, 1, 2, 3,4,5,6,7,8,9,10,11])
roll_4_10 = pd.DataFrame({'A': ['NaN','NaN','NaN','NaN','NaN','NaN','NaN','NaN','NaN','NaN',-1,-1},index=[0, 1, 2, 3,4,5,6,7,8,9,10,11])
3/
Now that I have all the diffs, I simply want the average of them, so I sum all the 4 rolling dataframes, and I divide it by 4 (number of differences computed). The results should be (before dropping all N/A values):
result = pd.DataFrame({'A': ['NaN','NaN','NaN','NaN','NaN','NaN','NaN','NaN','NaN','NaN',-1,-1},index=[0, 1, 2, 3,4,5,6,7,8,9,10,11])
I am trying to understand how the date-related features of indexing in pandas work.
If I have this data frame:
dates = pd.date_range('6/1/2000', periods=12, freq='M')
df1 = DataFrame(randn(12, 2), index=dates, columns=['A', 'B'])
I know that we can extract records from 2000 using df1['2000'] or a range of dates using df1['2000-09':'2001-03'].
But suppose instead I have a dataframe with a multi-index
index = pd.MultiIndex.from_arrays([dates, list('HIJKHIJKHIJK')], names=['date', 'id'])
df2 = DataFrame(randn(12, 2), index=index, columns=['C', 'D'])
Is there a way to extract rows with a year 2000 as we did with a single index? It appears that df2.xs('2000-06-30') works for accessing a particular date, but df2.xs('2000') does not return anything. Is xs not the right way to go about this?
You don't need to use xs for this, but you can index using .loc.
One of the example you tried, would then look like df2.loc['2000-09':'2001-03']. The only problem is that the 'partial string parsing' feature does not work yet when using multi-index. So you have to provide actual datetimes:
In [17]: df2.loc[pd.Timestamp('2000-09'):pd.Timestamp('2001-04')]
Out[17]:
C D
date id
2000-09-30 K -0.441505 0.364074
2000-10-31 H 2.366365 -0.404136
2000-11-30 I 0.371168 1.218779
2000-12-31 J -0.579180 0.026119
2001-01-31 K 0.450040 1.048433
2001-02-28 H 1.090321 1.676140
2001-03-31 I -0.272268 0.213227
But note that in this case pd.Timestamp('2001-03') would be interpreted as 2001-03-01 00:00:00(an actual moment in time). Therefore, you have to adjust the start/stop values a little bit.
A selection for a full year (eg df1['2000']) would then become df2.loc[pd.Timestamp('2000'):pd.Timestamp('2001')] or df2.loc[pd.Timestamp('2000-01-01'):pd.Timestamp('2000-12-31')]
I am trying to use a slicer on a pandas dataframe with a MultiIndex:
dates = pd.date_range('6/30/2000', periods=12, freq='M')
index = MultiIndex.from_arrays([dates, list('HIJKHIJKHIJKHIJK')], names=['date', 'id'])
df = DataFrame(randn(12, 4), index=index, columns=['A', 'B', 'C', 'D'])
I would like to get the rows where id='H'. From this comment in a related question I asked, and from reading the documentation, I thought this:
df.loc[(slice(None), 'H'),:]
or perhaps this:
df.loc[(slice(None), ['H']),:]
would work. The first one returns this error:
IndexError: index 12 is out of bounds for axis 1 with size 12
and the second one gives this error:
IndexError: indices are out-of-bounds
From looking at other questions, I thought perhaps I need to sort by the 2nd-level index before trying to slice. I'm not really sure what I'm doing here, but I tried to use df.sort_index() but am having trouble with the syntax. I'm also not sure whether that is even the issue here.
You've got a problem in your line
index = MultiIndex.from_arrays([dates, list('HIJKHIJKHIJKHIJK')], names=['date', 'id'])
The arrays should be the same length. This not raising is probably a bug.
If you fix that then either of
In [58]: df.xs('H', level='id')
Out[58]:
A B C D
date
2000-06-30 -0.645203 0.965441 0.150037 -0.083979
2000-10-31 -1.222954 0.498284 -1.249005 -1.664407
2001-02-28 -0.941248 2.025381 0.450256 1.182266
In [59]: df.loc[(slice(None), 'H'), :]
Out[59]:
A B C D
date id
2000-06-30 H -0.645203 0.965441 0.150037 -0.083979
2000-10-31 H -1.222954 0.498284 -1.249005 -1.664407
2001-02-28 H -0.941248 2.025381 0.450256 1.182266
should work, depending on whether you'd like to drop the id level.