common_index is size 8783 and Series[common_index] is 8784, how is this even possible? - pandas

common_index is a DatetimeIndex of length: 8783
When I use it to filter a Series object like this Series[common_index] I get a Series object of 8784 size.
Like how is it possible?

Yes, it is possible if use duplicated values for select:
s = pd.Series(range(3), index=[0,1,5])
print (s[[0,1,0,5,5,5]])
0 0
1 1
0 0
5 2
5 2
5 2
dtype: int64
print (pd.unique([0,1,0,5,5,5]))
[0 1 5]
print (s[pd.unique([0,1,0,5,5,5])])
0 0
1 1
5 2
dtype: int64

Related

Groupby and multiindexes - how to organize data with irregular sizes?

I am trying to organize 3D data collected from several participants with a different number of samples for each participant. Each participant has a unique session and seat index in the experiment. For each participant i, I have a 3D array composed of Ni images (height*width).
I first tried by creating a Dataset of participants but I ended up having many NaNs due to the fact that participants have different samples on the same dimension (sample dim). I then switched to a unique DataArray containing all my participants data concatenated on a single dimension I call depth. This dimension is then associated to a multiindex coordinate combining session, seatand sample coordinates:
<xarray.DataArray (depth: 52, height: 4, width: 4)>
array([[[0.92337111, 0.86505447, 0.08541727, 0.74850848],
[0.02336959, 0.0495726 , 0.98745956, 0.58831929],
[0.62128185, 0.7732787 , 0.27716268, 0.83634779],
[0.08146719, 0.35851012, 0.44170263, 0.74338872]],
...
[[0.4365896 , 0.23527988, 0.86891853, 0.94486637],
[0.20884748, 0.81012315, 0.61542411, 0.76706922],
[0.33391262, 0.88955315, 0.25329999, 0.35803887],
[0.49586615, 0.94767265, 0.40868892, 0.42393425]]])
Coordinates:
* height (height) int64 0 1 2 3
* width (width) int64 0 1 2 3
* depth (depth) MultiIndex
- session (depth) int64 0 0 0 0 0 0 0 0 0 0 0 1 1 ... 3 3 3 3 3 3 3 3 3 3 3 3
- seat (depth) int64 0 0 0 0 0 1 1 1 1 1 1 0 0 ... 0 0 0 0 0 1 1 1 1 1 1 1
- sample (depth) int64 0 1 2 3 4 0 1 2 3 4 5 0 1 ... 1 2 3 4 5 0 1 2 3 4 5 6
However I find this solution not really usable for several reasons:
each time I want to perform a groupby I have to reset the index to recreate one with the coordinates I want to group since xarray does not support multiple groupby on the same dim:
da = da.reset_index('depth')
da = da.set_index(depth=['session', 'seat'])
da.groupby('depth').mean()
the result of the code above is not perfect as it does not maintain the multiindex names:
<xarray.DataArray (depth: 8, height: 4, width: 4)>
array([[[0.47795382, 0.67322777, 0.12946181, 0.48983815],
[0.33895882, 0.46772217, 0.62886196, 0.55970122],
[0.57370573, 0.47272117, 0.31529004, 0.63230245],
[0.63230284, 0.5352105 , 0.65805407, 0.65274841]],
...
[[0.55672404, 0.37963945, 0.57334768, 0.64853806],
[0.46608072, 0.39506509, 0.66339553, 0.71447367],
[0.58989461, 0.66066485, 0.53271228, 0.43036214],
[0.44163921, 0.54990042, 0.4229631 , 0.5941268 ]]])
Coordinates:
* height (height) int64 0 1 2 3
* width (width) int64 0 1 2 3
* depth (depth) MultiIndex
- depth_level_0 (depth) int64 0 0 1 1 2 2 3 3
- depth_level_1 (depth) int64 0 1 0 1 0 1 0 1
I can use sel only on fully indexed data (i.e. by using session, seatand sample in the depth index), so I end up re-indexing my data again and again.
I find using hvplot on such DataArray not really straightforward (skipping the details here for easier reading of this already long post).
Is there something I am missing ? Is there a better way to organize my data ? I tried to create mutliple indexes on the same dim for convenience but without success.

calculate the mean of one row according it's label

calculate the mean of the values in one row according it's label:
A = [1,2,3,4,5,6,7,8,9,10]
B = [0,0,0,0,0,1,1,1,1, 1]
Result = pd.DataFrame(data=[A, B])
I want the output is: 0->3; 1-> 7.8
pandas has the groupby function, but I don't know how to implement this. Thanks
This is simple groupby problem ...
Result=Result.T
Result.groupby(Result[1])[0].mean()
Out[372]:
1
0 3
1 8
Name: 0, dtype: int64
Firstly, it sounds like you want to label the index:
In [11]: Result = pd.DataFrame(data=[A, B], index=['A', 'B'])
In [12]: Result
Out[12]:
0 1 2 3 4 5 6 7 8 9
A 1 2 3 4 5 6 7 8 9 10
B 0 0 0 0 0 1 1 1 1 1
If the index was unique you wouldn't have to do any groupby, just take the mean of each row (that's the axis=1):
In [13]: Result.mean(axis=1)
Out[13]:
A 5.5
B 0.5
dtype: float64
However, if you had multiple rows with the same label, then you'd need to groupby:
In [21]: Result2 = pd.DataFrame(data=[A, A, B], index=['A', 'A', 'B'])
In [22]: Result2.mean(axis=1)
Out[22]:
A 5.5
A 5.5
B 0.5
dtype: float64
Note: the duplicate rows (that happen to have the same mean as I lazily used the same row contents), in general we'd want to take the mean of those means:
In [23]: Result2.mean(axis=1).groupby(level=0).mean()
Out[23]:
A 5.5
B 0.5
dtype: float64
Note: .groupby(level=0) groups the rows which have the same index label.
You're making it difficult on yourself by constructing the dataframe in such a way as to put the things you want to take the mean of and the things you want to be your labels as different rows.
Option 1
groubpy
This deals with the data presented in the dataframe Result
Result.loc[0].groupby(Result.loc[1]).mean()
1
0 3
1 8
Name: 0, dtype: int64
Option 2
Overkill using np.bincount and because your grouping values are 0 and 1. I'd have a solution even if they weren't but it makes it simpler.
I wanted to use the raw lists A and B
pd.Series(np.bincount(B, A) / np.bincount(B))
0 3.0
1 8.0
dtype: float64
Option 3
Construct a series instead of a dataframe.
Again using raw lists A and B
pd.Series(A, B).mean(level=0)
0 3
1 8
dtype: int64

Combining two dataframes along Index in Pandas

I have two dataframes pin1 and pin2 of different sizes with different indexes i.e pin1 have index values=['0','1','7'] and pin2 have ['2','4']. I would like to combine both along index to form ['0','1','2','4','7']. I tried merging using 'outer' join but it is changing the index values to ['0','1','2','3','4'].
In [1]: pin1= pd.Series(np.random.randn(2), index=['2', '4'])
In [2]: pin2= pd.Series(np.random.randn(3), index=['0', '1', '7'])
In [3]: pin3=pd.merge(pin1,pin2,how='outer')
In [4]: pin3
Out [4]:
0 0.2941
1 0.2869
2 1.7098
3 -0.2126
4 0.2696
expected output:
Out [4]:
0 0.2941
1 0.2869
2 1.7098
4 -0.2126
7 0.2696
If the sets of indices are disjoint, you can use pd.concat:
pd.concat([pin1, pin2]).sort_index()
Using combine_first
In [3732]: pin1.combine_first(pin2)
Out[3732]:
0 -0.820341
1 0.492719
2 -0.785723
4 -1.815021
7 2.027267
dtype: float64
Or, append
In [3734]: pin1.append(pin2).sort_index()
Out[3734]:
0 -0.820341
1 0.492719
2 -0.785723
4 -1.815021
7 2.027267
dtype: float64
Details
In [3735]: pin1
Out[3735]:
2 -0.785723
4 -1.815021
dtype: float64
In [3736]: pin2
Out[3736]:
0 -0.820341
1 0.492719
7 2.027267
dtype: float64
Or using align
pin1.align(pin2,join='outer')[0].fillna(pin1.align(pin2,join='outer')[1])
Out[991]:
0 -0.278627
1 0.009388
2 -0.655377
4 0.564739
7 0.793576
dtype: float64

Pandas .count() puts in first row name out of nowhere?

I have a pandas dataframe, where the first row is called school and the last row is called passed, and it has only numbers 1 and 0.
I simply wanted to count how often 1 or 0 occurs in that row.
i went with :
n_passed = df[df.passed==1].count()
the funny thing is, it gives me the correct number, but also outputs 'school', for a reason that is beyond me.
school 265
Can anyone bring light into this ?
IIUC you think no rows, but columns passed and school. Then you can use value_counts with column passed:
print df
school aa bb passed
0 1 0 1 1
1 0 1 0 0
2 1 1 0 1
3 0 0 1 1
n_passed1 = df.passed[df.passed==1].value_counts()
print n_passed1
1 3
Name: passed, dtype: int64
n_passed0 = df.passed[df.passed==0].value_counts()
print n_passed0
0 1
Name: passed, dtype: int64
But I think the best is use:
n_passed1 = df.passed.value_counts()
print n_passed1
1 3
0 1
Name: passed, dtype: int64

Pandas: how to use convert_objects to replace strings with NaN values

This is related to a previous question I've asked, here: Replace any string in columns with 1
However, since that question has been answered long ago, I've started a new question here. I am essentially trying to use convert_objects to replace string values with 1's in the following dataframe (abbreviated here):
uniq_epoch T_Opp T_Eval
1 0 0
1 0 vv.bo
2 bx 0
3 0 0
3 vo.bp 0
...
I am using the following code to do this. I've actually tried using this code on the entire dataframe, and have also applied it to a particular column. The result each time is that there is no error message, but also no change to the data (no values are converted to NaN, and the dtype is still 'O').
df = df.convert_objects(convert_numeric = True)
or
df.T_Eval = df.T_Eval.convert_objects(convert_numeric=True)
Desired final output is as follows:
uniq_epoch T_Opp T_Eval
1 0 0
1 0 1
2 1 0
3 0 0
3 1 0
...
Where there may also be a step prior to this, with the 1s as NaN, and fillna(1) is used to insert 1s where strings have been.
I've already searched posts on stackoverflow, and looked at the documentation for convert_objects, but it is unfortunately pretty sparse. I wouldn't have known to even attempt to apply it this way if not for the previous post (linked above).
I'll also mention that there are quite a few strings (codes) in these columns, and that the codes can recombine, so that to do this with a dict and replace(), would take about the same amount of time as if I did this by hand.
Based on the previous post and the various resources I've been able to find, I can't figure out why this isn't working - any help much appreciated, including pointing towards further documentation.
This is on 0.13.1
docs here
and here
Maybe you have an older version; IIRC convert_objects introduced in 0.11.
In [5]: df = read_csv(StringIO(data),sep='\s+',index_col=0)
In [6]: df
Out[6]:
T_Opp T_Eval
uniq_epoch
1 0 0
1 0 vv.bo
2 bx 0
3 0 0
3 vo.bp 0
[5 rows x 2 columns]
In [7]: df.convert_objects(convert_numeric=True)
Out[7]:
T_Opp T_Eval
uniq_epoch
1 0 0
1 0 NaN
2 NaN 0
3 0 0
3 NaN 0
[5 rows x 2 columns]
In [8]: df.convert_objects(convert_numeric=True).dtypes
Out[8]:
T_Opp float64
T_Eval float64
dtype: object
In [9]: df.convert_objects(convert_numeric=True).fillna(1)
Out[9]:
T_Opp T_Eval
uniq_epoch
1 0 0
1 0 1
2 1 0
3 0 0
3 1 0
[5 rows x 2 columns]