Python Pandas Where Condition Is Not Working - pandas

I have created a where condition with python.
filter = data['Ueber'] > 2.3
data[filter]
Here you can see my dataset.
Saison Spieltag Heimteam ... Ueber Unter UeberUnter
0 1819 3 Bayern München ... 1.30 3.48 Ueber
1 1819 3 Werder Bremen ... 1.75 2.12 Unter
2 1819 3 SC Freiburg ... 2.20 1.69 Ueber
3 1819 3 VfL Wolfsburg ... 2.17 1.71 Ueber
4 1819 3 Fortuna Düsseldorf ... 1.46 2.71 Ueber
Unfortunately, my greater than condition is not working. What's the problem?
Thanks

Just for the sake of clarity, if you have really floats into your column, which you want into conditional check then it should work.
Example DataFrame:
>>> df = pd.DataFrame({'num': [-12.5, 60.0, 50.0, -25.10, 50.0, 51.0, 71.0]} , dtype=float)
>>> df
num
0 -12.5
1 60.0
2 50.0
3 -25.1
4 50.0
5 51.0
6 71.0
Conditional check to compare..
>>> df['num'] > 50.0
0 False
1 True
2 False
3 False
4 False
5 True
6 True
Name: num, dtype: bool
Result:
>>> df [ df['num'] > 50.0 ]
num
1 60.0
5 51.0
6 71.0

Related

Pandas concatenate dataframe with multiindex retaining index names

I have a list of DataFrames as follows where each DataFrame in the list is as follows:
dfList[0]
monthNum 1 2
G1
2.0 0.05 -0.16
3.0 1.17 0.07
4.0 9.06 0.83
dfList[1]
monthNum 1 2
G2
21.0 0.25 0.26
31.0 1.27 0.27
41.0 9.26 0.23
dfList[0].index
Float64Index([2.0, 3.0, 4.0], dtype='float64', name='G1')
dfList[0].columns
Int64Index([1, 2], dtype='int64', name='monthNum')
I am trying to achieve the following in a dataframe Final_Combined_DF:
monthNum 1 2
G1
2.0 0.05 -0.16
3.0 1.17 0.07
4.0 9.06 0.83
G2
21.0 0.25 0.26
31.0 1.27 0.27
41.0 9.26 0.23
I tried doing different combinations of:
pd.concat(dfList, axis=0)
but it has not given me desired output. I am not sure how to go about this.
We can try pd.concat with keys using the Index.name from each DataFrame to add a new level index in the final frame:
final_combined_df = pd.concat(
df_list, keys=map(lambda d: d.index.name, df_list)
)
final_combined_df:
monthNum 0 1
G1 2.0 4 7
3.0 7 1
4.0 9 5
G2 21.0 8 1
31.0 1 8
41.0 2 6
Setup Used:
import numpy as np
import pandas as pd
np.random.seed(5)
df_list = [
pd.DataFrame(np.random.randint(1, 10, (3, 2)),
columns=pd.Index([0, 1], name='monthNum'),
index=pd.Index([2.0, 3.0, 4.0], name='G1')),
pd.DataFrame(np.random.randint(1, 10, (3, 2)),
columns=pd.Index([0, 1], name='monthNum'),
index=pd.Index([21.0, 31.0, 41.0], name='G2'))
]
df_list:
[monthNum 0 1
G1
2.0 4 7
3.0 7 1
4.0 9 5,
monthNum 0 1
G2
21.0 8 1
31.0 1 8
41.0 2 6]

fill_value in the pandas shift doesn't work with groupby

I need to shift column in pandas dataframe, for every name and fill resulting NA's with predefined value. Below is code snippet compiled with python 2.7
import pandas as pd
d = {'Name': ['Petro', 'Petro', 'Petro', 'Petro', 'Petro', 'Mykola', 'Mykola', 'Mykola', 'Mykola', 'Mykola', 'Mykyta', 'Mykyta', 'Mykyta', 'Mykyta', 'Mykyta'],
'Month': [1, 2, 3, 4, 5, 1, 2, 3, 4, 5, 1, 2, 3, 4, 5],
'Value': [25, 2.5, 24.6, 28, 26.4, 35, 24, 35, 22, 27, 30, 30, 34, 30, 23]
}
data = pd.DataFrame(d)
data['ValueLag'] = data.groupby('Name').Value.shift(-1, fill_value = 20)
print data
After running code above I get the following output
Month Name Value ValueLag
0 1 Petro 25.0 2.5
1 2 Petro 2.5 24.6
2 3 Petro 24.6 28.0
3 4 Petro 28.0 26.4
4 5 Petro 26.4 NaN
5 1 Mykola 35.0 24.0
6 2 Mykola 24.0 35.0
7 3 Mykola 35.0 22.0
8 4 Mykola 22.0 27.0
9 5 Mykola 27.0 NaN
10 1 Mykyta 30.0 30.0
11 2 Mykyta 30.0 34.0
12 3 Mykyta 34.0 30.0
13 4 Mykyta 30.0 23.0
14 5 Mykyta 23.0 NaN
Looks like fill_value did not work here. While I need NaN to be filled with some number let's say 4.
Or if to tell all the story I need that last value to be extended like this
Month Name Value ValueLag
0 1 Petro 25.0 2.5
1 2 Petro 2.5 24.6
2 3 Petro 24.6 28.0
3 4 Petro 28.0 26.4
4 5 Petro 26.4 26.4
5 1 Mykola 35.0 24.0
6 2 Mykola 24.0 35.0
7 3 Mykola 35.0 22.0
8 4 Mykola 22.0 27.0
9 5 Mykola 27.0 27.0
10 1 Mykyta 30.0 30.0
11 2 Mykyta 30.0 34.0
12 3 Mykyta 34.0 30.0
13 4 Mykyta 30.0 23.0
14 5 Mykyta 23.0 23.0
Is there a way to fill with last value forward or first value backward if shifting positive number of periods?
It seems that the fill value is by group rather than a single value. Try the following,
data['ValueLag'] = data.groupby('Name').Value.shift(-1).ffill()

The union of the intersection of rows of n dataframes

Say I have n dataframes, in this example n = 3.
**df1**
A B C
0 True 3 21.0
1 True 1 23.0
2 False 2 25.0
3 False 4 25.5
4 False 1 25.0
5 True 0 26.0
**df2**
A B C
0 True 3 21.0
1 True 1 23.0
2 False 2 25.0
3 False 4 25.5
4 True 2 19.0
**df3**
A B C
0 True 3 21.0
1 True 2 23.0
2 False 2 25.0
3 False 1 25.0
4 False 4 25.5
5 True 0 27.50
**dfn** ...
I want one dataframe that includes all the rows where the value in Column C appears in every dataframe dfn. So this is a kind of the union of the intersection of dataframes on a Column, in this case Column C. So for the above dataframes, the rows with 19.0, 26.0 and 27.50 don't make it to the final dataframe which is:
**Expected df**
0 True 3 21.0
1 True 1 23.0
2 False 2 25.0
3 False 4 25.5
4 False 1 25.0
0 True 3 21.0
1 True 1 23.0
2 False 2 25.0
3 False 4 25.5
0 True 3 21.0
1 True 2 23.0
2 False 2 25.0
3 False 1 25.0
4 False 4 25.5
So a row lives on to the final dataframe, if and only if, the value in Column C is seen in all dataframes.
Reproducible code:
import pandas as pd
df1 = pd.DataFrame({'A': [True,True,False,False,False,True], 'B': [3,1,2,4,1,0],
'C': [21.0,23.0,25.0,25.5,25.0,26.0]})
df2 = pd.DataFrame({'A': [True,True,False,False,False], 'B': [3,1,2,4,2],
'C': [21.0,23.0,25.0,25.5,19.0]})
df3 = pd.DataFrame({'A': [True,True,False,False,False,True], 'B': [3,2,2,1,4,0],
'C': [21.0,23.0,25.0,25.0,25.5,27.5]})
dfn = ...
The straightforward approach seems to be to compute the (n-way intersection) common C values (as a set/list), then filter with .isin:
common_C_values = set.intersection(set(df1['C']), set(df2['C']), set(df3['C']))
df_all = pd.concat([df1,df2,df3])
df_all = df_all[ df_all['C'].isin(common_C_values) ]
You can use pd.concat:
# merge column C from all DataFrames
df_C = pd.concat([df1,df2,df3],1)['C']
# concat all DataFrames
df_all = pd.concat([df1,df2,df3])
# only extract rows with its C value appears in all DataFrames C columns.
df_all.loc[df_all.apply(lambda x: df_C.eq(x.C).sum().all(), axis=1)]
Out[105]:
A B C
0 True 3 21.0
1 True 1 23.0
2 False 2 25.0
3 False 4 25.5
4 False 1 25.0
0 True 3 21.0
1 True 1 23.0
2 False 2 25.0
3 False 4 25.5
0 True 3 21.0
1 True 2 23.0
2 False 2 25.0
3 False 1 25.0
4 False 4 25.5
For simplicity, store your dataframes in a list. We'll make use of set operations to speed this up as much as possible.
df_list = [df1, df2, df3, ...]
common_idx = set.intersection(*[set(df['C']) for df in df_list])
print(common_idx)
{21.0, 23.0, 25.0, 25.5}
Thanks to #smci for the improvement! set.intersection will find the intersection of all the indices. Finally, call pd.concat, join the dataframes vertically, and then use query to filter on common indices obtained from the previous step.
pd.concat(df_list, ignore_index=True).query('C in #common_idx')
A B C
0 True 3 21.0
1 True 1 23.0
2 False 2 25.0
3 False 4 25.5
4 False 1 25.0
5 True 3 21.0
6 True 1 23.0
7 False 2 25.0
8 False 4 25.5
9 True 3 21.0
10 True 2 23.0
11 False 2 25.0
12 False 1 25.0
13 False 4 25.5

how to create a time series column and reindex in pandas?

How to create a column and reindex in pandas?
I'm a new pandas learner. I have 5 rows dataframe as follow:
A B C D
0 2.34 3.16 99.0 3.2
1 2.1 55.5 77.5 1
2 22.1 54 89 33
3 23 1.24 4.7 5
4 45 2.5 8.7 99
I want to replace index column 0,1...4 with new index 1 to 5. My expected output is:
A B C D
1 2.34 3.16 99.0 3.2
2 2.1 55.5 77.5 1
3 22.1 54 89 33
4 23 1.24 4.7 5
5 45 2.5 8.7 99
What I did is I create a new column:
new_index = pd.DataFrame({'#': range(1, 5 + 1 ,1)})
Then I tried to reindex:
df.reindex(new_index)
But I got error:
ValueError: Buffer has wrong number of dimensions (expected 1, got 2)
What should I do to reindex the former index? Thanks.
Use set_index
In [5081]: df.set_index([range(1, 6)])
Out[5081]:
A B C D
1 2.34 3.16 99.0 3.2
2 2.10 55.50 77.5 1.0
3 22.10 54.00 89.0 33.0
4 23.00 1.24 4.7 5.0
5 45.00 2.50 8.7 99.0
Or set values of df.index
In [5082]: df.index = range(1, 6)
In [5083]: df
Out[5083]:
A B C D
1 2.34 3.16 99.0 3.2
2 2.10 55.50 77.5 1.0
3 22.10 54.00 89.0 33.0
4 23.00 1.24 4.7 5.0
5 45.00 2.50 8.7 99.0
Details
Original df
In [5085]: df
Out[5085]:
A B C D
0 2.34 3.16 99.0 3.2
1 2.10 55.50 77.5 1.0
2 22.10 54.00 89.0 33.0
3 23.00 1.24 4.7 5.0
4 45.00 2.50 8.7 99.0
You need .values
df.index=df.index.values+1
df
Out[141]:
A B C D
1 2.34 3.16 99.0 3.2
2 2.10 55.50 77.5 1.0
3 22.10 54.00 89.0 33.0
4 23.00 1.24 4.7 5.0
5 45.00 2.50 8.7 99.0
As Per Zero :
df.index += 1

Efficient method for using formulas in a pandas dataframe

I am trying to add a column to a dataframe based on a formula. I don't think my current solution is very pythonic/efficient. So I am looking for faster options.
I have a table with 3 columns
import pandas as pd
df = pd.DataFrame([
[1,1,20.0],
[1,2,50.0],
[1,3,30.0],
[2,1,30.0],
[2,2,40.0],
[2,3,30.0],
],
columns=['seg', 'reach', 'len']
)
# print df
df
seg reach len
0 1 1 20.0
1 1 2 50.0
2 1 3 30.0
3 2 1 30.0
4 2 2 40.0
5 2 3 30.0
# Formula here
for index, row in df.iterrows():
if row['reach'] ==1:
df.ix[index,'cumseglen'] = row['len'] * 0.5
else:
df.ix[index,'cumseglen'] = df.ix[index-1,'cumseglen'] + 0.5 *(df.ix[index-1,'len'] + row['len'])
#print final results
df
seg reach len cumseglen
0 1 1 20.0 10.0
1 1 2 50.0 45.0
2 1 3 30.0 85.0
3 2 1 30.0 15.0
4 2 2 40.0 50.0
5 2 3 30.0 85.0
How can I improve the efficiency of the formula step?
To me this looks like a group-by operation. That is, within each "segment" group, you want to apply some operation to that group.
Here's one way to perform your calculation from above, using a group-by and some cumulative sums within each group:
import numpy as np
def cumulate(group):
cuml = 0.5 * np.cumsum(group)
return cuml + cuml.shift(1).fillna(0)
df['cumseglen'] = df.groupby('seg')['len'].apply(cumulate)
print(df)
The result:
seg reach len cumseglen
0 1 1 20.0 10.0
1 1 2 50.0 45.0
2 1 3 30.0 85.0
3 2 1 30.0 15.0
4 2 2 40.0 50.0
5 2 3 30.0 85.0
Algorithmically, this is not exactly the same as what you wrote, but under the assumption that the "reach" column starts from 1 at the beginning of each new segment indicated by the "seg" column, this should work.