Pandas Multiindex - Select from List - pandas

I have the following DataFrame with Index Date and ID
V1 V2
Date ID
01.01.2010 1 x y
01.01.2010 2 x y
02.01.2010 1 x y
......
I was able to select a date range with
df.loc[ slice(start, end) ]
But I need to filter data based on a list of ID's. For example
allowed = [1, 5]
df.loc[ allowed ]
How this is possible?

Use index.isin to check for membership across Index axis. In the case of a multi-index, supply the position of the level or level name explicitly for which the check has to be performed.
df.loc[df.index.isin(allowed, level='ID')] # level name is specified
(Or)
df.loc[df.index.isin(allowed, level=1)] # level position is specified

You can also use .query() method:
In [62]: df
Out[62]:
V1 V2
Date ID
2010-01-01 1 x y
2 x y
2010-02-01 1 x y
In [63]: allowed = [1, 5]
In [64]: df.query("ID in #allowed")
Out[64]:
V1 V2
Date ID
2010-01-01 1 x y
2010-02-01 1 x y

Related

df.apply(myfunc, axis=1) results in error but df.groupby(df.index).apply(myfunc) does not

I have a dataframe that looks like this:
a b c
0 x x x
1 y y y
2 z z z
I would like to apply a function to each row of dataframe. That function then creates a new dataframe with multiple rows from each input row and returns it. Here is my_func:
def my_func(df):
dup_num = int(df.c - df.a)
if isinstance(df, pd.Series):
df_expanded = pd.concat([pd.DataFrame(df).transpose()]*dup_num,
ignore_index=True)
else:
df_expanded = pd.concat([pd.DataFrame(df)]*dup_num,
ignore_index=True)
return df_expanded
The final dataframe will look like something like this:
a b c
0 x x x
1 x x x
2 y y y
3 y y y
4 y y y
5 z z z
6 z z z
So I did:
df_expanded = df.apply(my_func, axis=1)
I inserted breakpoints inside the function and for each row, the created dataframe from my_func is correct. However, at the end, when the last row returns, I get an error stating that:
ValueError: cannot copy sequence with size XX to array axis with dimension YY
As if apply is trying to return a Series not a group of dataFrames that the function created.
So instead of df.apply I did:
df_expanded = df.groupby(df.index).apply(my_func)
Which just creates groups of single rows and applies the same function. This on the other hand works.
Why?
Perhaps we can take advantage of how pd.Series.explode and pd.Series.apply(pd.Series) work to simplify this process.
Given:
a b c
0 1 1 4
1 2 2 4
2 3 3 4
Doing:
new_df = (df.apply(lambda x: [x.tolist()]*(x.c-x.a), axis=1)
.explode(ignore_index=True)
.apply(pd.Series))
new_df.columns = df.columns
print(new_df)
Output:
a b c
0 1 1 4
1 1 1 4
2 1 1 4
3 2 2 4
4 2 2 4
5 3 3 4

create a new column based on cumulative occurrences of a specific value in another column pandas

I want to count the number of occurrences of one specific value (string) in one column and write it down in another column cumulatively.
For example, counting the cumulative number of Y values here:
col_1 new_col
Y 1
Y 2
N 2
Y 3
N 3
I wrote this code but it gives me the final number instead of cumulative frequencies.
df['new_col'] = 0
df['new_col'] = df.loc[df.col_1 == 'Y'].count()
To count both values cumulatively you can use:
df['new_col'] = (df
.groupby('col_1')
.cumcount().add(1)
.cummax()
)
If you want to focus on 'Y':
df['new_col'] = (df
.groupby('col_1')
.cumcount().add(1)
.where(df['col_1'].eq('Y'))
.ffill()
.fillna(0, downcast='infer')
)
Output:
col_1 new_col
0 Y 1
1 Y 2
2 N 2
3 Y 3
4 N 3

drop consecutive duplicates of groups

I am removing consecutive duplicates in groups in a dataframe. I am looking for a faster way than this:
def remove_consecutive_dupes(subdf):
dupe_ids = [ "A", "B" ]
is_duped = (subdf[dupe_ids].shift(-1) == subdf[dupe_ids]).all(axis=1)
subdf = subdf[~is_duped]
return subdf
# dataframe with columns key, A, B
df.groupby("key").apply(remove_consecutive_dupes).reset_index()
Is it possible to remove these without grouping first? Applying the above function to each group individually takes a lot of time, especially if the group count is like half the row count. Is there a way to do this operation on the entire dataframe at once?
A simple example for the algorithm if the above was not clear:
input:
key A B
0 x 1 2
1 y 1 4
2 x 1 2
3 x 1 4
4 y 2 5
5 x 1 2
output:
key A B
0 x 1 2
1 y 1 4
3 x 1 4
4 y 2 5
5 x 1 2
Row 2 was dropped because A=1 B=2 was also the previous row in group x.
Row 5 will not be dropped because it is not a consecutive duplicate in group x.
According to your code, you drop only lines if they appear below each other if
they are grouped by the key. So rows with another key inbetween do not influence this logic. But doing this, you want to preserve the original order of the records.
I guess the biggest influence in the runtime is the call of your function and
possibly not the grouping itself.
If you want to avoid this, you can try the following approach:
# create a column to restore the original order of the dataframe
df.reset_index(drop=True, inplace=True)
df.reset_index(drop=False, inplace=True)
df.columns= ['original_order'] + list(df.columns[1:])
# add a group column, that contains consecutive numbers if
# two consecutive rows differ in at least one of the columns
# key, A, B
compare_columns= ['key', 'A', 'B']
df.sort_values(['key', 'original_order'], inplace=True)
df['group']= (df[compare_columns] != df[compare_columns].shift(1)).any(axis=1).cumsum()
df.drop_duplicates(['group'], keep='first', inplace=True)
df.drop(columns=['group'], inplace=True)
# now just restore the original index and it's order
df.set_index('original_order', inplace=True)
df.sort_index(inplace=True)
df
Testing this, results in:
key A B
original_order
0 x 1 2
1 y 1 4
3 x 1 4
4 y 2 5
If you don't like the index name above (original_order), you just need to add the following line to remove it:
df.index.name= None
Testdata:
from io import StringIO
infile= StringIO(
""" key A B
0 x 1 2
1 y 1 4
2 x 1 2
3 x 1 4
4 y 2 5"""
)
df= pd.read_csv(infile, sep='\s+') #.set_index('Date')
df

Conditional frequency of elements within lists in pandas data frame

I have a data frame in pandas like this:
STATUS FEATURES
A [x,y,z]
A [t, y]
B [x,p,t]
B [x,p]
I want to count the frequency of the elements in the lists of features conditional on the status.
The desired output would be:
STATUS FEATURES FREQUENCY
A x 1
A y 2
A z 1
A t 1
B x 2
B t 1
B p 2
Let us do explode , the groupby size
s=df.explode(['FEATURES']).groupby(['STATUS','FEATURES']).size().reset_index()
Use DataFrame.explode and SeriesGroupBy.value_counts:
new_df = (df.explode('FEATURES')
.groupby('STATUS')['FEATURES']
.value_counts()
.reset_index(name='FRECUENCY'))
print(new_df)
Output
STATUS FEATURES FRECUENCY
0 A y 2
1 A t 1
2 A x 1
3 A z 1
4 B p 2
5 B x 2
6 B t 1

how to iterate a Series with multiindex in pandas

I am a beginner to pandas. And now I want to realise Decision Tree Algorithm with pandas. First, I read test data into a padas.DataFrame, it is like below:
In [4]: df = pd.read_csv('test.txt', sep = '\t')
In [5]: df
Out[5]:
Chocolate Vanilla Strawberry Peanut
0 Y N Y Y
1 N Y Y N
2 N N N N
3 Y Y Y Y
4 Y Y N Y
5 N N N N
6 Y Y Y Y
7 N Y N N
8 Y N Y N
9 Y N Y Y
then I groupby 'Peanut' and 'Chocolate', what I get is:
In [15]: df2 = df.groupby(['Peanut', 'Chocolate'])
In [16]: serie1 = df2.size()
In [17]: serie1
Out[17]:
Peanut Chocolate
N N 4
Y 1
Y Y 5
dtype: int64
Now, the type of serie1 is Series. I can access the value of serie1 but I can not get value of 'Peanut' and 'Chocolate. How can I get the number of serie1 and the value of 'Peanut' and 'Chocolate at the same time?
You can use index:
>>> serie1.index
MultiIndex(levels=[[u'N', u'Y'], [u'N', u'Y']],
labels=[[0, 0, 1], [0, 1, 1]],
names=[u'Peanut', u'Chocolate'])
You can obtain the values of the column names and the levels. Note that the labels refer to the index in the same row in levels. So for example for 'Peanut' the first label is levels[0][labels[0][0]] which is 'N'. The last label of 'Chocolate' is levels[1][labels[1][2]] which is 'Y'.
I created a small example which loops through the indexes and prints all data:
#loop the rows
for i in range(len(serie1)):
print "Row",i,"Value",serie1.iloc[i],
#loop the columns
for j in range(len(serie1.index.names)):
print "Column",serie1.index.names[j],"Value",serie1.index.levels[j][serie1.index.labels[j][i]],
print
Which results in:
Row 0 Value 4 Column Peanut Value N Column Chocolate Value N
Row 1 Value 1 Column Peanut Value N Column Chocolate Value Y
Row 2 Value 5 Column Peanut Value Y Column Chocolate Value Y