I am a beginner to pandas. And now I want to realise Decision Tree Algorithm with pandas. First, I read test data into a padas.DataFrame, it is like below:
In [4]: df = pd.read_csv('test.txt', sep = '\t')
In [5]: df
Out[5]:
Chocolate Vanilla Strawberry Peanut
0 Y N Y Y
1 N Y Y N
2 N N N N
3 Y Y Y Y
4 Y Y N Y
5 N N N N
6 Y Y Y Y
7 N Y N N
8 Y N Y N
9 Y N Y Y
then I groupby 'Peanut' and 'Chocolate', what I get is:
In [15]: df2 = df.groupby(['Peanut', 'Chocolate'])
In [16]: serie1 = df2.size()
In [17]: serie1
Out[17]:
Peanut Chocolate
N N 4
Y 1
Y Y 5
dtype: int64
Now, the type of serie1 is Series. I can access the value of serie1 but I can not get value of 'Peanut' and 'Chocolate. How can I get the number of serie1 and the value of 'Peanut' and 'Chocolate at the same time?
You can use index:
>>> serie1.index
MultiIndex(levels=[[u'N', u'Y'], [u'N', u'Y']],
labels=[[0, 0, 1], [0, 1, 1]],
names=[u'Peanut', u'Chocolate'])
You can obtain the values of the column names and the levels. Note that the labels refer to the index in the same row in levels. So for example for 'Peanut' the first label is levels[0][labels[0][0]] which is 'N'. The last label of 'Chocolate' is levels[1][labels[1][2]] which is 'Y'.
I created a small example which loops through the indexes and prints all data:
#loop the rows
for i in range(len(serie1)):
print "Row",i,"Value",serie1.iloc[i],
#loop the columns
for j in range(len(serie1.index.names)):
print "Column",serie1.index.names[j],"Value",serie1.index.levels[j][serie1.index.labels[j][i]],
print
Which results in:
Row 0 Value 4 Column Peanut Value N Column Chocolate Value N
Row 1 Value 1 Column Peanut Value N Column Chocolate Value Y
Row 2 Value 5 Column Peanut Value Y Column Chocolate Value Y
Related
I have a dataframe that looks like this:
a b c
0 x x x
1 y y y
2 z z z
I would like to apply a function to each row of dataframe. That function then creates a new dataframe with multiple rows from each input row and returns it. Here is my_func:
def my_func(df):
dup_num = int(df.c - df.a)
if isinstance(df, pd.Series):
df_expanded = pd.concat([pd.DataFrame(df).transpose()]*dup_num,
ignore_index=True)
else:
df_expanded = pd.concat([pd.DataFrame(df)]*dup_num,
ignore_index=True)
return df_expanded
The final dataframe will look like something like this:
a b c
0 x x x
1 x x x
2 y y y
3 y y y
4 y y y
5 z z z
6 z z z
So I did:
df_expanded = df.apply(my_func, axis=1)
I inserted breakpoints inside the function and for each row, the created dataframe from my_func is correct. However, at the end, when the last row returns, I get an error stating that:
ValueError: cannot copy sequence with size XX to array axis with dimension YY
As if apply is trying to return a Series not a group of dataFrames that the function created.
So instead of df.apply I did:
df_expanded = df.groupby(df.index).apply(my_func)
Which just creates groups of single rows and applies the same function. This on the other hand works.
Why?
Perhaps we can take advantage of how pd.Series.explode and pd.Series.apply(pd.Series) work to simplify this process.
Given:
a b c
0 1 1 4
1 2 2 4
2 3 3 4
Doing:
new_df = (df.apply(lambda x: [x.tolist()]*(x.c-x.a), axis=1)
.explode(ignore_index=True)
.apply(pd.Series))
new_df.columns = df.columns
print(new_df)
Output:
a b c
0 1 1 4
1 1 1 4
2 1 1 4
3 2 2 4
4 2 2 4
5 3 3 4
I want to count the number of occurrences of one specific value (string) in one column and write it down in another column cumulatively.
For example, counting the cumulative number of Y values here:
col_1 new_col
Y 1
Y 2
N 2
Y 3
N 3
I wrote this code but it gives me the final number instead of cumulative frequencies.
df['new_col'] = 0
df['new_col'] = df.loc[df.col_1 == 'Y'].count()
To count both values cumulatively you can use:
df['new_col'] = (df
.groupby('col_1')
.cumcount().add(1)
.cummax()
)
If you want to focus on 'Y':
df['new_col'] = (df
.groupby('col_1')
.cumcount().add(1)
.where(df['col_1'].eq('Y'))
.ffill()
.fillna(0, downcast='infer')
)
Output:
col_1 new_col
0 Y 1
1 Y 2
2 N 2
3 Y 3
4 N 3
I have this dataframe:
player_id scout_occ round scout
812842 2 1 X
812842 4 1 Y
812842 1 1 Z
812842 1 2 X
812842 2 2 Y
812842 2 2 Z
And I need to transpose 'scout' values to columns, as well as using number of occurrences as value or these new columns, ending up with:
player_id round X Y Z
812842 1 2 4 1
812842 2 1 2 2
How do I achieve this?
Use pivot_table. For example:
df = df.pivot_table(values='scout_occ',index=['player_id','round'],columns='scout')
Then if you don't want to use column name(scout):
df.columns.name = None
Also, if you want to use player_id and round as a column not as an index:
df.reset_index()
df
col
a,b
b
c
b,c
Goal
a→x, b→y, c→z
col
x,y
y
z
y,z
Try
df['col']=df['col'].replace({'a':'x','b':'y','c':'z'})
It only works for one word, but multiple words like x,y failed.
Or you could try using this piece of code:
>>> df['col'].str.split(',', expand=True).fillna('').replace({'a':'x','b':'y','c':'z'}).apply(','.join, axis=1).str.rstrip(',')
0 x,y
1 y
2 z
3 y,z
dtype: object
>>>
Add parameter regex=True for subtrings replacement:
df['col']=df['col'].replace({'a':'x','b':'y','c':'z'}, regex=True)
print (df)
col
0 x,y
1 y
2 z
3 y,z
Another idea with dictionary.get for replace by splitted values and if no match get original value, last join back by ,:
d = {'a':'x','b':'y','c':'z'}
df['col']=df['col'].apply(lambda x: ','.join(d.get(y, y) for y in x.split(',')))
print (df)
col
0 x,y
1 y
2 z
3 y,z
I have the following DataFrame with Index Date and ID
V1 V2
Date ID
01.01.2010 1 x y
01.01.2010 2 x y
02.01.2010 1 x y
......
I was able to select a date range with
df.loc[ slice(start, end) ]
But I need to filter data based on a list of ID's. For example
allowed = [1, 5]
df.loc[ allowed ]
How this is possible?
Use index.isin to check for membership across Index axis. In the case of a multi-index, supply the position of the level or level name explicitly for which the check has to be performed.
df.loc[df.index.isin(allowed, level='ID')] # level name is specified
(Or)
df.loc[df.index.isin(allowed, level=1)] # level position is specified
You can also use .query() method:
In [62]: df
Out[62]:
V1 V2
Date ID
2010-01-01 1 x y
2 x y
2010-02-01 1 x y
In [63]: allowed = [1, 5]
In [64]: df.query("ID in #allowed")
Out[64]:
V1 V2
Date ID
2010-01-01 1 x y
2010-02-01 1 x y