calculate the mean of one row according it's label - pandas

calculate the mean of the values in one row according it's label:
A = [1,2,3,4,5,6,7,8,9,10]
B = [0,0,0,0,0,1,1,1,1, 1]
Result = pd.DataFrame(data=[A, B])
I want the output is: 0->3; 1-> 7.8
pandas has the groupby function, but I don't know how to implement this. Thanks

This is simple groupby problem ...
Result=Result.T
Result.groupby(Result[1])[0].mean()
Out[372]:
1
0 3
1 8
Name: 0, dtype: int64

Firstly, it sounds like you want to label the index:
In [11]: Result = pd.DataFrame(data=[A, B], index=['A', 'B'])
In [12]: Result
Out[12]:
0 1 2 3 4 5 6 7 8 9
A 1 2 3 4 5 6 7 8 9 10
B 0 0 0 0 0 1 1 1 1 1
If the index was unique you wouldn't have to do any groupby, just take the mean of each row (that's the axis=1):
In [13]: Result.mean(axis=1)
Out[13]:
A 5.5
B 0.5
dtype: float64
However, if you had multiple rows with the same label, then you'd need to groupby:
In [21]: Result2 = pd.DataFrame(data=[A, A, B], index=['A', 'A', 'B'])
In [22]: Result2.mean(axis=1)
Out[22]:
A 5.5
A 5.5
B 0.5
dtype: float64
Note: the duplicate rows (that happen to have the same mean as I lazily used the same row contents), in general we'd want to take the mean of those means:
In [23]: Result2.mean(axis=1).groupby(level=0).mean()
Out[23]:
A 5.5
B 0.5
dtype: float64
Note: .groupby(level=0) groups the rows which have the same index label.

You're making it difficult on yourself by constructing the dataframe in such a way as to put the things you want to take the mean of and the things you want to be your labels as different rows.
Option 1
groubpy
This deals with the data presented in the dataframe Result
Result.loc[0].groupby(Result.loc[1]).mean()
1
0 3
1 8
Name: 0, dtype: int64
Option 2
Overkill using np.bincount and because your grouping values are 0 and 1. I'd have a solution even if they weren't but it makes it simpler.
I wanted to use the raw lists A and B
pd.Series(np.bincount(B, A) / np.bincount(B))
0 3.0
1 8.0
dtype: float64
Option 3
Construct a series instead of a dataframe.
Again using raw lists A and B
pd.Series(A, B).mean(level=0)
0 3
1 8
dtype: int64

Related

Why use to_frame before reset_index?

Using a data set like this one
df = pd.DataFrame(np.random.randint(0,5,size=(20, 3)), columns=['user_id','module_id','week'])
we often see this pattern:
df.groupby(['user_id'])['module_id'].count().to_frame().reset_index().rename({'module_id':'count'}, axis='columns')
But we get exactly the same result from
df.groupby(['user_id'])['module_id'].count().reset_index(name='count')
(N.B. we need the additional rename in the former because reset_index on Series (here) includes a name parameter and returns a data frame, while reset_index on DataFrame (here) does not include the name parameter.)
Is there any advantage in using to_frame first?
(I wondered if it might be an artefact of earlier versions of pandas, but that looks unlikely:
Series.reset_index was added in this commit on the 27th of January 2012.
Series.to_frame was added in this commit on the 13th of October 2013.
So Series.reset_index was available over a year before Series.to_frame.)
There is no noticeable advantage of using to_frame(). Both approaches can be used to achieve the same result. It is common in pandas to use multiple approaches for solving a problem. The only advantage I can think of is that for larger sets of data, it maybe more convenient to have a dataframe view first before resetting the index. If we take your dataframe as an example, you will find that to_frame() displays a dataframe view that maybe useful to understand the data in terms of a neat dataframe table v/s a count series. Also, the usage of to_frame() makes the intent more clear to a new user who looks at your code for the first time.
The example dataframe:
In [7]: df = pd.DataFrame(np.random.randint(0,5,size=(20, 3)), columns=['user_i
...: d','module_id','week'])
In [8]: df.head()
Out[8]:
user_id module_id week
0 3 4 4
1 1 3 4
2 1 2 2
3 1 3 4
4 1 2 2
The count() function returns a Series:
In [18]: test1 = df.groupby(['user_id'])['module_id'].count()
In [19]: type(test1)
Out[19]: pandas.core.series.Series
In [20]: test1
Out[20]:
user_id
0 2
1 7
2 4
3 6
4 1
Name: module_id, dtype: int64
In [21]: test1.index
Out[21]: Int64Index([0, 1, 2, 3, 4], dtype='int64', name='user_id')
Using to_frame makes it explicit that you intend to convert the Series to a Dataframe. The index here is user_id:
In [22]: test1.to_frame()
Out[22]:
module_id
user_id
0 2
1 7
2 4
3 6
4 1
And now we reset the index and rename the column using Dataframe.rename. As you rightly pointed, Dataframe.reset_index() does not have a name parameter and therefore, we will have to rename the column explicitly.
In [24]: testdf1 = test1.to_frame().reset_index().rename({'module_id':'count'}, axis='columns')
In [25]: testdf1
Out[25]:
user_id count
0 0 2
1 1 7
2 2 4
3 3 6
4 4 1
Now lets look at the other case. We will use the same count() series test1 but rename it as test2 to differentiate between the two approaches. In other words, test1 is equal to test2.
In [26]: test2 = df.groupby(['user_id'])['module_id'].count()
In [27]: test2
Out[27]:
user_id
0 2
1 7
2 4
3 6
4 1
Name: module_id, dtype: int64
In [28]: test2.reset_index()
Out[28]:
user_id module_id
0 0 2
1 1 7
2 2 4
3 3 6
4 4 1
In [30]: testdf2 = test2.reset_index(name='count')
In [31]: testdf1 == testdf2
Out[31]:
user_id count
0 True True
1 True True
2 True True
3 True True
4 True True
As you can see both dataframes are equivalent, and in the second approach we just had to use reset_index(name='count') to both reset the index and rename the column name because Series.reset_index() does have a name parameter.
The second case has lesser code but is less readable for new eyes and I'd prefer the first approach of using to_frame() because it makes the intent clear: "Convert this count series to a dataframe and rename the column 'module_id' to 'count'".

apply() function to generate new value in a new column

I am new to python 3 and pandas. I tried to add a new column into a dataframe where the value is the difference between two existing columns.
My current code is:
import pandas as pd
import io
from io import StringIO
x="""a,b,c
1,2,3
4,5,6
7,8,9"""
with StringIO(x) as df:
new=pd.read_csv(df)
print (new)
y=new.copy()
y.loc[:,"d"]=0
# My lambda function is completely wrong, but I don't know how to make it right.
y["d"]=y["d"].apply(lambda x:y["a"]-y["b"], axis=1)
Desired output is
a b c d
1 2 3 -1
4 5 6 -1
7 8 9 -1
Does anyone have any idea how I can make my code work?
Thanks for your help.
You need y only for DataFrame for DataFrame.apply with axis=1 for process by rows:
y["d"]= y.apply(lambda x:x["a"]-x["b"], axis=1)
For better debugging is possible create custom function:
def f(x):
print (x)
a = x["a"]-x["b"]
return a
y["d"]= y.apply(f, axis=1)
a 1
b 2
c 3
Name: 0, dtype: int64
a 4
b 5
c 6
Name: 1, dtype: int64
a 7
b 8
c 9
Name: 2, dtype: int64
Better solution if need only subtract columns:
y["d"] = y["a"] - y["b"]
print (y)
a b c d
0 1 2 3 -1
1 4 5 6 -1
2 7 8 9 -1

Combining two dataframes along Index in Pandas

I have two dataframes pin1 and pin2 of different sizes with different indexes i.e pin1 have index values=['0','1','7'] and pin2 have ['2','4']. I would like to combine both along index to form ['0','1','2','4','7']. I tried merging using 'outer' join but it is changing the index values to ['0','1','2','3','4'].
In [1]: pin1= pd.Series(np.random.randn(2), index=['2', '4'])
In [2]: pin2= pd.Series(np.random.randn(3), index=['0', '1', '7'])
In [3]: pin3=pd.merge(pin1,pin2,how='outer')
In [4]: pin3
Out [4]:
0 0.2941
1 0.2869
2 1.7098
3 -0.2126
4 0.2696
expected output:
Out [4]:
0 0.2941
1 0.2869
2 1.7098
4 -0.2126
7 0.2696
If the sets of indices are disjoint, you can use pd.concat:
pd.concat([pin1, pin2]).sort_index()
Using combine_first
In [3732]: pin1.combine_first(pin2)
Out[3732]:
0 -0.820341
1 0.492719
2 -0.785723
4 -1.815021
7 2.027267
dtype: float64
Or, append
In [3734]: pin1.append(pin2).sort_index()
Out[3734]:
0 -0.820341
1 0.492719
2 -0.785723
4 -1.815021
7 2.027267
dtype: float64
Details
In [3735]: pin1
Out[3735]:
2 -0.785723
4 -1.815021
dtype: float64
In [3736]: pin2
Out[3736]:
0 -0.820341
1 0.492719
7 2.027267
dtype: float64
Or using align
pin1.align(pin2,join='outer')[0].fillna(pin1.align(pin2,join='outer')[1])
Out[991]:
0 -0.278627
1 0.009388
2 -0.655377
4 0.564739
7 0.793576
dtype: float64

Check whether a column in a dataframe is an integer or not, and perform operation

Check whether a column in a dataframe is an integer or not, and if it is an integer, it must be multiplied by 10
import numpy as np
import pandas as pd
df = pd.dataframe(....)
#function to check and multiply if a column is integer
def xtimes(x):
for col in x:
if type(x[col]) == np.int64:
return x[col]*10
else:
return x[col]
#using apply to apply that function on df
df.apply(xtimes).head(10)
I am getting an error like ('GP', 'occurred at index school')
You could use select_dtypes to get numeric columns and then multiply.
In [1284]: df[df.select_dtypes(include=['int', 'int64', np.number]).columns] *= 10
You could have your specific check list for include=[... np.int64, ..., etc]
You can use the dtypes attribute and loc.
df.loc[:, df.dtypes <= np.integer] *= 10
Explanation
pd.DataFrame.dtypes returns a pd.Series of numpy dtype objects. We can use the comparison operators to determine subdtype status. See this document for the numpy.dtype hierarchy.
Demo
Consider the dataframe df
df = pd.DataFrame([
[1, 2, 3, 4, 5, 6],
[1, 2, 3, 4, 5, 6]
]).astype(pd.Series([np.int32, np.int16, np.int64, float, object, str]))
df
0 1 2 3 4 5
0 1 2 3 4.0 5 6
1 1 2 3 4.0 5 6
The dtypes are
df.dtypes
0 int32
1 int16
2 int64
3 float64
4 object
5 object
dtype: object
We'd like to change columns 0, 1, and 2
Conveniently
df.dtypes <= np.integer
0 True
1 True
2 True
3 False
4 False
5 False
dtype: bool
And that is what enables us to use this within a loc assignment.
df.loc[:, df.dtypes <= np.integer] *= 10
df
0 1 2 3 4 5
0 10 20 30 4.0 5 6
1 10 20 30 4.0 5 6

How to unite several results of a dataframe columns describe() into one dataframe?

I am applying describe() to several columns of my dataframe, for example:
raw_data.groupby("user_id").size().describe()
raw_data.groupby("business_id").size().describe()
And several more, because I want to find out how many data points are there per user on average/median/etc..
My question is, each of those calls returns something that seems to be an unstructured output. Is there an easy way to combine them all to a single new dataframe which columns will be: [count,mean,std,min,25%,50%,75%,max] and the index will be the various columns described?
Thanks!
I might simply build a new DataFrame manually. If you have
>>> raw_data
user_id business_id data
0 10 1 5
1 20 10 6
2 20 100 7
3 30 100 8
Then the results of groupby(smth).size().describe() are just another Series:
>>> raw_data.groupby("user_id").size().describe()
count 3.000000
mean 1.333333
std 0.577350
min 1.000000
25% 1.000000
50% 1.000000
75% 1.500000
max 2.000000
dtype: float64
>>> type(_)
<class 'pandas.core.series.Series'>
and so:
>>> descrs = ((col, raw_data.groupby(col).size().describe()) for col in raw_data)
>>> pd.DataFrame.from_items(descrs).T
count mean std min 25% 50% 75% max
user_id 3 1.333333 0.57735 1 1 1 1.5 2
business_id 3 1.333333 0.57735 1 1 1 1.5 2
data 4 1.000000 0.00000 1 1 1 1.0 1
Instead of from_items I could have passed a dictionary, e.g.
pd.DataFrame({col: raw_data.groupby(col).size().describe() for col in raw_data}).T, but this way the column order is preserved without having to think about it.
If you don't want all the columns, instead of for col in raw_data, you could define columns_to_describe = ["user_id", "business_id"] etc and use for col in columns_to_describe, or use for col in raw_data if col.endswith("_id"), or whatever you like.