how to create a new column after groupby count? - pandas

I am trying to create a new column with group by and count.
However it throws error "incompatible index of inserted column with frame index"
import pandas as pd
#read csv
df1 = pd.read_csv('hi.txt',sep = '\t')#provide name and sheet of excel file
a=df1.groupby(['c','t']).count()
df1['difference']=a
print(df1)
Input :
coun id cat
A 12 90
U 13 91

Use:
new_df=df.groupby(['category','country'],sort=False).country.count().to_frame('count').reset_index()
print(new_df)
category country count
0 9910 AUS 2
1 7310 NZL 1
2 9910 NZL 1

Related

find value of column base on other column condition

I would like to find a value of column A base on column B's condition. for example, find the ts of first 'B' from value columns:
import pandas as pd
data = [[10,"A"],[20,"A"],[30,"B"],[40,"B"]]
df = pd.DataFrame(data,columns=['ts','value'])
print(df)
: ts value
: 0 10 A
: 1 20 A
: 2 30 B
: 3 40 B
I would like to print out 30 for this example!
You can do that with slice
df.loc[df['value'].eq('B'),'ts'].iloc[0]
Out[163]: 30

How to make pandas work for cross multiplication

I have 3 data frame:
df1
id,k,a,b,c
1,2,1,5,1
2,3,0,1,0
3,6,1,1,0
4,1,0,5,0
5,1,1,5,0
df2
name,a,b,c
p,4,6,8
q,1,2,3
df3
type,w_ave,vac,yak
n,3,5,6
v,2,1,4
from the multiplication, using pandas and numpy, I want to the output in df1:
id,k,a,b,c,w_ave,vac,yak
1,2,1,5,1,16,15,18
2,3,0,1,0,0,3,6
3,6,1,1,0,5,4,7
4,1,0,5,0,0,11,14
5,1,1,5,0,13,12,15
the conditions are:
The value of the new column will be =
#its not a code
df1["w_ave"][1] = df3["w_ave"]["v"]+ df1["a"][1]*df2["a"]["q"]+df1["b"][1]*df2["b"]["q"]+df1["c"][1]*df2["c"]["q"]
for output["w_ave"][1]= 2 +(1*1)+(5*2)+(1*3)
df3["w_ave"]["v"]=2
df1["a"][1]=1, df2["a"]["q"]=1 ;
df1["b"][1]=5, df2["b"]["q"]=2 ;
df1["c"][1]=1, df2["c"]["q"]=3 ;
Which means:
- a new column will be added in df1, from the name of the column from df3.
- for each row of the df1, the value of a, b, c will be multiplied with the same-named q value from df2. and summed together with the corresponding value of df3.
-the column name of df1 , matched will column name of df2 will be multiplied. The other not matched column will not be multiplied, like df1[k].
- However, if there is any 0 in df1["a"], the corresponding output will be zero.
I am struggling with this. It was tough to explain also. My attempts are very silly. I know this attempt will not work. However, I have added this:
import pandas as pd, numpy as np
data1 = "Sample_data1.csv"
data2 = "Sample_data2.csv"
data3 = "Sample_data3.csv"
folder = '~Sample_data/'
df1 =pd.read_csv(folder + data1)
df2 =pd.read_csv(folder + data2)
df3 =pd.read_csv(folder + data3)
df1= df2 * df1
Ok, so this will in no way resemble your desired output, but vectorizing the formula you provided:
df2=df2.set_index("name")
df3=df3.set_index("type")
df1["w_ave"] = df3.loc["v", "w_ave"]+ df1["a"].mul(df2.loc["q", "a"])+df1["b"].mul(df2.loc["q", "b"])+df1["c"].mul(df2.loc["q", "c"])
Outputs:
id k a b c w_ave
0 1 2 1 5 1 16
1 2 3 0 1 0 4
2 3 6 1 1 0 5
3 4 1 0 5 0 12
4 5 1 1 5 0 13

add "all" row to pandas group by

This is my code (using pandas 0.19.2)
import pandas as pd
data=StringIO("""category,region,sales
fruits,east,12
vegatables,east,3
fruits,west,5
vegatables,wst,7
""")
df = pd.read_csv(data)
print(df.groupby('category', as_index=False).agg({'sales': sum}))
This is the output:
category sales
0 fruits 17
1 vegatables 10
My question is: how do add an 'all' row so the output would look like this:
category sales
0 fruits 17
1 vegatables 10
all 27
You can try pivot_table and alter the new data:
new_df = df.pivot_table(columns='category',index='region', values='sales')
new_df['all'] = new_df.sum(1)
Output:
category fruits vegatables all
region
east 12 3 15
west 5 7 12
And if you want your original data:
new_df.stack().to_frame(name='Sales').reset_index()
Output:
region category Sales
0 east fruits 12
1 east vegatables 3
2 east all 15
3 west fruits 5
4 west vegatables 7
5 west all 12
here is what i ended up doing:
from io import StringIO
import pandas as pd
data = StringIO("""category,region,sales
fruits,east,12
vegatables,east,3
fruits,west,5
vegatables,wst,7
""")
df = pd.read_csv(data)
body=df.groupby('category', as_index=False).agg({'sales': sum})
head=df.groupby(lambda x: True, as_index=False) #advanced panda trickery
head=head.agg({'sales': sum})
head.insert(0,'category','*all*')
print(body.append(head))
basically, create another dataframe with the 'all' row and concat

Issue looping through dataframes in Pandas

I have a dict 'd' set up which is a list of dataframes E.g.:
d["DataFrame1"]
Will return that dataframe with all its columns:
ID Name
0 123 John
1 548 Eric
2 184 Sam
3 175 Andy
Each dataframe has a column in it called 'Names'. I want to extract this column from each dataframe in the dict and to create a new dataframe consisting of these columns.
df_All_Names = pd.DataFrame()
for df in d:
df_All_Names[df] = df['Names']
Returns the error:
TypeError: string indices must be integers
Unsure where I'm going wrong here.
For example you have df as follow
df=pd.DataFrame({'Name':['X', 'Y']})
df1=pd.DataFrame({'Name':['X1', 'Y1']})
And we create a dict
d=dict()
d['df']=df
d['df1']=df1
Then presetting a empty data frame:
yourdf=pd.DataFrame()
Using items with for loop
for key,val in d.items():
yourdf[key]=val['Name']
yield :
yourdf
Out[98]:
df df1
0 X X1
1 Y Y1
Your can use reduce and concatenate all of the columns named ['Name'] in your dictionary of dataframes
Sample Data
from functools import reduce
d = {'df1':pd.DataFrame({'ID':[0,1,2],'Name':['John','Sam','Andy']}),'df2':pd.DataFrame({'ID':[3,4,5],'Name':['Jen','Cara','Jess']})}
You can stack the data side by side using axis=1
reduce(lambda x,y:pd.concat([x.Name,y.Name],axis=1),d.values())
Name Name
0 John Jen
1 Sam Cara
2 Andy Jess
Or on top of one an other usingaxis=0
reduce(lambda x,y:pd.concat([x.Name,y.Name],axis=0),d.values())
0 John
1 Sam
2 Andy
0 Jen
1 Cara
2 Jess

How to check dependency of one column to another in a pandas dataframe

I have the following dataframe:
import pandas as pd
df=pd.DataFrame([[1,11,'a'],[1,12,'a'],[1,11,'a'],[1,12,'a'],[1,7,'a'],
[1,12,'a']])
df.columns=['id','code','name']
df
id code name
0 1 11 a
1 1 12 a
2 1 11 a
3 1 12 a
4 1 7 a
5 1 12 a
As shown in the above dataframe, the value of column "id" is directly related to the value of column "name". If I have say, a million records, how can I know that a column is totally dependent on other column in a dataframe?
If they are totally dependent, then their factorizations will be the same
(df.id.factorize()[0] == df.name.factorize()[0]).all()
True