Why the column name is missing in pandas output in the group by result? - pandas

Update
if use to_frame() the column name seems not in the same row
重量
型号
HG-R2075 2040
HG220 680
This is my code, it groups the "型号"(which means type), and get the sum of the "重量"(weight) and exclude the column("是否发送") with a value in it.
import pandas as pd
import numpy as np
import sys
import os
script_dir = os.path.dirname(os.path.abspath(__file__))
os.chdir(script_dir ) # change to the path that you already know
try:
ClientName = sys.argv[1]
except :
print(u'没有输入或者错误的客户名称!')
df = pd.read_excel("Summary.xlsm")
df = df[df['客户'].str.contains(ClientName)][pd.isnull(df[u"是否已经发送"])].groupby([ u'型号'])[u'重量'].sum()
print('[CQ:face,id=21] ' + '*' * 10 + u'以下是' + ClientName + u'未发送的重量' + '*' * 10 + '[CQ:face,id=21]')
print(str(df))
Output is this :
[CQ:face,id=21] **********以下是KATUN未发送的重量**********[CQ:face,id=
21]
型号 (****the column name is missing here*****)
HG-R2075 2040
HG220 680
Name: 重量, dtype: int64
I don't know why the column name is missing?
The output I want is this: how to make it?
型号 重量
HG-R2075 2040
HG220 680
Name: 重量, dtype: int64

The result df of your groupby operation is actually a Series, not a DataFrame. That's why it is printed with a different format.
print(df.to_frame()) should to the trick.
EDIT: Actually in such a dataframe index name and column name will not be printed on the same row. To get a cleaner output, use reset_index to get 2 proper columns:
print(df.reset_index().to_string(index=False))

First use boolean indexing with chaining by &.
If need 2 column DataFrame add as_index=False or Series.reset_index:
mask = df['客户'].str.contains(ClientName) & df[u"是否已经发送"].isnull()
df = df[mask].groupby([ u'型号'], as_index=False)[u'重量'].sum()
Or:
df = df[mask].groupby([ u'型号'])[u'重量'].sum().reset_index()
For one column DataFrame use Series.to_frame - first column is index:
df = df[mask].groupby([ u'型号'])[u'重量'].sum().to_frame()
Sample:
np.random.seed(345)
N = 10
df = pd.DataFrame({'客户':np.random.choice(list('abc'), size=N),
u"是否已经发送":np.random.choice([np.nan,0], size=N),
u'型号':np.random.randint(2, size=N),
u'重量':np.random.randint(10, size=N)})
print (df)
型号 客户 是否已经发送 重量
0 0 a 0.0 4
1 0 a 0.0 0
2 1 b NaN 8
3 1 b NaN 5
4 1 c 0.0 6
5 1 a NaN 3
6 1 a NaN 3
7 1 b 0.0 4
8 0 a NaN 2
9 1 c NaN 8
ClientName = 'a'
mask = df['客户'].str.contains(ClientName) & df[u"是否已经发送"].isnull()
df1 = df[mask].groupby([ u'型号'], as_index=False)[u'重量'].sum()
print(df1)
型号 重量
0 0 2
1 1 6
df1 = df[mask].groupby([ u'型号'])[u'重量'].sum().reset_index()
print(df1)
型号 重量
0 0 2
1 1 6
df2 = df[mask].groupby([ u'型号'])[u'重量'].sum().to_frame()
print (df2)
重量
型号
0 2
1 6

Related

Pandas - 'Series' object has no attribute 'Columns' [duplicate]

I have a dataframe which I want to plot with matplotlib, but the index column is the time and I cannot plot it.
This is the dataframe (df3):
but when I try the following:
plt.plot(df3['magnetic_mag mean'], df3['YYYY-MO-DD HH-MI-SS_SSS'], label='FDI')
I'm getting an error obviously:
KeyError: 'YYYY-MO-DD HH-MI-SS_SSS'
So what I want to do is to add a new extra column to my dataframe (named 'Time) which is just a copy of the index column.
How can I do it?
This is the entire code:
#Importing the csv file into df
df = pd.read_csv('university2.csv', sep=";", skiprows=1)
#Changing datetime
df['YYYY-MO-DD HH-MI-SS_SSS'] = pd.to_datetime(df['YYYY-MO-DD HH-MI-SS_SSS'],
format='%Y-%m-%d %H:%M:%S:%f')
#Set index from column
df = df.set_index('YYYY-MO-DD HH-MI-SS_SSS')
#Add Magnetic Magnitude Column
df['magnetic_mag'] = np.sqrt(df['MAGNETIC FIELD X (μT)']**2 + df['MAGNETIC FIELD Y (μT)']**2 + df['MAGNETIC FIELD Z (μT)']**2)
#Subtract Earth's Average Magnetic Field from 'magnetic_mag'
df['magnetic_mag'] = df['magnetic_mag'] - 30
#Copy interesting values
df2 = df[[ 'ATMOSPHERIC PRESSURE (hPa)',
'TEMPERATURE (C)', 'magnetic_mag']].copy()
#Hourly Average and Standard Deviation for interesting values
df3 = df2.resample('H').agg(['mean','std'])
df3.columns = [' '.join(col) for col in df3.columns]
df3.reset_index()
plt.plot(df3['magnetic_mag mean'], df3['YYYY-MO-DD HH-MI-SS_SSS'], label='FDI')
Thank you !!
I think you need reset_index:
df3 = df3.reset_index()
Possible solution, but I think inplace is not good practice, check this and this:
df3.reset_index(inplace=True)
But if you need new column, use:
df3['new'] = df3.index
I think you can read_csv better:
df = pd.read_csv('university2.csv',
sep=";",
skiprows=1,
index_col='YYYY-MO-DD HH-MI-SS_SSS',
parse_dates='YYYY-MO-DD HH-MI-SS_SSS') #if doesnt work, use pd.to_datetime
And then omit:
#Changing datetime
df['YYYY-MO-DD HH-MI-SS_SSS'] = pd.to_datetime(df['YYYY-MO-DD HH-MI-SS_SSS'],
format='%Y-%m-%d %H:%M:%S:%f')
#Set index from column
df = df.set_index('YYYY-MO-DD HH-MI-SS_SSS')
EDIT: If MultiIndex or Index is from groupby operation, possible solutions are:
df = pd.DataFrame({'A':list('aaaabbbb'),
'B':list('ccddeeff'),
'C':range(8),
'D':range(4,12)})
print (df)
A B C D
0 a c 0 4
1 a c 1 5
2 a d 2 6
3 a d 3 7
4 b e 4 8
5 b e 5 9
6 b f 6 10
7 b f 7 11
df1 = df.groupby(['A','B']).sum()
print (df1)
C D
A B
a c 1 9
d 5 13
b e 9 17
f 13 21
Add parameter as_index=False:
df2 = df.groupby(['A','B'], as_index=False).sum()
print (df2)
A B C D
0 a c 1 9
1 a d 5 13
2 b e 9 17
3 b f 13 21
Or add reset_index:
df2 = df.groupby(['A','B']).sum().reset_index()
print (df2)
A B C D
0 a c 1 9
1 a d 5 13
2 b e 9 17
3 b f 13 21
You can directly access in the index and get it plotted, following is an example:
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randn(1000), index=pd.date_range('1/1/2000', periods=1000))
#Get index in horizontal axis
plt.plot(df.index, df[0])
plt.show()
#Get index in vertiacal axis
plt.plot(df[0], df.index)
plt.show()
You can also use eval to achieve this:
In [2]: df = pd.DataFrame({'num': range(5), 'date': pd.date_range('2022-06-30', '2022-07-04')}, index=list('ABCDE'))
In [3]: df
Out[3]:
num date
A 0 2022-06-30
B 1 2022-07-01
C 2 2022-07-02
D 3 2022-07-03
E 4 2022-07-04
In [4]: df.eval('index_copy = index')
Out[4]:
num date index_copy
A 0 2022-06-30 A
B 1 2022-07-01 B
C 2 2022-07-02 C
D 3 2022-07-03 D
E 4 2022-07-04 E

df.diff() how to compare row A with last row B

If we have two columns A and B
how to compare row A with last row B
A B diff
0 0.904560 0.208318 0
1 0.679290 0.747496 0
2 0.069841 0.165834 0
3 0.045818 0.907888 0
4 0.485712 0.593785 0
5 0.771665 0.800182 0
6 0.485041 0.024829 0
7 0.897172 0.584406 0
8 0.561953 0.626699 0
9 0.412803 0.900643 0
You can use Pandas' shift function to create a 'lagged' version of column B. Then it's a simple difference between columns.
from io import StringIO
import pandas as pd
raw = '''
A,B
0.904560,0.208318
0.679290,0.747496
0.069841,0.165834
0.045818,0.907888
0.485712,0.593785
0.771665,0.800182
0.485041,0.024829
0.897172,0.584406
0.561953,0.626699
0.412803,0.900643
'''.strip()
df = pd.read_csv(StringIO(raw))
df['B_lag'] = df.B.shift(1)
df['diff'] = df.A - df.B_lag
print(df)
Output looks like
A B B_lag diff
0 0.904560 0.208318 NaN NaN
1 0.679290 0.747496 0.208318 0.470972
2 0.069841 0.165834 0.747496 -0.677655
3 0.045818 0.907888 0.165834 -0.120016
4 0.485712 0.593785 0.907888 -0.422176
5 0.771665 0.800182 0.593785 0.177880
6 0.485041 0.024829 0.800182 -0.315141
7 0.897172 0.584406 0.024829 0.872343
8 0.561953 0.626699 0.584406 -0.022453
9 0.412803 0.900643 0.626699 -0.213896

regarding controlling the setup of index column [duplicate]

I have a dataframe which I want to plot with matplotlib, but the index column is the time and I cannot plot it.
This is the dataframe (df3):
but when I try the following:
plt.plot(df3['magnetic_mag mean'], df3['YYYY-MO-DD HH-MI-SS_SSS'], label='FDI')
I'm getting an error obviously:
KeyError: 'YYYY-MO-DD HH-MI-SS_SSS'
So what I want to do is to add a new extra column to my dataframe (named 'Time) which is just a copy of the index column.
How can I do it?
This is the entire code:
#Importing the csv file into df
df = pd.read_csv('university2.csv', sep=";", skiprows=1)
#Changing datetime
df['YYYY-MO-DD HH-MI-SS_SSS'] = pd.to_datetime(df['YYYY-MO-DD HH-MI-SS_SSS'],
format='%Y-%m-%d %H:%M:%S:%f')
#Set index from column
df = df.set_index('YYYY-MO-DD HH-MI-SS_SSS')
#Add Magnetic Magnitude Column
df['magnetic_mag'] = np.sqrt(df['MAGNETIC FIELD X (μT)']**2 + df['MAGNETIC FIELD Y (μT)']**2 + df['MAGNETIC FIELD Z (μT)']**2)
#Subtract Earth's Average Magnetic Field from 'magnetic_mag'
df['magnetic_mag'] = df['magnetic_mag'] - 30
#Copy interesting values
df2 = df[[ 'ATMOSPHERIC PRESSURE (hPa)',
'TEMPERATURE (C)', 'magnetic_mag']].copy()
#Hourly Average and Standard Deviation for interesting values
df3 = df2.resample('H').agg(['mean','std'])
df3.columns = [' '.join(col) for col in df3.columns]
df3.reset_index()
plt.plot(df3['magnetic_mag mean'], df3['YYYY-MO-DD HH-MI-SS_SSS'], label='FDI')
Thank you !!
I think you need reset_index:
df3 = df3.reset_index()
Possible solution, but I think inplace is not good practice, check this and this:
df3.reset_index(inplace=True)
But if you need new column, use:
df3['new'] = df3.index
I think you can read_csv better:
df = pd.read_csv('university2.csv',
sep=";",
skiprows=1,
index_col='YYYY-MO-DD HH-MI-SS_SSS',
parse_dates='YYYY-MO-DD HH-MI-SS_SSS') #if doesnt work, use pd.to_datetime
And then omit:
#Changing datetime
df['YYYY-MO-DD HH-MI-SS_SSS'] = pd.to_datetime(df['YYYY-MO-DD HH-MI-SS_SSS'],
format='%Y-%m-%d %H:%M:%S:%f')
#Set index from column
df = df.set_index('YYYY-MO-DD HH-MI-SS_SSS')
EDIT: If MultiIndex or Index is from groupby operation, possible solutions are:
df = pd.DataFrame({'A':list('aaaabbbb'),
'B':list('ccddeeff'),
'C':range(8),
'D':range(4,12)})
print (df)
A B C D
0 a c 0 4
1 a c 1 5
2 a d 2 6
3 a d 3 7
4 b e 4 8
5 b e 5 9
6 b f 6 10
7 b f 7 11
df1 = df.groupby(['A','B']).sum()
print (df1)
C D
A B
a c 1 9
d 5 13
b e 9 17
f 13 21
Add parameter as_index=False:
df2 = df.groupby(['A','B'], as_index=False).sum()
print (df2)
A B C D
0 a c 1 9
1 a d 5 13
2 b e 9 17
3 b f 13 21
Or add reset_index:
df2 = df.groupby(['A','B']).sum().reset_index()
print (df2)
A B C D
0 a c 1 9
1 a d 5 13
2 b e 9 17
3 b f 13 21
You can directly access in the index and get it plotted, following is an example:
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randn(1000), index=pd.date_range('1/1/2000', periods=1000))
#Get index in horizontal axis
plt.plot(df.index, df[0])
plt.show()
#Get index in vertiacal axis
plt.plot(df[0], df.index)
plt.show()
You can also use eval to achieve this:
In [2]: df = pd.DataFrame({'num': range(5), 'date': pd.date_range('2022-06-30', '2022-07-04')}, index=list('ABCDE'))
In [3]: df
Out[3]:
num date
A 0 2022-06-30
B 1 2022-07-01
C 2 2022-07-02
D 3 2022-07-03
E 4 2022-07-04
In [4]: df.eval('index_copy = index')
Out[4]:
num date index_copy
A 0 2022-06-30 A
B 1 2022-07-01 B
C 2 2022-07-02 C
D 3 2022-07-03 D
E 4 2022-07-04 E

Replace cell values in df based on complex condition

Hello friends,
I would like to iterate trough all the numeric columns in the df (in a generic way).
For each unique df["Type"] group in each numeric column:
Replace all values that are greater than each column mean + 2 standard
deviation values with "nan"
df = pd.DataFrame(data=d)
df = pd.DataFrame(data=d)
df['Test1']=[7,1,2,5,1,90]
df['Test2']=[99,10,13,12,11,87]
df['Type']=['Y','X','X','Y','Y','X']
Sample df:
PRODUCT Test1 Test2 Type
A 7 99 Y
B 1 10 X
C 2 13 X
A 5 12 Y
B 1 11 Y
C 90 87 X
Expected output:
RODUCT Test1 Test2 Type
A 7 nan Y
B 1 10 X
C 2 13 X
A 5 12 Y
B 1 11 Y
C nan nan X
Logically, it can go like this:
test_cols = ['Test1', 'Test2']
# calculate mean and std with groupby
groups = df.groupby('Type')
test_mean = groups[test_cols].transform('mean')
test_std = groups[test_cols].transform('std')
# threshold
thresh = test_mean + 2 * test_std
# thresholding
df[test_cols] = np.where(df[test_cols]>thresh, np.nan, df[test_cols])
However, from your sample data set, thresh is:
Test1 Test2
0 10.443434 141.707912
1 133.195890 123.898159
2 133.195890 123.898159
3 10.443434 141.707912
4 10.443434 141.707912
5 133.195890 123.898159
So, it wouldn't change anything.
You can get this through a groupby and transform:
import pandas as pd
import numpy as np
df = pd.DataFrame()
df['Product'] = ['A', 'B', 'C', 'A', 'B', 'C']
df['Test1']=[7,1,2,5,1,90]
df['Test2']=[99,10,13,12,11,87]
df['Type']=['Y','X','X','Y','Y','X']
df = df.set_index('Product')
def nan_out_values(type_df):
type_df[type_df > type_df.mean() + 2*type_df.std()] = np.nan
return type_df
df[['Test1', 'Test2']] = df.groupby('Type').transform(nan_out_values)

How to add a new row to pandas dataframe with non-unique multi-index

df = pd.DataFrame(np.arange(4*3).reshape(4,3), index=[['a','a','b','b'],[1,2,1,2]], columns=list('xyz'))
where df looks like:
Now I add a new row by:
df.loc['new',:]=[0,0,0]
Then df becomes:
Now I want to do the same but with a different df that has non-unique multi-index:
df = pd.DataFrame(np.arange(4*3).reshape(4,3), index=[['a','a','b','b'],[1,1,2,2]], columns=list('xyz'))
,which looks like:
and call
df.loc['new',:]=[0,0,0]
The result is "Exception: cannot handle a non-unique multi-index!"
How could I achieve the goal?
Use append or concat with helper DataFrame:
df1 = pd.DataFrame([[0,0,0]],
columns=df.columns,
index=pd.MultiIndex.from_arrays([['new'], ['']]))
df2 = df.append(df1)
df2 = pd.concat([df, df1])
print (df2)
x y z
a 1 0 1 2
1 3 4 5
b 2 6 7 8
2 9 10 11
new 0 0 0