Does anyone know how to fix the issue "no attr='labels' under dataframe.index" when dateframe is generated from Pivot?
Please find below two examples,
first one (pivot dataframe) return error 'index has no attr 'labels',
the second one ('groupby' dataframe) works.
Example for 'Pivot':
import pandas as pd
import numpy as np
raw_data = {'A' : ['A1', 'A1', 'A2', 'A2'],
'B': ['B1', 'B2', 'B2', 'B1'],
'QTY': [1., 2., 3., 4.],}
df = pd.DataFrame(raw_data)
df_pivot = df.pivot(index='A', columns='B', values='QTY')
print "-------Result---------"
print df_pivot
print "------View Index------"
print df_pivot.index
print "------View Label------"
print df_pivot.index.labels
Output:
-------Result---------
B B1 B2
A
A1 1.0 2.0
A2 4.0 3.0
------View Index------
Index([u'A1', u'A2'], dtype='object', name=u'A')
------View Label------
Traceback (most recent call last):
File "test6.py", line 15, in
print df_pivot.index.labels
AttributeError: 'Index' object has no attribute 'labels'
Example for 'Groupby':
import pandas as pd
import numpy as np
raw_data = {'A' : ['A1', 'A1', 'A2', 'A2'],
'B': ['B1', 'B2', 'B2', 'B1'],
'QTY': [1., 2., 3., 4.],}
df = pd.DataFrame(raw_data)
#df_pivot = df.pivot(index='A', columns='B', values='QTY')
df_gb = df.groupby(['A','B']).agg({'QTY':'sum'})
print "-------Result---------"
print df_gb
print "------View Index------"
print df_gb.index
print "------View Label------"
print df_gb.index.labels
Output:
-------Result---------
QTY
A B
A1 B1 1.0
B2 2.0
A2 B1 4.0
B2 3.0
------View Index------
MultiIndex(levels=[[u'A1', u'A2'], [u'B1', u'B2']],
labels=[[0, 0, 1, 1], [0, 1, 0, 1]],
names=[u'A', u'B'])
------View Label------
[[0, 0, 1, 1], [0, 1, 0, 1]]
It seems Index hasn't labels attr, so I have to add "isinstance( index, pd.Index)" to apply different logic.
Related
I want to groupby my data and create a new column assignment.
Given the following data frame
import pandas as pd
import numpy as np
df = pd.DataFrame({'col1': ['x1', 'x1', 'x1', 'x2', 'x2', 'x2'], 'col2': [1, 2, 3, 4, 5, 6]})
df['col3']=df[['col1','col2']].groupby('col1').rolling(2).mean().reset_index()
Expected output = pd.DataFrame({'col1': ['x1', 'x1', 'x1', 'x2', 'x2', 'x2'], 'col2': [1, 2, 3, 4, 5, 6], 'col3': [NAN, 1.5, 2.5, NAN, 4.5, 5.5]})
However, this does not work. Is there an straightforward way to do it?
A combination of groupby, apply and assign:
df.groupby('col1', as_index = False).apply(lambda g: g.assign(col3 = g['col2'].rolling(2).mean())).reset_index(drop = True)
output:
col1 col2 col3
0 x1 1 NaN
1 x1 2 1.5
2 x1 3 2.5
3 x2 4 NaN
4 x2 5 4.5
5 x2 6 5.5
I have a two data frames, one made up with a column of numpy array list, and other with two columns. I am trying to match the elements in the 1st dataframe (df) to get two columns, o1 and o2 from the df2, by matching based on index. I was wondering i can get some inputs.. please note the string 'A1' in column in 'o1' is repeated twice in df2 and as you may see in my desired output dataframe the duplicates are removed in column o1.
import numpy as np
import pandas as pd
array_1 = np.array([[0, 2, 3], [3, 4, 6], [1,2,3,6]])
#dataframe 1
df = pd.DataFrame({ 'A': array_1})
#dataframe 2
df2 = pd.DataFrame({ 'o1': ['A1', 'B1', 'A1', 'C1', 'D1', 'E1', 'F1'], 'o2': [15, 17, 18, 19, 20, 7, 8]})
#desired output
df_output = pd.DataFrame({ 'A': array_1, 'o1': [['A1', 'C1'], ['C1', 'D1', 'F1'], ['B1','A1','C1','F1']],
'o2': [[15, 18, 19], [19, 20, 8], [17,18,19,8]] })
# please note in the output, the 'index 0 of df1 has 0&2 which have same element i.e. 'A1', the output only shows one 'A1' by removing duplicated one.
I believe you can explode df and use that to extract information from df2, then finally join back to df
s = df['A'].explode()
df_output= df.join(df2.loc[s].groupby(s.index).agg(lambda x: list(set(x))))
Output:
A o1 o2
0 [0, 2, 3] [C1, A1] [18, 19, 15]
1 [3, 4, 6] [F1, D1, C1] [8, 19, 20]
2 [1, 2, 3, 6] [F1, B1, C1, A1] [8, 17, 18, 19]
How can I use pd.cut() in Dask?
Because of the large dataset, I am not able to put the whole dataset into memory before finishing the pd.cut().
Current code that is working in Pandas but needs to be changed to Dask:
import pandas as pd
d = {'name': [1, 5, 1, 10, 5, 1], 'amount': [1, 5, 3, 8, 4, 1]}
df = pd.DataFrame(data=d)
#Groupby name and add column sum (of amounts) and count (number of grouped rows)
df = (df.groupby('name')['amount'].agg(['sum', 'count']).reset_index().sort_values(by='name', ascending=True))
print(df.head(15))
#Groupby bins and chnage sum and count based on grouped rows
df = df.groupby(pd.cut(df['name'],
bins=[0,4,8,100],
labels=['namebin1', 'namebin2', 'namebin3']))['sum', 'count'].sum().reset_index()
print(df.head(15))
Output:
name sum count
0 namebin1 5 3
1 namebin2 9 2
2 namebin3 8 1
I tried:
import pandas as pd
import dask.dataframe as dd
d = {'name': [1, 5, 1, 10, 5, 1], 'amount': [1, 5, 3, 8, 4, 1]}
df = pd.DataFrame(data=d)
df = dd.from_pandas(df, npartitions=2)
df = df.groupby('name')['amount'].agg(['sum', 'count']).reset_index()
print(df.head(15))
df = df.groupby(df.map_partitions(pd.cut,
df['name'],
bins=[0,4,8,100],
labels=['namebin1', 'namebin2', 'namebin3']))['sum', 'count'].sum().reset_index()
print(df.head(15))
Gives error:
TypeError("cut() got multiple values for argument 'bins'",)
The reason why you're seeing this error is that pd.cut() is being called with the partition as the first argument which it doesn't expect (see the docs).
You can wrap it in a custom function and call that instead, like so:
import pandas as pd
import dask.dataframe as dd
def custom_cut(partition, bins, labels):
result = pd.cut(x=partition["name"], bins=bins, labels=labels)
return result
d = {'name': [1, 5, 1, 10, 5, 1], 'amount': [1, 5, 3, 8, 4, 1]}
df = pd.DataFrame(data=d)
df = dd.from_pandas(df, npartitions=2)
df = df.groupby('name')['amount'].agg(['sum', 'count']).reset_index()
df = df.groupby(df.map_partitions(custom_cut,
bins=[0,4,8,100],
labels=['namebin1', 'namebin2', 'namebin3']))[['sum', 'count']].sum().reset_index()
df.compute()
name sum count
namebin1 5 3
namebin2 9 2
namebin3 8 1
I am just starting out with pandas and have the following code:
import pandas as pd
d = {'num_legs': [4, 4, 2, 2, 2],
'num_wings': [0, 0, 2, 2, 2],
'class': ['mammal', 'mammal','bird-mammal', 'mammal', 'bird'],
'animal': ['cat', 'dog','cat', 'bat', 'penguin'],
'locomotion': ['walks', 'walks','hops', 'flies', 'walks']}
df = pd.DataFrame(data=d)
df = df.set_index(['class', 'animal', 'locomotion'])
I want to print everything that the animal cat does; here, that will be 'walks' and 'hops'.
I can filter to just the cat cross-section using
df2=df.xs('cat', level=1)
But from here, how do I access the level 'locomotion'?
You can do get_level_values
df.xs('cat', level=1).index.get_level_values(1)
Out[181]: Index(['walks', 'hops'], dtype='object', name='locomotion')
Extract data from the given SalaryGender CSV file and store the data from each column in a separate NumPy array
SalaryGender.csv sample data
Salary,Gender,Age,PhD
140,1,47,1
30,0,65,1
35.1,0,56,0
30,1,23,0
80,0,53,1
Use: DataFrame.groupby
that will create a list where each element has a numpy array of each column:
[group.values for i,group in df.groupby(level=0,axis=1)]
If you aren't looking for a list then use:
for i,group in df.groupby(level=0,axis=1):
print(group.values)
.....
Also you can use DataFrame.iteritems:
for i,col in df.iteritems():
print(col.to_numpy())
In [199]: txt = """Salary,Gender,Age,PhD
...: 140,1,47,1
...: 30,0,65,1
...: 35.1,0,56,0
...: 30,1,23,0
...: 80,0,53,1"""
We can load your sample as a structured array:
In [203]: data = np.genfromtxt(txt.splitlines(), dtype=None, delimiter=',', encoding=None, names=True)
In [204]: data
Out[204]:
array([(140. , 1, 47, 1), ( 30. , 0, 65, 1), ( 35.1, 0, 56, 0),
( 30. , 1, 23, 0), ( 80. , 0, 53, 1)],
dtype=[('Salary', '<f8'), ('Gender', '<i8'), ('Age', '<i8'), ('PhD', '<i8')])
Each element of the array is a row of the file; field names come from the header line, field dtype is deduced from the data.
Fields can be accessed by name:
In [205]: data['Salary']
Out[205]: array([140. , 30. , 35.1, 30. , 80. ])
In [206]: data['Gender']
Out[206]: array([1, 0, 0, 1, 0])
They can be accessed that way or can be assigned to a variable
salary = data['Salary']
You can also use unpack:
In [213]: a,b,c,d = np.genfromtxt(txt.splitlines(), delimiter=',', encoding=Non
...: e, skip_header=1, unpack=True)
In [214]: a
Out[214]: array([140. , 30. , 35.1, 30. , 80. ])
In [215]: b
Out[215]: array([1., 0., 0., 1., 0.])
In [216]: c
Out[216]: array([47., 65., 56., 23., 53.])
In [217]: d
Out[217]: array([1., 1., 0., 0., 1.])
Sometimes it's simpler to load the file one (or selected) column at a time:
In [218]: b = np.genfromtxt(txt.splitlines(), delimiter=',', encoding=None, ski
...: p_header=1, usecols=[1])
In [219]: b
Out[219]: array([1., 0., 0., 1., 0.])
Please try this:
SG[SG.columns].values
where SG is your file name. The code above gives you all columns values in array in a single go.