How to turn Pandas' DataFrame.groupby() result into MultiIndex - pandas

Suppose I have a set of measurements that were obtained by varying two parameters, knob_b and knob_2 (in practice there are a lot more):
data = np.empty((6,3), dtype=np.float)
data[:,0] = [3,4,5,3,4,5]
data[:,1] = [1,1,1,2,2,2]
data[:,2] = np.random.random(6)
df = pd.DataFrame(data, columns=['knob_1', 'knob_2', 'signal'])
i.e., df is
knob_1 knob_2 signal
0 3 1 0.076571
1 4 1 0.488965
2 5 1 0.506059
3 3 2 0.415414
4 4 2 0.771212
5 5 2 0.502188
Now, considering each parameter on its own, I want to find the minimum value that was measured for each setting of this parameter (ignoring the settings of all other parameters). The pedestrian way of doing this is:
new_index = []
new_data = []
for param in df.columns:
if param == 'signal':
continue
group = df.groupby(param)['signal'].min()
for (k,v) in group.items():
new_index.append((param, k))
new_data.append(v)
new_index = pd.MultiIndex.from_tuples(new_index,
names=('parameter', 'value'))
df2 = pd.Series(index=new_index, data=new_data)
resulting df2 being:
parameter value
knob_1 3 0.495674
4 0.277030
5 0.398806
knob_2 1 0.485933
2 0.277030
dtype: float64
Is there a better way to do this, in particular to get rid of the inner loop?
It seems to me that the result of the df.groupby operation already has everything I need - if only there was a way to somehow create a MultiIndex from it without going through the list of tuples.

Use the keys argument of pd.concat():
pd.concat([df.groupby('knob_1')['signal'].min(),
df.groupby('knob_2')['signal'].min()],
keys=['knob_1', 'knob_2'],
names=['parameter', 'value'])

Related

Classifying pandas columns according to range limits

I have a dataframe with several numeric columns and their range goes either from 1 to 5 or 1 to 10
I want to create two lists of these columns names this way:
names_1to5 = list of all columns in df with numbers ranging from 1 to 5
names_1to10 = list of all columns in df with numbers from 1 to 10
Example:
IP track batch size type
1 2 3 5 A
9 1 2 8 B
10 5 5 10 C
from the dataframe above:
names_1to5 = ['track', 'batch']
names_1to10 = ['ip', 'size']
I want to use a function that gets a dataframe and perform the above transformation only on columns with numbers within those ranges.
I know that if the column 'max()' is 5 than it's 1to5 same idea when max() is 10
What I already did:
def test(df):
list_1to5 = []
list_1to10 = []
for col in df:
if df[col].max() == 5:
list_1to5.append(col)
else:
list_1to10.append(col)
return list_1to5, list_1to10
I tried the above but it's returning the following error msg:
'>=' not supported between instances of 'float' and 'str'
The type of the columns is 'object' maybe this is the reason. If this is the reason, how can I fix the function without the need to cast these columns to float as there are several, sometimes hundreds of these columns and if I run:
df['column'].max() I get 10 or 5
What's the best way to create this this function?
Use:
string = """alpha IP track batch size
A 1 2 3 5
B 9 1 2 8
C 10 5 5 10"""
temp = [x.split() for x in string.split('\n')]
cols = temp[0]
data = temp[1:]
def test(df):
list_1to5 = []
list_1to10 = []
for col in df.columns:
if df[col].dtype!='O':
if df[col].max() == 5:
list_1to5.append(col)
else:
list_1to10.append(col)
return list_1to5, list_1to10
df = pd.DataFrame(data, columns = cols, dtype=float)
Output:
(['track', 'batch'], ['IP', 'size'])

Replace values inside list by generic numbers to group and reference for statistical computing

I usually use "${:,.2f}". format(prices) to round numbers before commas, but what I'm looking for is different, I want to change values numbers to group them and reference them by mode:
Let say I have this list:
0 34,123.45
1 34,456.78
2 34,567.89
3 33,222.22
4 30,123.45
And the replace function will turn the list to:
0 34,500.00
1 34,500.00
2 34,500.00
3 33,200.00
4 30,100.00
Like this when I use stats.mode(prices_rounded) it will show as a result:
Mode Value = 34500.00
Mode Count = 3
Is there a conversion function already available that does the job? I did search for days without luck...
EDIT - WORKING CODE:
#create list
df3 = df_array
print('########## df3: ',df3)
#convert to float
df4 = df3.astype(float)
print('########## df4: ',df4)
#convert list to string
#df5 = ''.join(map(str, df4))
#print('########## df5: ',df5)
#round values
df6 = np.round(df4 /100) * 100
print('######df6',df6)
#get mode stats
df7 = stats.mode(df6)
print('######df7',df7)
#get mode value
df8 = df7[0][0]
print('######df8',df8)
#convert to integer
df9 = int(df8)
print('######df9',df9)
This is exactly what I wanted, thanks!
You can use:
>>> sr
0 34123.45 # <- why 34500.00?
1 34456.78
2 34567.89 # <- why 34500.00?
3 33222.22
4 30123.45
dtype: float64
>>> np.round(sr / 100) * 100
0 34100.0
1 34500.0
2 34600.0
3 33200.0
4 30100.0
dtype: float64

Comparing string values from sequential rows in pandas series

I am trying to count common string values in sequential rows of a panda series using a user defined function and to write an output into a new column. I figured out individual steps, but when I put them together, I get a wrong result. Could you please tell me the best way to do this? I am a very beginner Pythonista!
My pandas df is:
df = pd.DataFrame({"Code": ['d7e', '8e0d', 'ft1', '176', 'trk', 'tr71']})
My string comparison loop is:
x='d7e'
y='8e0d'
s=0
for i in y:
b=str(i)
if b not in x:
s+=0
else:
s+=1
print(s)
the right result for these particular strings is 2
Note, when I do def func(x,y): something happens to s counter and it doesn't produce the right result. I think I need to reset it to 0 every time the loop runs.
Then, I use df.shift to specify the position of y and x in a series:
x = df["Code"]
y = df["Code"].shift(periods=-1, axis=0)
And finally, I use df.apply() method to run the function:
df["R1SB"] = df.apply(func, axis=0)
and I get None values in my new column "R1SB"
My correct output would be:
"Code" "R1SB"
0 d7e None
1 8e0d 2
2 ft1 0
3 176 1
4 trk 0
5 tr71 2
Thank you for your help!
TRY:
df['R1SB'] = df.assign(temp=df.Code.shift(1)).apply(
lambda x: np.NAN
if pd.isna(x['temp'])
else sum(i in str(x['temp']) for i in str(x['Code'])),
1,
)
OUTPUT:
Code R1SB
0 d7e NaN
1 8e0d 2.0
2 ft1 0.0
3 176 1.0
4 trk 0.0
5 tr71 2.0

Pandas - Trying to create a list or Series in a data frame cell

I have the following data frame
df = pd.DataFrame({'A':[74.75, 91.71, 145.66], 'B':[4, 3, 3], 'C':[25.34, 33.52, 54.70]})
A B C
0 74.75 4 25.34
1 91.71 3 33.52
2 145.66 3 54.70
I would like to create another column df['D'] that would be a list or series from the first 3 columns suitable for use in another column with the np.irr function that would look like this
D
0 [ -74.75, 2.34, 25.34, 25.34, 25.34]
1 [ -91.71, 33.52, 33.52, 33.52]
2 [-145.66, 54.70, 54.70, 54.70]
so I could ultimately do something like this
df['E'] = np.irr(df['D'])
I did get as far as this
[-df.A[0]]+[df.C[0]]*df.B[0]
but it is not quite there.
Do you really need the column 'D'?
By the way you can easily add it as:
df['D'] = [[-df.A[i]]+[df.C[i]]*df.B[i] for i in xrange(len(df))]
df['E'] = df['D'].map(np.irr)
if you don't need it, you can directly set E
df['E'] = [np.irr([-df.A[i]]+[df.C[i]]*df.B[i]) for i in xrange(len(df))]
or:
df['E'] = df.apply(lambda x: np.irr([-x.A] + [x.C] * x.B), axis=1)

selecting data from pandas panel with MultiIndex

I have a DataFrame with MultiIndex, for example:
In [1]: arrays = [['one','one','one','two','two','two'],[1,2,3,1,2,3]]
In [2]: df = DataFrame(randn(6,2),index=MultiIndex.from_tuples(zip(*arrays)),columns=['A','B'])
In [3]: df
Out [3]:
A B
one 1 -2.028736 -0.466668
2 -1.877478 0.179211
3 0.886038 0.679528
two 1 1.101735 0.169177
2 0.756676 -1.043739
3 1.189944 1.342415
Now I want to compute the means of elements 2 and 3 (index level 1) for each row (index level 0) and each column. So I need a DataFrame which would look like
A B
one 1 mean(df['A'].ix['one'][1:3]) mean(df['B'].ix['one'][1:3])
two 1 mean(df['A'].ix['two'][1:3]) mean(df['B'].ix['two'][1:3])
How do I do that without using loops over rows (index level 0) of the original data frame? What if I want to do the same for a Panel? There must be a simple solution with groupby, but I'm still learning it and can't think of an answer.
You can use the xs function to select on levels.
Starting with:
A B
one 1 -2.712137 -0.131805
2 -0.390227 -1.333230
3 0.047128 0.438284
two 1 0.055254 -1.434262
2 2.392265 -1.474072
3 -1.058256 -0.572943
You can then create a new dataframe using:
DataFrame({'one':df.xs('one',level=0)[1:3].apply(np.mean), 'two':df.xs('two',level=0)[1:3].apply(np.mean)}).transpose()
which gives the result:
A B
one -0.171549 -0.447473
two 0.667005 -1.023508
To do the same without specifying the items in the level, you can use groupby:
grouped = df.groupby(level=0)
d = {}
for g in grouped:
d[g[0]] = g[1][1:3].apply(np.mean)
DataFrame(d).transpose()
I'm not sure about panels - it's not as well documented, but something similar should be possible
I know this is an old question, but for reference who searches and finds this page, the easier solution I think is the level keyword in mean:
In [4]: arrays = [['one','one','one','two','two','two'],[1,2,3,1,2,3]]
In [5]: df = pd.DataFrame(np.random.randn(6,2),index=pd.MultiIndex.from_tuples(z
ip(*arrays)),columns=['A','B'])
In [6]: df
Out[6]:
A B
one 1 -0.472890 2.297778
2 -2.002773 -0.114489
3 -1.337794 -1.464213
two 1 1.964838 -0.623666
2 0.838388 0.229361
3 1.735198 0.170260
In [7]: df.mean(level=0)
Out[7]:
A B
one -1.271152 0.239692
two 1.512808 -0.074682
In this case it means that level 0 is kept over axis 0 (the rows, default value for mean)
Do the following:
# Specify the indices you want to work with.
idxs = [("one", elem) for elem in [2,3]] + [("two", elem) for elem in [2,3]]
# Compute grouped mean over only those indices.
df.ix[idxs].mean(level=0)