How to pivot a dataframe using multiple column? - pandas

How can I use pandas.pivot_table or any other method to split the following data frame into two?
This is my input data frame:
Method N Mean Max Min Median Mode Meduim
0 A 5 0.40 0.55 0.25 0.39 N/A m1
2 A 10 0.26 0.47 0.10 0.25 N/A m2
1 B 5 0.48 0.62 0.33 0.50 N/A m1
3 B 7 0.41 0.47 0.36 0.42 0.36 m2
And I want to output the two following data frames
A m1 m2
N 5 10
Mean 0.40 0.26
Max 0.55 0.47
Min 0.25 0.10
Median 0.39 0.25
Mode N/A N/A
and
B m1 m2
N 5 7
Mean 0.48 0.41
Max 0.62 0.47
Min 0.33 0.36
Median 0.50 0.42
Mode N/A 0.36
Thank you.

Is it pivot?
df.set_index(['Method','Meduim']).T
gives:
Method A B
Meduim m1 m2 m1 m2
N 5.00 10.00 5.00 7.00
Mean 0.40 0.26 0.48 0.41
Max 0.55 0.47 0.62 0.47
Min 0.25 0.10 0.33 0.36
Median 0.39 0.25 0.50 0.42
Mode NaN NaN NaN 0.36

Related

How to use NTILE function to create two groups for two columns where one is a nested NTILE of the other?

I have financial data that I want to categorise (using NTILE()) by two columns that contain percentage values (risk, debt_to_assets). My end goal is that I want to aggregate some other column (profit) by the two categories, but that's not related to my issue at hand. The data looks something like this:
profit
risk
debt_to_assets
7000
0.10
0.20
1000
0.40
0.70
3000
0.15
0.50
4000
0.30
0.30
2000
0.20
0.60
The issue I'm trying to solve is that I want the categories to be nested such that, in addition to the population distribution being uniform, the categories of the inner quantile are consistent across the categories of the outer quantile in terms of the range that defines the categories (i.e. I want the min and max value for the inner categories (x0, y0), (x1, y0), (x2, y0), ... to all be the same or as close as possible, where the x's are the outer category and the y's are the inner category).
Ideally if I were to aggregate the columns used for the NTILE() function (using 3 inner categories and 3 outer categories for example) I'd want a table that resembles the following:
risk_cat
dta_cat
min_risk
max_risk
min_dta
max_dta
count
1
1
0.00
0.33
0.00
0.33
100
1
2
0.00
0.33
0.34
0.67
100
1
3
0.00
0.33
0.68
1.00
100
2
1
0.34
0.67
0.00
0.33
100
2
2
0.34
0.67
0.34
0.67
100
2
3
0.34
0.67
0.68
1.00
100
3
1
0.68
1.00
0.00
0.33
100
3
2
0.68
1.00
0.34
0.67
100
3
3
0.68
1.00
0.68
1.00
100
These are the solutions I've tried but they only solve part of the issue, not the whole thing:
SELECT *,
NTILE(3) OVER (
ORDER BY risk
) AS risk_cat,
NTILE(3) OVER (
ORDER BY debt_to_assets
) AS dta_cat
FROM my_table
This would result in an aggregated table like this:
risk_cat
dta_cat
min_risk
max_risk
min_dta
max_dta
count
1
1
0.00
0.33
0.00
0.33
10
1
2
0.00
0.33
0.34
0.67
55
1
3
0.00
0.33
0.68
1.00
180
2
1
0.34
0.67
0.00
0.33
135
2
2
0.34
0.67
0.34
0.67
140
2
3
0.34
0.67
0.68
1.00
100
3
1
0.68
1.00
0.00
0.33
130
3
2
0.68
1.00
0.34
0.67
110
3
3
0.68
1.00
0.68
1.00
40
The problem is that the count across the two categories isn't uniform.
WITH outer_cat AS (
SELECT *,
NTILE(3) OVER (
ORDER BY risk
) AS risk_cat
FROM my_table
)
SELECT *,
NTILE(3) OVER(
PARTITION BY risk_cat
ORDER BY debt_to_assets
) AS dta_cat
FROM outer_cat
The aggregated table for this might resemble the following:
risk_cat
dta_cat
min_risk
max_risk
min_dta
max_dta
count
1
1
0.00
0.33
0.10
0.70
100
1
2
0.00
0.33
0.71
0.90
100
1
3
0.00
0.33
0.91
1.00
100
2
1
0.34
0.67
0.05
0.35
100
2
2
0.34
0.67
0.36
0.60
100
2
3
0.34
0.67
0.61
0.90
100
3
1
0.68
1.00
0.00
0.25
100
3
2
0.68
1.00
0.26
0.50
100
3
3
0.68
1.00
0.51
0.80
100
The problem this time is that the min and max values for the inner category vary to much across the outer category.
SELECT *,
NTILE(9) OVER(
ORDER BY risk, debt_to_assets
) AS dual_cat
FROM my_table
The aggregated table for this looks something like the following:
dual_cat
min_risk
max_risk
min_dta
max_dta
count
1
0.00
0.11
0.55
1.00
100
2
0.12
0.22
0.35
1.00
100
3
0.23
0.33
0.15
1.00
100
4
0.34
0.44
0.40
1.00
100
5
0.45
0.55
0.10
1.00
100
6
0.56
0.66
0.10
0.95
100
7
0.67
0.77
0.05
1.00
100
8
0.78
0.88
0.20
1.00
100
9
0.89
1.00
0.00
1.00
100
This was just a last attempt at a solution after the previous two didn't work. This attempt didn't capture any of the behaviour that I was looking for.
Is there a solution to my problem that I'm not seeing?

Pandas df.shift(axis=1) adds extra entries, why?

Here is a sample of the original table.
# z speed dir U_geo V_geo U U[QCC] U[ign] U[siC] U[siD] V
0 40 2.83 181.0 0.05 2.83 -0.20 11 -0.20 2.24 0.95 2.83 11
1 50 2.41 184.8 0.20 2.40 -0.01 11 -0.01 2.47 0.94 2.41 11
2 60 1.92 192.4 0.41 1.88 0.25 11 0.25 2.46 0.94 1.91 11
3 70 1.75 201.7 0.65 1.63 0.50 11 0.50 2.47 0.94 1.68 11
I need to shift the entire table over by 1 column to produce this:
z speed dir U_geo V_geo U U[QCC] U[ign] U[siC] U[siD] V
0 40 2.83 181.0 0.05 2.83 -0.20 11 -0.20 2.24 0.95 2.83
1 50 2.41 184.8 0.20 2.40 -0.01 11 -0.01 2.47 0.94 2.41
2 60 1.92 192.4 0.41 1.88 0.25 11 0.25 2.46 0.94 1.91
3 70 1.75 201.7 0.65 1.63 0.50 11 0.50 2.47 0.94 1.68
Here is the code that ingests the data and tries to shift it over by one column
wind_rass_table_df=pd.read_csv(file_path, header=j+3, engine='python', nrows=77,sep=r'\s{2,}',skip_blank_lines=False,index_col=False)
wind_rass_table_df=wind_rass_table_df.shift(periods=1,axis=1)
Supposedly df.shift(axis=1) should shift the dataframe over by 1 column but it does more than that, it does this:
# z speed dir U_geo V_geo U U[QCC] U[ign] U[siC]
0 NaN NaN 2.83 181.0 0.05 2.83 40.0 -0.20 -0.20 2.24
1 NaN NaN 2.41 184.8 0.20 2.40 50.0 -0.01 -0.01 2.47
2 NaN NaN 1.92 192.4 0.41 1.88 60.0 0.25 0.25 2.46
3 NaN NaN 1.75 201.7 0.65 1.63 70.0 0.50 0.50 2.47
The shift function has taken the first column, inserted into the 7th column, shifted the 7th into the 8th and repeated the 8th, shifting the 9th over and so on.
What is the correct way of shifting a dataframe over by one column?
Many thanks!
You can use iloc and create another dataframe:
df = pd.DataFrame(data=df.iloc[:, :-1], columns=df.columns[1:], index=df.index)

Shifting columns in dataframe

I have a pandas dataframe as:
Date normPwr_0 normPwr_1 tempNorm_1 tempNorm_2 tempNorm_3 tempNorm_0
6/15/2019 0.89 0.94 0.83 0.88 0.92 0.82
6/16/2019 0.97 0.89 0.82 0.83 0.88 0.97
6/17/2019 0.97 0.97 0.97 0.82 0.83 2,188.18
I want to shift the column values for only tempNorm columns. My desired output is:
Date normPwr_0 normPwr_1 tempNorm_2 tempNorm_3 tempNorm_1
6/15/2019 0.89 0.94 0.83 0.88 0.82
6/16/2019 0.97 0.89 0.82 0.83 0.97
6/17/2019 0.97 0.97 0.97 0.82 2,188.18
The tricky part is the columns names for tempNormvaries such that sometimes I have [tempNorm_1 tempNorm_2 tempNorm_3 tempNorm_0] and other times I have
[tempNorm_4 tempNorm_5 tempNorm_6 tempNorm_7 tempNorm_0]
When columns have [tempNorm_4 tempNorm_5 tempNorm_6 tempNorm_7 tempNorm_0] my desired columns in output dataframe will be [tempNorm_4 tempNorm_5 tempNorm_6 tempNorm_7]
Basically I am trying to shift the dataframe in columns name containing tempNorm where all values from tempNorm_0 gets pushed into next higher named column and the highest named column gets dropped off.
I am not sure how to approach this in a clean pythonic way.
EDIT:
For [tempNorm_4 tempNorm_5 tempNorm_6 tempNorm_7 tempNorm_0], the values from tempNorm_0 is moved in to tempNorm_4 ;the values from tempNorm_4 is moved in to tempNorm_5 and so forth. tempNorm_7 data gets dropped off and is replaced by data from tempNorm_6
sorted, filter, rename
a, b, *c = sorted(df.filter(like='tempNorm'), key=lambda c: int(c.rsplit('_', 1)[1]))
df.drop(b, 1).rename(columns={a: b})
Date normPwr_0 normPwr_1 tempNorm_2 tempNorm_3 tempNorm_1
0 6/15/2019 0.89 0.94 0.88 0.92 0.82
1 6/16/2019 0.97 0.89 0.83 0.88 0.97
2 6/17/2019 0.97 0.97 0.82 0.83 2,188.18
IIUC, you want to roll the columns with name tempNorm_ and drop the last:
# get all the tempNorm columns
tmp_cols = np.array([col for col in df.columns if 'tempNorm' in col])
# roll and rename:
df.rename(columns={col:new_col for col, new_col in zip(tmp_cols,np.roll(tmp_cols,-1) ) },
inplace=True)
# drop the last tempNorm
df.drop(tmp_cols[-1], axis=1, inplace=True)
Output:
Date normPwr_0 normPwr_1 tempNorm_2 tempNorm_3 tempNorm_1
0 6/15/2019 0.89 0.94 0.83 0.88 0.82
1 6/16/2019 0.97 0.89 0.82 0.83 0.97
2 6/17/2019 0.97 0.97 0.97 0.82 2,188.18
You can also do something like:
m=df.filter(like='tempNorm').sort_index(axis=1)
n=m[m.columns[::-1]].T.shift(-1,axis=0).T.dropna(how='all',axis=1)
pd.concat([df[df.columns.difference(m.columns)],n],axis=1)
Date normPwr_0 normPwr_1 tempNorm_3 tempNorm_2 tempNorm_1
0 6/15/2019 0.89 0.94 0.88 0.83 0.82
1 6/16/2019 0.97 0.89 0.83 0.82 0.97
2 6/17/2019 0.97 0.97 0.82 0.97 2,188.18

Top high IO consuming process

I would like to know the Top high IO consuming process. I used 'sar' for the same. But it does not display the pid of the processes. Please suggest an efficient way for the same.
From questions, previously asked I understand there is a utility called iotop. But unfortunately, I cant be installing it on our systems without organizational approval. Please suggest other alternative ways.
sar > file_sar
cat file_sar|sort -nr -k7,7|head
12:10:01 AM all 15.06 0.00 6.59 0.53 0.10 77.72
01:50:02 AM all 16.67 0.00 6.30 0.20 0.08 76.74
12:50:01 AM all 11.09 0.00 2.64 0.18 0.08 86.01
12:30:02 AM all 12.44 0.00 1.68 0.17 0.08 85.63
01:10:02 AM all 13.43 0.00 1.83 0.16 0.11 84.47
01:30:02 AM all 13.21 0.00 5.86 0.13 0.07 80.73
Average: all 13.20 0.00 3.94 0.15 0.08 82.63
12:20:01 AM all 10.94 0.00 1.53 0.07 0.08 87.38
12:40:02 AM all 8.17 0.00 1.28 0.06 0.07 90.42
01:20:01 AM all 15.35 0.00 6.14 0.06 0.09 78.36

Pandas - ufunc 'subtract' did not contain a loop with signature matching type

I have this piece of code:
self.value=0.8
for col in df.ix[:,'value1':'value3']:
df = df.iloc[abs(df[col] - self.value).argsort()]
which works perfectly as part of main() function. at return, it prints:
artist track pos neg neu
4 Sufjan Stevens Casimir Pulaski Day 0.09 0.91 0.0
9 Sufjan Stevens The Only Thing 0.09 0.91 0.0
5 Radiohead Desert Island Disk 0.08 0.92 0.0
0 Sufjan Stevens Should Have Known Better 0.07 0.93 0.0
1 Sufjan Stevens To Be Alone With You 0.05 0.95 0.0
8 Radiohead Daydreaming 0.05 0.95 0.0
3 Sufjan Stevens Death with Dignity 0.03 0.97 0.0
11 Elliott Smith Between the Bars 0.03 0.97 0.0
2 Jeff Buckley Hallelujah 0.39 0.61 0.0
6 Radiohead Codex 0.00 1.00 0.0
7 Aphex Twin Avril 14th 0.00 1.00 0.0
10 Radiohead You And Whose Army? 0.00 1.00 0.0
however, when I import this function as part of a module, and even though I'm passing and printing the same 0.8 self.value, I get the following error:
df = df.iloc[(df[col] - self.flavor).argsort()]
File "/Users/me/anaconda/lib/python2.7/site-packages/pandas/core/ops.py", line 721, in wrapper
result = wrap_results(safe_na_op(lvalues, rvalues))
File "/Users/me/anaconda/lib/python2.7/site-packages/pandas/core/ops.py", line 682, in safe_na_op
return na_op(lvalues, rvalues)
File "/Users/me/anaconda/lib/python2.7/site-packages/pandas/core/ops.py", line 668, in na_op
result[mask] = op(x[mask], y)
TypeError: ufunc 'subtract' did not contain a loop with signature matching types dtype('S32') dtype('S32') dtype('S32')
why is it so? what is going on?
pd.DataFrame.ix is has been deprecated. You should stop using it.
Your use of 'value1':'value3' is dangerous as it can include columns you didn't expect if your columns aren't positioned in the order you thought.
df = pd.DataFrame(
[['a', 'b', 1, 2, 3]],
columns='artist track v1 v2 v3'.split()
)
list(df.loc[:, 'v1':'v3'])
['v1', 'v2', 'v3']
But rearrange the columns and
list(df.loc[:, ['v1', 'v2', 'artist', 'v3', 'b']].loc[:, 'v1':'v3'])
['v1', 'v2', 'artist', 'v3']
You got 'artist' in the the list. And column 'artist' is of type string and can't be subtracted from or by an integer or float.
df['artist'] - df['v1']
> TypeError: ufunc 'subtract' did not contain a loop with signature matching types dtype('<U21') dtype('<U21') dtype('<U21')
Setup
Shuffle df
df = df.sample(frac=1)
df
artist track pos neg neu
0 Sufjan Stevens Should Have Known Better 0.07 0.93 0.0
8 Radiohead Daydreaming 0.05 0.95 0.0
1 Sufjan Stevens To Be Alone With You 0.05 0.95 0.0
5 Radiohead Desert Island Disk 0.08 0.92 0.0
11 Elliott Smith Between the Bars 0.03 0.97 0.0
7 Aphex Twin Avril 14th 0.00 1.00 0.0
2 Jeff Buckley Hallelujah 0.39 0.61 0.0
4 Sufjan Stevens Casimir Pulaski Day 0.09 0.91 0.0
9 Sufjan Stevens The Only Thing 0.09 0.91 0.0
3 Sufjan Stevens Death with Dignity 0.03 0.97 0.0
6 Radiohead Codex 0.00 1.00 0.0
10 Radiohead You And Whose Army? 0.00 1.00 0.0
Solution
Use np.lexsort
value = 0.8
v = df[['pos', 'neg', 'neu']].values
df.iloc[np.lexsort(np.abs(v - value).T)]
artist track pos neg neu
4 Sufjan Stevens Casimir Pulaski Day 0.09 0.91 0.0
9 Sufjan Stevens The Only Thing 0.09 0.91 0.0
5 Radiohead Desert Island Disk 0.08 0.92 0.0
0 Sufjan Stevens Should Have Known Better 0.07 0.93 0.0
8 Radiohead Daydreaming 0.05 0.95 0.0
1 Sufjan Stevens To Be Alone With You 0.05 0.95 0.0
11 Elliott Smith Between the Bars 0.03 0.97 0.0
3 Sufjan Stevens Death with Dignity 0.03 0.97 0.0
2 Jeff Buckley Hallelujah 0.39 0.61 0.0
7 Aphex Twin Avril 14th 0.00 1.00 0.0
6 Radiohead Codex 0.00 1.00 0.0
10 Radiohead You And Whose Army? 0.00 1.00 0.0