using awk to print two loop pattern - awk

Here is an example of the output:
_Ibaseebna1_1 | .7890247 .4173177 -0.45 0.654 .2798256 2.224814
_Ibaseebna1_2 | .8838311 .2739327 -0.40 0.690 .4814483 1.622516
_Ibaseebna1_3 | .7762056 .3459759 -0.57 0.570 .3240211 1.859432
gen | 1.072875 .3515304 0.21 0.830 .5644823 2.039142
iagrv | .9696135 .0108311 -2.76 0.006 .9486157 .991076
_Istudysite_1 | .60195 .1877754 -1.63 0.104 .3266121 1.109401
baseebna1_cat | .9311375 .1199702 -0.55 0.580 .7233404 1.198629
gen | 1.050177 .3304 0.16 0.876 .5668433 1.945638
iagrv | .9701813 .0106488 -2.76 0.006 .949533 .9912787
_Ibaseebna1_1 | 1.852936 1.067927 1.07 0.285 .5987925 5.733826
_Ibaseebna1_2 | 1.894772 1.157542 1.05 0.295 .5721982 6.274333
_Ibaseebna1_3 | 3.209055 1.914574 1.95 0.051 .996636 10.33279
gen | 2.397269 .867482 2.42 0.016 1.179502 4.872308
iagrv | .9593829 .013887 -2.86 0.004 .9325473 .9869908
baseebna1_cat | 1.39457 .1997098 2.32 0.020 1.05328 1.846447
I want to extract the results and the final one should look like this :
_Ibaseebna1_1 0.79 0.28 2.22 0.65
_Ibaseebna1_2 0.88 0.48 1.62 0.69
_Ibaseebna1_3 0.78 0.32 1.86 0.57
_baseebna1_cat 0.93 0.72 1.20 0.58
_Ibaseebna1_1 1.85 0.60 5.73 0.29
_Ibaseebna1_2 1.89 0.57 6.27 0.30
_Ibaseebna1_3 3.21 1.00 10.33 0.05
_baseebna1_cat 1.39 1.05 1.85 0.02
As i have many variables that were in the same pattern, so i use the below for loop to try to extract the results
FILE=t.txt
for i in baseebna1 ebna1 baseLDL LDL
do
for ((n=1;n<=3;n++))
do
awk -v pattern=_I"$i"_"$n" '$0 ~ pattern && /\|/ {printf "%15s %7.2f %7.2f %7.2f %7.2f\n",$1,$3,$7,$8,$6}' $FILE >> 1.txt
awk -v p1="$i"_cat '$0 ~ p1 && /\|/ {printf "%15s %7.2f %7.2f %7.2f %7.2f\n",$1,$3,$7,$8,$6}' $FILE >> 1.txt
done
done
But the results is not what i expected, would some one help me to improve that?
Thanks

$ awk '!/^ /{printf "%-15s %.2f %.2f %.2f %.2f\n",($1~/^_/?"":"_")$1,$3,$7,$8,$6}' file
_Ibaseebna1_1 0.79 0.28 2.22 0.65
_Ibaseebna1_2 0.88 0.48 1.62 0.69
_Ibaseebna1_3 0.78 0.32 1.86 0.57
_Istudysite_1 0.60 0.33 1.11 0.10
_baseebna1_cat 0.93 0.72 1.20 0.58
_Ibaseebna1_1 1.85 0.60 5.73 0.28
_Ibaseebna1_2 1.89 0.57 6.27 0.29
_Ibaseebna1_3 3.21 1.00 10.33 0.05
_baseebna1_cat 1.39 1.05 1.85 0.02

Related

Pandas df.shift(axis=1) adds extra entries, why?

Here is a sample of the original table.
# z speed dir U_geo V_geo U U[QCC] U[ign] U[siC] U[siD] V
0 40 2.83 181.0 0.05 2.83 -0.20 11 -0.20 2.24 0.95 2.83 11
1 50 2.41 184.8 0.20 2.40 -0.01 11 -0.01 2.47 0.94 2.41 11
2 60 1.92 192.4 0.41 1.88 0.25 11 0.25 2.46 0.94 1.91 11
3 70 1.75 201.7 0.65 1.63 0.50 11 0.50 2.47 0.94 1.68 11
I need to shift the entire table over by 1 column to produce this:
z speed dir U_geo V_geo U U[QCC] U[ign] U[siC] U[siD] V
0 40 2.83 181.0 0.05 2.83 -0.20 11 -0.20 2.24 0.95 2.83
1 50 2.41 184.8 0.20 2.40 -0.01 11 -0.01 2.47 0.94 2.41
2 60 1.92 192.4 0.41 1.88 0.25 11 0.25 2.46 0.94 1.91
3 70 1.75 201.7 0.65 1.63 0.50 11 0.50 2.47 0.94 1.68
Here is the code that ingests the data and tries to shift it over by one column
wind_rass_table_df=pd.read_csv(file_path, header=j+3, engine='python', nrows=77,sep=r'\s{2,}',skip_blank_lines=False,index_col=False)
wind_rass_table_df=wind_rass_table_df.shift(periods=1,axis=1)
Supposedly df.shift(axis=1) should shift the dataframe over by 1 column but it does more than that, it does this:
# z speed dir U_geo V_geo U U[QCC] U[ign] U[siC]
0 NaN NaN 2.83 181.0 0.05 2.83 40.0 -0.20 -0.20 2.24
1 NaN NaN 2.41 184.8 0.20 2.40 50.0 -0.01 -0.01 2.47
2 NaN NaN 1.92 192.4 0.41 1.88 60.0 0.25 0.25 2.46
3 NaN NaN 1.75 201.7 0.65 1.63 70.0 0.50 0.50 2.47
The shift function has taken the first column, inserted into the 7th column, shifted the 7th into the 8th and repeated the 8th, shifting the 9th over and so on.
What is the correct way of shifting a dataframe over by one column?
Many thanks!
You can use iloc and create another dataframe:
df = pd.DataFrame(data=df.iloc[:, :-1], columns=df.columns[1:], index=df.index)

How to pivot a dataframe using multiple column?

How can I use pandas.pivot_table or any other method to split the following data frame into two?
This is my input data frame:
Method N Mean Max Min Median Mode Meduim
0 A 5 0.40 0.55 0.25 0.39 N/A m1
2 A 10 0.26 0.47 0.10 0.25 N/A m2
1 B 5 0.48 0.62 0.33 0.50 N/A m1
3 B 7 0.41 0.47 0.36 0.42 0.36 m2
And I want to output the two following data frames
A m1 m2
N 5 10
Mean 0.40 0.26
Max 0.55 0.47
Min 0.25 0.10
Median 0.39 0.25
Mode N/A N/A
and
B m1 m2
N 5 7
Mean 0.48 0.41
Max 0.62 0.47
Min 0.33 0.36
Median 0.50 0.42
Mode N/A 0.36
Thank you.
Is it pivot?
df.set_index(['Method','Meduim']).T
gives:
Method A B
Meduim m1 m2 m1 m2
N 5.00 10.00 5.00 7.00
Mean 0.40 0.26 0.48 0.41
Max 0.55 0.47 0.62 0.47
Min 0.25 0.10 0.33 0.36
Median 0.39 0.25 0.50 0.42
Mode NaN NaN NaN 0.36

Shifting columns in dataframe

I have a pandas dataframe as:
Date normPwr_0 normPwr_1 tempNorm_1 tempNorm_2 tempNorm_3 tempNorm_0
6/15/2019 0.89 0.94 0.83 0.88 0.92 0.82
6/16/2019 0.97 0.89 0.82 0.83 0.88 0.97
6/17/2019 0.97 0.97 0.97 0.82 0.83 2,188.18
I want to shift the column values for only tempNorm columns. My desired output is:
Date normPwr_0 normPwr_1 tempNorm_2 tempNorm_3 tempNorm_1
6/15/2019 0.89 0.94 0.83 0.88 0.82
6/16/2019 0.97 0.89 0.82 0.83 0.97
6/17/2019 0.97 0.97 0.97 0.82 2,188.18
The tricky part is the columns names for tempNormvaries such that sometimes I have [tempNorm_1 tempNorm_2 tempNorm_3 tempNorm_0] and other times I have
[tempNorm_4 tempNorm_5 tempNorm_6 tempNorm_7 tempNorm_0]
When columns have [tempNorm_4 tempNorm_5 tempNorm_6 tempNorm_7 tempNorm_0] my desired columns in output dataframe will be [tempNorm_4 tempNorm_5 tempNorm_6 tempNorm_7]
Basically I am trying to shift the dataframe in columns name containing tempNorm where all values from tempNorm_0 gets pushed into next higher named column and the highest named column gets dropped off.
I am not sure how to approach this in a clean pythonic way.
EDIT:
For [tempNorm_4 tempNorm_5 tempNorm_6 tempNorm_7 tempNorm_0], the values from tempNorm_0 is moved in to tempNorm_4 ;the values from tempNorm_4 is moved in to tempNorm_5 and so forth. tempNorm_7 data gets dropped off and is replaced by data from tempNorm_6
sorted, filter, rename
a, b, *c = sorted(df.filter(like='tempNorm'), key=lambda c: int(c.rsplit('_', 1)[1]))
df.drop(b, 1).rename(columns={a: b})
Date normPwr_0 normPwr_1 tempNorm_2 tempNorm_3 tempNorm_1
0 6/15/2019 0.89 0.94 0.88 0.92 0.82
1 6/16/2019 0.97 0.89 0.83 0.88 0.97
2 6/17/2019 0.97 0.97 0.82 0.83 2,188.18
IIUC, you want to roll the columns with name tempNorm_ and drop the last:
# get all the tempNorm columns
tmp_cols = np.array([col for col in df.columns if 'tempNorm' in col])
# roll and rename:
df.rename(columns={col:new_col for col, new_col in zip(tmp_cols,np.roll(tmp_cols,-1) ) },
inplace=True)
# drop the last tempNorm
df.drop(tmp_cols[-1], axis=1, inplace=True)
Output:
Date normPwr_0 normPwr_1 tempNorm_2 tempNorm_3 tempNorm_1
0 6/15/2019 0.89 0.94 0.83 0.88 0.82
1 6/16/2019 0.97 0.89 0.82 0.83 0.97
2 6/17/2019 0.97 0.97 0.97 0.82 2,188.18
You can also do something like:
m=df.filter(like='tempNorm').sort_index(axis=1)
n=m[m.columns[::-1]].T.shift(-1,axis=0).T.dropna(how='all',axis=1)
pd.concat([df[df.columns.difference(m.columns)],n],axis=1)
Date normPwr_0 normPwr_1 tempNorm_3 tempNorm_2 tempNorm_1
0 6/15/2019 0.89 0.94 0.88 0.83 0.82
1 6/16/2019 0.97 0.89 0.83 0.82 0.97
2 6/17/2019 0.97 0.97 0.82 0.97 2,188.18

Top high IO consuming process

I would like to know the Top high IO consuming process. I used 'sar' for the same. But it does not display the pid of the processes. Please suggest an efficient way for the same.
From questions, previously asked I understand there is a utility called iotop. But unfortunately, I cant be installing it on our systems without organizational approval. Please suggest other alternative ways.
sar > file_sar
cat file_sar|sort -nr -k7,7|head
12:10:01 AM all 15.06 0.00 6.59 0.53 0.10 77.72
01:50:02 AM all 16.67 0.00 6.30 0.20 0.08 76.74
12:50:01 AM all 11.09 0.00 2.64 0.18 0.08 86.01
12:30:02 AM all 12.44 0.00 1.68 0.17 0.08 85.63
01:10:02 AM all 13.43 0.00 1.83 0.16 0.11 84.47
01:30:02 AM all 13.21 0.00 5.86 0.13 0.07 80.73
Average: all 13.20 0.00 3.94 0.15 0.08 82.63
12:20:01 AM all 10.94 0.00 1.53 0.07 0.08 87.38
12:40:02 AM all 8.17 0.00 1.28 0.06 0.07 90.42
01:20:01 AM all 15.35 0.00 6.14 0.06 0.09 78.36

Pandas - ufunc 'subtract' did not contain a loop with signature matching type

I have this piece of code:
self.value=0.8
for col in df.ix[:,'value1':'value3']:
df = df.iloc[abs(df[col] - self.value).argsort()]
which works perfectly as part of main() function. at return, it prints:
artist track pos neg neu
4 Sufjan Stevens Casimir Pulaski Day 0.09 0.91 0.0
9 Sufjan Stevens The Only Thing 0.09 0.91 0.0
5 Radiohead Desert Island Disk 0.08 0.92 0.0
0 Sufjan Stevens Should Have Known Better 0.07 0.93 0.0
1 Sufjan Stevens To Be Alone With You 0.05 0.95 0.0
8 Radiohead Daydreaming 0.05 0.95 0.0
3 Sufjan Stevens Death with Dignity 0.03 0.97 0.0
11 Elliott Smith Between the Bars 0.03 0.97 0.0
2 Jeff Buckley Hallelujah 0.39 0.61 0.0
6 Radiohead Codex 0.00 1.00 0.0
7 Aphex Twin Avril 14th 0.00 1.00 0.0
10 Radiohead You And Whose Army? 0.00 1.00 0.0
however, when I import this function as part of a module, and even though I'm passing and printing the same 0.8 self.value, I get the following error:
df = df.iloc[(df[col] - self.flavor).argsort()]
File "/Users/me/anaconda/lib/python2.7/site-packages/pandas/core/ops.py", line 721, in wrapper
result = wrap_results(safe_na_op(lvalues, rvalues))
File "/Users/me/anaconda/lib/python2.7/site-packages/pandas/core/ops.py", line 682, in safe_na_op
return na_op(lvalues, rvalues)
File "/Users/me/anaconda/lib/python2.7/site-packages/pandas/core/ops.py", line 668, in na_op
result[mask] = op(x[mask], y)
TypeError: ufunc 'subtract' did not contain a loop with signature matching types dtype('S32') dtype('S32') dtype('S32')
why is it so? what is going on?
pd.DataFrame.ix is has been deprecated. You should stop using it.
Your use of 'value1':'value3' is dangerous as it can include columns you didn't expect if your columns aren't positioned in the order you thought.
df = pd.DataFrame(
[['a', 'b', 1, 2, 3]],
columns='artist track v1 v2 v3'.split()
)
list(df.loc[:, 'v1':'v3'])
['v1', 'v2', 'v3']
But rearrange the columns and
list(df.loc[:, ['v1', 'v2', 'artist', 'v3', 'b']].loc[:, 'v1':'v3'])
['v1', 'v2', 'artist', 'v3']
You got 'artist' in the the list. And column 'artist' is of type string and can't be subtracted from or by an integer or float.
df['artist'] - df['v1']
> TypeError: ufunc 'subtract' did not contain a loop with signature matching types dtype('<U21') dtype('<U21') dtype('<U21')
Setup
Shuffle df
df = df.sample(frac=1)
df
artist track pos neg neu
0 Sufjan Stevens Should Have Known Better 0.07 0.93 0.0
8 Radiohead Daydreaming 0.05 0.95 0.0
1 Sufjan Stevens To Be Alone With You 0.05 0.95 0.0
5 Radiohead Desert Island Disk 0.08 0.92 0.0
11 Elliott Smith Between the Bars 0.03 0.97 0.0
7 Aphex Twin Avril 14th 0.00 1.00 0.0
2 Jeff Buckley Hallelujah 0.39 0.61 0.0
4 Sufjan Stevens Casimir Pulaski Day 0.09 0.91 0.0
9 Sufjan Stevens The Only Thing 0.09 0.91 0.0
3 Sufjan Stevens Death with Dignity 0.03 0.97 0.0
6 Radiohead Codex 0.00 1.00 0.0
10 Radiohead You And Whose Army? 0.00 1.00 0.0
Solution
Use np.lexsort
value = 0.8
v = df[['pos', 'neg', 'neu']].values
df.iloc[np.lexsort(np.abs(v - value).T)]
artist track pos neg neu
4 Sufjan Stevens Casimir Pulaski Day 0.09 0.91 0.0
9 Sufjan Stevens The Only Thing 0.09 0.91 0.0
5 Radiohead Desert Island Disk 0.08 0.92 0.0
0 Sufjan Stevens Should Have Known Better 0.07 0.93 0.0
8 Radiohead Daydreaming 0.05 0.95 0.0
1 Sufjan Stevens To Be Alone With You 0.05 0.95 0.0
11 Elliott Smith Between the Bars 0.03 0.97 0.0
3 Sufjan Stevens Death with Dignity 0.03 0.97 0.0
2 Jeff Buckley Hallelujah 0.39 0.61 0.0
7 Aphex Twin Avril 14th 0.00 1.00 0.0
6 Radiohead Codex 0.00 1.00 0.0
10 Radiohead You And Whose Army? 0.00 1.00 0.0