Combine columns from multiple files using awk - awk

I have a set of files in a folder like this:
hsa-miR-106a-5p.filtered.txt hsa-miR-182-5p.filtered.txt hsa-miR-2467-5p.filtered.txt hsa-miR-421.filtered.txt hsa-miR-592.filtered.txt
hsa-miR-106b-3p.filtered.txt hsa-miR-183-3p.filtered.txt hsa-miR-25-3p.filtered.txt hsa-miR-424-3p.filtered.txt hsa-miR-615-3p.filtered.txt
hsa-miR-106b-5p.filtered.txt hsa-miR-183-5p.filtered.txt hsa-miR-25-5p.filtered.txt hsa-miR-424-5p.filtered.txt hsa-miR-625-3p.filtered.txt
hsa-miR-1180-3p.filtered.txt hsa-miR-188-5p.filtered.txt hsa-miR-27a-3p.filtered.txt hsa-miR-431-5p.filtered.txt hsa-miR-625-5p.filtered.txt
hsa-miR-1246.filtered.txt hsa-miR-18a-3p.filtered.txt hsa-miR-27a-5p.filtered.txt
file:
ENSG00000224531.4 SMIM13 ENST00000416247.2 9606 hsa-miR-135b-5p 3 132 139 -0.701 99 -0.701 99
ENSG00000112357.8 PEX7 ENST00000541292.1 9606 hsa-miR-135b-5p 3 428 435 -0.683 99 -0.640 99
ENSG00000138279.11 ANXA7 ENST00000372921.5 9606 hsa-miR-135b-5p 3 205 212 -0.631 99 -0.631 99
ENSG00000135248.11 FAM71F1 ENST00000315184.5 9606 hsa-miR-135b-5p 3 488 495 -0.581 99 -0.581 99
ENSG00000087302.4 C14orf166 ENST00000556760.1 9606 hsa-miR-135b-5p 3 34 41 -0.566 99 -0.566 99
ENSG00000104722.9 NEFM ENST00000433454.2 9606 hsa-miR-135b-5p 3 25 32 -0.565 99 -0.565 99
ENSG00000132485.8 ZRANB2 ENST00000254821.6 9606 hsa-miR-135b-5p 3 284 291 -0.566 99 -0.565 99
ENSG00000185127.5 C6orf120 ENST00000332290.2 9606 hsa-miR-135b-5p 3 125 132 -0.564 99 -0.553 99
I would like to combine the rows for all files so that I get one output files that contain all the rows.
I have been playing around with R, but decided I wanted to try awk:
for f in *.filtered.txt
do
....
done

Maybe this can help you (R solution):
files <- dir(pattern="\\.txt$") # Get the field list and put it in a vector
df <- NA # Initialize the variable that will hold the data frame
for(f in files){
if(is.na(df)) { # If the dataframe is empty ...
df <- read.table(f) # ... read the first file in the list
} else { # ... otherwise ...
df <- rbind(df, read.table(f)) # ... bind the dataframe with the rows
# from the next file
}
}

Related

Is there another way to solve about pandas set option?

I'm analyzing a data-frame and want to check more detailed lists
but even though I searched some solutions from google,
I don't understand why the result is not changed.
what is the problem??
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
# Import data
df = df = pd.read_csv(r"C:\Users\Administrator\Desktop\medical.txt")
pd.set_option("display.max_rows", 50)
pd.set_option('display.max_columns', 15)
print(df)
id age gender height weight ap_hi ap_lo cholesterol gluc
0 0 18393 2 168 62.0 110 80 1 1
1 1 20228 1 156 85.0 140 90 3 1
2 2 18857 1 165 64.0 130 70 3 1
3 3 17623 2 169 82.0 150 100 1 1
4 4 17474 1 156 56.0 100 60 1 1
... ... ... ... ... ... ... ... ...
69995 99993 19240 2 168 76.0 120 80 1 1
69996 99995 22601 1 158 126.0 140 90 2 2
69997 99996 19066 2 183 105.0 180 90 3 1
69998 99998 22431 1 163 72.0 135 80 1 2
69999 99999 20540 1 170 72.0 120 80 2 1
Look at https://pandas.pydata.org/pandas-docs/stable/user_guide/options.html at "Frequently used options" chapter.
You can see that if the "max_rows" is lower than the total number of rows in your dataframe then it is displayed like your results.
Below a copy past of the interesting part in the link that I gave you:
if there are a way to display enough columns
pd.set_option('display.width',1000)
or
pd.set_option('display.width',None)
but to rows may be you only use
df.head(50)
or
df.tail(50)
or follows to DisplayAll
pd.set_option("display.max_rows", None)
Why set that is useless:
The second parameter is not the maximum number of rows that can be viewed, but an internal template parameter
code as follows:
set_option = CallableDynamicDoc(_set_option, _set_option_tmpl)
CallableDynamicDoc:
class CallableDynamicDoc:
def __init__(self, func, doc_tmpl):
self.__doc_tmpl__ = doc_tmpl
self.__func__ = func
def __call__(self, *args, **kwds):
return self.__func__(*args, **kwds)
#property
def __doc__(self):
opts_desc = _describe_option("all", _print_desc=False)
opts_list = pp_options_list(list(_registered_options.keys()))
return self.__doc_tmpl__.format(opts_desc=opts_desc, opts_list=opts_list)

How to read a file in python?

I have a file of the format .lammpstrj which has the 15 million rows of data points of the following pattern:
id mol type x y z
100000
ITEM: NUMBER OF ATOMS
3000
ITEM: BOX BOUNDS pp pp pp
4.8659189006091452e-02 3.0951340810994285e+01
4.8659189006091452e-02 3.0951340810994285e+01
4.8659189006091452e-02 3.0951340810994285e+01
ITEM: ATOMS id mol type x y z
2325 775 1 4.5602 4.35401 4.16348
718 240 2 4.91829 3.19545 7.30041
2065 689 2 6.51189 1.25778 5.11324
639 213 1 6.84357 5.10011 0.530398
720 240 1 5.46433 3.36715 6.48044
694 232 2 0.107046 3.3119 7.42581
1855 619 2 6.17236 4.57208 5.02607
1856 619 1 6.65988 5.13298 5.69518
I want to store all this data into a pandas DataFrame. How can I do the following task?

How to combine two groupby into one

I have two GroubBy:
The First one
ser2 = ser.groupby(pd.cut(ser, 10)).sum()
(-2620.137, 476638.7] 12393813
(476638.7, 951152.4] 9479666
(951152.4, 1425666.1] 14381033
(1425666.1, 1900179.8] 5113056
(1900179.8, 2374693.5] 4114429
(2374693.5, 2849207.2] 4929537
(2849207.2, 3323720.9] 0
(3323720.9, 3798234.6] 0
(3798234.6, 4272748.3] 3978230
(4272748.3, 4747262.0] 4747262
And the second:
ser1= pd.cut(ser, 10)
print(ser1.value_counts())
(-2620.137, 476638.7] 110
(476638.7, 951152.4] 15
(951152.4, 1425666.1] 12
(1425666.1, 1900179.8] 3
(2374693.5, 2849207.2] 2
(1900179.8, 2374693.5] 2
(4272748.3, 4747262.0] 1
(3798234.6, 4272748.3] 1
(3323720.9, 3798234.6] 0
(2849207.2, 3323720.9] 0
Question: Are there ways to combine these operations into one code to get both calculations in the same pivot table
Use GroupBy.agg, instead value_counts use GroupBy.size:
np.random.seed(2020)
ser = pd.Series(np.random.randint(40, size=100))
df = ser.groupby(pd.cut(ser, 10)).agg(['sum','size'])
print (df)
sum size
(-0.039, 3.9] 27 14
(3.9, 7.8] 49 9
(7.8, 11.7] 142 15
(11.7, 15.6] 151 11
(15.6, 19.5] 159 9
(19.5, 23.4] 187 9
(23.4, 27.3] 253 10
(27.3, 31.2] 176 6
(31.2, 35.1] 231 7
(35.1, 39.0] 375 10
If need custom columns names:
np.random.seed(2020)
ser = pd.Series(np.random.randint(40, size=100))
df = ser.groupby(pd.cut(ser, 10)).agg([('col1','sum'),('col2','size')])
print (df)
col1 col2
(-0.039, 3.9] 27 14
(3.9, 7.8] 49 9
(7.8, 11.7] 142 15
(11.7, 15.6] 151 11
(15.6, 19.5] 159 9
(19.5, 23.4] 187 9
(23.4, 27.3] 253 10
(27.3, 31.2] 176 6
(31.2, 35.1] 231 7
(35.1, 39.0] 375 10

Select rows after certain condition

Select rows after the maximum minimum valor found
Input file
22 101 5
23 102 5
24 103 5
25 104 23
26 105 25
27 106 21
28 107 20
29 108 8
30 109 6
31 110 7
To figure out my problem, I tried to subtract column 3 and print the lines after the minimum value found in column 4. In this case after row 7
awk '{$4 = $3 - prev3; prev3 = $3; print $0}' file
22 101 5
23 102 5 0
24 103 5 0
25 104 2 18
26 105 2 2
27 106 2 -4
28 107 2 -1
29 108 8 -12
30 109 6 -2
31 110 7 1
Desired Output
29 108 8
30 109 6
31 110 7
I believe there is better and easy way to get same output.
Thanks in advance
You need to process the same file twice:
Find out the line number of the min value
Print the line and the lines after it
Like this:
awk 'NR==FNR{v=$3-prev3;prev3=$3;if(NR==2||v<m){m=v;ln=NR};next}FNR>=ln' file file
Explanation:
# This condition is true as long as we process the file the first time
NR==FNR {
# Your calculation
v=$3-prev3
prev3=$3
# If NR==2, meaning in row 2 we initialize m and ln.
# Otherwise check if v is the new minimum and set m and ln.
if(NR==2 || v<m){
# Set m and ln when v is the new minimum
m=v
ln=NR
}
next # Skip the conditional below
}
# This condition will be only evaluated when we parse the file
# the second time. (because of the "next" statement above)
# When the line number is greater or equal than "ln" print it.
# (print is the default action)
FNR>=ln

How to insert two lines for every data frame using awk?

I have repeating data as follows
....
4 4 4 66 79 169 150 0 40928 40938 40923 40921 40789 40000 40498
5 4 3 16 22 247 0 40168 40911 40944 40205 40000 40562
6 4 4 17 154 93 309 0 40930 40919 40903 40917 40852 40000 40419
7 3 2 233 311 0 40936 40932 40874 40000 40807
....
This data is made up of 115 data blocks, and each data block have 4000 lines like that format.
Here, I hope to put two new lines (number of line per data block = 4000 and empty line) at the begining of each data blocks, so it looks
4000
1 4 4 244 263 704 952 0 40936 40930 40934 40921 40820 40000 40570
2 4 4 215 172 305 33 0 40945 40942 40937 40580 40687 40000 40410
3 4 4 344 279 377 1945 0 40933 40915 40907 40921 40839 40000 40437
4 4 4 66 79 169 150 0 40928 40938 40923 40921 40789 40000 40498
...
3999 2 2 4079 4081 0 40873 40873 40746 40000 40634
4000 1 1 4080 0 40873 40923 40000 40345
4000
1 4 4 244 263 704 952 0 40936 40930 40934 40921 40820 40000 40570
2 4 4 215 172 305 33 0 40945 40942 40937 40580 40687 40000 40410
3 4 4 344 279 377 1945 0 40933 40915 40907 40921 40839 40000 40437
4 4 4 66 79 169 150 0 40928 40938 40923 40921 40789 40000 40498
...
Can I do this with awk or any other unix command?
My solution is more general, since the blocks can be of non-equal lenght as long as you restart the 1st field counter to denote the beginning of a new block
% cat mark_blocks
$1<count { print count; print "";
for(i=1;i<=count;i++) print l[i]; }
# executed for each line
{ l[$1] = $0; count=$1}
END { print count; print "";
for(i=1;i<=count;i++) print l[i]; }
% awk -f mark_blocks your_data > marked_data
%
The working is simple, awk accumulates lines in memory and it prints the header lines and the accumulated data when it reaches a new block or EOF.
The (modest) trick is that the output action must take place before we do the usual stuff we do for each line.
A simple one liner using awk can do the purpose.
awk 'NR%4000==1{print "4000\n"} {print$0}' file
what it does.
print $0 prints every line.
NR%4000==1 selects the 4000th line. When it occures it prints a 4000 and a newline \n, that is two new lines.
NR Number of records, which is effectivly number of lines reads so far.
simple test.
inserts 4000 at 5th line
awk 'NR%5==1{print "4000\n"} {print$0}'
output:
4000
1
2
3
4
5
4000
6
7
8
9
10
4000
11
12
13
14
15
4000
16
17
18
19
20
4000
You can do it all in bash :
cat $FILE | ( let countmax=4000; let count=countmax; while read lin ; do if [ $count == $countmax ]; then let count=0; echo -e "$countmax\n" ; fi ; echo $lin ; let count=count+1 ; done )
Here we assume you are reading this data from $FILE . Then all we are doing is reading from the file and piping it into our little bash script.
The bash script reads lines one by one (with the while read lin) , and increments the counter countfor each line. When starting or when the counter count reaches the value countmax (set to 4000) , then it prints out the 2 lines you asked for.