summarizing the contents of a text file to an other one using awk - awk

I have a big text file with 2 tab separated fields. as you see in the small example every 2 lines have a number in common. I want to summarize my text file in this way.
1- look for the lines that have the number in common and sum up the second column of those lines.
small example:
ENST00000054666.6 2
ENST00000054666.6_2 15
ENST00000054668.5 4
ENST00000054668.5_2 10
ENST00000054950.3 0
ENST00000054950.3_2 4
expected output:
ENST00000054666.6 17
ENST00000054668.5 14
ENST00000054950.3 4
as you see the difference is in both columns. in the 1st column there is only one repeat of each common and without "_2" and in the 2nd column the values is sum up of both lines (which have common number in input file).
I tried this code but does not return what I want:
awk -F '\t' '{ col2 = $2, $2=col2; print }' OFS='\t' input.txt > output.txt
do you know how to fix it?

Solution 1st: Following awk may help you on same.
awk '{sub(/_.*/,"",$1)} {a[$1]+=$NF} END{for(i in a){print i,a[i]}}' Input_file
Solution 2nd: In case your Input_file is sorted by 1st field then following may help you.
awk '{sub(/_.*/,"",$1)} prev!=$1 && prev{print prev,val;val=""} {val+=$NF;prev=$1} END{if(val){print prev,val}}' Input_file
Use > output.txt at the end of the above codes in case you need the output in a output file too.

If order is not a concern, below may also help :
awk -v FS="\t|_" '{count[$1]+=$NF}
END{for(i in count){printf "%s\t%s%s",i,count[i],ORS;}}' file
ENST00000054668.5 14
ENST00000054950.3 4
ENST00000054666.6 17
Edit :
If the order of the output does matter, below approach using a flag helps :
$ awk -v FS="\t|_" '{count[$1]+=$NF;++i;
if(i==2){printf "%s\t%s%s",$1,count[$1],ORS;i=0}}' file
ENST00000054666.6 17
ENST00000054668.5 14
ENST00000054950.3 4

Related

Problems with awk substr

I am trying to split a file column using the substr awk command. So the input is as follows (it consists of 4 lines, one blank line):
#NS500645:122:HYGVMBGX2:4:21402:2606:16446:ACCTAGAAGG:R1
ACCTAGAAGGATATGCGCTTGCGCGTTAGAGATCACTAGAGCTAAGGAATTTGAGATTACAGTAAGCTATGATCC
/AAAAEEEEEEEEEEAAEEEAEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE
I want to split the second line by the pattern "GATC" but keeping it on the right sub-string like:
ACCTAGAAGGATATGCGCTTGCGCGTTAGA GATCACTAGAGCTAAGGAATTTGAGATTACAGTAAGCTATGATCC
I want that the last line have the same length as the splitted one and regenerate the file like:
ACCTAGAAGGATATGCGCTTGCGCGTTAGA
/AAAAEEEEEEEEEEAAEEEAEEEEEEEEE
GATCACTAGAGCTAAGGAATTTGAGATTACAGTAAGCTAT
EEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE
GATCC
EEEEE
For split the last colum I am using this awk script:
cat prove | paste - - - - | awk 'BEGIN
{FS="\t"; OFS="\t"}\ {gsub("GATC","/tGATC", $2); {split ($2, a, "\t")};\ for
(i in a) print substr($4, length(a[i-1])+1,
length(a[i-1])+length(a[i]))}'
But the output is as follows:
/AAAAEEEEEEEEEEAAEEEAEEEEEEEEE
EEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE
EEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE
Being the second and third line longer that expected.
I check the calculated length that are passed to the substr command and are correct:
1 30
31 70
41 45
Using these length the output should be:
/AAAAEEEEEEEEEEAAEEEAEEEEEEEEE
EEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE
EEEEE
But as I showed it is not the case.
Any suggestions?
I guess you're looking something like this, but your question formatting is really confusing
$ awk -v OFS='\t' 'NR==1 {next}
NR==2 {n=index($0,"GATC")}
/^[^+]/ {print substr($0,1,n-1),substr($0,n)}' file
ACCTAGAAGGATATGCGCTTGCGCGTTAGA GATCACTAGAGCTAAGGAATTTGAGATTACAGTAAGCTATGATCC
/AAAAEEEEEEEEEEAAEEEAEEEEEEEEE EEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE
I assumed your file is in this format
dummy header line to be ignored
ACCTAGAAGGATATGCGCTTGCGCGTTAGAGATCACTAGAGCTAAGGAATTTGAGATTACAGTAAGCTATGATCC
+
/AAAAEEEEEEEEEEAAEEEAEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE

Replace value in particular columns in csv file

I would like to replace the values which are > than 20 in columns 5 and 7 to AAA
input file
9179,22.4,-0.1,22.4,2.6,0.1,2.6,39179
9179,98.1,-1.7,98.11,1.9,1.7,2.55,39179
9179,-48.8,0.5,48.8,-1.2,-0.5,1.3,39179
6121,25,0,25,50,0,50,36121
6123,50,0,50,50,0,50,36123
6125,75,0,75,50,0,50,36125
output desired
9179,22.4,-0.1,22.4,2.6,0.1,2.6,39179
9179,98.1,-1.7,98.11,1.9,1.7,2.55,39179
9179,-48.8,0.5,48.8,-1.2,-0.5,1.3,39179
6121,25,0,25,AAA,0,AAA,36121
6123,50,0,50,AAA,0,AAA,36123
6125,75,0,75,AAA,0,AAA,36125
I tried
With this command I replace the values in column 5, how to do it for column 7 too.
awk -F ',' -v OFS=',' '$1 { if ($5>20) $5="AAA"; print}' file
Thanks in advance
here is another take for making the columns set configurable
$ awk -v cols="5,7" 'BEGIN {FS=OFS=","; split(cols,a)}
{for(i in a) if($a[i]>20) $a[i]="AAA"}1' file
9179,22.4,-0.1,22.4,2.6,0.1,2.6,39179
9179,98.1,-1.7,98.11,1.9,1.7,2.55,39179
9179,-48.8,0.5,48.8,-1.2,-0.5,1.3,39179
6121,25,0,25,AAA,0,AAA,36121
6123,50,0,50,AAA,0,AAA,36123
6125,75,0,75,AAA,0,AAA,36125
awk 'BEGIN{FS=OFS=","} $5>20{$5="AAA"} $7>20{$7="AAA"}1' file
9179,22.4,-0.1,22.4,2.6,0.1,2.6,39179
9179,98.1,-1.7,98.11,1.9,1.7,2.55,39179
9179,-48.8,0.5,48.8,-1.2,-0.5,1.3,39179
6121,25,0,25,AAA,0,AAA,36121
6123,50,0,50,AAA,0,AAA,36123
6125,75,0,75,AAA,0,AAA,36125
You can use two {..} for multiple checks and action

Using awk pattern to file filter data

I have the folling file(named /tmp/test99) which containd the rows:
"0","15","wall15"
123132,09808098,"0","15"
I am trying to filter the rows that contains "0" in the 3rd place, and "15" in 4th place (like in the second row)
I tried running:
cat /tmp/test99 | awk '/"0","15"/{print>"/tmp/0_15_file.out"} '
but instead of getting only the second row, I get also the first row starting with "0","15".
Could you please help with the pattern ?
Thanks:)
You may check if Fields 3 and 4 are equal to some hardcoded value using
awk -F, '$3=="\"0\"" && $4=="\"15\""'
Set the field separator to a comma and then, if Field 3 is "0" and Field 4 is "15" print the line, else discard.
See the online demo:
s='"0","15","wall15"
123132,09808098,"0","15"'
awk -F, '$3=="\"0\"" && $4=="\"15\""' <<< "$s"
# => 123132,09808098,"0","15"
Could you please try following.(comment on your effort, you need NOT to use cat with awk it could read Input_file by itself)
awk -F, '$3!~/\"0\"/ && $4!~/\"15\"/' Input_file

remove all fields from 2nd col which is not 5 consecutive numerical digits

Record | RegistrationID
41-1|10551
1-105|5569
4-7|10043
78-3|2176
3-1|19826
12-1|1981
Output file has to
Record | RegistrationID
1-1|10551
3-1|19826
5-7|10043
My file is a Pipe delimited
any number in the 2nd col which is less than or more than 5lenght must be removed i.e only records that have 5 consecutive numbers must remain.I'm with google since an hour to fix this out any advice given would be highly appreciable. thanks in advance
tried this grep -E ' [0-9]{5}$|$' filename - > not getting any results ,tx to cyrus
If this doesn't do what you want:
$ awk '(NR==1) || ($NF~/^[0-9]{5}$/)' file
Acno | Zip
high | 12345
tyty | 19812
then your real input file simply does not match the format that you provided in your example and you'd have to follow up on that yourself to figure out the difference and post more truly representative sample input if you want more help.
Given your updated input file with no spaces around the |s:
$ awk -F'|' '(NR==1) || ($NF~/^[0-9]{5}$/)' file
Acno | Zip
45775-1|10551
2734455-7|10043
167115-1|19826
If you REALLY have leading white space in your input that you want to remove from the output that's easily done but I'm going to assume for now that you actually don't really have that situation and it's just more mistakes in your posted sample input file.
With gawk 3.1.7 as the OP has (see comments below):
awk --re-interval -F'|' '(NR==1) || ($NF~/^[0-9]{5}$/)' file
If your columns (fields) are |-separated, may contain spaces, and the filtering criteria is exactly 5 digits in the second field, then try this:
awk -F'|' '$2 ~ /^[ ]*[0-9]{5}[ ]*$/' file
Additionally, to pass-through the header (first) line in addition:
awk -F'|' 'NR==1 || $2 ~ /^[ ]*[0-9]{5}[ ]*$/' file
Add --re-interval option to support the interval expression in the regular expression.
gawk --re-interval -F'|' '$NF~/^[0-9]{4,5}$/' file

awk - skip last line for condition

When I wrote an answer for this question I used the following:
something | sed '$d' | awk '$1>3{print $0}'
e.g.
print only lines where the 1st field is bigger than 3 (awk)
but omit the last line sed '$d'.
This seems for me a bit of duplicate work, surely it is possible to do the above only with awk - without the sed?
I'm an awkdiot - so, can someone suggest a solution?
Here's one way you could do it:
$ printf "%s\n" {1..10} | awk 'NR>1&&p>3{print p}{p=$1}'
4
5
6
7
8
9
Basically, print the first field of the previous line, rather than the current one.
As Wintermute has rightly pointed out in the comments (thanks), in order to print the whole line, you can modify the code to this:
awk 'p { print p; p="" } $1 > 3 { p = $0 }'
This only assigns the contents of contents of the line to p if the first field is greater than 3.