awk merging 2 columns and adding an extra column to txt file [duplicate] - awk

This question already has answers here:
Why does my tool output overwrite itself and how do I fix it?
(3 answers)
Closed 3 years ago.
I did this in the past without problems, but I can't this time and I don't understand why.....
My original files is
1002 10214
1002 10220
1002 10222
1002 10248
1002 10256
I need to make a new file where the 2 columns above are merged and add a second column with value 1
Desired output should look like this
100210214 1
100210220 1
100210222 1
100210248 1
100210256 1
I tried the below awk commands to first print the 2 columns into 1 into a tmp file, then adding the extra column with "1"
cat input.txt | awk '{ print ($1$2)}' > tmp1.txt
cat tmp1.txt | awk ' {print $0, (1) }' > output.txt
While the first command seems working ok, the second does not
tmp1.txt (OK)
100210214
100210220
100210222
100210248
100210256
output.txt (not OK)
10210214
10210220
10210222
10210248
10210256
The "1"comes in the front of the first column, not sure why, even replacing the first 2 characters. Is it because the original input file is different (may be "space" was used instead of tab)?

Could you please try following.
awk 'BEGIN{OFS="\t"} {sub(/\r$/,"");print $1 $2,"1"}' Input_file

This happens when input file has Windows line endings (i.e. \r\n). You can fix it using this command:
dos2unix file
and then get the desired output with this one:
awk '{$1=$1$2;$2=1}1' file

Related

How to obtain the difference between 2 columns with awk in a tsv

I have a kind of the following tsv file:
Hi 10 6
hello 7 1
Hi 6 2
Hence, the related output should be:
Hi 10 6 4
hello 7 1 6
Hi 6 2 4
I'd like to create a 4th column with the difference between the 2 columns $2 and $3, but in an absolute value.
I am trying with the following line, but with no right result:
awk -F\t '{print $0 OFS $2-$3}' file
How can I do?
Marco
When you write -F\t without quotes you're inviting the shell to interpret the string and so the shell will read the backslash and leave you with -Ft. In some awks -Ft means "use a tab as the separator" while in others it means "use the character t as the separator". I assume since you say your file is a TSV that you want it to use tabs so just write your code to not invite the shell to interpret it, i.e. -F'\t' instead of -F\t.
After that to get the absolute value is obvious and well-covered in many postings:
$ awk 'BEGIN{FS=OFS="\t"} {print $0, ($2>$3 ? $2-$3 : $3-$2)}' file
Hi 10 6 4
hello 7 1 6
Hi 6 2 4
I changed from using -F'\t' to FS="\t" because both FS and OFS need to be set to "\t" and so setting them both together to the same character instead of separately avoids duplication.

extract specific row with numbers over N

I have a dataframe like this
1 3 MAPQ=0;CT=3to5;SRMAPQ=60
2 34 MAPQ=60;CT=3to5;SRMAPQ=67
4 56 MAPQ=67;CT=3to5;SRMAPQ=50
5 7 MAPQ=44;CT=3to5;SRMAPQ=61
with using awk (or others)
I want to extract rows with only SRMAPQ over 60.
This means the output is
2 34 MAPQ=60;CT=3to5;SRMAPQ=67
5 7 MAPQ=44;CT=3to5;SRMAPQ=61
update: "SRMAPQ=60" can be anywhere in the line,
MAPQ=44;CT=3to5;SRMAPQ=61;DT=3to5
You don't have to extract the value out of SRMAPQ separately and do the comparison. If the format is fixed like above, just use = as the field separator and access the last field using $NF
awk -F= '$NF > 60' file
Or if SRMAPQ can occur anywhere in the line (as updated in the comments), use a generic approach
awk 'match($0, /SRMAPQ=([0-9]+)/){ l = length("SRMAPQ="); v = substr($0, RSTART+l, RLENGTH-l) } v > 60' file
I would use GNU AWK following way let file.txt content be
1 3 MAPQ=0;CT=3to5;SRMAPQ=60
2 34 MAPQ=60;CT=3to5;SRMAPQ=67;SOMETHING=2
4 56 MAPQ=67;CT=3to5;SRMAPQ=50
5 7 MAPQ=44;CT=3to5;SRMAPQ=61
then
awk 'BEGIN{FS="SRMAPQ="}$2>60' file.txt
output
2 34 MAPQ=60;CT=3to5;SRMAPQ=67;SOMETHING=2
5 7 MAPQ=44;CT=3to5;SRMAPQ=61
Note: added SOMETHING to test if it would work when SRMAPQ is not last. Explantion: I set FS to SRMAPQ= thus what is before that becomes first field ($1) and what is behind becomes second field ($2). In 2nd line this is 67;SOMETHING=2 with which GNU AWK copes by converting its' longmost prefix which constitute number in this case 67, other lines have just numbers. Disclaimer: this solution assumes that all but last field have trailing ;, if this does not hold true please test my solution fully before usage.
(tested in gawk 4.2.1)

summarizing the contents of a text file to an other one using awk

I have a big text file with 2 tab separated fields. as you see in the small example every 2 lines have a number in common. I want to summarize my text file in this way.
1- look for the lines that have the number in common and sum up the second column of those lines.
small example:
ENST00000054666.6 2
ENST00000054666.6_2 15
ENST00000054668.5 4
ENST00000054668.5_2 10
ENST00000054950.3 0
ENST00000054950.3_2 4
expected output:
ENST00000054666.6 17
ENST00000054668.5 14
ENST00000054950.3 4
as you see the difference is in both columns. in the 1st column there is only one repeat of each common and without "_2" and in the 2nd column the values is sum up of both lines (which have common number in input file).
I tried this code but does not return what I want:
awk -F '\t' '{ col2 = $2, $2=col2; print }' OFS='\t' input.txt > output.txt
do you know how to fix it?
Solution 1st: Following awk may help you on same.
awk '{sub(/_.*/,"",$1)} {a[$1]+=$NF} END{for(i in a){print i,a[i]}}' Input_file
Solution 2nd: In case your Input_file is sorted by 1st field then following may help you.
awk '{sub(/_.*/,"",$1)} prev!=$1 && prev{print prev,val;val=""} {val+=$NF;prev=$1} END{if(val){print prev,val}}' Input_file
Use > output.txt at the end of the above codes in case you need the output in a output file too.
If order is not a concern, below may also help :
awk -v FS="\t|_" '{count[$1]+=$NF}
END{for(i in count){printf "%s\t%s%s",i,count[i],ORS;}}' file
ENST00000054668.5 14
ENST00000054950.3 4
ENST00000054666.6 17
Edit :
If the order of the output does matter, below approach using a flag helps :
$ awk -v FS="\t|_" '{count[$1]+=$NF;++i;
if(i==2){printf "%s\t%s%s",$1,count[$1],ORS;i=0}}' file
ENST00000054666.6 17
ENST00000054668.5 14
ENST00000054950.3 4

awk: Search missing value in file

awk newbie here! I am asking for help to solve a simple specific task.
Here is file.txt
1
2
3
5
6
7
8
9
As you can see a single number (the number 4) is missing. I would like to print on the console the number 4 that is missing. My idea was to compare the current line number with the entry and whenever they don't match I would print the line number and exit. I tried
cat file.txt | awk '{ if ($NR != $1) {print $NR; exit 1} }'
But it prints only a newline.
I am trying to learn awk via this small exercice. I am therefore mainly interested in solutions using awk. I also welcome an explanation for why my code does not do what I would expect.
Try this -
awk '{ if (NR != $1) {print NR; exit 1} }' file.txt
4
since you have a solution already, here is another approach, comparing with previous values.
awk '$1!=p+1{print p+1} {p=$1}' file
you positional comparison won't work if you have more than one missing value.
Maybe this will help:
seq $(tail -1 file)|diff - file|grep -Po '.*(?=d)'
4
Since I am learning awk as well
awk 'BEGIN{i=0}{i++;if(i!=$1){print i;i=$1}}' file
4
`awk` explanation read each number from `$1` into array `i` and increment that number list line by line with `i++`, if the number is not sequential, then print it.
cat file
1
2
3
5
6
7
8
9
11
12
13
15
awk 'BEGIN{i=0}{i++;if(i!=$1){print i;i=$1}}' file
4
10
14

Extracting block of data from a file

I have a problem, which surely can be solved with an awk one-liner.
I want to split an existing data file, which consists of blocks of data into separate files.
The datafile has the following form:
1 1
1 2
1 3
2 1
2 2
2 3
3 1
3 2
3 3
And i want to store every single block of data in a separate file, named - for example - "1.dat", ".dat", "3.dat",...
The problem is, that each block doesn't have a specific line number, they are just delimited by two "new lines".
Thanks in advance,
Jürgen
This should get you started:
awk '{ print > ++i ".dat" }' RS= file.txt
If by two "new lines" you mean, two newline characters:
awk '{ print > ++i ".dat" }' RS="\n\n" file.txt
See how the results differ? Setting a null RS (i.e. the first example) is probably what you're looking for.
Another approach:
awk 'NF != 0 {print > $1 ".dat"}' file.txt