Extract multiple columns and add null character in between - awk

I have a file with the following format :
TRINITY_DN119001_c0_g1_i1 4 * 0 0 * * 0 0 GAGCCTCCCTCATGAATGTACCAGCATTTACCTCATAAAGAGCT * XO:Z:NM
TRINITY_DN119037_c0_g1_i1 4 * 0 0 * * 0 0 TAAGATTAGGTTGTATTCCAG * XO:Z:NM
TRINITY_DN119099_c0_g1_i1 4 * 0 0 * * 0 0 AGGCAGGCGCTAAACGATTTGCATTTCTCTAATGATTACGCCAG * XO:Z:NM
I am trying to extract the 1st and 10th column and store it in the following format(output file) :
>TRINITY_DN119099_c0_g1_i1
GAGCCTCCCTCATGAATGTACCAGCATTTACCTCATAAAGAGCT
>TRINITY_DN119037_c0_g1_i1
TAAGATTAGGTTGTATTCCAG
>TRINITY_DN119001_c0_g1_i1
AGGCAGGCGCTAAACGATTTGCATTTCTCTAATGATTACGCCAG
I am doing the following code for now :
cut -d " " -f1,10 in.txt > out.txt
sed 's/^/>/' out.txt
but,unable to get how to get above output.

You may use awk:
awk '{printf ">%s\n%s\n", $1, $10}' file
>TRINITY_DN119001_c0_g1_i1
GAGCCTCCCTCATGAATGTACCAGCATTTACCTCATAAAGAGCT
>TRINITY_DN119037_c0_g1_i1
TAAGATTAGGTTGTATTCCAG
>TRINITY_DN119099_c0_g1_i1
AGGCAGGCGCTAAACGATTTGCATTTCTCTAATGATTACGCCAG
However note that it is 1st and 10th column in your shown output instead of 9th.

if your data in 'd' file, try gnu sed:
sed -E 's/^(TRINITY_DN\S+).*\s([ACGT]+).*/\1\n\2/' d

Related

Filter a file removing lines just with all 0

I need to remove rows from a file with all "0" in the differents columns
Example
seq_1
seq_2
seq_3
data_0
0
0
1
data_1
0
1
4
data_2
0
0
0
data_3
6
0
2
From the example, I need a new file just with the row of data_2. Because it has just all "0" numbers.
I was try using grep and awk but I dont know how to filter just between column $2:4
$ awk 'FNR>1{for(i=2;i<=NF;i++)if($i!=0)next}1' file
Explained:
$ awk 'FNR>1 { # process all data records
for(i=2;i<=NF;i++) # loop all data fields
if($i!=0) # once non-0 field is found
next # on to the next record
}1' file # output the header and all-0 records
Very poorly formated output as the sample data is in some kind of table format which it probably is not IRL:
seq_1 seq_2 seq_3
data_2 0 0 0
With awk you can rely on field string representation:
$ awk 'NR>1 && $2$3$4=="000"' test.txt > result.txt
Using sed, find lines matching a pattern of one or more spaces followed by a 0 (3 times) and if found print the line.
sed -nr '/\s+0\s+0\s+0/'p file.txt > new_file.txt
Or with awk, if columns 2, 3 and 4 are equal to a 0, print the line.
awk '{if ($2=="0" && $3=="0" && $4=="0"){print $0}}' file.txt > new_file.txt
EDIT: I ran the time command on these a bunch of times and the awk version is generally faster. Could add up if you are searching a large file. Of course your mileage may vary!

How to swap the last two column pairs with awk?

I am trying this
awk '{B=$(NF-1);A=$NF; $NF=$(NF-2); $(NF-1) = $(NF-3); $(NF-2)=A; $(NF-3) = B; print;}' input_text.txt
but I get the error:
awk: cmd. line:1: (FILENAME=cazzo.txt FNR=2) fatal: attempt to access field -1
Sample input:
$ cat input_text.txt
1 7 9 11 0 5 2
The same happens if I replace the spaces with tabs in the input_text.txt file.
Expected output:
1 7 9 5 2 11 0
I am running with Cygwin on Windows 10.
You can try this awk for swapping values:
awk 'NF > 3 {a=$NF; b=$(NF-1); $NF=$(NF-2); $(NF-1)=$(NF-3); $(NF-3)=b; $(NF-2)=a} 1' file
1 7 9 5 2 11 0
If there are DOS line breaks then use:
awk -v RS='\r?\n' 'NF > 3 {a=$NF; b=$(NF-1); $NF=$(NF-2); $(NF-1)=$(NF-3); $(NF-3)=b; $(NF-2)=a} 1' file
If you have gnu awk then you can use this regex based approach:
awk -v RS='\r?\n' 'NF > 3 {
$0 = gensub(/(\S+\s+\S+)(\s+)(\S+\s+\S+)$/, "\\3\\2\\1", "1")} 1' file
1 7 9 5 2 11 0
To swap the last n fields with the n fields before them:
$ awk -v n=2 'NF>=(2*n){ for (i=NF-(n-1); i<=NF; i++) {t=$i; $i=$(i-n); $(i-n)=t} } 1' file
1 7 9 5 2 11 0
$ awk -v n=3 'NF>=(2*n){ for (i=NF-(n-1); i<=NF; i++) {t=$i; $i=$(i-n); $(i-n)=t} } 1' file
1 0 5 2 7 9 11
With your shown samples, please try following code. This is a Generic code, where you have 2 awk variables named fromFields and toFields. So you need to give their values like: let's say you want to substitute 4th field value with 6th field AND 5th field value with 7th field, so you will set it like: fromFields="4,5" and toFields="6,7". I am assuming user will understand that values which are given are feasible with respect to Input_file.
awk -v fromFields="4,5" -v toFields="6,7" '
BEGIN{
num1=split(fromFields,arr1,",")
num2=split(toFields,arr2,",")
}
{
tmp=""
for(i=1;i<=num1;i++){
tmp=$arr1[i]
$arr1[i]=$arr2[i]
$arr2[i]=tmp
}
}
1
' Input_file

add conditionally a column to a file awk

I have a file that looks like this
--- 1:14567 14567 A C 0 0 1
--- 1:32424 34424 A C 0 0 1
rsid124 35245 C G 0 0 1
What I want is add a first field (---) only in those rows that don't already have the ---
The desired output will be like this
--- 1:14567 14567 A C 0 0 1
--- 1:32424 34424 A C 0 0 1
--- rsid124 35245 C G 0 0 1
If it is easier, a workaround would be remove the --- with vim or another text editor, but I would like to know how to add a column depending on row
Paraphrased:
What I want is add a first field --- only in those rows that don't already have it.
This is quite easy to do with awk, just use:
awk '$1 != "---" { print "--- "$0 } { print }" inputFile >outputFile
This will modify all lines where the first column isn't as expected, prefixing it with the desired string. All other lines will be printed as is.
If you don't mind using sed, the following may work :
$ sed -i 's/^[^-]\(.*\)/--- \1/g' file.txt
Otherwise, an awk version would be like :
awk '{ if ( $1 == "---" ) { print $0 } else { print "---", $0 } }' file.txt

Print every second consequtive field in two columns - awk

Assume the following file
#zvview.exe
#begin Present/3
77191.0000 189.320100 0 0 3 0111110 16 1
-8.072430+6-8.072430+6 77190 0 1 37111110 16 2
37 2 111110 16 3
8.115068+6 0.000000+0 8.500000+6 6.390560-2 9.000000+6 6.803440-1111110 16 4
9.500000+6 1.685009+0 1.000000+7 2.582780+0 1.050000+7 3.260540+0111110 16 5
37 2 111110 16 18
What I would like to do, is print in two columns, the fields after line 6. This can be done using NR. The tricky part is the following : Every second field, should go in one column as well as adding an E before the sign, so that the output file will look like this
8.115068E+6 0.000000E+0
8.500000E+6 6.390560E-2
9.000000E+6 6.803440E-1
9.500000E+6 1.685009E+0
1.000000E+7 2.582780E+0
1.050000E+7 3.260540E+0
From the output file you see that I want to keep in $6 only length($6)=10 characters.
How is it possible to do it in awk?
can do all in awk but perhaps easier with the unix toolset
$ sed -n '6,7p' file | cut -c2-66 | tr ' ' '\n' | pr -2ats' '
8.115068+6 0.000000+0
8.500000+6 6.390560-2
9.000000+6 6.803440-1
9.500000+6 1.685009+0
1.000000+7 2.582780+0
1.050000+7 3.260540+0
Here is a awk only solution or comparison
$ awk 'NR>=6 && NR<=7{$6=substr($6,1,10);
for(i=1;i<=6;i+=2) {f[++c]=$i;s[c]=$(i+1)}}
END{for(i=1;i<=c;i++) print f[i],s[i]}' file
8.115068+6 0.000000+0
8.500000+6 6.390560-2
9.000000+6 6.803440-1
9.500000+6 1.685009+0
1.000000+7 2.582780+0
1.050000+7 3.260540+0
Perhaps shorter version,
$ awk 'NR>=6 && NR<=7{$6=substr($6,1,10);
for(i=1;i<=6;i+=2) print $i FS $(i+1)}' file
8.115068+6 0.000000+0
8.500000+6 6.390560-2
9.000000+6 6.803440-1
9.500000+6 1.685009+0
1.000000+7 2.582780+0
1.050000+7 3.260540+0
to convert format to standard scientific notation, you can pipe the result to
sed or embed something similar in awk script (using gsub).
... | sed 's/[+-]/E&/g'
8.115068E+6 0.000000E+0
8.500000E+6 6.390560E-2
9.000000E+6 6.803440E-1
9.500000E+6 1.685009E+0
1.000000E+7 2.582780E+0
1.050000E+7 3.260540E+0
With GNU awk for FIELDWIDTHS:
$ cat tst.awk
BEGIN { FIELDWIDTHS="9 2 9 2 9 2 9 2 9 2 9 2" }
NR>5 && NR<8 {
for (i=1;i<NF;i+=4) {
print $i "E" $(i+1), $(i+2) "E" $(i+3)
}
}
$ awk -f tst.awk file
8.115068E+6 0.000000E+0
8.500000E+6 6.390560E-2
9.000000E+6 6.803440E-1
9.500000E+6 1.685009E+0
1.000000E+7 2.582780E+0
1.050000E+7 3.260540E+0
If you really want to get rid of the leading blanks then there's various ways to do it (simplest being gsub(/ /,"",$<field number>) on the relevant fields) but I left them in because the above allows your output to line up properly if/when your numbers start with a -, like they do on line 4 of your sample input.
If you don't have GNU awk, get it as you're missing a LOT of extremely useful functionality.
I tried to combine #karafka 's answer using substr, so the following does the trick!
awk 'NR>=6 && NR<=7{$6=substr($6,1,10);for(i=1;i<=6;i+=2) print substr($i,1,8) "E" substr($i,9) FS substr($(i+1),1,8) "E" substr($(i+1),9)}' file
and the output is
8.115068E+6 0.000000E+0
8.500000E+6 6.390560E-2
9.000000E+6 6.803440E-1
9.500000E+6 1.685009E+0
1.000000E+7 2.582780E+0
1.050000E+7 3.260540E+0

awk math: check result before printing

My input is:
cat input
1
4
-2
I want to subtract some value (e.g., 2) from column 1. If result is > 0 print it or else print 0. In other words, I don't want any negative numbers.
My try is:
awk '{NUMBER=$1-2} if (NUMBER > 0) print NUMBER; else print 0'
But I am probably making some syntax mistake.
Wanted ouput is:
0
2
0
This can be an option:
$ awk '{$1=$1-2>0?$1-2:0; print $1}' file
0
2
0
$1=$1-2>0?$1-2:0 is a ternary operator and is read like this:
Set $1 to:
if ($1-2 > 0)
==> $1-2
else ==> 0
and then it is printed out print $1.
With this:
awk '{NUMBER=$1-2} if (NUMBER > 0) print NUMBER; else print 0'
you're putting the if... statement in the condition part of the awk body instead of the action part because it's not enclosed in curly brackets. This is the correct syntax for what you wrote:
awk '{NUMBER=$1-2; if (NUMBER > 0) print NUMBER; else print 0}'
but in reality I'd write it as:
awk '{NUMBER=$1-2; print (NUMBER > 0 ? NUMBER : 0)}'