Extract two different types of values from a file and print it to an output file - awk

I have a file where the data looks like:
sp_0005_SySynthetic ConstructTumor protein p53 N-terminal transcription-activation domain
A=9 C=2 D=3 E=4 F=2 G=15 I=3 K=3 L=9 M=3 N=5 P=2 Q=11 R=8 S=12 T=6 V=8 W=1 Y=5
Amino acid alphabet = 19
Sequence length = 115
sp_0017_CaCamelidSorghum bicolor multidrug and toxic compound extrusion sbmate
A=10 C=2 D=4 E=4 F=2 G=15 H=1 I=2 K=4 L=7 M=2 N=5 P=3 Q=6 R=4 S=18 T=7 V=10 W=5 Y=10
Amino acid alphabet = 20
Sequence length = 126
sp_0021_LgLlamabotulinum neurotoxin BoNT serotype F
A=14 C=2 D=4 E=5 F=4 G=15 I=2 K=3 L=6 M=2 N=6 P=4 Q=7 R=8 S=13 T=10 V=8 W=3 Y=10
Amino acid alphabet = 19
Sequence length = 131
I want to extract the vales of 'Amino acid alphabet' and 'Sequence length into an output file', and it should look like:
19 115
20 126
19 131
As I am new to bash, all I could try so far is:
grep -i "Amino acid alphabet = $i" test.txt >>out.txt
But, I don't want the word "Amino acid alphabet" in the output. I only want the values of "Amino acid alphabet" and "Sequence length" as two columns.
Can I get any help how to do that? Thanks in advance.

$ awk -v RS= '{print $(NF-4), $NF}' file
19 115
20 126
19 131

Assuming both fields exist for all your records:
awk '/^Amino acid alphabet/{printf $NF FS} /^Sequence length/{print $NF}' file
19 115
20 126
19 131
Also you may want to have some introduction about awk into the awk wiki

This code: grep -i "Amino acid alphabet = $i" test.txt >>out.txt includes the shell expansion of $i. If you have not given a value to i then the search pattern resolves to Amino acid alphabet = , and thus will find each line that contains that. The $i would change the search pattern if $i had a value.
There are many ways to get what you want with BASH. one is to use grep with PCRE (Perl-style) regex enabled:
grep -Po "(?<=Amino acid alphapbet = )\d+" test.txt >> out.text
#yields:
19
20
19
(?<=string) tells grep that for the rest to match, it must have been preceded by string, but string is not a part of the Match. -Po are the options to enable PCRE (Perl Style) and to only print the match, rather than the whole line in which there was a match.
Note that the output redirect is >> if you want to append to a file if it already contains lines, > will overwrite an existing file if it exists, (without asking for confirmation!)

sed can do this too.
sed -En '/^Amino acid alphabet =/h; /^Sequence length =/{ H; x; s/[^0-9]+/ /g; s/^ //; p; }' infile > outfile
/^Amino acid alphabet =/h stores the first line in the save buffer.
/^Sequence length =/{ triggers all the steps inside the curlies.
H adds the current line to the save buffer.
x swaps the save buffer back to the workspace.
s/[^0-9]+/ /g; changes every sequence on NON-digits to a single space.
This includes the newline.
s/^ //; removes the leading space.
p prints the output line for this data set.

Related

How do I print every nth entry of the mth column, starting from a particular line of a file?

Consider the following data in a file file.txt:
$
$
$
FORCE 10 30 40
* 1 5 4
FORCE 11 20 22
* 2 3 0
FORCE 19 25 10
* 16 12 8
.
.
.
I want to print every 2nd element of the third column, starting from line 4, resulting in:
30
20
25
I have tried:
cat file.txt | sed 's/\|/ /' | awk 'NR%2==4 {print $3}'
However, this is not resulting in anything being printed and no errors generated either.
You might use awk checking that the row number > 3 and then check for an even row number with NR%2==0.
Note that you don't have to use cat
awk 'NR > 3 && NR%2==0 {
print $3
}' file.txt
Output
30
20
25
Using sed
$ sed -En '4~2s/([^ \t]*[ \t]+){2}([^ \t]*).*/\2/p' input_file
30
20
25
I have tried:
cat file.txt | sed 's/\|/ /' | awk 'NR%2==4 {print $3}'
However, this is not resulting in anything being printed and no errors
generated either.
You do not need cat whilst using GNU sed as it can read file on its' own, in this case it would be sed 's/\|/ /' file.txt.
You should consider if you need that part at all, your sample input does not have pipe character at all, so it would do nothing to it. You might also drop that part if lines holding values you want to print do not have that character.
Output is empty as NR%2==4 does never hold, remainder of division by x is always smaller than x (in particular case of %2 only 2 values are possible: 0 and 1)
This might work for you (GNU sed):
sed -nE '4~2s/^((\S+)\s*){3}.*/\2/p' file
Turn off implicit printing by setting the -n option and reduce back slashes in regexps by turning on -E.
From the fourth line and then every second line thereafter, capture the third column and print it.
N.B. The \2 represents the last inhabitant of that back reference which in conjunction with the {3} means the above.
Alternative:
sed -n '4,${s/^\(\(\S\+\)\s*\)\{3\}.*/\2/p;n}' file

awk command or sed command

000Bxxxxx111118064085vxas - header
10000000001000000000053009-000000000053009-
10000000005000000000000000+000000000000000+
10000000030000000004025404-000000004025404-
10000000039000000000004930-000000000004930-
10000000088000005417665901-000005417665901-
90000060883328364801913 - trailer
In the above file we have header and trailer and the records which start with 1 is the detail record
in the detail record,want to sum the values starting from position 28 till 44 including the sign using awk/sed command
Here is sed, with help from bc to do the arithmetic:
sed -rn '
/header|trailer/! {
s/[[:digit:]]*[+-]([[:digit:]]+)([+-])$/\2\1/
H
}
$ {
x
s/\n//gp
}
' file | bc
I assume the +/- sign follows the number.
Using awk we can solve this problem making use of substr:
substr(s, m[, n ]):
Return the at most n-character substring of s that begins at position m, numbering from 1. If n is omitted, or if n specifies more characters than are left in the string, the length of the substring shall be limited by the length of the string s.
This allows us to take the string which represents the number. Here, I assumed that the sign before and after the number is same and thus the sign of the number :
$ echo "10000000001000000000053009-000000000053009-" \
| awk '{print length($0); print substr($0,27,43-27)}'
43
-000000000053009
Since awk implicitly converts strings to numbers if you do numeric operations on them we can write the following awk-code to achieve the requested :
$ awk '/header|trailer/{next}
{s+=substr($0,27,43-27)}
END{print s}' file.dat
-5421749244
Or in a single line:
$ awk '/header|trailer/{next}{s+=substr($0,27,43-27)} END{print s}' file.dat
-5421749244
The above examples just work on the example file given by the OP. However, if you have a file containing multiple blocks with header and trailer and you only want to use the text inside these blocks (exclude everything outside of the blocks), then you should handle it a bit differently :
$ awk '/header/{s=0;c=1;next}
/trailer/{S+=s;c=0;next}
c{s+=substr($0,27,43-27)}
END{print S}' file.dat
Here we do the following:
If a line with header is found, reset the block sum s to ZERO and set c=1 indicating that we take the next lines into account
If a line with trailer is found, add the block sum s to the overall sum S and set c=0 indicating to ignore the lines.
If c/=0 compute the block sum s
At the END, print the total sum S

Comparing two files based on 1st column, printing the unique part of one file

I have two files looking like this:
file1:
RYR2 29 70 0.376583106063 4.77084855376
MUC16 51 94 0.481067457376 3.9233164551
DCAF4L2 0 13 0.0691414496833 3.05307268261
USH2A 32 62 0.481792717087 2.81864194236
ZFHX4 14 37 0.371576262084 2.81030548752
file2:
A26B2
RYR2
MUC16
ACTL9
I need to compare them based on first column and print only those lines of first file that are not in second, so the output should be:
DCAF4L2 0 13 0.0691414496833 3.05307268261
USH2A 32 62 0.481792717087 2.81864194236
ZFHX4 14 37 0.371576262084 2.81030548752
I tried with grep:
grep -vFxf file2 file1
with awk:
awk 'NR==FNR {exclude[$0];next} !($0 in exclude)' file 2 file1
comm:
comm -23 <(sort file1) <(sort file2)
nothing works
You can use
grep -vFf file2 file1
Also, grep -vf file2 file1 will work, too, but in case the file2 strings contain * or [ that should be read in as literal chars you might get into trouble since they should be escaped. F makes grep treat those strings as fixed strings.
NOTES
-v: Invert match.
-f file: Take regexes from a file.
-F: Interpret the pattern as a list of fixed strings (instead of regular expressions), separated by newlines, any of which is to be matched.
So, it reads the regexes from file2 and applies them to file1, and once it finds a match, that line is not output due to inverted search. This is enough because only the first column contains alphanumerics, the rest contain numeric data only.
Why your command did not work
The -x (short for --line-regexp) option means Select only those matches that exactly match the whole line.
Also, see more about grep options in grep documentation.

Print rows that has numbers in it

this is my data - i've more than 1000rows . how to get only the the rec's with numbers in it.
Records | Num
123 | 7 Y1 91
7834 | 7PQ34-102
AB12AC|87 BWE 67
5690278| 80505312
7ER| 998
Output has to be
7ER| 998
5690278| 80505312
I'm new to linux programming, any help would be highly useful to me. thanks all
I would use awk:
awk -F'[[:space:]]*[|][[:space:]]*' '$2 ~ /^[[:digit:]]+$/'
If you want to print the number of lines deleted as you've been asking in comments, you may use this:
awk -F'[[:space:]]*[|][[:space:]]*' '
{
if($2~/^[[:digit:]]+$/){print}else{c++}
}
END{printf "%d lines deleted\n", c}' file
A short and simple GNU awk (gawk) script to filter lines with numbers in the second column (field), assuming a one-word field (e.g. 1234, or 12AB):
awk -F'|' '$2 ~ /\y[0-9]+\y/' file
We use the GNU extension for regexp operators, i.e. \y for matching the word boundary. Other than that, pretty straightforward: we split fields on | and look for isolated digits in the second field.
Edit: Since the question has been updated, and now explicitly allows for multiple words in the second field (e.g. 12 AB, 12-34, 12 34), to get lines with numbers and separators only in the second field:
awk -F'|' '$2 ~ /^[- 0-9]+$/' file
Alternatively, if we say only letters are forbidden in the second field, we can use:
awk -F'|' '$2 ~ /^[^a-zA-Z]+$/' file

take out specific columns from mulitple files

I have multiple files that look like the one below. They are tab-separated. For all the files I would like to take out column 1 and the column that start with XF:Z:. This will give me output 1
The files names are htseqoutput*.sam.sam where * varies. I am not sure about the awk function use, and if the for-loop is correct.
for f in htseqoutput*.sam.sam
do
awk ????? "$f" > “out${f#htseqoutput}”
done
input example
AACAGATGATGAACTTATTGACGGGCGGACAGGAACTGTGTGCTGATTGTC_11 16 chr22 39715068 24 51M * 0 0 GACAATCAGCACACAGTTCCTGTCCGCCCGTCAATAAGTTCATCATCTGTT IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII AS:i:-12 XN:i:0 XM:i:2 XO:i:0 XG:i:0 NM:i:2 MD:Z:18T31G0 YT:Z:UU XF:Z:SNORD43
GTTTCCTTAGTGTAGCGGTTATCACATTCGCCT_0 16 chr19 4724687 40 33M * 0 0 AGGCGAATGTGATAACCGCTACACTAAGGAAAC IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII AS:i:-6 XN:i:0 XM:i:1 XO:i:0 XG:i:0 NM:i:1 MD:Z:26C6 YT:Z:UU XF:Z:tRNA
TCGACTCCCGGTGTGGGAACC_0 16 chr13 45492060 23 21M * 0 0 GGTTCCCACACCGGGAGTCGA IIIIIIIIIIIIIIIIIIIII AS:i:-6 XN:i:0 XM:i:1 XO:i:0 XG:i:0 NM:i:1 MD:Z:0C20 YT:Z:UU XF:Z:tRNA
output 1:
AACAGATGATGAACTTATTGACGGGCGGACAGGAACTGTGTGCTGATTGTC_11 SNORD43
GTTTCCTTAGTGTAGCGGTTATCACATTCGCCT_0 tRNA
TCGACTCCCGGTGTGGGAACC_0 tRNA
Seems like you could just use sed for this:
sed -r 's/^([ACGT0-9_]+).*XF:Z:([[:alnum:]]+).*/\1\t\2/' file
This captures the part at the start of the line and the alphanumeric part following XF:Z: and outputs them, separated by a tab character. One potential advantage of this approach is that it will work independently of the position of the XF:Z: string.
Your loop looks OK (you can use this sed command in place of the awk part) but be careful with your quotes. " should be used, not “/”.
Alternatively, if you prefer awk (and assuming that the bit you're interested in is always part of the last field), you can use a custom field separator:
awk -F'[[:space:]](XF:Z:)?' -v OFS='\t' '{print $1, $NF}' file
This optionally adds the XF:Z: part to the field separator, so that it is removed from the start of the last field.
You can try, if column with "XF:Z:" is always at the end
awk 'BEGIN{OFS="\t"}{n=split($NF,a,":"); print $1, a[n]}' file.sam
you get,
AACAGATGATGAACTTATTGACGGGCGGACAGGAACTGTGTGCTGATTGTC_11 SNORD43
GTTTCCTTAGTGTAGCGGTTATCACATTCGCCT_0 tRNA
TCGACTCCCGGTGTGGGAACC_0 tRNA
or, if this column is a variable position for each file
awk 'BEGIN{OFS="\t"}
FNR==1{
for(i=1;i<=NF;i++){
if($i ~ /^XF:Z:/) break
}
}
{n=split($i,a,":"); print $1, a[n]}' file.sam