How do I print every nth entry of the mth column, starting from a particular line of a file? - awk

Consider the following data in a file file.txt:
$
$
$
FORCE 10 30 40
* 1 5 4
FORCE 11 20 22
* 2 3 0
FORCE 19 25 10
* 16 12 8
.
.
.
I want to print every 2nd element of the third column, starting from line 4, resulting in:
30
20
25
I have tried:
cat file.txt | sed 's/\|/ /' | awk 'NR%2==4 {print $3}'
However, this is not resulting in anything being printed and no errors generated either.

You might use awk checking that the row number > 3 and then check for an even row number with NR%2==0.
Note that you don't have to use cat
awk 'NR > 3 && NR%2==0 {
print $3
}' file.txt
Output
30
20
25

Using sed
$ sed -En '4~2s/([^ \t]*[ \t]+){2}([^ \t]*).*/\2/p' input_file
30
20
25

I have tried:
cat file.txt | sed 's/\|/ /' | awk 'NR%2==4 {print $3}'
However, this is not resulting in anything being printed and no errors
generated either.
You do not need cat whilst using GNU sed as it can read file on its' own, in this case it would be sed 's/\|/ /' file.txt.
You should consider if you need that part at all, your sample input does not have pipe character at all, so it would do nothing to it. You might also drop that part if lines holding values you want to print do not have that character.
Output is empty as NR%2==4 does never hold, remainder of division by x is always smaller than x (in particular case of %2 only 2 values are possible: 0 and 1)

This might work for you (GNU sed):
sed -nE '4~2s/^((\S+)\s*){3}.*/\2/p' file
Turn off implicit printing by setting the -n option and reduce back slashes in regexps by turning on -E.
From the fourth line and then every second line thereafter, capture the third column and print it.
N.B. The \2 represents the last inhabitant of that back reference which in conjunction with the {3} means the above.
Alternative:
sed -n '4,${s/^\(\(\S\+\)\s*\)\{3\}.*/\2/p;n}' file

Related

Count b or B in even lines

I need count the number of times in the even lines of the file.txt the letter 'b' or 'B' appears, e.g. for the file.txt like:
everyB or gbnBra
uitiakB and kanapB bodddB
Kanbalis astroBominus
I got the first part but I need to count these b or B letters and I do not know how to count them together
awk '!(NR%2)' file.txt
$ awk '!(NR%2){print gsub(/[bB]/,"")}' file
4
Could you please try following, one more approach with awk written on mobile will try it in few mins should work but.
awk -F'[bB]' 'NR%2 == 0{print (NF ? NF - 1 : 0)}' Input_file
Thanks to #Ed sir for solving zero matches found line problem in comments.
In a single awk:
awk '!(NR%2){gsub(/[^Bb]/,"");print length}' file.txt
gsub(/[^Bb]/,"") deletes every character in the line the line except for B and b.
print length prints the length of the resulting string.
awk '!(NR%2)' file.txt | tr -cd 'Bb' | wc -c
Explanation:
awk '!(NR%2)' file.txt : keep only even lines from file.txt
tr -cd 'Bb' : keep only B and b characters
wc -c : count characters
Example:
With file bellow, the result is 4.
everyB or gbnBra
uitiakB and kanapB bodddB
Kanbalis astroBominus
Here is another way
$ sed -n '2~2s/[^bB]//gp' file | wc -c

Delete "0" or "1" from the end of each line, except the first line

the input file looks like
Kick-off team 68 0
Ball safe 69 1
Attack 77 8
Attack 81 4
Throw-in 83 0
Ball possession 86 3
Goal kick 100 10
Ball possession 101 1
Ball safe 114 13
Throw-in 123 9
Ball safe 134 11
Ball safe 135 1
Ball safe 137 2
and at the end it should look like this:
Kick-off team 68 0
Attack 77 8
Attack 81 4
Ball possession 86 3
Goal kick 100 10
Ball safe 114 13
Throw-in 123 9
Ball safe 134 11
Ball safe 137 2
my solution is
awk '{print $NF}' test.txt | sed -re '2,${/(^0$|^1$)/d}'
how can i directly change the file, e.g. sed -i?
sed -i '2,$ {/[^0-9][01]$/d}' test.txt
2,$ lines to act upon, this one says 2nd line to end of file
{/[^0-9][01]$/d} from filtered lines, delete those ending with 0 or 1
'2,$ {/ [01]$/d}' can be also used if character before last column is always a space
With awk which is better suited for column manipulations:
awk 'NR==1 || ($NF!=1 && $NF!=0)' test.txt > tmp && mv tmp test.txt
NR==1 first line
($NF!=1 && $NF!=0) last column shouldn't be 0 or 1
can also use $NF>1 if last column only have non-negative numbers
> tmp && mv tmp test.txt save output to temporary file and then move it back as original file
With GNU awk, there is inplace option awk -i inplace 'NR==1 || ($NF!=1 && $NF!=0)' test.txt
Here's my take on this.
sed -i.bak -e '1p;/[^0-9][01]$/d' file.txt
The sed script prints the first line, then deletes all subsequent lines that match the pattern you described. This assumes that your first line would be a candidate for deletion; if it contains something other than 0 or 1 in the last field, this script will print it twice. And the -i option is what tells sed to edit in-place (with a backup file).
Awk doesn't have an equivalent option for editing files in-place -- if you want that kind of functionality, you need to implement it in a shell wrapper around your awk script, as #sundeep suggested.
Note that I'm not using GNU sed, but this command should work equally well with it.
awk to the rescue!
$ awk 'NR==1 || $NF && $NF!=1' file
or more cryptic
$ awk 'NR==1 || $NF*($NF-1)' file
This might work for you (GNU sed):
sed -i '1b;/\s[01]$/d' file
Other than the first line, delete any line ending in 0 or 1.

Print every second consequtive field in two columns - awk

Assume the following file
#zvview.exe
#begin Present/3
77191.0000 189.320100 0 0 3 0111110 16 1
-8.072430+6-8.072430+6 77190 0 1 37111110 16 2
37 2 111110 16 3
8.115068+6 0.000000+0 8.500000+6 6.390560-2 9.000000+6 6.803440-1111110 16 4
9.500000+6 1.685009+0 1.000000+7 2.582780+0 1.050000+7 3.260540+0111110 16 5
37 2 111110 16 18
What I would like to do, is print in two columns, the fields after line 6. This can be done using NR. The tricky part is the following : Every second field, should go in one column as well as adding an E before the sign, so that the output file will look like this
8.115068E+6 0.000000E+0
8.500000E+6 6.390560E-2
9.000000E+6 6.803440E-1
9.500000E+6 1.685009E+0
1.000000E+7 2.582780E+0
1.050000E+7 3.260540E+0
From the output file you see that I want to keep in $6 only length($6)=10 characters.
How is it possible to do it in awk?
can do all in awk but perhaps easier with the unix toolset
$ sed -n '6,7p' file | cut -c2-66 | tr ' ' '\n' | pr -2ats' '
8.115068+6 0.000000+0
8.500000+6 6.390560-2
9.000000+6 6.803440-1
9.500000+6 1.685009+0
1.000000+7 2.582780+0
1.050000+7 3.260540+0
Here is a awk only solution or comparison
$ awk 'NR>=6 && NR<=7{$6=substr($6,1,10);
for(i=1;i<=6;i+=2) {f[++c]=$i;s[c]=$(i+1)}}
END{for(i=1;i<=c;i++) print f[i],s[i]}' file
8.115068+6 0.000000+0
8.500000+6 6.390560-2
9.000000+6 6.803440-1
9.500000+6 1.685009+0
1.000000+7 2.582780+0
1.050000+7 3.260540+0
Perhaps shorter version,
$ awk 'NR>=6 && NR<=7{$6=substr($6,1,10);
for(i=1;i<=6;i+=2) print $i FS $(i+1)}' file
8.115068+6 0.000000+0
8.500000+6 6.390560-2
9.000000+6 6.803440-1
9.500000+6 1.685009+0
1.000000+7 2.582780+0
1.050000+7 3.260540+0
to convert format to standard scientific notation, you can pipe the result to
sed or embed something similar in awk script (using gsub).
... | sed 's/[+-]/E&/g'
8.115068E+6 0.000000E+0
8.500000E+6 6.390560E-2
9.000000E+6 6.803440E-1
9.500000E+6 1.685009E+0
1.000000E+7 2.582780E+0
1.050000E+7 3.260540E+0
With GNU awk for FIELDWIDTHS:
$ cat tst.awk
BEGIN { FIELDWIDTHS="9 2 9 2 9 2 9 2 9 2 9 2" }
NR>5 && NR<8 {
for (i=1;i<NF;i+=4) {
print $i "E" $(i+1), $(i+2) "E" $(i+3)
}
}
$ awk -f tst.awk file
8.115068E+6 0.000000E+0
8.500000E+6 6.390560E-2
9.000000E+6 6.803440E-1
9.500000E+6 1.685009E+0
1.000000E+7 2.582780E+0
1.050000E+7 3.260540E+0
If you really want to get rid of the leading blanks then there's various ways to do it (simplest being gsub(/ /,"",$<field number>) on the relevant fields) but I left them in because the above allows your output to line up properly if/when your numbers start with a -, like they do on line 4 of your sample input.
If you don't have GNU awk, get it as you're missing a LOT of extremely useful functionality.
I tried to combine #karafka 's answer using substr, so the following does the trick!
awk 'NR>=6 && NR<=7{$6=substr($6,1,10);for(i=1;i<=6;i+=2) print substr($i,1,8) "E" substr($i,9) FS substr($(i+1),1,8) "E" substr($(i+1),9)}' file
and the output is
8.115068E+6 0.000000E+0
8.500000E+6 6.390560E-2
9.000000E+6 6.803440E-1
9.500000E+6 1.685009E+0
1.000000E+7 2.582780E+0
1.050000E+7 3.260540E+0

take out specific columns from mulitple files

I have multiple files that look like the one below. They are tab-separated. For all the files I would like to take out column 1 and the column that start with XF:Z:. This will give me output 1
The files names are htseqoutput*.sam.sam where * varies. I am not sure about the awk function use, and if the for-loop is correct.
for f in htseqoutput*.sam.sam
do
awk ????? "$f" > “out${f#htseqoutput}”
done
input example
AACAGATGATGAACTTATTGACGGGCGGACAGGAACTGTGTGCTGATTGTC_11 16 chr22 39715068 24 51M * 0 0 GACAATCAGCACACAGTTCCTGTCCGCCCGTCAATAAGTTCATCATCTGTT IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII AS:i:-12 XN:i:0 XM:i:2 XO:i:0 XG:i:0 NM:i:2 MD:Z:18T31G0 YT:Z:UU XF:Z:SNORD43
GTTTCCTTAGTGTAGCGGTTATCACATTCGCCT_0 16 chr19 4724687 40 33M * 0 0 AGGCGAATGTGATAACCGCTACACTAAGGAAAC IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII AS:i:-6 XN:i:0 XM:i:1 XO:i:0 XG:i:0 NM:i:1 MD:Z:26C6 YT:Z:UU XF:Z:tRNA
TCGACTCCCGGTGTGGGAACC_0 16 chr13 45492060 23 21M * 0 0 GGTTCCCACACCGGGAGTCGA IIIIIIIIIIIIIIIIIIIII AS:i:-6 XN:i:0 XM:i:1 XO:i:0 XG:i:0 NM:i:1 MD:Z:0C20 YT:Z:UU XF:Z:tRNA
output 1:
AACAGATGATGAACTTATTGACGGGCGGACAGGAACTGTGTGCTGATTGTC_11 SNORD43
GTTTCCTTAGTGTAGCGGTTATCACATTCGCCT_0 tRNA
TCGACTCCCGGTGTGGGAACC_0 tRNA
Seems like you could just use sed for this:
sed -r 's/^([ACGT0-9_]+).*XF:Z:([[:alnum:]]+).*/\1\t\2/' file
This captures the part at the start of the line and the alphanumeric part following XF:Z: and outputs them, separated by a tab character. One potential advantage of this approach is that it will work independently of the position of the XF:Z: string.
Your loop looks OK (you can use this sed command in place of the awk part) but be careful with your quotes. " should be used, not “/”.
Alternatively, if you prefer awk (and assuming that the bit you're interested in is always part of the last field), you can use a custom field separator:
awk -F'[[:space:]](XF:Z:)?' -v OFS='\t' '{print $1, $NF}' file
This optionally adds the XF:Z: part to the field separator, so that it is removed from the start of the last field.
You can try, if column with "XF:Z:" is always at the end
awk 'BEGIN{OFS="\t"}{n=split($NF,a,":"); print $1, a[n]}' file.sam
you get,
AACAGATGATGAACTTATTGACGGGCGGACAGGAACTGTGTGCTGATTGTC_11 SNORD43
GTTTCCTTAGTGTAGCGGTTATCACATTCGCCT_0 tRNA
TCGACTCCCGGTGTGGGAACC_0 tRNA
or, if this column is a variable position for each file
awk 'BEGIN{OFS="\t"}
FNR==1{
for(i=1;i<=NF;i++){
if($i ~ /^XF:Z:/) break
}
}
{n=split($i,a,":"); print $1, a[n]}' file.sam

awk - skip last line for condition

When I wrote an answer for this question I used the following:
something | sed '$d' | awk '$1>3{print $0}'
e.g.
print only lines where the 1st field is bigger than 3 (awk)
but omit the last line sed '$d'.
This seems for me a bit of duplicate work, surely it is possible to do the above only with awk - without the sed?
I'm an awkdiot - so, can someone suggest a solution?
Here's one way you could do it:
$ printf "%s\n" {1..10} | awk 'NR>1&&p>3{print p}{p=$1}'
4
5
6
7
8
9
Basically, print the first field of the previous line, rather than the current one.
As Wintermute has rightly pointed out in the comments (thanks), in order to print the whole line, you can modify the code to this:
awk 'p { print p; p="" } $1 > 3 { p = $0 }'
This only assigns the contents of contents of the line to p if the first field is greater than 3.