awk + Need to print everything (all rest fields) except $1 and $2 - awk

I have the following file and I need to print everything except $1 and $2 by awk
File:
INFORMATION DATA 12 33 55 33 66 43
INFORMATION DATA 45 76 44 66 77 33
INFORMATION DATA 77 83 56 77 88 22
...
the desirable output:
12 33 55 33 66 43
45 76 44 66 77 33
77 83 56 77 88 22
...

Well, given your data, cut should be sufficient:
cut -d\ -f3- infile

Although it adds an extra space at the beginning of each line compared to yael's expected output, here is a shorter and simpler awk based solution than the previously suggested ones:
awk '{$1=$2=""; print}'
or even:
awk '{$1=$2=""}1'

$ cat t
INFORMATION DATA 12 33 55 33 66 43
INFORMATION DATA 45 76 44 66 77 33
INFORMATION DATA 77 83 56 77 88 22
$ awk '{for (i = 3; i <= NF; i++) printf $i " "; print ""}' t
12 33 55 33 66 43
45 76 44 66 77 33
77 83 56 77 88 22

danbens answer leaves a whitespace at the end of the resulting string. so the correct way to do it would be:
awk '{for (i=3; i<NF; i++) printf $i " "; print $NF}' filename

If the first two words don't change, probably the simplest thing would be:
awk -F 'INFORMATION DATA ' '{print $2}' t

Here's another awk solution, that's more flexible than the cut one and is shorter than the other awk ones. Assuming your separators are single spaces (modify the regex as necessary if they are not):
awk --posix '{sub(/([^ ]* ){2}/, ""); print}'

If Perl is an option:
perl -lane 'splice #F,0,2; print join " ",#F' file
These command-line options are used:
-n loop around every line of the input file, do not automatically print it
-l removes newlines before processing, and adds them back in afterwards
-a autosplit mode – split input lines into the #F array. Defaults to splitting on whitespace
-e execute the perl code
splice #F,0,2 cleanly removes columns 0 and 1 from the #F array
join " ",#F joins the elements of the #F array, using a space in-between each element
Variation for csv input files:
perl -F, -lane 'splice #F,0,2; print join " ",#F' file
This uses the -F field separator option with a comma

Related

Accessing two fields of a line before a matched line

Given the following in a file, file.txt:
Line with asdfgh output1 45 output2 80
Special Header
Line with output1 38 output2 99
Special Header
I would like to print to a file:
45 80
38 99
i.e., in the line immediately preceding lines whose first column is Special, store the number (which can be a float type with decimals) after output1 and output2
I tried:
awk '($1 == "Special" ) {printf f; printf ("\n");} {f=$0}' file.txt > output.txt
This captures the entirety of the previous line and I get output.txt which looks like this:
Line with output1 45 output2 80
Line with output1 38 output2 99
Now, within the captured variable f, how can I access the specific values after output1 and output2?
Like this:
$ awk '{for (i=1; i<NF; i++)
if ($i == "output1") {arr[NR]=$(i+1)}
else if ($i == "output2") {arr[NR]=arr[NR]" "$(i+1)}}
($1 == "Special") {print arr[NR-1]}
' file
Output
45 80
38 99
With GNU awk:
$ awk '$1=="Special"{print m[1], m[2]}
{match($0, /output1\s+(\S+).*output2\s+(\S+)/, m)}' ip.txt
45 80
38 99
With perl:
$ perl -ne 'print "#d\n" if /^Special/; #d = /output[12]\s+(\S+)/g' ip.txt
45 80
38 99
$ awk '$1=="Special"{print x, y} {x=$(NF-2); y=$NF}' file
45 80
38 99

How do I output data on new lines with AWK? TSHARK BATCH SCRIPT

I'm trying to live capture packets and output each packets ASCII data on a new line in a text file. I want to be able to still read this file while its being written to. If I can't read it while its being written to. I would like be able to rerun the batch script and have it not overwrite the file and to continue on a new line. My tools are awk tshark batch script I'm open to other options though. So I guess my questions are.
How can I output to a text file that I can still read while batch script is still running?
If this isn't possible.
Can I capture one packet at time and rerun a batch script? If so how can I prevent it from overwriting/deleting the previous info from the text file and to continue on a new line.
How can I output ASCII data all on one line and start a new line for each packet?
Here is a sample tshark output with this command. Each packet is separated with two newlines. I can also change this with -S
tshark -i 1 -f "CaptureFilter" -x
0000 00 fc 31 55 24 47 a4 72 4d cf 12 f4 06 02 44 00 ..b1...c].....d.
0010 01 23 x5 dt 42 30 63 04 d3 20 c5 24 28 ed 1a 00 .6..#.f... . ...
0020 23 54 cd 32 45 52 .3.2..
etc...
0000 00 fc 31 55 24 47 a4 72 4d cf 12 f4 06 02 44 00 ..b1...c].....d.
0010 01 23 x5 dt 42 30 63 04 d3 20 c5 24 28 ed 1a 00 .6..#.f... . ...
0020 23 54 cd 32 45 52 .3.2..
etc...
Here is another command I'm using.
Some of it has spaces so it was being skipped so I had to add more fields.
tshark -i 1 -f "CaptureFilter" -x | awk "{print $18, $19, $20}" > "test.txt"
Example of output
..b1...c].....d.
.6..#.f... . ...
.3.2..
..b1...c].....d.
.6..#.f... . ...
.3.2..
This command prints packets ASCII on a single line but continues without creating a new line
tshark -i 1 -f "CaptureFilter" -x | awk "{printf $18, $19, $20}" > "test.txt"
Output continues where it left off
..a1...c].....d..6..#.f... . ....3.2....a1...c].....d..6..#.f... . ....3.2....a1...c].....d..6..#.f... . ....3.2..
The output I'm looking for is something like this
..a1...c].....d..6..#.f........3.2..
..a1...c].....d..6..#.f........3.2..
..a1...c].....d..6..#.f........3.2..
With your shown samples please try following awk code. Written and tested with shown samples in GNU awk.
tshark -i 1 -f "CaptureFilter" -x |
awk '
val && !NF{
print val
val=""
next
}
match($0,/(\.+[^.]+\.*)+/){
val=(val?val OFS:"") substr($0,RSTART,RLENGTH)
}
END{
if(val){
print val
}
}
'
Explanation: Adding detailed explanation for above code.
awk ' ##Starting awk program from here.
val && !NF{ ##Checking if val is NOT NULL and line is empty then do following.
print val ##Printing val here.
val="" ##Nullifying val here.
next ##next will skip all further statements from here.
}
match($0,/(\.+[^.]+\.*)+/){ ##Using match to match regex (\.+[^.]+\.*)+ here.
val=(val?val OFS:"") substr($0,RSTART,RLENGTH) ##Creating val which has its sub string value and keep appending in it.
}
END{ ##Starting END block of this awk block here.
if(val){ ##Checking if val is NOT NULL then do following.
print val ##Printing val here.
}
}
'
This awk should work for you:
tshark -i 1 -f "CaptureFilter" -x | awk -f parse.awk
Where parse.awk is:
{
sub(/^[0-9]{4}[ \t]+[0-9a-z \t]+/, "")
s = (s == "" ? "" : s " ") $0
}
s && !NF {
print s
s = ""
}
END {print s}
Output:
..a1...c].....d. .6..#.f... . ... .3.2..
..a1...c].....d. .6..#.f... . ... .3.2..

Extract info from a column based on a range within another file

I've tried looking around but what I found was looking up within the same file or combining columns with exact matches. Whereas I would not have exact matches and right now trying to combine these two codes is above my skill level. Basically I need to add an extra column to include the gene name based chromosome position and grabbing the gene name based on the range of the gene within another file. I know awk is my best bet possibly with FNR==NR.
File1 looks like this, where $1 is chromosome, $2 is position, the rest of the columns are sample coverage across that position:
chr1H 49525 47 41 60 74 93 34 117
chr1H 49526 48 41 62 74 94 34 118
chr1H 53978 48 40 61 73 94 33 117
chr1H 53979 48 40 62 72 94 33 116
File2 looks like this, where $1 is the chromosome, $2 is the start of the gene $3 is the end of the gene and $4 is the gene name:
chr1H 49525 49772 gene1
chr1H 50194 50649 gene2
chr1H 53978 54323 gene3
chr1H 76743 77373 gene4
Either over writing or making a new file to end up with a file that looks like this:
chr1H 49525 47 41 60 74 93 34 117 gene1
chr1H 49526 48 41 62 74 94 34 118 gene1
chr1H 53978 48 40 61 73 94 33 117 gene3
chr1H 53979 48 40 62 72 94 33 116 gene3
Right now my code looks like this but I'm not sure how I specify the files (right now I've put in file1 or 2 so you know what my thinking is). So that the chromosomes match in both files and the position in the coverage file is between a range within start and end positions within the second file, then printing the entire line from file1 and the gene name from file2:
awk '{ if (file1$1 == file2$1 && file1$2 >= file2$2 && file1$2 <= file2$3) print file1$0, file2$4 }' file1 file2 > file3
Thanks for any help!
You can do this fairly simply in awk by reading the range values from file2 into arrays indexed by gene name. That gives you a range by gene name to compare against the 2nd field in file1. You can do:
awk '
NR == FNR { # reading file2
b[$4] = $2 # store b[] (begin) indexed by name
e[$4] = $3 # store e[] (end) indexed by name
next # get next record
}
{ # for all file1 records
for(i in b) { # loop over values by gene name
if ($2 >= b[i] && $2 <= e[i]) { # if in range b[] to e[]
printf "%s %s\n", $0, i # output with gene name at end
next # get next record
}
}
}
' file2 file1
Example Use/Output
With the values shown in file1 and file2 you would have:
$ awk '
> NR == FNR { # reading file2
> b[$4] = $2 # store b[] (begin) indexed by name
> e[$4] = $3 # store e[] (end) indexed by name
> next # get next record
> }
> { # for all file1 records
> for(i in b) { # loop over values by gene name
> if ($2 >= b[i] && $2 <= e[i]) { # if in range b[] to e[]
> printf "%s %s\n", $0, i # output with gene name at end
> next # get next record
> }
> }
> }
> ' file2 file1
chr1H 49525 47 41 60 74 93 34 117 gene1
chr1H 49526 48 41 62 74 94 34 118 gene1
chr1H 53978 48 40 61 73 94 33 117 gene3
chr1H 53979 48 40 62 72 94 33 116 gene3
You can try this solution. Although I would suggest to use e.g. bedtools for stuff like this. You can encounter situations where intuition fails and it's good to have stress tested tools when that happens.
$ awk 'NR==FNR{ chr[NR]=$1; st[NR]=$2; en[NR]=$3; gene[NR]=$4; x=NR }
NR!=FNR{ for(i=1;i<=x;i++){ if($1==chr[i] && $2>=st[i] && $2<=en[i]){
print $0,gene[i]; break } } }' f2 f1
chr1H 49525 47 41 60 74 93 34 117 gene1
chr1H 49526 48 41 62 74 94 34 118 gene1
chr1H 53978 48 40 61 73 94 33 117 gene3
chr1H 53979 48 40 62 72 94 33 116 gene3
Btw, this doesn't handle multiple matches. If you want to allow these you can remove the break statement and change the print to a printf.
You can use join to get half way there:
sort -k 2,2 file-1 > f1-sorted
sort -k 2,2 file-2 > f2-sorted
join -1 2 -2 2 -o 1.1,1.2,1.3,1.4,1.5,1.6,1.7,1.8,1.9,2.4 f1-sorted f2-sorted
#gives
chr1H 49525 47 41 60 74 93 34 117 gene1
chr1H 53978 48 40 61 73 94 33 117 gene3
join joins files on a column, but it must be sorted
-1 2 -2 2 means join on the second column of file 1 and file 2
-o specifies output format for the columns (1.1 is file 1, column 1, 2.4 is file 2, column 4)
But you have an unusual requirement of matching the next immediate position number with the previous gene name. For that I would use this awk:
awk '
FNR==NR {name[$2]=$4}
FNR!=NR && name[$2] {n = name[$2]}
FNR!=NR && name[$2-1] {n = name[$2-1]}
FNR!=NR {print $0,n}' file-2 file-1
This gives your expected output exactly. You can also use FNR!=NR && n {print $0,n}, to only print a record if there's a match for the position number (column 2) in both files.

Awk script displaying incorrect output

I'm facing an issue in awk script - I need to generate a report containing the lowest, highest and average score for each assignment in the data file. The name of the assignment is located in column 3.
Input data is:
Student,Catehory,Assignment,Score,Possible
Chelsey,Homework,H01,90,100
Chelsey,Homework,H02,89,100
Chelsey,Homework,H03,77,100
Chelsey,Homework,H04,80,100
Chelsey,Homework,H05,82,100
Chelsey,Homework,H06,84,100
Chelsey,Homework,H07,86,100
Chelsey,Lab,L01,91,100
Chelsey,Lab,L02,100,100
Chelsey,Lab,L03,100,100
Chelsey,Lab,L04,100,100
Chelsey,Lab,L05,96,100
Chelsey,Lab,L06,80,100
Chelsey,Lab,L07,81,100
Chelsey,Quiz,Q01,100,100
Chelsey,Quiz,Q02,100,100
Chelsey,Quiz,Q03,98,100
Chelsey,Quiz,Q04,93,100
Chelsey,Quiz,Q05,99,100
Chelsey,Quiz,Q06,88,100
Chelsey,Quiz,Q07,100,100
Chelsey,Final,FINAL,82,100
Chelsey,Survey,WS,5,5
Sam,Homework,H01,19,100
Sam,Homework,H02,82,100
Sam,Homework,H03,95,100
Sam,Homework,H04,46,100
Sam,Homework,H05,82,100
Sam,Homework,H06,97,100
Sam,Homework,H07,52,100
Sam,Lab,L01,41,100
Sam,Lab,L02,85,100
Sam,Lab,L03,99,100
Sam,Lab,L04,99,100
Sam,Lab,L05,0,100
Sam,Lab,L06,0,100
Sam,Lab,L07,0,100
Sam,Quiz,Q01,91,100
Sam,Quiz,Q02,85,100
Sam,Quiz,Q03,33,100
Sam,Quiz,Q04,64,100
Sam,Quiz,Q05,54,100
Sam,Quiz,Q06,95,100
Sam,Quiz,Q07,68,100
Sam,Final,FINAL,58,100
Sam,Survey,WS,5,5
Andrew,Homework,H01,25,100
Andrew,Homework,H02,47,100
Andrew,Homework,H03,85,100
Andrew,Homework,H04,65,100
Andrew,Homework,H05,54,100
Andrew,Homework,H06,58,100
Andrew,Homework,H07,52,100
Andrew,Lab,L01,87,100
Andrew,Lab,L02,45,100
Andrew,Lab,L03,92,100
Andrew,Lab,L04,48,100
Andrew,Lab,L05,42,100
Andrew,Lab,L06,99,100
Andrew,Lab,L07,86,100
Andrew,Quiz,Q01,25,100
Andrew,Quiz,Q02,84,100
Andrew,Quiz,Q03,59,100
Andrew,Quiz,Q04,93,100
Andrew,Quiz,Q05,85,100
Andrew,Quiz,Q06,94,100
Andrew,Quiz,Q07,58,100
Andrew,Final,FINAL,99,100
Andrew,Survey,WS,5,5
Ava,Homework,H01,55,100
Ava,Homework,H02,95,100
Ava,Homework,H03,84,100
Ava,Homework,H04,74,100
Ava,Homework,H05,95,100
Ava,Homework,H06,84,100
Ava,Homework,H07,55,100
Ava,Lab,L01,66,100
Ava,Lab,L02,77,100
Ava,Lab,L03,88,100
Ava,Lab,L04,99,100
Ava,Lab,L05,55,100
Ava,Lab,L06,66,100
Ava,Lab,L07,77,100
Ava,Quiz,Q01,88,100
Ava,Quiz,Q02,99,100
Ava,Quiz,Q03,44,100
Ava,Quiz,Q04,55,100
Ava,Quiz,Q05,66,100
Ava,Quiz,Q06,77,100
Ava,Quiz,Q07,88,100
Ava,Final,FINAL,99,100
Ava,Survey,WS,5,5
Shane,Homework,H01,50,100
Shane,Homework,H02,60,100
Shane,Homework,H03,70,100
Shane,Homework,H04,60,100
Shane,Homework,H05,70,100
Shane,Homework,H06,80,100
Shane,Homework,H07,90,100
Shane,Lab,L01,90,100
Shane,Lab,L02,0,100
Shane,Lab,L03,100,100
Shane,Lab,L04,50,100
Shane,Lab,L05,40,100
Shane,Lab,L06,60,100
Shane,Lab,L07,80,100
Shane,Quiz,Q01,70,100
Shane,Quiz,Q02,90,100
Shane,Quiz,Q03,100,100
Shane,Quiz,Q04,100,100
Shane,Quiz,Q05,80,100
Shane,Quiz,Q06,80,100
Shane,Quiz,Q07,80,100
Shane,Final,FINAL,90,100
Shane,Survey,WS,5,5
awk script :
BEGIN {
FS=" *\\, *"
}
FNR>1 {
min[$3]=(!($3 in min) || min[$3]> $4 )? $4 : min[$3]
max[$3]=(max[$3]> $4)? max[$3] : $4
cnt[$3]++
sum[$3]+=$4
}
END {
print "Name\tLow\tHigh\tAverage"
for (i in cnt)
printf("%s\t%d\t%d\t%.1f\n", i, min[i], max[i], sum[i]/cnt[i])
}
Expected sample output:
Name Low High Average
Q06 77 95 86.80
L05 40 96 46.60
WS 5 5 5
Q07 58 100 78.80
L06 60 99 61
L07 77 86 64.80
When I run the script, I get a "Low" of 0 for all assignments which is not correct. Where am I going wrong? Please guide.
You can certainly do this with awk, but since you tagged this scripting as well, I'm assuming other tools are an option. For this sort of gathering of statistics on groups present in the data, GNU datamash often reduces the job to a simple one-liner. For example:
$ (echo Name,Low,High,Average; datamash --header-in -s -t, -g3 min 4 max 4 mean 4 < input.csv) | tr , '\t'
Name Low High Average
FINAL 58 99 85.6
H01 19 90 47.8
H02 47 95 74.6
H03 70 95 82.2
H04 46 80 65
H05 54 95 76.6
H06 58 97 80.6
H07 52 90 67
L01 41 91 75
L02 0 100 61.4
L03 88 100 95.8
L04 48 100 79.2
L05 0 96 46.6
L06 0 99 61
L07 0 86 64.8
Q01 25 100 74.8
Q02 84 100 91.6
Q03 33 100 66.8
Q04 55 100 81
Q05 54 99 76.8
Q06 77 95 86.8
Q07 58 100 78.8
WS 5 5 5
This says that for each group with the same value for the 3rd column (-g3, plus -s to sort the input (A requirement of the tool)) of simple CSV input (-t,) with a header (--header-in), display the minimum, maximum, and mean of the 4th column. It's all given a new header and piped to tr to turn the commas into tabs.
Your code works as-is with GNU awk. However, running it with the -t option to warn about non-portable constructs gives:
awk: foo.awk:6: warning: old awk does not support the keyword `in' except after `for'
awk: foo.awk:2: warning: old awk does not support regexps as value of `FS'
And running the script with a different implementation of awk (mawk in my case) does give 0's for the Low column. So, some tweaks to the script:
BEGIN {
FS=","
}
FNR>1 {
min[$3]=(cnt[$3] == 0 || min[$3]> $4 )? $4 : min[$3]
max[$3]=(max[$3]> $4)? max[$3] : $4
cnt[$3]++
sum[$3]+=$4
}
END {
print "Name\tLow\tHigh\tAverage"
PROCINFO["sorted_in"] = "#ind_str_asc" # gawk-ism for pretty output; ignored on other awks
for (i in cnt)
printf("%s\t%d\t%d\t%.1f\n", i, min[i], max[i], sum[i]/cnt[i])
}
and it works as expected on that other awk too.
The changes:
Using a simple comma as the field separator instead of a regex.
Changing the min conditional to setting to the current value on the first time this assignment has been seen by checking to see if cnt[$3] is equal to 0 (Which it will be the first time because that value is incremented in a later line), or if the current min is greater than this value.
another similar approach
$ awk -F, 'NR==1 {print "name","low","high","average"; next}
{k=$3; sum[k]+=$4; count[k]++}
!(k in min) {min[k]=max[k]=$4}
min[k]>$4 {min[k]=$4}
max[k]<$4 {max[k]=$4}
END {for(k in min) print k,min[k],max[k],sum[k]/count[k]}' file |
column -t
name low high average
Q06 77 95 86.8
L05 0 96 46.6
WS 5 5 5
Q07 58 100 78.8
L06 0 99 61
L07 0 86 64.8
H01 19 90 47.8
H02 47 95 74.6
H03 70 95 82.2

Changing the field separator of awk to newline

The -F option lets you specify the field separator for awk, but using '\n' as the line separator doesn't work, that is, it doesn't make $1 the first line of the input, $2 the second line, and so on.
I suspect that this is because awk looks for the field separator within each line. Is there a way to get around this with awk, or some other Linux command? Basically, I want to separate my input by newline characters and put them into an Excel file.
I'm still warming up to Linux and shell scripts, which is the reason for my lack of creativity with this problem.
Thank you!
You may require to overwrite the input record separator (RS), which default is newline.
See my example below,
$ cat test.txt
a
b
c
d
$ awk 'BEGIN{ RS = "" ; FS = "\n" }{print $1,$2,$3,$4}' test.txt
a b c d
Note that you can change both the input and output record separator so you can do something like this to achieve a similar result to the accepted answer.
cat test.txt
a
b
c
d
$ awk -v ORS=" " '{print $1}' test.txt
a b c d
one can simplify it to just the following, with a minor caveat of extra trailing space without trailing newline :
% echo "a\nb\nc\nd"
a
b
c
d
% echo "a\nb\nc\nd" | mawk 8 ORS=' '
a b c d %
To rectify that, plus handling the edge case of no trailing newline from input, one can modify it to :
% echo -n "a\nb\nc\nd" | mawk 'NF-=_==$NF' FS='\n' RS='^$' | odview
0000000 543301729 174334051
a b c d \n
141 040 142 040 143 040 144 012
a sp b sp c sp d nl
97 32 98 32 99 32 100 10
61 20 62 20 63 20 64 0a
0000010
% echo "a\nb\nc\nd" | mawk 'NF -= (_==$NF)' FS='\n' RS='^$' | odview
0000000 543301729 174334051
a b c d \n
141 040 142 040 143 040 144 012
a sp b sp c sp d nl
97 32 98 32 99 32 100 10
61 20 62 20 63 20 64 0a
0000010