How can I use sed or awk to delete lines matching certain field criteria? - scripting

I have following kind of data :
1 abc xyz - - 2 mno
2 lnm dse - - 3 pqr
3 ebe aaa xhd asw 4 pow
4 abc fww wrw ffp 3 ffw
I would like to delete the lines which satisfies the following two conditions:
4th & 5th column are blank
The row number of the corresponding line is not contained in the 6th column of any other line
In this case line 1 should be deleted. How could I do it in sed/awk or most suitable scripting language for this case.

May be something like this could work -
awk 'NR==FNR{a[$6];next}
($4 ~ /[- ]/ && $5 ~ /[- ]/) && !($1 in a){next}1' file file
Condition:
If Column 4 and Column 5 are blank AND Index not present in Column 6, we skip that line and we print everything else.
Explanation:
We use NR and FNR built-in variables and pass the same file twice. In the first run, we scan thru the file and store Column 6 in an array. next is used to prevent the second pattern{action} statement from running until the first file is being read. Once, the file is completely read, we test the same file against your condition. If the Column 4 and Column 5 are blank, we look at the index and if it is not in the array then we skip the line using next, else we print it.
Test:
[jaypal:~/Temp] cat file
1 abc xyz - - 2 mno
2 lnm dse - - 3 pqr
3 ebe aaa xhd asw 4 pow
4 abc fww wrw ffp 3 ffw
[jaypal:~/Temp] awk 'NR==FNR{a[$6];next} ($4 ~ /[- ]/ && $5 ~ /[- ]/) && !($1 in a){next}1' file file
2 lnm dse - - 3 pqr
3 ebe aaa xhd asw 4 pow
4 abc fww wrw ffp 3 ffw

A possible solution using perl:
Content of script.pl:
use warnings;
use strict;
## Accept one argument, the input file.
#ARGV == 1 or die qq[Usage: perl $0 input-file\n];
my ($lines, %hash);
## Process file.
while ( <> ) {
## Remove leading and trailing spaces for each line.
s/^\s*//;
s/\s*$//;
## Get both indexes.
my ($idx1, $idx2) = (split)[0,5];
## Save line and index1.
push #{$lines}, [$_, $idx1];
## Save index2.
$hash{ $idx2 } = 1;
}
## Process file for second time.
for ( #{$lines} ) {
## Get fields of the line.
my #f = split /\s+/, $_->[0];
## If fourth and fifth fields are empty (-) and first index exists as second
## index, go to next line without printing.
if ( $f[3] eq qq[-] && $f[4] eq qq[-] && ! exists $hash{ $_->[1] } ) {
next;
}
## Print line.
printf qq[%s\n], $_->[0];
}
Run the script (infile has the data to process):
perl script.pl infile
And results:
2 lnm dse - - 3 pqr
3 ebe aaa xhd asw 4 pow
4 abc fww wrw ffp 3 ffw

This might work for you:
sed -rn 's/^.*(\S+)\s+\S+$/\1/;H;${x;s/^|\n/:/gp}' file |
sed -r '1{h;d};/^(\s*\S*){3}\s*-\s*-/{G;/^\s*(\S*).*:\1:/!d;s/\n.*//}' - file
2 lnm dse - - 3 pqr
3 ebe aaa xhd asw 4 pow
4 abc fww wrw ffp 3 ffw
Explanation:
Read the file and build a look-up table from column 6 delimited by :
Read the table (first line) into the hold space (HS) and then read the file again .
When columns 5 and 6 contain - only.
Append the look-up table to the pattern space (PS)
Do a look up using the first column as the key and if it fails delete that
line.
For all remaining lines remove the look-up table.

Related

awk with empty field in columns

Here my file.dat
1 A 1 4
2 2 4
3 4 4
3 7 B
1 U 2
Running awk '{print $2}' file.dat gives:
A
2
4
7
U
But I would like to keep the empty field:
A
4
U
How to do it?
I must add that between :
column 1 and 2 there is 3 whitespaces field separator
column 2 and 3 and between column 3 and 4 one whitespace field separator
So in column 2 there are 2 fields missing (lines 2 and 4) and in column 4
there are also 2 fields missing (lines 3 and 5)
If this isn't all you need:
$ awk -F'[ ]' '{print $4}' file
A
4
U
then edit your question to provide a more truly representative example and clearer requirements.
If the input is fixed-width columns, you can use substr to extract the slice you want. I have assumed that you want a single character at index 5:
awk '{ print(substr($0,5,1)) }' file
Your awk code is missing field separators.
Your example file doesn't clearly show what that field separator is.
From observation your file appears to have 5 columns.
You need to determine what your field separator is first.
This example code expects \t which means <TAB> as the field separator.
awk -F'\t' '{print $3}' OFS='\t' file.dat
This outputs the 3rd column from the file. This is the 'read in' field separator -F'\t' and OFS='\t' is the 'read out'.
A
4
U
For GNU awk. It processes the file twice. On the first time it examines all records for which string indexes have only space and considers continuous space sequences as separator strings building up FIELDWIDTHS variable. On the second time it uses that for fixed width processing of the data.
a[i]:s get valus 0/1 and h (header) with this input will be 100010101 and that leads to FIELDWIDTHS="4 2 2 1":
1 A 1 4
2 2 4
3 4 4
3 7 B
1 U 2
| | | |
100010101 - while(match(h,/10*/))
\ /|/|/|
4 2 2 1
Script:
$ awk '
NR==FNR {
for(i=1;i<=length;i++) # all record chars
a[i]=((a[i]!~/^(0|)$/) || substr($0,i,1)!=" ") # keep track of all space places
if(--i>m)
m=i # max record length...
next
}
BEGINFILE {
if(NR!=0) { # only do this once
for(i=1;i<=m;i++) # ... used here
h=h a[i] # h=100010101
while(match(h,/10*/)) { # build FIELDWIDTHS
FIELDWIDTHS=FIELDWIDTHS " " RLENGTH # qnd
h=substr(h,RSTART+RLENGTH)
}
}
}
{
print $2 # and output
}' file file
And output:
A
4
U
You need to trim off the space from the fields, though.

Bash awk print matched delimiter

Is there a way to print the currently matched delimiter with awk?
For example:
awk -F '["RESTART" | "FAILURE" | "WARNING" | [:blank:]]{2}' 'FNR > 4 { for (i=1; i<=NF; i++) print $i;}' file
Example Input
XX XXXX RESTART 6666 XX X
XXXX XXXX WARNING 8888 YYY YYY
XXX XXXX INFORM 7777 XXXX XX
Example Output (must)
XX
XXXX
RESTART
6666
XX X
XXXX
XXXX
WARNING
8888
YYY YYY
XXX
XXXX
INFORM
7777
XXXX XX
Example Output (now)
XX
XXXX
6666
XX X
XXXX
XXXX
8888
YYY YYY
XXX
XXXX
INFORM
7777
XXXX XX
I use more than 2 white spaces as column delimiter, but there are some cases (RESTART 6666) or (WARNING 8888) where two columns are not separated by two whitespaces, that is why I have to use the content (RESTART, WARNING) as delimiter, but if I use the content as delimiter, it doesn't get displayed, so I want to display/print the used delimiter (in case it is content and not whitespaces)
The main problem is to differentiate between one white space used as column separator and one white space used as word separator in one column. I cannot affect the file I have to deal with.
awk:
awk '{gsub(/ +|\t/,"\n")} {print}' file | awk '/RESTART|WARNING|FAILURE/{gsub(/ /,"\n")} {print}'
gsub(/ +|\t/,"\n"): to replace "2 or more spaces OR \t" with newline \n.
This converts our file into multiple lines wherein each line can consist of multiple words separated by single space only.
/RESTART|WARNING|FAILURE/{gsub(/ /,"\n") : if line contains one of these 3 words then replace space with \n
You can also use sed :
sed "s/\s\s\+/\n/g; s/\(RESTART\|WARNING\|FAILURE\) /\1\n/g" file
for older sed-version (mostly in MAC): + mayn't be supported so modify it for *
sed "s/\s\s\s*/\n/g; s/\(RESTART\|WARNING\|FAILURE\) /\1\n/g" file
s/\s\s\+/\n/g : replaces 2 or more than spaces to single \n
s/\(RESTART\|WARNING\|FAILURE\) /\1\n/g : replaces space with \n after your
three exceptions
Input:
line one hello hello RESTART 6666 XX X
line two hello hello WARNING 8888 YYY YYY
line three hello hello INFORM 7777 XXXX XX
Output:
line one
hello hello
RESTART
6666
XX X
line two
hello hello
WARNING
8888
YYY YYY
line three
hello hello
INFORM
7777
XXXX XX
Here's a fixed width fields approach that will work with any awk (except, of course, old broken awk /bin/awk on Solaris where you should use /usr/xpg4/bin/awk instead):
$ cat tst.awk
{
# identify the fields:
nf = 0
f[++nf] = substr($0,1,8)
f[++nf] = substr($0,9,7)
f[++nf] = substr($0,16,8)
f[++nf] = substr($0,24,6)
f[++nf] = substr($0,30)
# remove leading/trailing white space from each field:
for (i in f) {
sub(/^[[:space:]]+/,"",f[i])
sub(/[[:space:]]+$/,"",f[i])
}
# print the fields:
for (i=1; i<=nf; i++) {
print NR, i, "<" f[i] ">"
}
print "---"
}
.
$ awk -f tst.awk file
1 1 <XX>
1 2 <XXXX>
1 3 <RESTART>
1 4 <6666>
1 5 <XX X>
---
2 1 <XXXX>
2 2 <XXXX>
2 3 <WARNING>
2 4 <8888>
2 5 <YYY YYY>
---
3 1 <XXX>
3 2 <XXXX>
3 3 <INFORM>
3 4 <7777>
3 5 <XXXX XX>
---
If you used nawk on Solaris then you'd have to replace [[:space:]] with [ \t] since it predates POSIX character classes but just don't use nawk, use /usr/xpg4/bin/awk instead.
It can be modified to use a loop instead of 5 explicit substr() calls if this approach works for you.
Maybe you could use GNU awk's split with seps. https://www.gnu.org/software/gawk/manual/html_node/String-Functions.html tells:
split(string, array [, fieldsep [, seps ] ])
seps is a gawk extension, with seps[i] being the separator string between array[i] and array[i+1].

AWK - sum particular fields after match

I have a txt file that is 10s to hundreds lines long and and I need to sum a particular field each line ( and output) if a preceeding field matches.
Here is an example datset:
Sample4;6a0f64d2;size=1;,Sample4;f1cb4733a;size=6;,Sample3;aa44410feb29210c1156;size=2;
Sample2;5b91bef2329bd87f4c7;size=2;,Sample1;909cd4e2940f328b3;size=2;
The structure is
<sample ID>;<random id>;size=<numeric>;, then the next entry. There could be hundreds of entries in a line (this is just a small example)
Basically, I want to sum the "size" numbers for each entry across a line (entries seperated by ',') , but only those that have match with a particular sample IDentifier (e.g. sample4 for example)
So, if we want to match just the 'Sample4's, the script would produce this-
awk '{some-code for sample4}' example.txt
7
0
Because the entries with 'Sample4' add up to 7 in line 1, but in line 2, there are no Sample4 entries matching.
This could be done for each "SampleID" or, ideally, done for all sample IDs provided in a list ( perhaps in simple file, 1 line per sample ID), which would then output the counts for each row, with each Sample ID having its own column - e.g. for the example file above, results of the script would be:
Sample1 Sample2 Sample3 Sample4
0 0 2 7
2 2 0 0
Any ideas on how to get started?
Thanks!
another awk
awk -F';' '{for(i=1;i<NF-1;i+=3)
{split($(i+2),e,"=");
sub(/,/,"",$i);
header[$i];
a[$i,NR]+=e[2]}}
END {for(h in header) printf "%s", h OFS;
print "";
for(i=1;i<=NR;i++)
{for(h in header) printf "%s", a[h,i]+0 OFS;
print ""}}' file | column -t
Sample1 Sample2 Sample3 Sample4
0 0 2 7
2 2 0 0
ps. the order of columns is not guaranteed.
Explanation
To simplify parsing I used ; as the delimiter and got rid of , before the names. Using the structure assign name=sum of values for each line using multi-dim array a, separately keep track of all names in the header array. Once the lines are consumed, in the END block print the header and for each line the value of the corresponding name (or 0 if missing). Pretty print with column -t.
If I am understanding this correctly, you can do:
$ awk '{split($0,samp,/,/)
for (i=1; i in samp; i++){
sub(/;$/, "", samp[i])
split(samp[i], fields, /;/)
split(fields[3], ns, /=/)
data[fields[1]]+=ns[2]
}
printf "For line %s:\n", NR
for (e in data)
print e, data[e]
split("", data)
}' file
Prints:
For line 1:
Sample3 2
Sample4 7
For line 2:
Sample1 2
Sample2 2

awk: how to interpret/read the below command (how this command works)

Could someone please help understand how to interpret/read this awk command?
awk '/foo/{if (a ~ /abc/) print a; print} {a=$0}' file
For a file with lines:
abc 0
def
abc 1
foo 1
ghi
jkl
foo 2
foo 3
mno
abc 2
foo 4
foo 5`
Observed that the command prints the output as:
abc 1
foo 1
foo 2
foo 3
abc 2
foo 4
foo 5`
Could someone please help understand how to interpret/read this awk
command?
awk '/foo/{if (a ~ /abc/) print a; print} {a=$0}' file
In short what above command does is, it searches for line which contains foo, if line found then then it checks whether previous read line (variable a) has abc, if true then, it prints previous line (that is variable a contents; print a), and then print current line (line which contains "foo"; print).
Explanation as follows:
awk ' # call awk
/foo/{ # if line/record/row contains "foo" then
if (a ~ /abc/) # if variable a contains "abc" then
print a; # print contents of variable a
print # print current record/row/line
}
{
a=$0 # save current record/line/row in variable a
}
' file # here you read file

In AWK, is it possible to specify "ranges" of fields?

In AWK, is it possible to specify "ranges" of fields?
Example. Given a tab-separated file "foo" with 100 fields per line, I want to print only the fields 32 to 57 for each line, and save the result in a file "bar". What I do now:
awk 'BEGIN{OFS="\t"}{print $32, $33, $34, $35, $36, $37, $38, $39, $40, $41, $42, $43, $44, $45, $46, $47, $48, $49, $50, $51, $52, $53, $54, $55, $56, $57}' foo > bar
The problem with this is that it is tedious to type and prone to errors.
Is there some syntactic form which allows me to say the same in a more concise and less error prone fashion (like "$32..$57") ?
Besides the awk answer by #Jerry, there are other alternatives:
Using cut (assumes tab delimiter by default):
cut -f32-58 foo >bar
Using perl:
perl -nle '#a=split;print join "\t", #a[31..57]' foo >bar
Mildly revised version:
BEGIN { s = 32; e = 57; }
{ for (i=s; i<=e; i++) printf("%s%s", $(i), i<e ? OFS : "\n"); }
You can do it in awk by using RE intervals. For example, to print fields 3-6 of the records in this file:
$ cat file
1 2 3 4 5 6 7 8 9
a b c d e f g h i
would be:
$ gawk 'BEGIN{f="([^ ]+ )"} {print gensub("("f"{2})("f"{4}).*","\\3","")}' file
3 4 5 6
c d e f
I'm creating an RE segment f to represent every field plus it's succeeding field separator (for convenience), then I'm using that in the gensub to delete 2 of those (i.e the first 2 fields), remember the next 4 for reference later using \3, and then delete what comes after them. For your tab-separated file where you want to print fields 32-57 (i.e. the 26 fields after the first 31) you'd use:
gawk 'BEGIN{f="([^\t]+\t)"} {print gensub("("f"{31})("f"{26}).*","\\3","")}' file
The above uses GNU awk for it's gensub() function. With other awks you'd use sub() or match() and substr().
EDIT: Here's how to write a function to do the job:
gawk '
function subflds(s,e, f) {
f="([^" FS "]+" FS ")"
return gensub( "(" f "{" s-1 "})(" f "{" e-s+1 "}).*","\\3","")
}
{ print subflds(3,6) }
' file
3 4 5 6
c d e f
Just set FS as appropriate. Note that this will need a tweak for the default FS if your input file can start with spaces and/or have multiple spaces between fields and will only work if your FS is a single character.
I'm late but this is quick at to the point so I'll leave it here. In cases like this I normally just remove the fields I don't need with gsub and print. Quick and dirty example, since you know your file is delimited by tabs you can remove the first 31 fields:
awk '{gsub(/^(\w\t){31}/,"");print}'
example of removing 4 fields because lazy:
printf "a\tb\tc\td\te\tf\n" | awk '{gsub(/^(\w\t){4}/,"");print}'
Output:
e f
This is shorter to write, easier to remember and uses less CPU cycles than horrendous loops.
You can use a combination of loops and printf for that in awk:
#!/bin/bash
start_field=32
end_field=58
awk -v start=$start_field -v end=$end_field 'BEGIN{OFS="\t"}
{for (i=start; i<=end; i++) {
printf "%s" $i;
if (i < end) {
printf "%s", OFS;
} else {
printf "\n";
}
}}'
This looks a bit hacky, however:
it properly delimits your output based on the specified OFS, and
it makes sure to print a new line at the end for each input line in the file.
I do not know a way to do field range selection in awk. I know how to drop fields at the end of the input (see bellow), but not easily at the beginning. Bellow, the hard way to drop fields at the beginning.
If you know a character c that is not included in your input, you could use the following awk script:
BEGIN { s = 32; e = 57; c = "#"; }
{ NF = e # Drop the fields after e.
$s = c $s # Put a c in front of the s field.
sub(".*"c, "") # Drop the chars before c.
print # Print the edited line.
}
EDIT:
And I just thought that you can always find a character that is not in the input: use \n.
Unofrtunately don't seem to have access to my account anymore, but also don't have 50 rep to add a comment anyway.
Bob's answer can be simplified a lot using 'seq':
echo $(seq -s ,\$ 5 9| cut -d, -f2-)
$6,$7,$8,$9
The minor disadvantage is you have to specify your first field number as one lower.
So to get fields 3 through 7, I specify 2 as the first argument.
seq -s ,\$ 2 7 sets field seperator for seq at ',$' and yields 2,$3,$4,$5,$6,$7
cut -d, -f2- sets field delimiter at ',' and basically cuts of everything before the first comma, by showing everything from the second field on. Thus resulting in $3,$4,$5,$6,$7
When combined with Bob's answer, we get:
$ cat awk.txt
1 2 3 4 5 6 7 8 9
a b c d e f g h i
$ awk "{print $(seq -s ,\$ 2 7| cut -d, -f2-)}" awk.txt
3 4 5 6 7
c d e f g
$
I use this simple function, which does not check that the field range exists in the line.
function subby(f,l, s) {
s = $f
for(i=f+1;i<=l;i++)
s = sprintf("%s %s",s,$i)
return s
}
(I know OP requested "in AWK" but ... )
Using bash expansion on the command line to generate arguments list;
$ cat awk.txt
1 2 3 4 5 6 7 8 9
a b c d e f g h i
$ awk "{print $(c="" ;for i in {3..7}; do c=$c\$$i, ; done ; c=${c%%,} ; echo $c ;)}" awk.txt
3 4 5 6 7
c d e f g
explanation ;
c="" # var to hold args list
for i in {3..7} # the required variable range 3 - 7
do
# replace c's value with concatenation of existing value, literal $, i value and a comma
c=$c\$$i,
done
c=${c%%,} # remove trailing/final comma
echo $c #return the list string
placed on single line using semi-colons, inside $() to evaluate/expand in place.