Removing entire line based on a condition - awk

I have a folder that contains 300 files. I would like to remove lines from the files if $2<=25. How can I remove the lines directly from the files(in place editing)? My code is shown below.
awk '{ for (i=1; i<=NF; i++) { if ($2 <= 25) next } print }' < *
ads 54.5 18 15.3 39.2
bdy 18.5 21 1.5 17.0
cst 36.8 22 27.7 9.1
hst 40.2 25 16.2 24.0
ads 18.9 41 5.0 13.2
bdy 20.5 45 67.0 19.0

You're script is doing too much. The cleanest solution, to my mind, inverts the condition:
awk '{ if ($2 > 25) print }'
or even:
awk '$2 > 25'
If you don't want to invert the condition, then:
awk '{ if ($2 <= 25) next; print }'
There's no need to iterate over all the fields.
Not even GNU awk supports 'in situ' file modification. You have to write the result to a temporary file, and then copy or move the temporary back over the original. (Copy preserves hard links and permissions; move breaks links and can modify owner and permissions. You need to decide whether that's a concern.)
Thanks to Ed Morton for pointing out that GNU Awk 4.x does have a mechanism to edit files 'in situ', in part because he campaigned to get it added.
The command line help won't tell you that GNU Awk 4.x supports in-place modification of files, but if you find the right part of the manual (Extension sample: inplace — which is mis-titled from my perspective; it isn't just a sample because it is a distributed extension) then you can find out that there is an extension that makes GNU Awk overwrite regular files specified on its command line.
gawk -i inplace '{ if ($2 > 25) print }' file1 …
or even:
gawk -i inplace '$2 > 25' file1 …
Note that experimentation shows that it is quite happy to modify read-only files in situ. This is consistent with sed (both GNU and BSD (Mac OS X) sub-species); they also modify read-only files in situ without warning — and preserve the permissions on the file but break any hard links.
Your script uses awk '…' < *; that is a peculiar way of ignoring the first file in your directory unless it is the only file in the directory (it is used for standard input, but if there's more than one file in the directory, standard input is ignored). You need to use just *, not < *.

Related

I need to sum all the values in a column across multiple files

I have a directory with multiple csv text files, each with a single line in the format:
field1,field2,field3,560
I need to output the sum of the fourth field across all files in a directory (can be hundreds or thousands of files). So for an example of:
file1.txt
field1,field2,field3,560
file2.txt
field1,field2,field3,415
file3.txt
field1,field2,field3,672
The output would simply be:
1647
I've been trying a few different things, with the most promising being an awk command that I found here in response to another user's question. It doesn't quite do what I need it to do, and I am an awk newb so I'm unsure how to modify it to work for my purpose:
awk -F"," 'NR==FNR{a[NR]=$4;next}{print $4+a[FNR]:' file1.txt file2.txt
This correctly outputs 975.
However if I try pass it a 3rd file, rather than add field 4 from all 3 files, it adds file1 to file2, then file1 to file3:
awk -F"," 'NR==FNR{a[NR]=$4;next}{print $4+a[FNR]:' file1.txt file2.txt file3.txt
975
1232
Can anyone show me how I can modify this awk statement to accept more than two files or, ideally because there are thousands of files to sum up, an * to output the sum of the fourth field of all files in the directory?
Thank you for your time and assistance.
A couple issues with the current code:
NR==FNR is used to indicate special processing for the 1st file; in this case there is no processing that is 'special' for just the 1st file (ie, all files are to be processed the same)
an array (eg, a[NR]) is used to maintain a set of values; in this case you only have one global value to maintain so there is no need for an array
Since you're only looking for one global sum, a bit more simpler code should suffice:
$ awk -F',' '{sum+=$4} END {print sum+0}' file{1..3}.txt
1647
NOTES:
in the (unlikely?) case all files are empty, sum will be undefined so print sum will display a blank link; sum+0 insures we print 0 if sum remains undefined (ie, all files are empty)
for a variable number of files file{1..3}.txt can be replaced with whatever pattern will match on the desired set of files, eg, file*.txt, *.txt, etc
Here we go (no need to test NR==FNR in a concatenation):
$ cat file{1,2,3}.txt | awk -F, '{count+=$4}END{print count}'
1647
Or same-same 🇹🇭 (without wasting some pipe(s)):
$ awk -F, '{count+=$4}END{print count}' file{1,2,3}.txt
1647
$ perl -MList::Util=sum0 -F, -lane'push #a,$F[3];END{print sum0 #a}' file{1..3}.txt
1647
$ perl -F, -lane'push #a,$F[3];END{foreach(#a){ $sum +=$_ };print "$sum"}' file{1..3}.txt
1647
$ cut -d, -f4 file{1..3}.txt | paste -sd+ - | bc
1647

AWK fixed record files

Is there a way of using awk to deal with files without LF/CR to mark the EOL - fixed size "record/line" files, or that having the first 4 bytes indicating the size of the record?
Is there a way of assign $1, $2, ... to fixed "columns/fields" (without any separator)?
I tried but didn't found any solution just using awk - the only solution I found was to use another program that reads each record and then "pipes" the "line/record" to awk.
Thanks
Is there a way of assign $1, $2, ... to fixed "columns/fields" (without any separator)?
In GNU AWK you might use FIELDWIDTHS to work with fixed-width columns, consider following simple example, let file.txt content be
01120
10150
11180
and imagine it has three columns - single, single, triple - then you might do
awk 'BEGIN{FIELDWIDTHS="1 1 3"}{print $1, $2, $3}' file.txt
to get output
0 1 120
1 0 150
1 1 180
(tested in gawk 4.2.1)

What does this Awk expression mean

I am working with bash script that has this command in it.
awk -F ‘‘ ‘/abc/{print $3}’|xargs
What is the meaning of this command?? Assume input is provided to awk.
The quick answer is it'll do different things depending on the version of awk you're running and how many fields of output the awk script produces.
I assume you meant to write:
awk -F '' '/abc/{print $3}'|xargs
not the syntactically invalid (due to "smart quotes"):
awk -F ‘’’/abc/{print $3}’|xargs
-F '' is undefined behavior per POSIX so what it will do depends on the version of awk you're running. In some awks it'll split the current line into 1 character per field. in others it'll be ignored and the line will be split into fields at every sequence of white space. In other awks still it could do anything else.
/abc/ looks for a string matching the regexp abc on the current line and if found invokes the subsequent action, in this case {print $3}.
However it's split into fields, print $3 will print the 3rd such field.
xargs as used will just print chunks of the multi-line input it's getting all on 1 line so you could get 1 line of all-fields output if you don't have many fields being output or several lines of multi-field output if you do.
I suspect the intent of that code was to do what this code actually will do in any awk alone:
awk '/abc/{printf "%s%s", sep, substr($0,3,1); sep=OFS} END{print ""}'
e.g.:
$ printf 'foo\nxabc\nyzabc\nbar\n' |
awk '/abc/{printf "%s%s", sep, substr($0,3,1); sep=OFS} END{print ""}'
b a

Looks for patterns across different lines

I have a file like this (test.txt):
abc
12
34
def
56
abc
ghi
78
def
90
And I would like to search the 78 which is enclosed by "abc\nghi" and "def". Currently, I know I can do this by:
cat test.txt | awk '/abc/,/def/' | awk '/ghi/,'/def/'
Is there any better way?
One way is to use flags
$ awk '/ghi/ && p~/abc/{f=1} f; /def/{f=0} {p=$0}' test.txt
ghi
78
def
{p=$0} this will save input line for future use
/ghi/ && p~/abc/{f=1} set flag if current line contains ghi and previous line contains abc
f; print input record as long as flag is set
/def/{f=0} clear the flag if line contains def
If you only want the lines between these two boundaries
$ awk '/ghi/ && p~/abc/{f=1; next} /def/{f=0} f; {p=$0}' ip.txt
78
$ awk '/12/ && p~/abc/{f=1; next} /def/{f=0} f; {p=$0}' ip.txt
34
See also How to select lines between two patterns?
This is not really clean, but you can redefine your record separator as a regular expression to be abc\nghi\n|\ndef. This however creates multiple records, and you need to keep track which ones are between the correct ones. With awk you can check which RS was found using RT.
awk 'BEGIN{RS="abc\nghi\n|\ndef"}
(RT~/abc/){s=1}
(s==1)&&(RT~/def/){print $0}
{s=0}' file
This does :
set RS to abc\nghi\n or \ndef.
check if the record is found, if RT contains abc you found the first one.
if you found the first one and the next RT contains def, then print.
grep alternative
$ grep -Pazo '(?s)(?<=abc\nghi)(.*)(?=def)' file
but I think awk will be better
You could do this with sed. It's not ideal in that it doesn't actually understand records, but it might work for you...
sed -Ene 'H;${x;s/.*\nabc\nghi\n([0-9]+)\ndef\n.*/\1/;p;}' input.txt
Here's what's basically going on:
H - appends the current line to sed's "hold space"
${ - specifies the start of a series of commands that will be run once we come to the end of the file
x - swaps the hold space with the pattern space, so that future substitutions will work on what was stored using H
s/../../ - analyses the pattern space (which is now multi-line), capturing the data specified in your question, replacing the entire pattern space with the bracketed expression...
p - prints the result.
One important factor here is that the regular expression is ERE, so the -E option is important. If your version of sed uses some other option to enable support for ERE, then use that option instead.
Another consideration is that the regex above assumes Unix-style line endings. If you try to process a text file that was generated on DOS or Windows, the regex may need to be a little different.
awk solution:
awk '/ghi/ && r=="abc"{ f=1; n=NR+1 }f && NR==n{ v=$0 }v && NR==n+1{ print v }{ r=$0 }' file
The output:
78
Bonus GNU awk approach:
awk -v RS= 'match($0,/\nabc\nghi\n(.+)\ndef/,a){ print a[1] }' file

awk to output matches to separate files

I am trying to combine the text in $2 that is the same and output them to separate files with the match being the name of the new file. Since the actual files are quite large I open each file, then close to save on speed and memory, my attempt is below. Thank you :).
awk '{printf "%s\n", $2==$2".txt"; close($2".txt")}' input.txt **'{ print $2 > "$2.txt" }'**
input.txt
chr19:41848059-41848167 TGFB1:exon.2;TGFB1:exon.3;TGFB1:exon.4 284.611 108 bases
chr15:89850833-89850913 FANCI:exon.20;FANCI:exon.27;FANCI:exon.32;FANCI:exon.33;FANCI:exon.34 402.012 80 bases
chr15:31210356-31210508 FANC1:exon.6;FANC1:exon.7 340.914 152 bases
chr19:41850636-41850784 TGFB1:exon.1;TGFB1:exon.2;TGFB1:exon.3 621.527 148 bases
Desired output for TGFB1.txt
chr19:41848059-41848167 TGFB1:exon.2;TGFB1:exon.3;TGFB1:exon.4 284.611 108 bases
chr19:41850636-41850784 TGFB1:exon.1;TGFB1:exon.2;TGFB1:exon.3 621.527 148 bases
Desired output for FANC1.txt
chr15:89850833-89850913 FANCI:exon.20;FANCI:exon.27;FANCI:exon.32;FANCI:exon.33;FANCI:exon.34 402.012 80 bases
chr15:31210356-31210508 FANC1:exon.6;FANC1:exon.7 340.914 152 bases
EDIT:
awk -F '[ :]' '{f = $3 ".txt"; close($3 ".txt")} print > f}' BMF_unix_loop_genes_IonXpress_008_150902_loop_genes_average_IonXpress_008_150902.bed > /home/cmccabe/Desktop/panels/BMF **/"$f".txt;**
bash: /home/cmccabe/Desktop/panels/BMF: Is a directory
You can just redefine your field separator to include a colon and then the file name would be in $3
awk -F '[ :]' '{f = $3 ".txt"; print > f}' input.txt
I've encountered problems with some awks where constructing the filename to the right of the redirection is problematic, which is why I'm using a variable. However the Friday afternoon beer cart has been around and I can't recall specific details :/
I wouldn't bother closing the files unless you're expecting hundreds or thousands of new files to be produced.
You need to split the second field to the desired field name. This should do
$ awk 'BEGIN{close(p)} {split($2,f,":"); p=f[1]".txt"; print $0 > p }' file
Note that it won't produce your output exactly since you have a typo in one of the fields
$ ls *.txt
FANC1.txt FANCI.txt TGFB1.txt