eliminating all values that occur in all files in folder with awk - awk

I have a folder with several files of which I want to eliminate all of the terms that they have in common using awk.
Here is the script that I have been using:
awk '
FNR==1 {
if (seen[FILENAME]++) {
firstPass = 0
outfile = FILENAME "_new"
}
else {
firstPass = 1
numFiles++
ARGV[ARGC++] = FILENAME
}
}
firstPass { count[$2]++; next }
count[$2] != numFiles { print > outfile }
' *
An example of the information in the files would be:
File1
3 coffee
4 and
8 milk
File2
4 dog
2 and
9 cat
The output should be:
File1_new
3 coffee
8 milk
File2_new
4 dog
9 cat
It works when I use a small number of files (i.e. 10), but when I start to increase that number, I get the following error message:
awk: file20_new makes too many open files input record number 27, file file20_new source line number 14
Where is the error coming from when I use larger amounts of files?
My main goal is to run this script over all of the files in a folder to generate new files that do not contain any words that occur in all of the files in the folder.

When you use >, a file is opened for writing (and truncated). As suggested in the comments, you need to close your files as you go along. Try something like this:
awk '
FNR==1 {
if (seen[FILENAME]++) {
firstPass = 0
if (outfile) close(outfile) # <-- close the previous file
outfile = FILENAME "_new"
}
else {
firstPass = 1
numFiles++
ARGV[ARGC++] = FILENAME
}
}
firstPass { count[$2]++; next }
count[$2] != numFiles { print > outfile }
' *

Related

How to update a variable from inside an AWK function

I run this script from a loop inside another script and I want to:
a) print errors into a file keeping track of number of line, name of file and error.
b)I want to print into another file the unique names of files in which an error has been found, provided that a single file could have more than one error and I don't like repetitions.
I know I can sort | unique the file in the end from the calling script but... Is there another technique?
Something like:
if(tempVar != FILENAME)
{
print FILENAME >> uniqueFiles;
}
tempVar= FILENAME;
here's my script
awk '
function errorHandler(error1)
{
print FILENAME >> uniqueFiles;
print FILENAME";"NR";"error >> errorListing;
uniqueFiles = FILENAME;
}
BEGIN {
uniqueFiles="files.txt";
errorListing="errorList.txt";
error1="Error code 1"
}
{
if(NR>1)
{
if(length($1) != 10)
{
errorHandler(error1);
}
}
}
END{}' $1

How to detect the last line in awk before END?

I'm trying to concatenate String values and print them, but if the last types are Strings and there is no change of type then the concatenation won't print:
input.txt:
String 1
String 2
Number 5
Number 2
String 3
String 3
awk:
awk '
BEGIN { tot=0; ant_t=""; }
{
t = $1; val=$2;
#if string, concatenate its value
if (t == "String") {
tot+=val;
nx=1;
} else {
nx=0;
}
#if type change, add tot to res
if (t != "String" && ant_t == "String") {
res=res tot;
tot=0;
}
ant_t=t;
#if string, go next
if (nx == 1) {
next;
}
res=res"\n"val;
}
END { print res; }' input.txt
Current output:
3
5
2
Expected output:
3
5
2
6
How can I detect if awk is reading last line, so if there won't be change of type it will check if it is the last line?
awk reads line by line hence it cannot determine if it is reading the last line or not. The END block can be useful to perform actions once the end of file has reached.
To perform what you expect
awk '/String/{sum+=$2} /Number/{if(sum) print sum; sum=0; print $2} END{if(sum) print sum}'
will produce output as
3
5
2
6
what it does?
/String/ selects line that matches String so is Number
sum+=$2 performs the concatanation with String lines. When Number occurs, print the sum and reset to zero
Like this maybe:
awk -v lines="$(wc -l < /etc/hosts)" 'NR==lines{print "LAST"};1' /etc/hosts
I am pre-calculating the number of lines (using wc) and passing that into awk as a variable called lines, if that is unclear.
Just change last line to:
END { print res; print tot;}'
awk '$1~"String"{x+=$2;y=1}$1~"Number"{if (y){print x;x=0;y=0;}print $2}END{if(y) print x}' file
Explanation
y is used as a boolean, and I check at the END if the last pattern was a string and print the sum
You can actually use x as the boolean like nu11p01n73R does which is smarter
Test
$ cat file
String 1
String 2
Number 5
Number 2
String 3
String 3
$ awk '$1~"String"{x+=$2;y=1}$1~"Number"{if (y){print x;x=0;y=0;}print $2}END{if(y) print x}' file
3
5
2
6

How can I subtract to each column its mean using awk?

I have a file such as the following (but with thousands of rows and hundreds of columns)
1 2 1
1 2 2
3 2 3
3 2 6
How can I subtract to each column/field its mean using awk, in order to obtain such a thing?
-1 0 -2
-1 0 -1
1 0 0
1 0 3
Thank you very much for your help.
The most close solution http://www.unix.com/shell-programming-scripting/102293-normalize-dataset-awk.html does not seem to do the job "element by element". Of course it performs another operation, but the generic concept is "perform an operation on each column using a value calculated on that column"
With awk in two passes:
awk '
NR==FNR {
for (i=1;i<=NF;i++) {
a[i]+=$i
}
next
}
{
for (y=1;y<=NF;y++) {
printf "%2d ", $y-=(a[y]/(NR-FNR))
}
print ""
}' file file
With awk in one pass:
awk '{
for (i=1;i<=NF;i++) {
a[i]+=$i;
b[NR,i]=$i
}
}
END {
for (i=1;i<=NR;i++) {
for (j=1;j<=NF;j++) {
printf "%2d ",b[i,j]-=(a[j]/NR)
}
print ""
}
}' file
import sys, numpy as np
a = np.array([i.strip().split() for i in open(sys.argv[1])],dtype =float)
for i in a - np.mean(a,axis=0): print ' '.join(map(str, i))
Usage : python script.py inputFile

Awk merge the results of processing two files into a single file

I use awk to extract and calculate information from two different files and I want to merge the results into a single file in columns ( for example, the output of first file in columns 1 and 2 and the output of the second one in 3 and 4 ).
The input files contain:
file1
SRR513804.1218581HWI-ST695_116193610:4:1307:17513:49120 SRR513804.16872HWI ST695_116193610:4:1101:7150:72196 SRR513804.2106179HWI-
ST695_116193610:4:2206:10596:165949 SRR513804.1710546HWI-ST695_116193610:4:2107:13906:128004 SRR513804.544253
file2
>SRR513804.1218581HWI-ST695_116193610:4:1307:17513:49120
TTTTGTTTTTTCTATATTTGAAAAAGAAATATGAAAACTTCATTTATATTTTCCACAAAG
AATGATTCAGCATCCTTCAAAGAAATTCAATATGTATAAAACGGTAATTCTAAATTTTAT
ACATATTGAATTTCTTTGAAGGATGCTGAATCATTCTTTGTGGAAAATATAAATGAAGTT
TTCATATTTCTTTTTCAAAT
To parse the first file I do this:
awk '
{
s = NF
center = $1
}
{
printf "%s\t %d\n", center, s
}
' file1
To parse the second file I do this:
awk '
/^>/ {
if (count != "")
printf "%s\t %d\n", seq_id, count
count = 0
seq_id = $0
next
}
NF {
long = length($0)
count = count+long
}
END{
if (count != "")
printf "%s\t %d\n", seq_id, count
}
' file2
My provisional solution is create one temporal and overwrite in the second step. There is a more "elegant" way to get this output?
I am not fully clear on the requirement and if you can update the question may be we can help improvise the answer. However, from what I have gathered is that you would like to summarize the output from both files. I have made an assumption that content in both files are in sequential order. If that is not the case, then we will have to add additional checks while printing the summary.
Content of script.awk (re-using most of your existing code):
NR==FNR {
s[NR] = NF
center[NR] = $1
next
}
/^>/ {
seq_id[++y] = $0
++i
next
}
NF {
long[i] += length($0)
}
END {
for(x=1;x<=length(s);x++) {
printf "%s\t %d\t %d\n", center[x], s[x], long[x]
}
}
Test:
$ cat file1
SRR513804.1218581HWI-ST695_116193610:4:1307:17513:49120 SRR513804.16872HWI ST695_116193610:4:1101:7150:72196 SRR513804.2106179HWI-
ST695_116193610:4:2206:10596:165949 SRR513804.1710546HWI-ST695_116193610:4:2107:13906:128004 SRR513804.544253
$ cat file2
>SRR513804.1218581HWI-ST695_116193610:4:1307:17513:49120
TTTTGTTTTTTCTATATTTGAAAAAGAAATATGAAAACTTCATTTATATTTTCCACAAAG
AATGATTCAGCATCCTTCAAAGAAATTCAATATGTATAAAACGGTAATTCTAAATTTTAT
ACATATTGAATTTCTTTGAAGGATGCTGAATCATTCTTTGTGGAAAATATAAATGAAGTT
TTCATATTTCTTTTTCAAAT
$ awk -f script.awk file1 file2
SRR513804.1218581HWI-ST695_116193610:4:1307:17513:49120 4 200
ST695_116193610:4:2206:10596:165949 3 0

How can I delete one line before and two after a string?

Say you have records in a text file which look like this:
header
data1
data2
data3
I would like to delete the whole record if data1 is a given string. I presume this needs awk which I do not know.
Awk can handle these multiline records by setting the record separator to the empty string:
BEGIN { RS = ""; ORS = "\n\n" }
$2 == "some string" { next } # skip this record
{ print } # print (non-skipped) record
You can save this in a file (eg remove.awk) and execute it with awk -f remove.awk data.txt > newdata.txt
This assumes your data is of the format:
header
data
....
header
data
...
If there are no blank lines between the records, you need to manually split the records (this is with 4 lines per record):
{ a[++i] = $0 }
i == 2 && a[i] == "some string" { skip = 1 }
i == 4 && ! skip { for (i = 1; i <= 4; i++) print a[i] }
i == 4 { skip = 0; i = 0 }
without knowing what output you desired and insufficient sample input.
awk 'BEGIN{RS=""}!/data1/' file