Delete every occurrence in a file - awk

Let's say I have a file with this format content:
6 8
6 9
12 20
6
8
9
12
20
35
I want to delete all the lines if the number (from 1st or 2nd column) is found in next lines whether it is in the first or 2nd column inluding the line where the initial number is found.
So I should have this as a result:
35
I've tried using
awk '{for(i=1;i<=NF;i++){if($i in a){next};a[$i]}} 1'
Is there a simple way to do this ?

This line should help:
awk 'NR==FNR{a[$0]++;next}{for(i=1;i<=NF;i++)if(a[$i]>1)next}7'
RS=" |\n" file RS="\n" file
With this input file:
100 200 300
6 8
6 9
12 20
6
8
9
12
20
35
The one-liner above is gonna output:
100 200 300
35

You want to find numbers that only occur once. Let me add "42" to your file on the same line as "35"
$ cat file
6 8
6 9
12 20
6
8
9
12
20
35 42
To find all the number that only appear once:
$ tr -s "[:blank:]" "\n" < file | sort -n | uniq -u
35
42
But that doesn't show the line in the file where they appear, so
$ grep -n -Fw -f <(tr -s "[:blank:]" "\n" < file | sort -n | uniq -u) file
9:35 42
I added the -n option just for effect, to show that we have line 9 from the file.

The code you originally wrote works perfectly under the condition: print the current line if and only if any of the previous lines do not have a number listed in the current line. However, you do not request previous but next.
The simples solution for you would be to tac the input and tac it back.
$ tac <file> | awk '{for(i=1;i<=NF;i++){if($i in a) next;a[$i]}}1' | tac
If you want to use a single awk then you have to store the lines first and process them afterward, it is also a bit more cumbersome because :
you need to count how many times a number appeared
and you need a double pass algorithm (by running the file twice or storing it in memory)
This gives you the following :
double pass:
$ awk '(NR==FNR){ for(i=1;i<=NF;i++) a[$i]++; next }'
{ b=0; for(i=1;i<=NF;i++) b+=--a[$i] }
!b; <file> <file>
memory:
$ awk '{ m[NR]=$0; for(i=1;i<=NF;i++) a[$i]++; next }
END { for (n=1;n<=NR;++n) {
b=0; $0=m[n];
for(i=1;i<=NF;i++) b+=--a[$i]
if (!b) { print }
}
}' <file>
This outputs the expected :
6
8
9
12
20
35
If you want to remove all lines which contain a number which appears more then ones in the file, then you simply do :
$ awk '(NR==FNR) { for(i=1;i<=NF;i++) a[$i]++; next }
{ b=1; for(i=1;i<=NF;i++) b = b && (a[$i]==1) }
b' <file> <file>
note: from your input you might also be suffering from the windows \r problem, so you might want to do gsub("\r","") before everything.

Another in awk, processes the data once:
awk '
{
for(i=1;i<=NF;i++) { # process numbers in record
r[$i]=r[$i] NR OFS # add NR on a record list array
if(p=split(r[$i],t)>1) { # if there are more than one record in r
for(j in t) # delete records on the list
delete a[t[j]]
f=1 # flag up
}
}
if(f=="") # if flag is down
a[NR]=$0 # store current record
f="" # reset flag
}
END {
for(i=1;i<=NR;i++) # in the end
if(i in a) # print all records stored
print a[i]
}' file
35
A revised version:
awk '
{
for(i=1;i<=NF;i++) {
if($i in r) { # just store the NR of last duplicate
delete a[r[$i]]
f=1
}
r[$i]=NR
}
if(f=="")
a[NR]=$0
f=""
}
END {
for(i=1;i<=NR;i++)
if(i in a)
print a[i]
}' file
35

You could try something like this:
cat yourFile | tr ' ' '\n' | sort | uniq -u

Related

How to swap the last two column pairs with awk?

I am trying this
awk '{B=$(NF-1);A=$NF; $NF=$(NF-2); $(NF-1) = $(NF-3); $(NF-2)=A; $(NF-3) = B; print;}' input_text.txt
but I get the error:
awk: cmd. line:1: (FILENAME=cazzo.txt FNR=2) fatal: attempt to access field -1
Sample input:
$ cat input_text.txt
1 7 9 11 0 5 2
The same happens if I replace the spaces with tabs in the input_text.txt file.
Expected output:
1 7 9 5 2 11 0
I am running with Cygwin on Windows 10.
You can try this awk for swapping values:
awk 'NF > 3 {a=$NF; b=$(NF-1); $NF=$(NF-2); $(NF-1)=$(NF-3); $(NF-3)=b; $(NF-2)=a} 1' file
1 7 9 5 2 11 0
If there are DOS line breaks then use:
awk -v RS='\r?\n' 'NF > 3 {a=$NF; b=$(NF-1); $NF=$(NF-2); $(NF-1)=$(NF-3); $(NF-3)=b; $(NF-2)=a} 1' file
If you have gnu awk then you can use this regex based approach:
awk -v RS='\r?\n' 'NF > 3 {
$0 = gensub(/(\S+\s+\S+)(\s+)(\S+\s+\S+)$/, "\\3\\2\\1", "1")} 1' file
1 7 9 5 2 11 0
To swap the last n fields with the n fields before them:
$ awk -v n=2 'NF>=(2*n){ for (i=NF-(n-1); i<=NF; i++) {t=$i; $i=$(i-n); $(i-n)=t} } 1' file
1 7 9 5 2 11 0
$ awk -v n=3 'NF>=(2*n){ for (i=NF-(n-1); i<=NF; i++) {t=$i; $i=$(i-n); $(i-n)=t} } 1' file
1 0 5 2 7 9 11
With your shown samples, please try following code. This is a Generic code, where you have 2 awk variables named fromFields and toFields. So you need to give their values like: let's say you want to substitute 4th field value with 6th field AND 5th field value with 7th field, so you will set it like: fromFields="4,5" and toFields="6,7". I am assuming user will understand that values which are given are feasible with respect to Input_file.
awk -v fromFields="4,5" -v toFields="6,7" '
BEGIN{
num1=split(fromFields,arr1,",")
num2=split(toFields,arr2,",")
}
{
tmp=""
for(i=1;i<=num1;i++){
tmp=$arr1[i]
$arr1[i]=$arr2[i]
$arr2[i]=tmp
}
}
1
' Input_file

Bash/AWK conditionals using two files

First of all, thank you for your help. I have a problem trying to use bash conditionals using two files. I have the file letters.txt
A
B
C
And I have the file number.txt
B 10
D 20
A 15
C 18
E 23
A 12
B 14
And I want to use conditionals so if one letter in file letter.txt is also in number.txt it generates the file a.txt b.txt c.txt so the will look as this:
a.txt
A 12
A 15
b.txt
B 10
B 14
c.txt
C 18
I know I can do it using this code:
cat number.txt | awk '{if($1=="A")print $0}' > a.txt
But I want to do it using two files.
The efficient way to approach this type of problem is to sort the input on the key field(s) first so you don't need to have multiple output files open simultaneously (which has limits and/or can slow processing down managing them) or be opening/closing output files with every line read (which is always slow).
Using GNU sort for -s (stable sort) to retain input order of the non-key fields and only having 1 output file open at a time and keeping it open for the whole time it's being populated:
$ sort -k1,1 -s number.txt |
awk '
NR==FNR { lets[$1]; next }
!($1 in lets) { next }
$1 != prev { close(out); out=tolower($1) ".txt"; prev=$1 }
{ print > out }
' letters.txt -
$ head ?.txt
==> a.txt <==
A 15
A 12
==> b.txt <==
B 10
B 14
==> c.txt <==
C 18
If you don't have GNU sort for -s to retain input order of the lines for each key field, you can replace it with awk | sort | cut, e.g.:
$ sort -k1,1 -s number.txt
A 15
A 12
B 10
B 14
C 18
D 20
E 23
$ awk '{print NR, $0}' number.txt | sort -k2,2 -k1,1n | cut -d' ' -f2-
A 15
A 12
B 10
B 14
C 18
D 20
E 23
Note the change in the order of the 2nd fields for A compared to the input order without doing the above because by default sort doesn't guarantee to retain the relative line order for each key it sorts on:
$ sort -k1,1 number.txt
A 12
A 15
B 10
B 14
C 18
D 20
E 23
With your shown samples, please try following.
awk '
FNR==NR{
arr[$0]
next
}
($1 in arr){
outputFile=(tolower($1)".txt")
print >> (outputFile)
close(outputFile)
}
' letters.txt numbers.txt
Explanation: Adding detailed explanation for above.
awk ' ##Starting awk program from here.
FNR==NR{ ##Checking condition which will be TRUE when letters.txt is being read.
arr[$0] ##Creating array arr with index of current line.
next ##next will skip all further statements from here.
}
($1 in arr){ ##checking condition if 1st field is present in arr.
outputFile=(tolower($1)".txt") ##Creating outputFile to print output.
print >> (outputFile) ##Printing current line into output file.
close(outputFile) ##Closing output file in backend.
}
' letters.txt numbers.txt ##Mentioning Input_file names here.

awk to add one to column based on condition

Trying to add a awk condition that will add one to $2 if the sum of $3 - $2is greater than one, The below has a syntax error on the if line. Thank you :).
file
2 21002880 21002881
17 3476163 3476186
11 108292759 108292760
2 218661210 218661265
2 21002865 21002866
desired
2 21002880 21002881
17 3476164 3476186
11 108292759 108292760
2 218661211 218661265
2 21002865 21002866
awk
awk 'BEGIN {FS=OFS="\t"} {sum+=$3-$2} # define FS, OFS, and sum
if((sum > 1)) { # condition check
print $1,$2+1,$3 # print desired output
next # goto next line
}
}1' file.txt
Another short one:
awk -v OFS="\t" '{$2+=($3-$2>1)}1' file
Output:
2 21002880 21002881
17 3476164 3476186
11 108292759 108292760
2 218661211 218661265
2 21002865 21002866
I missed a { and needed sum=$3-$2
awk 'BEGIN {FS=OFS="\t"} {sum=$3-$2} # define FS, OFS, and sum
{
if((sum > 1)) { # condition check
print $1,$2+1,$3 # print desired output
next # goto next line
}
}1' file.txt

Print every second consequtive field in two columns - awk

Assume the following file
#zvview.exe
#begin Present/3
77191.0000 189.320100 0 0 3 0111110 16 1
-8.072430+6-8.072430+6 77190 0 1 37111110 16 2
37 2 111110 16 3
8.115068+6 0.000000+0 8.500000+6 6.390560-2 9.000000+6 6.803440-1111110 16 4
9.500000+6 1.685009+0 1.000000+7 2.582780+0 1.050000+7 3.260540+0111110 16 5
37 2 111110 16 18
What I would like to do, is print in two columns, the fields after line 6. This can be done using NR. The tricky part is the following : Every second field, should go in one column as well as adding an E before the sign, so that the output file will look like this
8.115068E+6 0.000000E+0
8.500000E+6 6.390560E-2
9.000000E+6 6.803440E-1
9.500000E+6 1.685009E+0
1.000000E+7 2.582780E+0
1.050000E+7 3.260540E+0
From the output file you see that I want to keep in $6 only length($6)=10 characters.
How is it possible to do it in awk?
can do all in awk but perhaps easier with the unix toolset
$ sed -n '6,7p' file | cut -c2-66 | tr ' ' '\n' | pr -2ats' '
8.115068+6 0.000000+0
8.500000+6 6.390560-2
9.000000+6 6.803440-1
9.500000+6 1.685009+0
1.000000+7 2.582780+0
1.050000+7 3.260540+0
Here is a awk only solution or comparison
$ awk 'NR>=6 && NR<=7{$6=substr($6,1,10);
for(i=1;i<=6;i+=2) {f[++c]=$i;s[c]=$(i+1)}}
END{for(i=1;i<=c;i++) print f[i],s[i]}' file
8.115068+6 0.000000+0
8.500000+6 6.390560-2
9.000000+6 6.803440-1
9.500000+6 1.685009+0
1.000000+7 2.582780+0
1.050000+7 3.260540+0
Perhaps shorter version,
$ awk 'NR>=6 && NR<=7{$6=substr($6,1,10);
for(i=1;i<=6;i+=2) print $i FS $(i+1)}' file
8.115068+6 0.000000+0
8.500000+6 6.390560-2
9.000000+6 6.803440-1
9.500000+6 1.685009+0
1.000000+7 2.582780+0
1.050000+7 3.260540+0
to convert format to standard scientific notation, you can pipe the result to
sed or embed something similar in awk script (using gsub).
... | sed 's/[+-]/E&/g'
8.115068E+6 0.000000E+0
8.500000E+6 6.390560E-2
9.000000E+6 6.803440E-1
9.500000E+6 1.685009E+0
1.000000E+7 2.582780E+0
1.050000E+7 3.260540E+0
With GNU awk for FIELDWIDTHS:
$ cat tst.awk
BEGIN { FIELDWIDTHS="9 2 9 2 9 2 9 2 9 2 9 2" }
NR>5 && NR<8 {
for (i=1;i<NF;i+=4) {
print $i "E" $(i+1), $(i+2) "E" $(i+3)
}
}
$ awk -f tst.awk file
8.115068E+6 0.000000E+0
8.500000E+6 6.390560E-2
9.000000E+6 6.803440E-1
9.500000E+6 1.685009E+0
1.000000E+7 2.582780E+0
1.050000E+7 3.260540E+0
If you really want to get rid of the leading blanks then there's various ways to do it (simplest being gsub(/ /,"",$<field number>) on the relevant fields) but I left them in because the above allows your output to line up properly if/when your numbers start with a -, like they do on line 4 of your sample input.
If you don't have GNU awk, get it as you're missing a LOT of extremely useful functionality.
I tried to combine #karafka 's answer using substr, so the following does the trick!
awk 'NR>=6 && NR<=7{$6=substr($6,1,10);for(i=1;i<=6;i+=2) print substr($i,1,8) "E" substr($i,9) FS substr($(i+1),1,8) "E" substr($(i+1),9)}' file
and the output is
8.115068E+6 0.000000E+0
8.500000E+6 6.390560E-2
9.000000E+6 6.803440E-1
9.500000E+6 1.685009E+0
1.000000E+7 2.582780E+0
1.050000E+7 3.260540E+0

How to get random lines with a given distribution with awk?

I have two tabular files that I want to merge, but prior to that I want to reduce the second one.
The first file let's say File1 is tabular and is like this
A 67 98 56
A 22 29 62
A 11 99 28
B 37 88 14
B 33 99 65
We have 3 lines with A and two with B
File2 contains 3000 lines between A and B, I want to randomly select lines from File2 but exactly the same number of A and B than File1, which means just 3 random lines with A and two with B
Any one have an idea on how to do this with awk ?
Thanks
#!/bin/bash
read -r acount bcount <<< $(csplit file2 '/^B /')
awk -v "acount=$acount" -v "bcount=$bcount" '
NR == FNR {
arr[$1]++;
next
}
! setup {
setup = 1
while (arandcount < acount) {
line = int(rand() * acount) + 1
if (! alines[line]) {
alines[line] = 1
arandcount++
}
}
while (brandcount < bcount) {
line = int(rand() * bcount) + 1
if (! blines[line]) {
blines[line] = 1
brandcount++
}
}
}
FILENAME == "xx00" && FNR in alines {
print
}
FILENAME == "xx01" && FNR in blines {
print
}' file1 xx00 xx01
Where "xx00" and "xx01" are the names of the files created by csplit.
The csplit command splits the input file on the regex and outputs the line counts of each output file. The read command puts those counts into the variables. The variables are passed into the AWK program.
The first block reads all the lines in file1 and counts each "type" (A or B).
The second block picks the line numbers to select by choosing a random number between 1 and the count of lines of the "type". This block only gets executed once because of the flag.
The last two blocks check each line's record number to see if it's in the array of picked line numbers and prints it if it is.
This might work for you:
grep '^A' file2 | sort -R | head -$(grep -c '^A' file1) >file3
grep '^B' file2 | sort -R | head -$(grep -c '^B' file1) >>file3
N.B. This assumes file1 is sorted.