Check which string in certain column is repeated - awk

I want to see which string in my column 2 is repeated.
For example:
a apple
b peach
c grape
d peach
e peach
f apple
My output would be:
a apple
f apple
b peach
d peach
e peach
Showing the whole line that has common string on second column.

If you do not want to store all the file in memory, the best thing is to read the file twice.
$ awk 'FNR==NR {a[$2]++; next} a[$2]>1' file file
a apple
b peach
d peach
e peach
f apple
firstly to count how many times a column value appears
secondly to print rows in which the second column was counted at least twice.
As Jonathan Leffler suggests, to reproduce the exact output you are getting, just pipe to sort indicating that it should sort firstly by column 2 and then by column 1:
awk 'FNR==NR {a[$2]++; next} a[$2]>1' file file | sort -k2,2 -k1

A perl solution that doesn't read the file twice:
perl -lane 'push #{$s{$F[1]}},$_;
END{
do{print join "\n", #{$s{$_}} if scalar(#{$s{$_}})>1}for(%s)
}' file
This goes through the file and keeps each line in a hash whose key is the 2st field and whose values are lists of lines. Then, at the end, it will print the lists whose key was seen more than once.

With GNU awk for true 2D arrays:
gawk '
{ vals[$2][++cnt[$2]] = $0 }
END {
for (fruit in vals)
if (cnt[fruit] > 1)
for (i=1; i<=cnt[fruit]; i++)
print vals[fruit][i]
}
' file
a apple
f apple
b peach
d peach
e peach

Related

Awk - store line that matched range pattern start

I use awk to operate on lines within a range, but I need to use the line the matched the range pattern start in my action.
Now I am doing this:
awk '/BANANA/,/END/ {if ($0 ~ /BANANA/) line=$0; print line, $2}' infile.txt
Is there a more elegant way of doing this? A way that does not require me to store $0 at the beginning of the range? Does awk keep this line somewhere?
Thanks and best regards
EDIT (added samples):
infile.txt
few
r t y u i
few
BANANA
a b c d
a b c d
a b c d
a b c d
a b c d
a b c d
a b c d
END
r t y u i
ewqf
few
r t y u i
few
r t y u i
f
expected output
BANANA
BANANA b
BANANA b
BANANA b
BANANA b
BANANA b
BANANA b
BANANA b
BANANA
Never use a range expression as they make trivial tasks very slightly briefer but then need a complete rewrite when you need to do anything the slightest bit more interesting. Always use a flag instead. Instead of:
awk '/BANANA/,/END/ { do something }' infile.txt
you should write:
awk '/BANANA/{f=1} f{ do something } /END/{f=0} ' infile.txt
and then to enhance that to do what you want now is simply:
awk '/BANANA/{f=1; line=$0} f{ print line, $2 } /END/{f=0} ' infile.txt
and any other changes (e.g. skip first line, skip last line, etc.) are equally trivial.
The only "trick" I can suggest in your case is "assignment in condition":
awk '/BANANA/ && (r=$0),/END/{ print r, $2 }' infile.txt
(r=$0) - assign current record value (i.e. BANANA) to variable r only once, thereby avoiding condition check if ($0 ~ /BANANA/) on each record within a range
The output:
BANANA
BANANA b
BANANA b
BANANA b
BANANA b
BANANA b
BANANA b
BANANA b
BANANA

awk remove mirrored duplicates from 2 columns

Big question:
I want a list of the unique combinations between two fields in a data frame.
Example data:
A B
C D
E F
B A
C F
E F
I would like to be able to get the result of 4 unique combinations: AB, CD, EF, and CF. Since BA and and BA contain the same components but in a different order, I only want one copy (it is a mutual relationship so BA is the same thing as AB)
Attempt:
So far I have tried sorting and keeping unique lines:
sort file | uniq
but of course that produces 5 combinations:
A B
C D
E F
B A
C F
I do not know how to approach AB/BA being considered the same. Any suggestions on how to do this?
The idiomatic awk approach is to order the index parts:
$ awk '!seen[$1>$2 ? $1 FS $2 : $2 FS $1]++' file
A B
C D
E F
C F
another awk magic
awk '!a[$1,$2] && !a[$2,$1]++' file
In awk:
$ awk '($1$2 in a){next}{a[$1$2];a[$2$1]}1' file
A B
C D
E F
C F
Explained:
($1$2 in a) { next } # if duplicate in hash, next record
{ a[$1$2]; a[$2$1] } 1 # hash reverse also and output
It works for single char fields. If you want to use it for longer strings, add FS between fields, like a[$1 FS $2] etc. (thanks #EdMorton).

Extract lines in File A based on two columns in File B

I have two files of the same number of column (tab delimited) that look like this
File A:
12345 Fish Apple 7123
321 Chicken Apple 9912
661 Ant Apple 316
File B:
321 Duck Orange 9912
12345 Bird Orange 7123
661 Eagle Orange 34
Expected Output:
Fiile A_edited
661 Ant Apple 316
Based on the ID from column 1 and column 4 in File B, if both values appear in column 1 and column 4 of the line in File A, I want to remove the line from File A. I tried using grep to do this, but the two lists are very long, around 66Gb each, so it's still running after a day. Is there any other faster way besides grep that I can do it?
p/s: the number of columns is actually more than 4, shown here only four for simplicity.
awk '{print $1 "\t"$4}'B.txt >> B_edited.txt
# Extract the line number in A.txt containing lines where two IDs are present in B_edited.txt
cat B_edited.txt|while read ID1 ID2
do
grep -nE "$ID1.*$ID2" A.txt|cut -c 1 >> LineNumber.txt
done
# Remove duplicates of line numbers
sort -u LineNumber.txt >> LineNumberUnique.txt
# Output only lines from A.txt where line numbers are not in the list
awk 'FNR == NR { h[$1]; next } !(FNR in h)' LineNumberUnique.txt A.txt >> A_edited.txt
I would greatly appreciate any help!
Thanks,
Jen
$ awk '{k=$1FS$4} NR==FNR{keys[k];next} !(k in keys)' fileB fileA
661 Ant Apple 316
To overwrite fileA with the output, just add > tmp && mv tmp fileA or use -i inplace if you have GNU awk 4.*.

Addition of particular numbers in a file using awk or grep

I am looking for something like this:
FILE NAME : fruites.txt
Apple a day keeps doctor away
but people dont like it............... 23 peoples found.
Banana_A.1 keeps u fit
and its very tasty.................... 12 peoples found.
Banana_B.2 juices is very good to taste
and most people like them
as well as consumed the most.......... 15 peoples found.
Anar is difficult to eat
as well as its very costly............ 35 peoples found.
grapes are easy to eat
and people like it the most........... 10 peoples found.
fruites are very healthy and improves vitamins.
Apple : The apple tree is a deciduous tree in the rose family best known for its sweet, pomaceous
fruit, the apple.
Banana_A.1: A banana is an edible fruit, botanically a berry, produced by several kinds of large
herbaceous flowering plants in the genus Musa.
Banana_B.2: A banana is an fruit, botanically a kerry, produced by several kinds of large
herbaceous flowering plants in the genus Musa.
Anar : The pomegranate, botanical name Punica granatum, is a fruit-bearing deciduous shrub or
small tree growing between 5 and 8 m tall.
I want the addition of all peoples found except banana
ANS : 68 ( 23+35+10 )
I am able to find the count separately, but unable to subtract them
I tried like this
grep -E ".found" fruites.txt | awk ' { sum+=$3 } END {print sum }'
ANS : 95 (68+27)
grep -E "Banana|.found" fruites.txt | grep -A1 "Banana" | grep -E ".found" | awk ' { sum+=$3 } END {print sum }'
AND : 27 ( only bananas)
Can anyone please help
awk '$1 != "Banana" {s+=$(NF-2)} END { print s}' RS= fruites.txt
The key here is the RS= assignment which makes awk treat each section of text delimited by blank lines as a separate record. Note that you may prefer to write RS="" fruites.txt for clarity, but that is not necessary. Be sure not to omit the space after the =, though, as the key is to have a blank string as the value of RS.
-- Edit --
Given the comments and the modified question, perhaps you want:
awk '! match($1,"Banana") && match($NF, "found") {
s += $(NF-2)} END { print s }' RS= fruites.txt
You could use the below awk command.
$ awk -v RS="\n\n" '!/Banana/ && /peoples found\.$/{s+=$(NF-2)} END { print s}' file
68
The above awk command sets a blank line \n\n as the Record seperator value and check for the non-existence of Banana string and the existence of peoples found. string at the last. If both conditions are satisfied, then only the sum of third column from the last would be calculated. So s+=$(NF-2) also written as s = s + $(NF-2) contains the sum. Printing the value of s at the last will give you the total sum.

In AWK, is it possible to specify "ranges" of fields?

In AWK, is it possible to specify "ranges" of fields?
Example. Given a tab-separated file "foo" with 100 fields per line, I want to print only the fields 32 to 57 for each line, and save the result in a file "bar". What I do now:
awk 'BEGIN{OFS="\t"}{print $32, $33, $34, $35, $36, $37, $38, $39, $40, $41, $42, $43, $44, $45, $46, $47, $48, $49, $50, $51, $52, $53, $54, $55, $56, $57}' foo > bar
The problem with this is that it is tedious to type and prone to errors.
Is there some syntactic form which allows me to say the same in a more concise and less error prone fashion (like "$32..$57") ?
Besides the awk answer by #Jerry, there are other alternatives:
Using cut (assumes tab delimiter by default):
cut -f32-58 foo >bar
Using perl:
perl -nle '#a=split;print join "\t", #a[31..57]' foo >bar
Mildly revised version:
BEGIN { s = 32; e = 57; }
{ for (i=s; i<=e; i++) printf("%s%s", $(i), i<e ? OFS : "\n"); }
You can do it in awk by using RE intervals. For example, to print fields 3-6 of the records in this file:
$ cat file
1 2 3 4 5 6 7 8 9
a b c d e f g h i
would be:
$ gawk 'BEGIN{f="([^ ]+ )"} {print gensub("("f"{2})("f"{4}).*","\\3","")}' file
3 4 5 6
c d e f
I'm creating an RE segment f to represent every field plus it's succeeding field separator (for convenience), then I'm using that in the gensub to delete 2 of those (i.e the first 2 fields), remember the next 4 for reference later using \3, and then delete what comes after them. For your tab-separated file where you want to print fields 32-57 (i.e. the 26 fields after the first 31) you'd use:
gawk 'BEGIN{f="([^\t]+\t)"} {print gensub("("f"{31})("f"{26}).*","\\3","")}' file
The above uses GNU awk for it's gensub() function. With other awks you'd use sub() or match() and substr().
EDIT: Here's how to write a function to do the job:
gawk '
function subflds(s,e, f) {
f="([^" FS "]+" FS ")"
return gensub( "(" f "{" s-1 "})(" f "{" e-s+1 "}).*","\\3","")
}
{ print subflds(3,6) }
' file
3 4 5 6
c d e f
Just set FS as appropriate. Note that this will need a tweak for the default FS if your input file can start with spaces and/or have multiple spaces between fields and will only work if your FS is a single character.
I'm late but this is quick at to the point so I'll leave it here. In cases like this I normally just remove the fields I don't need with gsub and print. Quick and dirty example, since you know your file is delimited by tabs you can remove the first 31 fields:
awk '{gsub(/^(\w\t){31}/,"");print}'
example of removing 4 fields because lazy:
printf "a\tb\tc\td\te\tf\n" | awk '{gsub(/^(\w\t){4}/,"");print}'
Output:
e f
This is shorter to write, easier to remember and uses less CPU cycles than horrendous loops.
You can use a combination of loops and printf for that in awk:
#!/bin/bash
start_field=32
end_field=58
awk -v start=$start_field -v end=$end_field 'BEGIN{OFS="\t"}
{for (i=start; i<=end; i++) {
printf "%s" $i;
if (i < end) {
printf "%s", OFS;
} else {
printf "\n";
}
}}'
This looks a bit hacky, however:
it properly delimits your output based on the specified OFS, and
it makes sure to print a new line at the end for each input line in the file.
I do not know a way to do field range selection in awk. I know how to drop fields at the end of the input (see bellow), but not easily at the beginning. Bellow, the hard way to drop fields at the beginning.
If you know a character c that is not included in your input, you could use the following awk script:
BEGIN { s = 32; e = 57; c = "#"; }
{ NF = e # Drop the fields after e.
$s = c $s # Put a c in front of the s field.
sub(".*"c, "") # Drop the chars before c.
print # Print the edited line.
}
EDIT:
And I just thought that you can always find a character that is not in the input: use \n.
Unofrtunately don't seem to have access to my account anymore, but also don't have 50 rep to add a comment anyway.
Bob's answer can be simplified a lot using 'seq':
echo $(seq -s ,\$ 5 9| cut -d, -f2-)
$6,$7,$8,$9
The minor disadvantage is you have to specify your first field number as one lower.
So to get fields 3 through 7, I specify 2 as the first argument.
seq -s ,\$ 2 7 sets field seperator for seq at ',$' and yields 2,$3,$4,$5,$6,$7
cut -d, -f2- sets field delimiter at ',' and basically cuts of everything before the first comma, by showing everything from the second field on. Thus resulting in $3,$4,$5,$6,$7
When combined with Bob's answer, we get:
$ cat awk.txt
1 2 3 4 5 6 7 8 9
a b c d e f g h i
$ awk "{print $(seq -s ,\$ 2 7| cut -d, -f2-)}" awk.txt
3 4 5 6 7
c d e f g
$
I use this simple function, which does not check that the field range exists in the line.
function subby(f,l, s) {
s = $f
for(i=f+1;i<=l;i++)
s = sprintf("%s %s",s,$i)
return s
}
(I know OP requested "in AWK" but ... )
Using bash expansion on the command line to generate arguments list;
$ cat awk.txt
1 2 3 4 5 6 7 8 9
a b c d e f g h i
$ awk "{print $(c="" ;for i in {3..7}; do c=$c\$$i, ; done ; c=${c%%,} ; echo $c ;)}" awk.txt
3 4 5 6 7
c d e f g
explanation ;
c="" # var to hold args list
for i in {3..7} # the required variable range 3 - 7
do
# replace c's value with concatenation of existing value, literal $, i value and a comma
c=$c\$$i,
done
c=${c%%,} # remove trailing/final comma
echo $c #return the list string
placed on single line using semi-colons, inside $() to evaluate/expand in place.