Finding common elements among all lines of a text file - awk

I have a text file like:
a b c d e
b c e
d f g e h c
I am looking for a simple AWK which can output the common elements among all lines ignoring their first element. The desired output is:
c e
or
e c

$ cat tst.awk
FNR==1 { for (i=1; i<=NF; i++) common[$i]; next }
{
for (c in common) {
present = 0
for (i=1; i<=NF; i++) {
if ($i == c) {
present = 1
}
}
if (!present) {
delete common[c]
}
}
}
END {
i=0
for (c in common) {
printf "%s%s", (++i>1?OFS:""), c
}
print ""
}
$ awk -f tst.awk file
c e
If you really want to skip the first char on each line, just change the 2 for (i=1; i<=NF; i++) loops to start at 2 instead of 1.
Although the above was accepted I actually prefer #jaypal's approach (but not his choice of tool :-) ), so here's the awk equivalent:
$ cat tst.awk
{ delete seen; for (i=1; i<=NF; i++) if (!seen[$i]++) count[$i]++ }
END {
i=0
for (c in count)
if (count[c] == NR)
printf "%s%s", (++i>1?OFS:""), c
print ""
}
$
$ awk -f tst.awk file
c e
If your awk doesn't support delete seen, change it to split("",seen).

perl to the rescue:
perl -lane '
my %seen;
map { $total{$F[$_]}++ unless $seen{$F[$_]}++ } 1 .. $#F;
}{
print join " ", grep { $total{$_} == $. } keys %total
' file
e c
Keep a rolling %total hash which will increment the elements only if they are unique for every line. %seen is a hash that helps keep track of those elements. Hence we use my declaration to reset it for every line.
In the END block we just grep those elements whose value meets the total number of lines as that would mean they were seen on every line.
The command line options are:
-l: Chomps the newline and places it back during print.
-a: Splits the line on whitespace and loads an array #F with those values.
-n: Creates a while(<>) { .. } loop to process every line.
-e: Executes the code block thats follows in quotes.

Another perl approach:
perl -lane '
if ($. == 1) { %intersect = map {$_ => 1} #F; next }
%intersect = map {$_ => 1} grep {$intersect{$_}} #F;
END {print join " ", keys %intersect}
' file
Results will not be in any particular order.

Related

Sum up from line "A" to line "B" from a big file using awk

aNumber|bNumber|startDate|timeZone|duration|currencyType|cost|
22677512549|778|2014-07-02 10:16:35.000|NULL|NULL|localCurrency|0.00|
22675557361|76457227|2014-07-02 10:16:38.000|NULL|NULL|localCurrency|10.00|
22677521277|778|2014-07-02 10:16:42.000|NULL|NULL|localCurrency|0.00|
22676099496|77250331|2014-07-02 10:16:42.000|NULL|NULL|localCurrency|1.00|
22667222160|22667262389|2014-07-02 10:16:43.000|NULL|NULL|localCurrency|10.00|
22665799922|70110055|2014-07-02 10:16:45.000|NULL|NULL|localCurrency|20.00|
22676239633|433|2014-07-02 10:16:48.000|NULL|NULL|localCurrency|0.00|
22677277255|76919167|2014-07-02 10:16:51.000|NULL|NULL|localCurrency|1.00|
This is the input (sample of million of line) i have in csv file.
I want to sum up duration based on date.
My concern is i want to sum up first 1000000 lines
the awk program i'm using is:
test.awk
BEGIN { FS = "|" }
NR>1 && NR<=1000000
FNR == 1{ next }
{
sub(/ .*/,"",$3)
key=sprintf("%10s",$3)
duration[key] += $5 } END {
printf "%-10s %16s,"dAccused","Duration"
for (i in duration) {
printf "%-4s %16.2f i,duration[i]
}}
i run my script as
$awk -f test.awk 'file'
The input i have doesn't condsidered my condition NR>1 && NR<=1000000
ANY SUGGESTION? PLEASE!
You're looking for this:
BEGIN { FS = "|" }
1 < NR && NR <= 1000000 {
sub(/ .*/, "", $3)
key = sprintf("%10s",$3)
duration[key] += $5
}
END {
printf "%-10s %16s\n", "dAccused", "Duration"
for (i in duration) {
printf "%-4s %16.2f i,duration[i]
}
}
A lot of errors become obvious with proper indentation.
The reason you saw 1,000,000 lines was due to this:
NR>1 && NR<=1000000
That is a condition with no action block. The default action is to print the current record if the condition is true. That's why you see a lot of awk one-liners end with the number 1
You didn't post any expected output and your duration field is always NULL so it's still not clear what you really want output, but this is probably the right approach:
$ cat tst.awk
BEGIN { FS = "|" }
NR==1 { for (i=1;i<NF;i++) f[$i] = i; next }
{
sub(/ .*/,"",$(f["startDate"]))
sum[$(f["startDate"])] += $(f["duration"])
}
NR==1000000 { exit }
END { for (date in sum) print date, sum[date] }
$ awk -f tst.awk file
2014-07-02 0
Instead of discarding your header line, it uses it to create an array f[] that maps the field names to their order in each line so instead of having to hard-code that duration is field 4 (or whatever) you just reference it as $(f["duration"]).
Any time your input file has a header line, don't discard it - use it so your script is not coupled to the order of fields in your input file.

awk totally separate duplicate and non-duplicates

If we have an input:
TargetIDs,CPD,Value,SMILES
95,CPD-1111111,-2,c1ccccc1
95,CPD-2222222,-3,c1ccccc1
95,CPD-2222222,-4,c1ccccc1
95,CPD-3333333,-1,c1ccccc1N
Now we would like to separate the duplicates and non-duplicates based on the fourth column (smiles)
duplicate:
95,CPD-1111111,-2,c1ccccc1
95,CPD-2222222,-3,c1ccccc1
95,CPD-2222222,-4,c1ccccc1
non-duplicate
95,CPD-3333333,-1,c1ccccc1N
Now the following attempt could do separate the duplicate without any problem. However, the first occurrence of the duplicate will still be included into the non-duplicate file.
BEGIN { FS = ","; f1="a"; f2="b"}
{
# Keep count of the fields in fourth column
count[$4]++;
# Save the line the first time we encounter a unique field
if (count[$4] == 1)
first[$4] = $0;
# If we encounter the field for the second time, print the
# previously saved line
if (count[$4] == 2)
print first[$4] > f1 ;
# From the second time onward. always print because the field is
# duplicated
if (count[$4] > 1)
print > f1;
if (count[$4] == 1) #if (count[$4] - count[$4] == 0) <= change to this doesn't work
print first[$4] > f2;
duplicate output results from the attempt:
95,CPD-1111111,-2,c1ccccc1
95,CPD-2222222,-3,c1ccccc1
95,CPD-2222222,-4,c1ccccc1
non-duplicate output results from the attempt
TargetIDs,CPD,Value,SMILES
95,CPD-3333333,-1,c1ccccc1N
95,CPD-1111111,-2,c1ccccc1
May I know if any guru might have comments/solutions? Thanks.
I would do this:
awk '
NR==FNR {count[$2] = $1; next}
FNR==1 {FS=","; next}
{
output = (count[$NF] == 1 ? "nondup" : "dup")
print > output
}
' <(cut -d, -f4 input | sort | uniq -c) input
The process substitution will pre-process the file and perform a count on the 4th column. Then, you can process the file and decide if that line is "duplicated".
All in awk: Ed Morton shows a way to collect the data in a single pass. Here's a 2 pass solution that's virtually identical to my example above
awk -F, '
NR==FNR {count[$NF]++; next}
FNR==1 {next}
{
output = (count[$NF] == 1 ? "nondup" : "dup")
print > output
}
' input input
Yes, the input file is given twice.
$ cat tst.awk
BEGIN{ FS="," }
NR>1 {
if (cnt[$4]++) {
dups[$4] = nonDups[$4] dups[$4] $0 ORS
delete nonDups[$4]
}
else {
nonDups[$4] = $0 ORS
}
}
END {
print "Duplicates:"
for (key in dups) {
printf "%s", dups[key]
}
print "\nNon Duplicates:"
for (key in nonDups) {
printf "%s", nonDups[key]
}
}
$ awk -f tst.awk file
Duplicates:
95,CPD-1111111,-2,c1ccccc1
95,CPD-2222222,-3,c1ccccc1
95,CPD-2222222,-4,c1ccccc1
Non Duplicates:
95,CPD-3333333,-1,c1ccccc1N
This solution only works if the duplicates are grouped together.
awk -F, '
function fout( f, i) {
f = (cnt > 1) ? "dups" : "nondups"
for (i = 1; i <= cnt; ++i)
print lines[i] > f
}
NR > 1 && $4 != lastkey { fout(); cnt = 0 }
{ lastkey = $4; lines[++cnt] = $0 }
END { fout() }
' file
Little late
My version in awk
awk -F, 'NR>1{a[$0":"$4];b[$4]++}
END{d="\n\nnondupe";e="dupe"
for(i in a){split(i,c,":");b[c[2]]==1?d=d"\n"i:e=e"\n"i} print e d}' file
Another built similar to glenn jackmans but all in awk
awk -F, 'function r(f) {while((getline <f)>0)a[$4]++;close(f)}
BEGIN{r(ARGV[1])}{output=(a[$4] == 1 ? "nondup" : "dup");print >output} ' file

Extract specific data from a file with awk

I have a text file as shown below. I would like to extract the .pdb IDs and its corresponding chains. How is this possible with awk?
>4HSU:A|PDBID|CHAIN|SEQUENCE
PLGSRKCEKAGCTATCPVCFASASERCAKNGY
PKAFMADQQL
>4HSU:B|PDBID|CHAIN|SEQUENCE
PLGSPEFSERGSKSPLKRAQETE
>4HSU:C|PDBID|CHAIN|SEQUENCE
ARTMQTARKSTGGKAPRKQLATKAARKSAP
>4HT3:A|PDBID|CHAIN|SEQUENCE
MERYENLFAQLNDRREGAF
>4HT3:B|PDBID|CHAIN|SEQUENCE
MTTLLNPYFGEFGGMYVPQ
>4I0W:A|PDBID|CHAIN|SEQUENCE
MENKAKVGIDFINTIPKQILTSLIEQYSPNNGEIELVVLYGDNFLRFKNSVDVIGAKVEDLGYGFGILII
>4I0W:B|PDBID|CHAIN|SEQUENCE
AYDSNRASCIPSVWNNYNLTGEGILVGFLDT
>4I0W:D|PDBID|CHAIN|SEQUENCE
AYDSNRASCIPSVWNNYNLTGEGILVGFLLPLGDTITSGGWRIIVRKLNNYEGYFDIWLPIAEGLN
ERTRFLQPSVYNTLGIPATVEGVIS
`
Desired output:
4HSU A B C
4HT3 A B
4I0W A B D
kent$ awk -F'[>:|]' '/^>/{a[$2]=a[$2] OFS $3}END{for(x in a)print x,a[x]}' file
4I0W A B D
4HSU A B C
4HT3 A B
I am satisfied with my FS value: >:| like a cute face!
Looks as though you want the output of everything in the original order; so, it takes some indirection to take care of this. All the below works in POSIX AWK as requested (or at least gawk with LINT = 1) and has the addtional feature of keeping track of what is seen to eliminate duplicates.
#! /usr/bin/awk -f
BEGIN {
FS="[>:|]"
split("", t) # table of output
split("", r) # row number in table for a ID
split("", seen) # keeps track of duplicates
row=0
}
/^>/ && !($2 SUBSEP $3 in seen) {
if ($2 in r) {
i=r[$2]
t[i] = t[i] OFS $3
} else {
r[$2] = row
t[row++] = $2 OFS $3
}
seen[$2, $3] = 1
}
END {
for (i=0; i<row; i++)
print t[i]
}

awk | Rearrange fields of CSV file on the basis of column value

I need you help in writing awk for the below problem. I have one source file and required output of it.
Source File
a:5,b:1,c:2,session:4,e:8
b:3,a:11,c:5,e:9,session:3,c:3
Output File
session:4,a=5,b=1,c=2
session:3,a=11,b=3,c=5|3
Notes:
Fields are not organised in source file
In Output file: fields are organised in their specific format, for example: all a values are in 2nd column and then b and then c
For value c, in second line, its coming as n number of times, so in output its merged with PIPE symbol.
Please help.
Will work in any modern awk:
$ cat file
a:5,b:1,c:2,session:4,e:8
a:5,c:2,session:4,e:8
b:3,a:11,c:5,e:9,session:3,c:3
$ cat tst.awk
BEGIN{ FS="[,:]"; split("session,a,b,c",order) }
{
split("",val) # or delete(val) in gawk
for (i=1;i<NF;i+=2) {
val[$i] = (val[$i]=="" ? "" : val[$i] "|") $(i+1)
}
for (i=1;i in order;i++) {
name = order[i]
printf "%s%s", (i==1 ? name ":" : "," name "="), val[name]
}
print ""
}
$ awk -f tst.awk file
session:4,a=5,b=1,c=2
session:4,a=5,b=,c=2
session:3,a=11,b=3,c=5|3
If you actually want the e values printed, unlike your posted desired output, just add ,e to the string in the split() in the BEGIN section wherever you'd like those values to appear in the ordered output.
Note that when b was missing from the input on line 2 above, it output a null value as you said you wanted.
Try with:
awk '
BEGIN {
FS = "[,:]"
OFS = ","
}
{
for ( i = 1; i <= NF; i+= 2 ) {
if ( $i == "session" ) { printf "%s:%s", $i, $(i+1); continue }
hash[$i] = hash[$i] (hash[$i] ? "|" : "") $(i+1)
}
asorti( hash, hash_orig )
for ( i = 1; i <= length(hash); i++ ) {
printf ",%s:%s", hash_orig[i], hash[ hash_orig[i] ]
}
printf "\n"
delete hash
delete hash_orig
}
' infile
that splits line with any comma or colon and traverses all odd fields to save either them and its values in a hash to print at the end. It yields:
session:4,a:5,b:1,c:2,e:8
session:3,a:11,b:3,c:5|3,e:9

Awk merge the results of processing two files into a single file

I use awk to extract and calculate information from two different files and I want to merge the results into a single file in columns ( for example, the output of first file in columns 1 and 2 and the output of the second one in 3 and 4 ).
The input files contain:
file1
SRR513804.1218581HWI-ST695_116193610:4:1307:17513:49120 SRR513804.16872HWI ST695_116193610:4:1101:7150:72196 SRR513804.2106179HWI-
ST695_116193610:4:2206:10596:165949 SRR513804.1710546HWI-ST695_116193610:4:2107:13906:128004 SRR513804.544253
file2
>SRR513804.1218581HWI-ST695_116193610:4:1307:17513:49120
TTTTGTTTTTTCTATATTTGAAAAAGAAATATGAAAACTTCATTTATATTTTCCACAAAG
AATGATTCAGCATCCTTCAAAGAAATTCAATATGTATAAAACGGTAATTCTAAATTTTAT
ACATATTGAATTTCTTTGAAGGATGCTGAATCATTCTTTGTGGAAAATATAAATGAAGTT
TTCATATTTCTTTTTCAAAT
To parse the first file I do this:
awk '
{
s = NF
center = $1
}
{
printf "%s\t %d\n", center, s
}
' file1
To parse the second file I do this:
awk '
/^>/ {
if (count != "")
printf "%s\t %d\n", seq_id, count
count = 0
seq_id = $0
next
}
NF {
long = length($0)
count = count+long
}
END{
if (count != "")
printf "%s\t %d\n", seq_id, count
}
' file2
My provisional solution is create one temporal and overwrite in the second step. There is a more "elegant" way to get this output?
I am not fully clear on the requirement and if you can update the question may be we can help improvise the answer. However, from what I have gathered is that you would like to summarize the output from both files. I have made an assumption that content in both files are in sequential order. If that is not the case, then we will have to add additional checks while printing the summary.
Content of script.awk (re-using most of your existing code):
NR==FNR {
s[NR] = NF
center[NR] = $1
next
}
/^>/ {
seq_id[++y] = $0
++i
next
}
NF {
long[i] += length($0)
}
END {
for(x=1;x<=length(s);x++) {
printf "%s\t %d\t %d\n", center[x], s[x], long[x]
}
}
Test:
$ cat file1
SRR513804.1218581HWI-ST695_116193610:4:1307:17513:49120 SRR513804.16872HWI ST695_116193610:4:1101:7150:72196 SRR513804.2106179HWI-
ST695_116193610:4:2206:10596:165949 SRR513804.1710546HWI-ST695_116193610:4:2107:13906:128004 SRR513804.544253
$ cat file2
>SRR513804.1218581HWI-ST695_116193610:4:1307:17513:49120
TTTTGTTTTTTCTATATTTGAAAAAGAAATATGAAAACTTCATTTATATTTTCCACAAAG
AATGATTCAGCATCCTTCAAAGAAATTCAATATGTATAAAACGGTAATTCTAAATTTTAT
ACATATTGAATTTCTTTGAAGGATGCTGAATCATTCTTTGTGGAAAATATAAATGAAGTT
TTCATATTTCTTTTTCAAAT
$ awk -f script.awk file1 file2
SRR513804.1218581HWI-ST695_116193610:4:1307:17513:49120 4 200
ST695_116193610:4:2206:10596:165949 3 0