How do I count the number of occurrences of a string in a column in each file, and output the result filename and count + awk - awk

How do I count the number of occurrences of a string in a column in each file, and output the result filename and count + awk
I have these 2 files, :
>cat file.csv
col1,col2,col3,col4
col1,col2,col3
col1,col2
col1
col1,col2,col3,col4,col5
> cat fild.csv
col1,col2,col3,col4
col1,col2,col3
col1,col2
col1
col1,col2,col3,col4,col5
how do I get this output:(basically count the number of occurrences of a string (e.g. "col1")in a column in each file )
file.csv,5
fild.csv,5
Below, my attempts, for my reference:
Output column/field1
> awk -F, '$1 =="col1" {print $1}' file.csv
col1
col1
col1
col1
col1
Output filenamecolumn/field1, how do I add a comma as separator
> awk -F, '$1 =="col1" {print FILENAME $1}' file.csv
file.csvcol1
file.csvcol1
file.csvcol1
file.csvcol1
file.csvcol1
output Id like
file.csv,5
attempt working on 2 files:
> awk -F, '$1 =="col1" {print FILENAME $1}' fil*.csv
fild.csvcol1
fild.csvcol1
fild.csvcol1
fild.csvcol1
fild.csvcol1
file.csvcol1
file.csvcol1
file.csvcol1
file.csvcol1
file.csvcol1
But the output i'd like is this:
file.csv,5
fild.csv,5
Answer
this works for me:
awk 'BEGIN{FS=OFS=","} $1 == "col1" {cnt++} ENDFILE{print FILENAME, (cnt>0&&cnt?cnt:"0"); cnt=0}' fil*.csv
fild.csv,5
file1.csv,0
file.csv,5

If you're using GNU awk another potential solution is to use the ENDFILE special pattern, e.g. using #markp-fuso's example data:
cat filb.csv # empty
cat filc.csv
col1
cat fild.csv
col1,col2,col3,col4
col1,col2,col3
col1,col2
col1
col1,col2,col3,col4,col5
cat file.csv
col1,col2,col3,col4
col1,col2,col3
col1,col2
col1
col1,col2,col3,col4,col5
awk 'BEGIN{FS=OFS=","} $1 == "col1" {cnt++} ENDFILE{print FILENAME, (cnt>0&&cnt?cnt:"0"); cnt=0}' fil*.csv
filb.csv,0
filc.csv,1
fild.csv,5
file.csv,5
# the 'cnt>0&&cnt?cnt:"0"' is to handle empty files
# basically, if there are lines print cnt otherwise, if
# there are no lines print "0"
Edit
As commented by #EdMorten, cnt+0 can be used instead of cnt>0&&cnt?cnt:"0" to handle empty files (much easier to remember!), e.g.
awk 'BEGIN{FS=OFS=","} $1 == "col1" {cnt++} ENDFILE{print FILENAME, cnt+0; cnt=0}' fil*.csv
filb.csv,0
filc.csv,1
fild.csv,5
file.csv,5

Adding a couple more files to the mix:
$ cat filb.csv # empty
$ cat filc.csv
col2
One awk approach:
awk -v str='col1' ' # pass in string to search for
BEGIN { FS=OFS=","
for (i=1;i<ARGC;i++)
count[ARGV[i]]=0 # initialize counter for all files; addresses issue where a file may not have any matches or file is empty (ie, count==0)
}
{ for (i=1;i<=NF;i++) # loop through fields looking for a match and ...
if ($i==str) # if found then ...
count[FILENAME]++ # increment our counter
}
END { for (fname in count)
print fname,count[fname]
}
' fil?.csv
This generates:
file.csv,5
filb.csv,0
fild.csv,5
filc.csv,0
NOTES:
$i==str - assumes we're looking for an exact match on the value in the field (as opposed to a substring of the field's value)
assumes we need to match str on any field/column in the file, otherwise we'll need to add an additional input variable to designate which column(s) to search
output ordering is not guaranteed; OP can pipe the results to sort, or add some code to allow awk to sort the output before printing to stdout
An alternative grep|tr idea:
$ grep -oc "$str" fil?.csv | tr ':' ','
filb.csv,0
filc.csv,0
fild.csv,5
file.csv,5

Related

Retrieve all rows from 2 columns matching from 2 different files

I need to retrieve all rows from a file starting from some column matching from another file.
My first file is:
col1,col2,col3
1TF4,WP_110462952.1,AEV67733.1
1TF4,EGD45884.1,AEV67733.1
2BTO,NP_006073.2,XP_037953971.1
2BTO,XP_037953971.1,XP_037953971.1
The second one is:
col1,col2,col3,col4,col5
BAA13425.1,SDD02770.1,38.176,296,175
BAA13425.1,WP_002465021.1,32.056,287,185
BBE42932.1,AEG17356.1,40.909,110,64
BBE42932.1,WP_048124638.1,40.367,109,64
I want to retrieve all rows from the second file, where its file2_col1=file1_col3 and file2_col2=file1_col1
I tried like this but it doesn't print everything
awk -F"," 'FILENAME=="file1"{A[$3$2]=$3$2}
FILENAME=="file2"{if(A[$1$2]){print $0}}' file1 file2 > test
I want to retrieve all rows from the second file, where its file2_col1=file1_col3 and file2_col2=file1_col1
You may use this 2 pass awk solution:
awk -F, 'FNR == NR {seen[$3,$1]; next} FNR == 1 || ($1,$2) in seen' file1 file2
col1,col2,col3,col4,col5
BAA13425.1,2BTO,32.056,287,185
BAA13425.1,2BTO,12.410,641,123
Where input files are:
cat file1
col1,col2,col3
1TF4,WP_110462952.1,AEV67733.1
1TF4,EGD45884.BAA13425.1
2BTO,NP_006073.2,BAA13425.1
2BTO,XP_037953971.1,BAA13425.1
cat file2
col1,col2,col3,col4,col5
BAA13425.1,SDD02770.1,38.176,296,175
BAA13425.1,2BTO,32.056,287,185
BBE42932.1,AEG17356.1,40.909,110,64
BBE42932.1,WP_048124638.1,40.367,109,64
BAA13425.1,2BTO,12.410,641,123

how to get the common rows according to the first column in awk

I have two ',' separated files as follow:
file1:
A,inf
B,inf
C,0.135802
D,72.6111
E,42.1613
file2:
A,inf
B,inf
C,0.313559
D,189.5
E,38.6735
I want to compare 2 files ans get the common rows based on the 1st column. So, for the mentioned files the out put would look like this:
A,inf,inf
B,inf,inf
C,0.135802,0.313559
D,72.6111,189.5
E,42.1613,38.6735
I am trying to do that in awk and tried this:
awk ' NR == FNR {val[$1]=$2; next} $1 in val {print $1, val[$1], $2}' file1 file2
this code returns this results:
A,inf
B,inf
C,0.135802
D,72.6111
E,42.1613
which is not what I want. do you know how I can improve it?
$ awk 'BEGIN{FS=OFS=","}NR==FNR{a[$1]=$0;next}$1 in a{print a[$1],$2}' file1 file2
A,inf,inf
B,inf,inf
C,0.135802,0.313559
D,72.6111,189.5
E,42.1613,38.6735
Explained:
$ awk '
BEGIN {FS=OFS="," } # set separators
NR==FNR { # first file
a[$1]=$0 # hash to a, $1 as index
next # next record
}
$1 in a { # second file, if $1 in a
print a[$1],$2 # print indexed record from a with $2
}' file1 file2
Your awk code basically works, you are just missing to tell awk to use , as the field delimiter. You can do it by adding BEGIN{FS=OFS=","} to the beginning of the script.
But having that the files are sorted like in the examples in your question, you can simply use the join command:
join -t, file1 file2
This will join the files based on the first column. -t, tells join that columns are separated by commas.
If the files are not sorted, you can sort them on the fly like this:
join -t, <(sort file1) <(sort file2)

awk Compare 2 files, print match and print just 2 columns of the second file

I am novice and I am sure it is a silly question but I searched and I didn't find an answer.
I want to select just 2 columns of my file 2. I know how to select one column =$1 and all columns =$0. But If we want just show 2,3, ... column from file2 in my file3, is it possible?
awk -v RS='\r\n' 'BEGIN {FS=OFS=";"} FNR==NR {a[$2] = $1; next} {gsub(/_/,"-",$2);$2=toupper($2);print a[$2]?a[$2]:"NA",$0,a[$2]?a[$2]:"NA"}' $File2 $File1 > file3
or
awk -v RS='\r\n' 'BEGIN {FS=OFS=";"} FNR==NR {a[$2] = $0; next} {gsub(/_/,"-",$2);$2=toupper($2);print a[$2]?a[$2]:"NA",$0,a[$2]?a[$2]:"NA"}' $File2 $File1 > file3
I just want $1 and $2 from file2, this code doesn´t work. I obtain one column with data from $1 and $2
awk -v RS='\r\n' 'BEGIN {FS=OFS=";"} FNR==NR {a[$2] = $1$2; next} {gsub(/_/,"-",$2);$2=toupper($2);print a[$2]?a[$2]:"NA",$0,a[$2]?a[$2]:"NA"}' $File2 $File1 > file3
Any solution??
awk -v RS='\r\n' ' # call awk and set row separator
BEGIN {
FS=OFS=";" # set input and output field separator
}
# Here you are reading first argument that is File2
FNR==NR {
# Save column2 and column3 separated by OFS that is ;
# from File2 which is first argument, in array a
# whose index/key being second field/column from File2
a[$2] = $2 OFS $3;
# Stop processing go to next line of File1
next
}
# Here on words you are reading second argument that is File1
{
# Global substitution
# replace _ with hyphen - in field2/column2
gsub(/_/,"-",$2);
# Uppercase field2/column2
$2=toupper($2);
# If field2 of current file (File1) exists in array a
# which is created above using File2 then
# print array value that is your field2 and field3 of File2
# else print "NA", and then output field separator,
# entire line/record of current file
print ($2 in a ? a[$2] : "NA"), $0
}' $File2 $File1 > file3

Delete a line that contain an occurence in the first or second column

I would like to del a line that contain the occurence in the first or second column (separator \t). For exemple :
line 1 uni:1 uni:2 blabla blabla
line 2 uni:3 EBI:1 blbla blabla
I Want to delete the line2. The "blabla" text can contain the occurence (EBI) but I don't want to select by the rest of the text, just with the two first column.
I try : awk -F "\t" '{print $1 $2}' file1 |grep -v EBI > file2
but this will stock just the first and second column and not the entire line.
I try this too : awk -F "\t" '{print $1 $2}'file1 |grep -n EBI
and sed "numberOfLined" file1 >file2
But I have a lot of occurences so I don't want to write all numbers of lines by hand.
You can use if statement and regex matching via ~:
awk -F '\t' '{if (! (($1 ~ ".*EBI.*") || ($2 ~ ".*EBI.*"))) {print $0} }'
And thanks to comments, it could looks even better:
awk '!($1~/EBI/ || $2~/EBI/)'

awk to split variable length record and add unique number on each group of records

i have a file which has variable length columns
x|y|XREC|DELIMITER|ab|cd|ef|IREC|DELIMITER|j|a|CREC|
p|q|IREC|DELIMITER|ww|xx|ZREC|
what i would like is
1|x|y|XREC|
1|ab|cd|ef|IREC|
1|j|a|CREC|
2|p|q|IREC|
2|ww|xx|ZREC|
So far i just managed to get seq number at the beginning
awk '{printf "%d|%s\n", NR, $0}' oldfile > with_seq.txt
Any help?
You could set the delimiter to DELIMITER:
$ awk -F 'DELIMITER[|]' '{for (i=1;i<=NF;i++)print NR"|"$i}' file
1|x|y|XREC|
1|ab|cd|ef|IREC|
1|j|a|CREC|
2|p|q|IREC|
2|ww|xx|ZREC|
Using awk
awk -F "DELIMITER" '{for(i=1;i<=NF;i++)print NR "|" $i}' file|sed 's/||/|/g'
1|x|y|XREC|
1|ab|cd|ef|IREC|
1|j|a|CREC|
2|p|q|IREC|
2|ww|xx|ZREC|