I am looking for a way of counting the number of times a value in a field appears in a range of fields in a csv file much the same as countif in excel although I would like to use an awk command if possible.
So column 6 should have the range of values and column 7 would have the times the value appears in column 7, as per below
>awk -F, '{print $0}' file3
>awk -F, '{print $6}' file3
What i want is:
#adds field name count that I want:
awk -F, -v OFS=, 'NR==1{ print $0, "count"}
NR>1{ print $0}' file3
Ho do I get the output I want?
I have tried this from previous/similar question but no joy,
>awk -F, 'NR>1{c[$6]++;l[NR>1]=$0}END{for(i=0;i++<NR;){split(l[i],s,",");print l[i]","c[s[1]]}}' file3
very similar question to this one
similar python related Q, for my ref
I would harness GNU AWK for this task following way, let file.txt content be
awk 'BEGIN{FS=OFS=","}NR==1{print $0,"count";next}FNR==NR{arr[$6]+=1;next}FNR>1{print $0,arr[$6]}' file.txt file.txt
gives output
Explanation: this is two-pass approach, hence file.txt appears twice. I inform GNU AWK that , is both field separator (FS) and output field separator (OFS), then for first line (header) I print it followed by count and instruct GNU AWK to go to next line, so nothing other is done regarding 1st line, then for first pass, i.e. where global number of line (NR) is equal to number of line in file (FNR) I count number of occurences of values in 6th field and store them as values in array arr, then instruct GNU AWK to get to next line, so onthing other is done in this pass. During second pass for all lines after 1st (FNR>1) I print whole line ($0) followed by corresponding value from array arr
(tested in GNU Awk 5.0.1)
You did not copy the code from the linked question properly. Why change l[NR] to l[NR>1] at all? On the other hand, you should change s[1] to s[6] since it's the sixth field that has the key you're counting:
awk -F, 'NR>1{c[$6]++;l[NR]=$0}END{for(i=0;i++<NR;){split(l[i],s,",");print l[i]","c[s[6]]}}'
You can also output the header with the new field name:
awk -F, -vOFS=, 'NR==1{print $0,"count"}NR>1{c[$6]++;l[NR]=$0}END{for(i=0;i++<NR;){split(l[i],s,",");print l[i],c[s[6]]}}'
One awk idea:
awk '
BEGIN { FS=OFS="," } # define input/output field delimiters as comma
{ lines[NR]=$0
if (NR==1) next
col6[NR]=$6 # copy field 6 so we do not have to parse the contents of lines[] in the END block
END { for (i=1;i<=NR;i++)
print lines[i], (i==1 ? "count" : cnt[col6[i]] )
' file3
This generates:
I need the first column to check if it doesn't match the first column on the second file. Though, if the second column matches the second column on the second file, to display this data with awk on Linux.
I want awk to detect the change with both the first and second column of the first file with the second file.
sdsdjs ./file.txt
sdsksp ./example.txt
jsdjsk ./number.txt
dfkdfk ./ok.txt
sdsdks ./file.txt <-- different
sdsksd ./example.txt <-- different
jsdjsk ./number.txt <-- same
dfkdfa ./ok.txt <-- different
Expected output:
sdsdks ./file.txt
sdsksd ./example.txt
dfkdfa ./ok.txt
Notice how in the second file there may be lines missing and not the same as the first.
As seen above, how can awk display results only where the second column is unique and does not match the first column?
Something like this might work for you:
awk 'FNR == NR { f[FNR"_"$2] = $1; next }
f[FNR"_"$2] && f[FNR"_"$2] != $1' file1.txt file2.txt
FNR == NR { } # Run on first file as FNR is record number for the file, while NR is the global record number
f[FNR"_"$2] = $1; # Store first column under the name of FNR followed by an underbar followed by the second column
next # read next record and redo
f[FNR"_"$2] && f[FNR"_"$2] != $1 # If the first column doesn't match while the second does, then print the line
A simpler approach which will ignore the second column is:
awk 'FNR == NR { f[FNR"_"$1] = 1; next }
!f[FNR"_"$1]' file1.txt file2.txt
If the records don't have to be in the respective position in the files ie. we compare matching second column strings, this should be enough:
$ awk '{if($2 in a){if($1!=a[$2])print $2}else a[$2]=$1}' file1 file2
In pretty print:
$ awk '{
if($2 in a) { # if $2 match processing
if($1!=a[$2]) # and $1 don t
print $2 # output
} else # else
a[$2]=$1 # store
}' file1 file2
$ awk '{if($2 in a){if($1!=a[$2])print $1,$2}else a[$2]=$1}' file1 file2
sdsdks ./file.txt
sdsksd ./example.txt
dfkdfa ./ok.txt
Basically changed the print $2 to print $1,$2.
The way your question is worded is very confusing but after reading it several times and looking at your posted expected output I THINK you're just trying to say you want the lines from file2 that don't appear in file1. If so that's just:
$ awk 'NR==FNR{a[$0];next} !($0 in a)' file1 file2
sdsdks ./file.txt
sdsksd ./example.txt
dfkdfa ./ok.txt
If your real data has more fields than shown in your sample input but you only want the first 2 fields considered for the comparison then fix your question to show a more truly representative example but the solution would be:
$ awk 'NR==FNR{a[$1,$2];next} !(($1,$2) in a)' file1 file2
sdsdks ./file.txt
sdsksd ./example.txt
dfkdfa ./ok.txt
if that's not it then please edit your question to clarify what it is you're trying to do and include an example where the above doesn't produce the expected output.
I understand the original problem in the following way:
Two files, file1 and file2 contain a set of key-value pairs.
The key is the filename, the value is the string in the first column
If a matching key is found between file1 and file2 but the value is different, print the matching line of file2
You do not really need advanced awk for this task, it can easily be achieved with a simple pipeline of awk and grep.
$ awk '{print $NF}' file2.txt | grep -wFf - file1.txt | grep -vwFf - file2.txt
sdsdks ./file.txt
sdsksd ./example.txt
dfkdfa ./ok.txt
Here, the first grep will select the lines from file1.txt which do have the same key (filename). The second grep will try to search the full matching lines from file1 in file2, but it will print the failures. Be aware that in this case, the lines need to be completely identical.
If you just want to use awk, then the above logic is achieved with the solution presented by Ed Morton. No need to repeat it here.
I think this is what you're looking for
$ awk 'NR==FNR{a[$2]=$1; next} a[$2]!=$1' file1 file2
sdsdks ./file.txt
sdsksd ./example.txt
dfkdfa ./ok.txt
print the records from file2 where field1 values are different for the same field2 value. This script assumes that field2 values are unique within each file, so that it can be used as keys. Since the content looks like file paths, this is a valid assumption. Otherwise, you need to match the records perhaps with the corresponding line numbers.
In case you are looking for a more straightforward line-based diff based on the the first field on a line being different.
awk 'NR==FNR { a[NR] = $1; next } a[FNR]!=$1' file1 file2
I have a big text file with 2 tab separated fields. as you see in the small example every 2 lines have a number in common. I want to summarize my text file in this way.
1- look for the lines that have the number in common and sum up the second column of those lines.
small example:
ENST00000054666.6 2
ENST00000054666.6_2 15
ENST00000054668.5 4
ENST00000054668.5_2 10
ENST00000054950.3 0
ENST00000054950.3_2 4
expected output:
ENST00000054666.6 17
ENST00000054668.5 14
ENST00000054950.3 4
as you see the difference is in both columns. in the 1st column there is only one repeat of each common and without "_2" and in the 2nd column the values is sum up of both lines (which have common number in input file).
I tried this code but does not return what I want:
awk -F '\t' '{ col2 = $2, $2=col2; print }' OFS='\t' input.txt > output.txt
do you know how to fix it?
Solution 1st: Following awk may help you on same.
awk '{sub(/_.*/,"",$1)} {a[$1]+=$NF} END{for(i in a){print i,a[i]}}' Input_file
Solution 2nd: In case your Input_file is sorted by 1st field then following may help you.
awk '{sub(/_.*/,"",$1)} prev!=$1 && prev{print prev,val;val=""} {val+=$NF;prev=$1} END{if(val){print prev,val}}' Input_file
Use > output.txt at the end of the above codes in case you need the output in a output file too.
If order is not a concern, below may also help :
awk -v FS="\t|_" '{count[$1]+=$NF}
END{for(i in count){printf "%s\t%s%s",i,count[i],ORS;}}' file
ENST00000054668.5 14
ENST00000054950.3 4
ENST00000054666.6 17
Edit :
If the order of the output does matter, below approach using a flag helps :
$ awk -v FS="\t|_" '{count[$1]+=$NF;++i;
if(i==2){printf "%s\t%s%s",$1,count[$1],ORS;i=0}}' file
ENST00000054666.6 17
ENST00000054668.5 14
ENST00000054950.3 4