count, groupby with sed, or awk

count, groupby with sed, or awk - awk

i want to perform two different sort and count on a file, based on each line's content.
1. i need to take the first column of a .tsv file
i would like to group by each line that starts with three digits, and keep only the three first digits, and for everything else, just sort and count the whole occurrence of the sentence in the first column.
Sample data:
687/878 9
890987 4
01a 55
1b 8743917
890a 34
abcdee 987
dfeqfe fkdjald
890897 34213
6878853 834
32fasd 53891
abcdee 8794371
abd 873
result:
687 2
890 3
01a 1
1b 1
32fasd 1
abd 1
dfeqfe 1
abcdee 2
I would also appreciate a solution that would
also take into account a sample input like
687/878 9
890987 4
01a 55
1b 8743917
890a 34
abcdee 987
dfeqfe 545
890897 34213
6878853 834
(632)fasd 53891
(88)abcdee 8794371
abd 873
so the first column may have values like (,), #, ', all kind of characters
so output will have two columns, the first with the values extracted, and the second with the new count, with the new values extracted from the source file.
Again preferred output format tsv.
so i need to extract all values that start with
^\d\d\d, and then for these three first digits, sort and count unique values,
but in a second pass, also do the same for each line, that does not start with 3 digits, but this time, keep the whole columns value and sort count by it.
what i have tried:
| sort | uniq -c | sort -nr for the lines that do start with ^\d\d\d, and
the same for those that do not fulfill the above regex, but is there a more elegant way using either sed or awk?

$ cat tst.awk
BEGIN { FS=OFS="\t" }
{ cnt[/^[0-9]{3}/ ? substr($1,1,3) : $1]++ }
END {
for (key in cnt) {
print (key !~ /^[0-9]{3}/), cnt[key], key, cnt[key]
}
}
$ awk -f tst.awk file | sort -k1,2n | cut -f3-
687 1
890 2
abcdee 1

You can try Perl
$ cat nefijaka.txt
687 878 9
890987 4
890a 34
abcdee 987
$ perl -lne ' /^(\d{3})|(\S+)/; $x=$1?$1:$2; $kv{$x}++; END { print "$_\t$kv{$_}" for (sort keys %kv) } ' nefijaka.txt
687 1
890 2
abcdee 1
$
You can pipe it to sort and get the values sorted..
$ perl -lne ' /^(\d{3})|(\S+)/; $x=$1?$1:$2; $kv{$x}++; END { print "$_\t$kv{$_}" for (sort keys %kv) } ' nefijaka.txt | sort -k2 -nr
890 2
abcdee 1
687 1
EDIT1:
$ cat nefijaka.txt2
687 878 9
890987 4
890a 34
abcdee 987
a word and then 23
$ perl -lne ' /^(\d{3})|(.+?\t)/; $x=$1?$1:$2; $x=~s/\t//g; $kv{$x}++; END { print "$_\t$kv{$_}" for (sort keys %kv) } ' nefijaka.txt2
687 1
890 2
a word and then 1
abcdee 1
$

Related

How to compare 2 files having multiple occurances of a number and output the additional occurance?

Currently i am using a awk script to compare 2 files having random numbers in non sequential order.
It works perfect , but there is just one future condition i would like to fulfill.
Current awk function
awk '
{
$0=$0+0
}
FNR==NR{
a[$0]
next
}
($0 in a){
b[$0]
next
}
{ print }
END{
for(j in a){
if(!(j in b)){ print j }
}
}
' compare1.txt compare2.txt
What the the function accomplishes currently ?
It outputs list of all the numbers which are present in compare1 but not in compare 2 and vice versa
If any number has zero in its prefix, ignore zeros while comparing ( basically the absolute value of number must be different to be treated as a mismatch ) Example - 3 should be considered matching with 003 and 014 should be considered matching with 14, 008 with 8 etc
As required It also considers a number matched even if they are not necessarily on the same line in both files
Required additional condition
In its current form , this functions works in such a way that if a file has multiple occurances of a number and other file has even one occurance of that same number , it considers the number matched for both repetitions.
I need the awk function to be edited to output any additional occurrence of a number
cat compare1.txt
57
11
13
3
889
014
91
775
cat compare2.txt
003
889
13
14
57
12
90
775
775
Expected output
12
90
11
91
**775**
The number marked here at end is currently not being shown in output in my present awk function ( 2 occurances - 1 occurrence )

As mentioned at https://stackoverflow.com/a/62499047/1745001, this is the job that comm exists to do:
$ comm -3 <(awk '{print $0+0}' compare1.txt | sort) <(awk '{print $0+0}' compare2.txt | sort)
11
12
775
90
91
and to get rid of the white space:
$ comm -3 <(awk '{print $0+0}' compare1.txt | sort) <(awk '{print $0+0}' compare2.txt | sort) |
awk '{print $1}'
11
12
775
90
91

you just need to count the occurrences and account for it in matching...
$ awk '{k=$0+0}
NR==FNR {a[k]++; next}
!(k in a && a[k]-->0);
END {for(k in a) while(a[k]-->0) print k}' file1 file2
12
90
775
11
91
note that as in your original script there is no absolute value comparison, which you can add easily by just changing k in the first line.

Compare two files and append the values, leave the mismatches as such in the output file

I'm trying to match two files,file1.txt(50,000 lines), file2.txt(55,000 lines). I want to campare file2 to file 1 extract the values of column 2 and 3 and leave the mismatches as such. Output file must contain all the ids from file2 i.e., it should have 55000 lines. Note: All the ids in file 1 are not present in file2. i.e the actual matches could be less than 50,000.
file1.txt
ab1 12 345
ab2 9 456
gh67 6 987
file2.txt
ab2 0 0
ab1 0 345
nh7 0 0
gh67 6 987
Output
ab2 9 456
ab1 12 345
nh7 0 0
gh67 6 987
This is what i tried but it only print the matches (so instead of 55,000 lines i have 49,000 lines in my output file)
awk "NR==FNR {f[$1]=$0;next}$1 in f{print f[$1],$0}" file1.txt file2.txt >output.txt

This awk script will work
NR == FNR {
a[$1] = $0
next
}
$1 in a {
split(a[$1], b)
print $1, (b[2] == $2 ? $2 : b[2]), (b[3] == $3 ? $3 : b[3])
}
!($1 in a)
If you save this as a.awk and run
awk -f a.awk foo.txt foo1.txt
This will output
ab2 9 456
ab1 12 345
nh7 0 0
gh67 6 987

How to sum up every nth line in awk?

I want to output the sum of every N lines, for example, every 4 lines:
cat file
1
11
111
1111
2
22
222
2222
3
33
333
3333
The output should be:
6 #(1+2+3)
66 #(11+22+33)
666 #(111+222+333)
6666 #(1111+2222+3333)
How can I do this with awk?

Basically you can use the following awk command:
awk -vN=4 '{s[NR%N]+=$0}END{for(i=0;i<N;i++){print s[i]}}' input.txt
You can choose N like you wish.
Output:
6666
6
66
666
But you see, the output isn't sorted as expected when iterating through an awk array. You can fix this by shifting the line number by -1:
awk -vN=4 '{s[(NR-1)%N]+=$0}END{for(i=0;i<N;i++){print s[i]}}' a.txt
Output:
6
66
666
6666

compare between two columns and subtract them

my question
i have one file
344 0
465 1
729 2
777 3
676 4
862 5
766 0
937 1
980 2
837 3
936 5
i need to compare each two pair (zero with zero, one with one and so on) if the value exist(any value of column two should exist two times) subtract 766-344 , 937-465 and so on if not exist like the forth value do nothing (4 exist one time so do nothing) the output
422
472
251
060
074
also i need to add index
example
1 422
2 472
3 251
4 060
5 074
finally i need to add this code as part of tcl script, or function of tcl porgram
I have a tcl script contain awk functions like this
set awkCBR0 {
{
if ($1 == "r" && $6 == 280) {
print $2, i >> "cbr0.q";
i +=1 ;
}
}
}
exec rm -f cbr0.q
exec touch cbr0.q
exec awk $awkCBR0 cbr.trq
thanks

Try this:
awk 'a[$2]{printf "%d %03d\n",++x,$1-a[$2];next}{a[$2]=$1}' file
Output
$ awk 'a[$2]{printf "%d %03d\n",++x,$1-a[$2];next}{a[$2]=$1}' file
1 422
2 472
3 251
4 060
5 074
I will leave it for you to add it to tcl function.

Obtaining "consensus" results from two different files using awk

I have file1 as a result of a first operation, it has the following structure
201 12 0.298231 8.8942
206 13 -0.079795 0.6367
101 34 0.86348 0.7456
301 15 0.215355 4.6378
303 16 0.244734 5.9895
and file2 as a result of a different operation and has the same type of structure.
File 2 sample
204 60 -0.246038 6.0535
304 83 -0.246209 6.0619
101 34 -0.456629 6.0826
211 36 -0.247003 6.1011
305 83 -0.247134 6.1075
206 46 -0.247485 6.1249
210 39 -0.248066 6.1537
107 41 -0.248201 6.1603
102 20 -0.248542 6.1773
I would like to select fields 1 and 2 that have a field 3 value higher than a threshold in file1 (0.8) , then for these selected values of field 1 and 2, select the values that have a field 3 value higher than another threshold in file 2 (abs(x)=0.4).
Note that although files 1 and 2 have the same structure fields 1 and 2 values are not the same (not the same number of lines etc..)
Can you do this with awk?
desired output
101 34

If you combine awk with unix commands you can do the following
sort file1.txt > sorted1.txt
sort file2.txt > sorted2.txt
Sorting will allow you to use JOIN on the first line (which I assume is unique). Now field 3 of file1 is $3 and file2 is $6. Using awk you can write the following.:
join sorted1.txt sorted2.txt | awk 'function abs(value){return (value<0?-value:value);}{print $1"\t"$2} $3 >=0.8 && abs($6) >=0.4'
In essence, in the awk you first write a function to deal with absolute values, then you simply ask it to print line 1 and 2 selecting for the criteria you detailed at $3 and $6 (formely field 3 of file1 and file2 respectively)
Hope this helps...

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

count, groupby with sed, or awk - awk

$ cat tst.awk BEGIN { FS=OFS="\t" } { cnt[/^[0-9]{3}/ ? substr($1,1,3) : $1]++ } END { for (key in cnt) { print (key !~ /^[0-9]{3}/), cnt[key], key, cnt[key] } } $ awk -f tst.awk file | sort -k1,2n | cut -f3- 687 1 890 2 abcdee 1

Related

How to compare 2 files having multiple occurances of a number and output the additional occurance?

Compare two files and append the values, leave the mismatches as such in the output file

How to sum up every nth line in awk?

compare between two columns and subtract them

Obtaining "consensus" results from two different files using awk

Categories

Resources