Get unique string occurrence and display it - awk

I'm not really good with awk, so here what I just did to count the number occurence in one row:
the input.txt has this:
18 18 21 21 21 21 18 21
I just want to display the unique number that occur above. So, here is my code:
input="input.txt"
output=$(fmt -1 "$input" | sort | uniq | awk '{printf $1","}')
echo "$output"
The output:
18,21,
I got the result correctly but that comma , at the end, how do I remove that comma? Also, is there a simpler or a clean method without using fmt ?
The expected output:
18,21
Edit to remove comma, I use this:
sed 's/,$//'
and it's working, but is there a simpler way to do this without using fmt ?

Could you please try following.
awk '
BEGIN{ OFS="," }
{
for(i=1;i<=NF;i++){
if(!arr[$i]++){
val=(val?val OFS:"")$i
}
}
print val
val=""
}' Input_file
Explanation: Adding detailed explanation for above.
awk ' ##Starting awk program from here.
BEGIN{ OFS="," } ##Setting output field separator as comma here.
{
for(i=1;i<=NF;i++){ ##Traversing through all fields of currnet line here.
if(!arr[$i]++){ ##Checking condition if arr is NOT having current field present in it
val=(val?val OFS:"")$i ##Creating val and keep adding values to it, to print at last all values.
}
}
print val ##printing val here.
val="" ##Nullify val here.
}' Input_file ##mentioning Input_file name here.

Here is an alternative way in gnu awk:
awk -v RS='[[:blank:]]+' '!seen[$1]++{s=s (s!=""?",":"") $1} END{print s}' file.txt
18,21

Newer versions of perl has uniq in standard library. Otherwise, you'll have to manually write the logic (How do I print unique elements in Perl array?) or use https://metacpan.org/pod/List::MoreUtils
perl -MList::Util=uniq -lane 'print join ",", uniq #F'
perl -lane 'print join ",", grep { !$seen{$_}++ } #F'
With ruby
ruby -ane 'puts $F.uniq * ","'

I had similar problem today and I solved it by using echo,tr,sed. Code for your example is as below -
echo -n "18 18 21 21 21 21 18 21" | tr -s ' ' '\n' | sort -u | tr '\n' ',' | sed 's/,$//'
output- 18,21

echo '18 18 21 21 21 21 18 21' |
mawk 'END { print "\n" } !___[$_]-- && $!NF= NF < ++__\
? ","$_ : $_' ORS= RS='[\t- ]+'
18,21
or if u don't mind chaining up 2 awks :
mawk '!__[$_]--' ORS=',' RS='[\t- ]+' | mawk 7 RS=',$'
18,21

Related

Issues with OFS and sort working together in Bash

Given this type of input:
1,Name1,Type1,100,50
2,Name2,Type2,85,80
3,Name3,Type3,95,92
4,Name4,Type4,60,55
5,Name5,Type5,87,77
I want to calculate the average of the last 2 values and then sort them in decreasing order, so I wrote this bash code:
awk 'BEGIN{FS=","} {avg=($4+$5)/2;print $1,$3,avg}' | sort -k3 -nr
which gives me this output which is very close to my expected output:
3 Type3 93.5
2 Type2 82.5
5 Type5 82
1 Type1 75
4 Type4 57.5
The final thing I want is to separate the output with | (pipes), so I use the variable OFS like this:
awk 'BEGIN{FS=",";OFS="|"} {avg=($4+$5)/2;print $1,$3,avg}' | sort -k3 -nr
The output from this:
5|Type5|82
4|Type4|57.5
3|Type3|93.5
2|Type2|82.5
1|Type1|75
It seems like OFS is breaking the sort. Is this behaviour expected? Is there any workaround for this?
There are 2 issues in your shown code attempt. First is Input_file name is not passed in awk code(could be typo), 2nd is you need to set delimiter in sort by using -t'|' option so it will be like:
awk 'BEGIN{FS=",";OFS="|"} {avg=($4+$5)/2;print $1,$3,avg}' Input_file | sort -t'|' -k3 -nr
3|Type3|93.5
2|Type2|82.5
5|Type5|82
1|Type1|75
4|Type4|57.5
OR in a non-one liner form of code + removing avg variable you could get average of columns during printing of itself(in case you are using avg variable further any where in program then you could create it):
awk '
BEGIN{
FS=","
OFS="|"
}
{
print $1,$3,($4 + $5)/2
}' Input_file |
sort -t'|' -k3 -nr
From man sort page:
-t, --field-separator=SEP
use SEP instead of non-blank to blank transition
Some more way,actually you can also use awk's print with sort :
$ cat testfile.csv
1,Name1,Type1,100,50
2,Name2,Type2,85,80
3,Name3,Type3,95,92
4,Name4,Type4,60,55
5,Name5,Type5,87,77
$ awk 'BEGIN{FS=",";OFS="|"}{print $1,$3,($4+$5)/2 | "sort -t \"|\" -nrk3"}' testfile.csv
3|Type3|93.5
2|Type2|82.5
5|Type5|82
1|Type1|75
4|Type4|57.5
Using GNU awk's controlling array traversal feature:
gawk '
BEGIN { FS = ","; SUBSEP = "|" }
{ avg = ($4+$5)/2; result[$1,$3,avg] = avg }
END {
PROCINFO["sorted_in"] = "#val_num_desc"
for (line in result) print line
}
' testfile.csv
3|Type3|93.5
2|Type2|82.5
5|Type5|82
1|Type1|75
4|Type4|57.5
SUBSEP is the variable that holds the join string for comma-separated array keys. It's default value is octal 034, the "FS" character.

how to fetch the data of column from txt file without column name in unix?

This is student.txt file:
RollNo|Name|Marks
123|Raghu|80
342|Maya|45
561|Gita|56
480|Mohan|71
I want to fetch data from "Marks" column and I used awk command
awk -F "|" '{print $3}' student.txt
and it gives the output like
Marks
80
45
56
71
This output included column name that is "Marks" but I only want to fetch the data and show the output like
80
45
56
71
Add a condition to your awk script to print the third field if the input record number is greater 1:
awk -F'|' 'FNR>1{print $3}' student.txt
Could you please try following, this has NOT hard coded field number for Marks keyword,in headers it will look for that string and will print only those columns which are under Marks, so even your Marks column is in any field number this should work fine. Written and test in https://ideone.com/Ufq5E2 link.
awk '
BEGIN{ FS="|" }
FNR==1{
for(i=1;i<=NF;i++){
if($i=="Marks"){ field=i }
}
next
}
{
print $field
}
' Input_file
to fetch data from "Marks" column - -use[ing] awk:
$ awk -F\| '
FNR==1 {
for(i=1;i<=NF;i++)
if($i=="Marks")
next
exit
}
{
print $i
}' file
80
45
56
71
Another example:
awk -F'|' 'NR != 1 {print $3}' input_file
a non-awk alternative
$ sed 1d file | cut -d\| -f3
80
45
56
71

Counting the number of unique values based on two columns in bash

I have a tab-separated file looking like this:
A 1234
A 123245
A 4546
A 1234
B 24234
B 4545
C 1234
C 1234
Output:
A 3
B 2
C 1
Basically I need counts of unique values that belong to the first column, all in one commando with pipelines. As you may see, there can be some duplicates like "A 1234". I had some ideas with awk or cut, but neither of the seem to work. They just print out all unique pairs, while I need count of unique values from the second column considering the value in the first one.
awk -F " "'{print $1}' file.tsv | uniq -c
cut -d' ' -f1,2 file.tsv | sort | uniq -ci
I'd really appreciate your help! Thank you in advance.
With complete awk solution could you please try following.
awk 'BEGIN{FS=OFS="\t"} !found[$0]++{val[$1]++} END{for(i in val){print i,val[i]}}' Input_file
Explanation: Adding detailed explanation for above.
awk ' ##Starting awk program from here.
BEGIN{
FS=OFS="\t"
}
!found[$0]++{ ##Checking condition if 1st and 2nd column is NOT present in found array then do following.
val[$1]++ ##Creating val with 1st column inex and keep increasing its value here.
}
END{ ##Starting END block of this progra from here.
for(i in val){ ##Traversing through array val here.
print i,val[i] ##Printing i and value of val with index i here.
}
}
' Input_file ##Mentioning Input_file name here.
Using GNU awk:
$ gawk -F\\t '{a[$1][$2]}END{for(i in a)print i,length(a[i])}' file
Output:
A 3
B 2
C 1
Explained:
$ gawk -F\\t '{ # using GNU awk and tab as delimiter
a[$1][$2] # hash to 2D array
}
END {
for(i in a) # for all values in first field
print i,length(a[i]) # output value and the size of related array
}' file
$ sort -u file | cut -f1 | uniq -c
3 A
2 B
1 C
Another way, using the handy GNU datamash utility:
$ datamash -g1 countunique 2 < input.txt
A 3
B 2
C 1
Requires the input file to be sorted on the first column, like your sample. If real file isn't, add -s to the options.
You could try this:
cat file.tsv | sort | uniq | awk '{print $1}' | uniq -c | awk '{print $2 " " $1}'
It works for your example. (But I'm not sure if it works for other cases. Let me know if it doesn't work!)

Need to retrieve a value from an HL7 file using awk

In a Linux script program, I've got the following awk command for other purposes and to rename the file.
cat $edifile | awk -F\| '
{ OFS = "|"
print $0
} ' | tr -d "\012" > $newname.hl7
While this is happening, I'd like to grab the 5th field of the MSH segment and save it for later use in the script. Is this possible?
If no, how could I do it later or earlier on?
Example of the segment.
MSH|^~\&|business1|business2|/u/tmp/TR0049-GE-1.b64|routing|201811302126||ORU^R01|20181130212105810|D|2.3
What I want to do is retrieve the path and file name in MSH 5 and concatenate it to the end of the new file.
I've used this to capture the data but no luck. If fpth is getting set, there is no evidence of it and I don't have the right syntax for an echo within the awk phrase.
cat $edifile | awk -F\| '
{ OFS = "|"
{fpth=$(5)}
print $0
} ' | tr -d "\012" > $newname.hl7
any suggestions?
Thank you!
Try
filename=`awk -F'|' '{print $5}' $edifile | head -1`
You can skip the piping through head if the file is a single line
First of all, it must be mentioned that the awk line in your first piece of code, has zero use:
$ cat $edifile | awk -F\| ' { OFS = "|"; print $0 }' | tr -d "\012" > $newname.hl7
This is totally equivalent to
$ cat $edifile | tr -d "\012" > $newname.hl7
because OFS is only used to redefine $0 if you redefine a field.
Example:
$ echo "a|b|c" | awk -F\| '{OFS="/"; print $0}'
a|b|c
$ echo "a|b|c" | awk -F\| '{OFS="/"; $1=$1; print $0}'
a/b/c
I understand that you have a hl7 file in which you have a single line starting with the string "MSH". From this line you want to store the 5th field: this is achieved in the following way:
fpth=$(awk -v outputfile="${newname}.hl7" '
BEGIN{FS="|"; ORS="" }
($1 == "MSH"){ print $5 }
{ print $0 > outputfile }' $edifile)
I have replaced ORS to an empty character set, as it is equivalent to tr -d "\012". The above will work very nicely if you only have a single MSH in your file.

What is the meaning of $0 = $0 in Awk?

While going through a piece of code I saw the below command:
grep "r" temp | awk '{FS=","; $0=$0} { print $1,$3}'
temp file contain the pattern like:
1. r,1,5
2. r,4,5
3. ...
I could not understand what does the statement $0=$0 mean in awk command.
Can anyone explain what does it mean?
When you do $1=$1 (or any other assignment to a field) it causes record recompilation where $0 is rebuilt with every FS replaced with OFS but it does not change NF (unless there was no $1 previously and then NF would change from 0 to 1) or reevaluate the record in any other way.
When you do $0=$0 it causes field splitting where NF, $1, $2, etc. are repopulated based on the current value of FS but it does not change the FSs to OFSs or modify $0 in any other way.
Look:
$ echo 'a-b-c' |
awk -F'-+' -v OFS='-' '
function p() { printf "%d) %d: $0=%s, $2=%s\n", ++c,NF,$0,$2 }
{ p(); $2=""; p(); $1=$1; p(); $0=$0; p(); $1=$1; p() }
'
1) 3: $0=a-b-c, $2=b
2) 3: $0=a--c, $2=
3) 3: $0=a--c, $2=
4) 2: $0=a--c, $2=c
5) 2: $0=a-c, $2=c
Note in the above that even though setting $2 to null resulted in 2 consecutive -s and the FS of -+ means that 2 -s are a single separator, they are not treated as such until $0=$0 causes the record to be re-split into fields as shown in output step 4.
The code you have:
awk '{FS=","; $0=$0}'
is using $0=$0 as a cludge to work around the fact that it's not setting FS until AFTER the first record has been read and split into fields:
$ printf 'a,b\nc,d\n' | awk '{print NF, $1}'
1 a,b
1 c,d
$ printf 'a,b\nc,d\n' | awk '{FS=","; print NF, $1}'
1 a,b
2 c
$ printf 'a,b\nc,d\n' | awk '{FS=","; $0=$0; print NF, $1}'
2 a
2 c
The correct solution, of course, is instead to simply set FS BEFORE The first record is read:
$ printf 'a,b\nc,d\n' | awk -F, '{print NF, $1}'
2 a
2 c
To be clear - assigning any value to $0 causes field splitting, it does not cause record recompilation while assigning any value to any field ($1, etc.) causes record recompilation but not field splitting:
$ echo 'a-b-c' | awk -F'-+' -v OFS='#' '{$2=$2}1'
a#b#c
$ echo 'a-b-c' | awk -F'-+' -v OFS='#' '{$0=$0}1'
a-b-c
$0 = $0 is used most often to rebuild the field separation evaluation of a modified entry. Ex: adding a field will change $NF after $0 = $0 where it stay as original (at entry of the line).
in this case, it change every line the field separator by , and (see #EdMorton comment below for strike) reparse the line with current FS info where a awk -F ',' { print $1 "," $3 }' is a lot better coding for the same idea, taking the field separator at begining for all lines (in this case, could be different if separator is modified during process depernding by example of previous line content)
ex:
echo "foo;bar" | awk '{print NF}{FS=";"; print NF}{$0=$0;print NF}'
1
1
2
based on #EdMorton comment and related post (What is the meaning of $0 = $0 in Awk)
echo "a-b-c" |\
awk ' BEGIN{ FS="-+"; OFS="-"}
function p(Ref) { printf "%12s) NF=%d $0=%s, $2=%s\n", Ref,NF,$0,$2 }
{
p("Org")
$2="-"; p( "S2=-")
$1=$1 ; p( "$1=$1")
$2=$2 ; p( "$2=$2")
$0=$0 ; p( "$0=$0")
$2=$2 ; p( "$2=$2")
$3=$3 ; p( "$3=$3")
$1=$1 ; p( "$1=$1")
} '
Org) NF=3 $0=a-b-c, $2=b
S2=-) NF=3 $0=a---c, $2=-
$1=$1) NF=3 $0=a---c, $2=-
$2=$2) NF=3 $0=a---c, $2=-
$0=$0) NF=2 $0=a---c, $2=c
$2=$2) NF=2 $0=a-c, $2=c
$3=$3) NF=3 $0=a-c-, $2=c
$1=$1) NF=3 $0=a-c-, $2=c
$0=$0 is for re-evaluate the fields
For example
akshay#db-3325:~$ cat <<EOF | awk '/:/{FS=":"}/\|/{FS="|"}{print $2}'
1:2
2|3
EOF
# Same with $0=$0, it will force awk to have the $0 reevaluated
akshay#db-3325:~$ cat <<EOF | awk '/:/{FS=":"}/\|/{FS="|"}{$0=$0;print $2}'
1:2
2|3
EOF
2
3
# NF - gives you the total number of fields in a record
akshay#db-3325:~$ cat <<EOF | awk '/:/{FS=":"}/\|/{FS="|"}{print NF}'
1:2
2|3
EOF
1
1
# When we Force to re-evaluate the fields, we get correct 2 fields
akshay#db-3325:~$ cat <<EOF | awk '/:/{FS=":"}/\|/{FS="|"}{$0=$0; print NF}'
1:2
2|3
EOF
2
2
>>> echo 'a-b-c' | awk -F'-+' -v OFS='#' '{$2=$2}1'
>>> a#b#c
This can be slightly simplified to
mawk 'BEGIN { FS="[-]+"; OFS = "#"; } ($2=$2)'
Rationale being the boolean test that comes afterwards will evaluate to true upon the assignment, so that itself is sufficient to re-gen the fields in OFS and print it.