print unique lines based on field - awk

Would like to print unique lines based on first field , keep the first occurrence of that line and remove duplicate other occurrences.
Input.csv
10,15-10-2014,abc
20,12-10-2014,bcd
10,09-10-2014,def
40,06-10-2014,ghi
10,15-10-2014,abc
Desired Output:
10,15-10-2014,abc
20,12-10-2014,bcd
40,06-10-2014,ghi
Have tried below command and in-complete
awk 'BEGIN { FS = OFS = "," } { !seen[$1]++ } END { for ( i in seen) print $0}' Input.csv
Looking for your suggestions ...

You put your test for "seen" in the action part of the script instead of the condition part. Change it to:
awk -F, '!seen[$1]++' Input.csv
Yes, that's the whole script:
$ cat Input.csv
10,15-10-2014,abc
20,12-10-2014,bcd
10,09-10-2014,def
40,06-10-2014,ghi
10,15-10-2014,abc
$
$ awk -F, '!seen[$1]++' Input.csv
10,15-10-2014,abc
20,12-10-2014,bcd
40,06-10-2014,ghi

This should give you what you want:
awk -F, '{ if (!($1 in a)) a[$1] = $0; } END '{ for (i in a) print a[i]}' input.csv

typo there in syntax.
awk '{ if (!($1 in a)) a[$1] = $0; } END { for (i in a) print a[i]}'

Related

Grouping duplicated fields with awk

I have the following file:
ID|2018-04-29
ID|2018-04-29
ID|2018-04-29
ID1|2018-06-26
ID1|2018-06-26
ID1|2018-08-07
ID1|2018-08-22
and using awk, I want to add $3 that groups the duplicated IDs based on $1 and $2 so that the output would be
ID|2018-04-29|group1
ID|2018-04-29|group1
ID|2018-04-29|group1
ID1|2018-06-26|group2
ID1|2018-06-26|group2
ID1|2018-08-07|group3
ID1|2018-08-22|group4
I tried the following code but it does not give me the desired output. Also, I am not sure if I can apply it to a column with date in it.
awk -F"|" '{print $0,"group"++seen[$1,$3]}' OFS="|"
Any hints on how to achieve it using awk (one-liner, if possible) would be highly appreciated.
With your shown samples, please try following awk code.
awk -v OFS="|" '!arr[$0]++{count++} {print $0,"group"count}' Input_file
Explanation: Adding detailed explanation for above.
awk ' ##Starting awk program from here.
BEGIN{ ##Starting BEGIN section of this program from here.
OFS="|" ##Setting OFS to | here.
}
!arr[$0]++{ ##Checking if current line is NOT present in array then do following.
count++ ##Increasing count with 1 here.
}
{
print $0,"group"count ##Printing current line with group and count value here.
}
' Input_file ##Mentioning Input_file name here.
and using awk, I want to add $3 that groups the duplicated IDs based
on $1 and $2 so that the output would be
Using $1 and $2
If input file is sorted then:
$ awk 'BEGIN{FS=OFS="|"}{print $0, "group" (!a[$1,$2]++?++c:c)}' file
ID|2018-04-29|group1
ID|2018-04-29|group1
ID|2018-04-29|group1
ID1|2018-06-26|group2
ID1|2018-06-26|group2
ID1|2018-08-07|group3
ID1|2018-08-22|group4
If file not sorted then :
$ awk 'BEGIN{FS=OFS="|"}{k=$1 SUBSEP $2}!(k in a){a[k]=++c}{print $0, "group" a[k]}' file
ID|2018-04-29|group1
ID|2018-04-29|group1
ID|2018-04-29|group1
ID1|2018-06-26|group2
ID1|2018-06-26|group2
ID1|2018-08-07|group3
ID1|2018-08-22|group4
Better Readable version:
awk 'BEGIN{
FS=OFS="|"
}
{
k=$1 SUBSEP $2
}
!(k in a){
a[k]=++c
}
{
print $0, "group" a[k]
}' file
BEGIN {OFS = FS = "|"}
{ if ($0 != prev) { #new item
prev = $0
print $1, $2, "group" ++g
}
else {
print $1, $2, "group" g
}
}
Note that the list has to be sorted (from your example, I assume it is).
This is my first time posting answer here. Hope the code is readable for you and hope it helps.

How to not remove the header while executing awk

I have a file file like this :
k_1_1
k_1_3
k_1_6
...
I have a file file2 :
0,1,2,3,...
k_1_1,17,16,15,...
k_1_2,17,89,15,...
k_1_3,10,26,45,...
k_1_4,17,16,15,...
k_1_5,10,26,45,...
k_1_6,17,16,15,...
...
I want to print lines of file2 that is matched with fileThe desired output is :
0,1,2,3,...
k_1_1,17,16,15,...
k_1_3,10,26,45,...
k_1_6,17,16,15,...
I tried
awk 'BEGIN{FS=OFS=","}NR==FNR{a[$1];next}$1 in a {print $0}' file file2 > result
But the header line is gone in result like this :
k_1_1,17,16,15,...
k_1_3,10,26,45,...
k_1_6,17,16,15,...
How can a maintain it? Thank you.
Always print the first line, unconditionally.
awk 'BEGIN{FS=OFS=","}
NR==FNR{a[$1];next}
FNR==1 || $1 in a' file file2 > result
Notice also how { print $0 } is not necessary because it's the default action.
A very ad-hoc solution to your problem could be to compose the output in a command group:
{ head -1 file2; awk 'BEGIN{FS=OFS=","}NR==FNR{a[$1];next}$1 in a {print $0}' file file2; } > result
Could you please try following.
awk -F, 'FNR==NR{a[$1]=$0;next} FNR==1 && ++count==1{print;next} a[$1]' Input_file Input_file2
OR
awk -F, 'FNR==NR{a[$1]=$0;next} FNR==1{print;next} a[$1]' Input_file Input_file2

awk to print the fields based on condition:

I would like to compare first field and third filed of Input.csv file with Second and third fields of Master.csv.
And if the fruits name matches and the Amount of Master.csv is less than Amount of Input.csv then print all the lines from both the files.
For example,
Fruits==Apple,Amount <20 from Master.csv need to be checked with Input.csv, so the output would be
Fruits,Region,Amount,Details
Apple,North,10,Abc
Apple,south,9,Abc
Input.csv
Fruits,Region,Amount,Details
Apple,North,10,Abc
Orange,East,115,Def
Apple,south,9,Abc
Apple,West,25,Abc
Orange,West,150,Def
Orange,North,200,Def
Mango,North,50,Ghi
Mango,East,75,Ghi
Master.csv
Details,Fruits,Amount
xxx,Apple,20
yyy,Mango,60
zzz,Cherry,80
Desired Output.csv
Fruits,Region,Amount,Details,Details,Fruits,Amount
Apple,North,10,Abc,xxx,Apple,20
Apple,south,9,Abc,xxx,Apple,20
Mango,North,50,Ghi,yyy,Mango,60
I have tried like below command
awk -F "," 'FNR==NR {a[$1]; b[$3]; next} $2 in a && $3 < b' Input.csv Master.csv > Output.csv
Following awk may help you on same:
awk -F"," 'FNR==1 && FNR==NR{val=$0} FNR==NR{a[$2]=$3;b[$2]=$0;next} FNR==1 && FNR!=NR{for(i=NF;i>0;i--){val1=val1?$i OFS val1:$i};print val,val1} (($1 in a) && $3<a[$1]){print $0,b[$1]}' OFS=, master.csv input.csv
Adding a non-one liner form of solution now:
awk -F"," '
FNR==1 && FNR==NR{
val=$0
}
FNR==NR{
a[$2]=$3;
b[$2]=$0;
next
}
FNR==1 && FNR!=NR{
for(i=NF;i>0;i--){
val1=val1?$i OFS val1:$i};
print val,val1
}
(($1 in a) && $3<a[$1]){
print $0,b[$1]
}
' OFS=, master.csv input.csv
Output will be as follows:
Details,Fruits,Amount,Fruits,Region,Amount,Details
Apple,North,10,Abc,xxx,Apple,20
Apple,south,9,Abc,xxx,Apple,20
Mango,North,50,Ghi,yyy,Mango,60

awk command to split nth field

I am learning AWK and was trying some exercises on built-in string functions.
Here's my exercise:
I have a file containing as below
RecordType:83
1,2,3,a|x|y|z,4,5
And my desired output is as below:
RecordType:83
1,2,3,a,4,5
1,0,0,x,4,5
1,0,0,y,4,5
1,0,0,z,4,5
I wrote an awk command for the above output.
awk -F',' '$1 ~ /RecordType:83/{print $0}
$1 == 1{
split($4,splt,"|")
for(i in splt)
{
if(i==1)
print $1,$2,$3,splt[i],$5,$6
else
print $1,0,0,splt[i],$5,$6
}
}' OFS=, file_name
The above command looks so clumsy. Is there any way minimizing the command?
Thanks in advance
The shortest possible one-liner I could manage:
awk -F, 'NR>1{n=split($4,a,"|");for(;i++<n;){$4=a[i];print;$2=$3=0}}NR==1' OFS=, file
RecordType:83    
1,2,3,a,4,5
1,0,0,x,4,5
1,0,0,y,4,5
1,0,0,z,4,5
The much more readable script (recommended):
BEGIN {
FS=OFS="," # Comma delimiter
}
NR==1 { # If the first line in file
print $0 # Print the whole line
next # Skip to next line
}
{
n=split($4,a,"|") # Split field four on |
for(i=1;i<=n;i++) # For each sub-field
print $1,i==1?$2OFS$3:"0"OFS"0",a[i],$5,$6 # Print the output
}
another shorter one-liner
awk -F, -v OFS="," 'NR>1{n=split($4,a,"|");while(++i<=n){$4=a[i];print;$2=$3=0}}NR==1' file
with your example:
kent$ awk -F, -v OFS="," 'NR>1{n=split($4,a,"|");while(++i<=n){$4=a[i];print;$2=$3=0}}NR==1' file
RecordType:83
1,2,3,a,4,5
1,0,0,x,4,5
1,0,0,y,4,5
1,0,0,z,4,5

How to print out a specific field in AWK?

A very simple question, which a found no answer to. How do I print out a specific field in awk?
awk '/word1/', will print out the whole sentence, when I need just a word1. Or I need a chain of patterns (word1 + word2) to be printed out only from a text.
Well if the pattern is a single word (which you want to print and can't contaion FS (input field separator)) why not:
awk -v MYPATTERN="INSERT_YOUR_PATTERN" '$0 ~ MYPATTERN { print MYPATTERN }' INPUTFILE
If your pattern is a regex:
awk -v MYPATTERN="INSERT_YOUR_PATTERN" '$0 ~ MYPATTERN { print gensub(".*(" MYPATTERN ").*","\\1","1",$0) }' INPUTFILE
If your pattern must be checked in every single field:
awk -v MYPATTERN="INSERT_YOUR_PATTERN" '$0 ~ MYPATTERN {
for (i=1;i<=NF;i++) {
if ($i ~ MYPATTERN) { print "Field " i " in " NR " row matches: " MYPATTERN }
}
}' INPUTFILE
Modify any of the above to your taste.
The fields in awk are represented by $1, $2, etc:
$ echo this is a string | awk '{ print $2 }'
is
$0 is the whole line, $1 is the first field, $2 is the next field ( or blank ),
$NF is the last field, $( NF - 1 ) is the 2nd to last field, etc.
EDIT (in response to comment).
You could try:
awk '/crazy/{ print substr( $0, match( $0, "crazy" ), RLENGTH )}'
i know you can do this with awk :
an alternative would be :
sed -nr "s/.*(PATTERN_TO_MATCH).*/\1/p" file
or you can use grep -o
Something like this perhaps:
awk '{split("bla1 bla2 bla3",a," "); print a[1], a[2], a[3]}'