Would like to print unique lines based on first field AND latest Date & Time of third field,
keep the latest date and time occurrence of that line and remove duplicate of other occurrences.
Having around 50 million rows , file is not sorted ...
Input.csv
10,ab,15-SEP-14.11:09:06,abc,xxx,yyy,zzz
20,ab,23-SEP-14.08:09:35,abc,xxx,yyy,zzz
10,ab,25-SEP-14.08:09:26,abc,xxx,yyy,zzz
62,ab,12-SEP-14.03:09:23,abc,xxx,yyy,zzz
58,ab,22-JUL-14.05:07:07,abc,xxx,yyy,zzz
20,ab,23-SEP-14.07:09:35,abc,xxx,yyy,zzz
Desired Output:
10,ab,25-SEP-14.08:09:26,abc,xxx,yyy,zzz
20,ab,23-SEP-14.08:09:35,abc,xxx,yyy,zzz
62,ab,12-SEP-14.03:09:23,abc,xxx,yyy,zzz
58,ab,22-JUL-14.05:07:07,abc,xxx,yyy,zzz
Have attempeted partial commands and in-complete due to Date and Time format of the file un sorting order ...
awk -F, '!seen[$1,$3]++' Input.csv
Looking for your suggestions ...
this awk command will do it for you:
awk -F, -v OFS=',' '{sub(/[.]/," ",$3);"date -d\""$3"\" +%s"|getline d}
!($1 in b)||d>b[$1] {b[$1] =d; a[$1] = $0}
END{for(x in a)print a[x]}' file
first line transforms the original $3 into valid date format string and get the seconds from 1970 via date cmd, so that we could later compare.
using a and b two arrays to hold the final result and the latest date (seconds)
the END block print all rows from a
test with your example data:
kent$ cat f
10,ab,15-SEP-14.11:09:06,abc,xxx,yyy,zzz
20,ab,23-SEP-14.08:09:35,abc,xxx,yyy,zzz
10,ab,25-SEP-14.08:09:26,abc,xxx,yyy,zzz
62,ab,12-SEP-14.03:09:23,abc,xxx,yyy,zzz
58,ab,22-JUL-14.05:07:07,abc,xxx,yyy,zzz
20,ab,23-SEP-14.07:09:35,abc,xxx,yyy,zzz
kent$ awk -F, '{sub(/[.]/," ",$3);"date -d\""$3"\" +%s"|getline d}
!($1 in b)||d>b[$1] { b[$1] =d;a[$1] = $0 }
END{for(x in a)print a[x]}' f
10 ab 25-SEP-14 08:09:26 abc xxx yyy zzz
20 ab 23-SEP-14 08:09:35 abc xxx yyy zzz
58 ab 22-JUL-14 05:07:07 abc xxx yyy zzz
62 ab 12-SEP-14 03:09:23 abc xxx yyy zzz
This should do:
sort -t , -k 3 file | awk -F, '{a[$1]=$0} END {for (i in a) print a[i]}'
62,ab,12-SEP-14.03:09:23,abc,xxx,yyy,zzz
58,ab,22-JUL-14.05:07:07,abc,xxx,yyy,zzz
10,ab,25-SEP-14.08:09:26,abc,xxx,yyy,zzz
20,ab,23-SEP-14.08:09:35,abc,xxx,yyy,zzz
Related
file1
1 123 ab456 A G PASS AC=0.15;FB=1.5;BV=45; 0|0 0|0 0|1 0|0
4 789 ab123 C T PASS FB=90;AC=2.15;BV=12; 0|1 0|1 0|0 0|0
desired output
1 123 ab456 A G PASS AC=0.15
4 789 ab123 C T PASS AC=2.15
I used
awk '{print $1,$2,$3,$4,$5,$6,$7}' file1 > out1.txt
sed -i 's/;/\t/g' out1.txt
awk '{print $1,$2,$3,$4,$5,$6,$7,$8}' out1.txt
output generated
1 123 ab456 A G PASS AC=0.15
4 789 ab123 C T PASS FB=90
I want to print first 6 columns along with value of AC=(*) from 7th column.
With your shown samples, please try following awk code.
awk '
{
val=""
while(match($7,/AC=[^;]*/)){
val=(val?val:"")substr($7,RSTART,RLENGTH)
$7=substr($7,RSTART+RLENGTH)
}
print $1,$2,$3,$4,$5,$6,val
}
' Input_file
Explanation: Adding detailed explanation for above.
awk ' ##Starting awk program from here.
{
val="" ##Nullifying val here.
while(match($7,/AC=[^;]*/)){ ##Running while loop to use match function to match AC= till semi colon all occurrences here.
val=(val?val:"")substr($7,RSTART,RLENGTH) ##Creating val and keep adding matched regex value to it, from 7th column.
$7=substr($7,RSTART+RLENGTH) ##Assigning rest pending values to 7th column itself.
}
print $1,$2,$3,$4,$5,$6,val ##Printing appropriate columns required by OP along with val here.
}
' Input_file ##Mentioning Input_file name here.
$ awk '{
n=split($7,a,/;/) # split $7 on ;s
for(i=1;i<=n&&a[i]!~/^AC=/;i++); # just loop looking for AC
print $1,$2,$3,$4,$5,$6,a[i] # output
}' file
Output:
1 123 ab456 A G PASS AC=0.15
4 789 ab123 C T PASS AC=2.15
If AC= was not found, and empty field is outputed instead.
Any time you have tag=value pairs in your data I find it best to first populate an array (f[] below) to hold those tag-value mappings so you can print/test/rearrange those values by their tags (names).
Using any awk in any shell on every Unix box:
$ cat tst.awk
{
split($7,tmp,/[=;]/)
for (i=1; i<NF; i+=2) {
f[tmp[i]] = tmp[i] "=" tmp[i+1]
}
sub(/[[:space:]]*[^[:space:]]+;.*/,"")
print $0, f["AC"]
}
$ awk -f tst.awk file
1 123 ab456 A G PASS AC=0.15
4 789 ab123 C T PASS AC=2.15
This might work for you (GNU sed):
sed -nE 's/^((\S+\s){6})\S*;?(AC=[^;]*);.*/\1\3/p' file
Turn off implicit printing -n and add easier regexp -E.
Match the first six fields and their delimiters and append the AC tag and its value from the next.
With only GNU sed:
$ sed -r 's/(\S+;)?(AC=[^;]*).*/\2/' file1
1 123 ab456 A G PASS AC=0.15
4 789 ab123 C T PASS AC=2.15
Lines without a AC=... part in the 7th field will be printed without modification. If you prefer removing the 7th field and the end of the line, use:
$ sed -r 's/(\S+;)?(AC=[^;]*).*/\2/;t;s/\S+;.*//' file1
I have a tab-separated file looking like this:
A 1234
A 123245
A 4546
A 1234
B 24234
B 4545
C 1234
C 1234
Output:
A 3
B 2
C 1
Basically I need counts of unique values that belong to the first column, all in one commando with pipelines. As you may see, there can be some duplicates like "A 1234". I had some ideas with awk or cut, but neither of the seem to work. They just print out all unique pairs, while I need count of unique values from the second column considering the value in the first one.
awk -F " "'{print $1}' file.tsv | uniq -c
cut -d' ' -f1,2 file.tsv | sort | uniq -ci
I'd really appreciate your help! Thank you in advance.
With complete awk solution could you please try following.
awk 'BEGIN{FS=OFS="\t"} !found[$0]++{val[$1]++} END{for(i in val){print i,val[i]}}' Input_file
Explanation: Adding detailed explanation for above.
awk ' ##Starting awk program from here.
BEGIN{
FS=OFS="\t"
}
!found[$0]++{ ##Checking condition if 1st and 2nd column is NOT present in found array then do following.
val[$1]++ ##Creating val with 1st column inex and keep increasing its value here.
}
END{ ##Starting END block of this progra from here.
for(i in val){ ##Traversing through array val here.
print i,val[i] ##Printing i and value of val with index i here.
}
}
' Input_file ##Mentioning Input_file name here.
Using GNU awk:
$ gawk -F\\t '{a[$1][$2]}END{for(i in a)print i,length(a[i])}' file
Output:
A 3
B 2
C 1
Explained:
$ gawk -F\\t '{ # using GNU awk and tab as delimiter
a[$1][$2] # hash to 2D array
}
END {
for(i in a) # for all values in first field
print i,length(a[i]) # output value and the size of related array
}' file
$ sort -u file | cut -f1 | uniq -c
3 A
2 B
1 C
Another way, using the handy GNU datamash utility:
$ datamash -g1 countunique 2 < input.txt
A 3
B 2
C 1
Requires the input file to be sorted on the first column, like your sample. If real file isn't, add -s to the options.
You could try this:
cat file.tsv | sort | uniq | awk '{print $1}' | uniq -c | awk '{print $2 " " $1}'
It works for your example. (But I'm not sure if it works for other cases. Let me know if it doesn't work!)
In the awk below I am trying to match each lines $2 in f1 and f2 and then $1 in f1 and f2, and if both match the $3 is "MATCH" otherwise $3 is "MISMATCH". The awk below produces does not execute unless I remove one of the if statements. Thank you :).
f1
1234 aaa
5678 xxxx
1244 yyyy
2255 zzzz
f2
5678 xxxx
224 zzzz
1244 yyyy
1234 aaa
desired
1234 aaa MATCH
5678 xxxx MATCH
1244 yyyy MATCH
2255 zzzz MISMATCH
awk
awk 'if($2==$2) && if($1==$1){print $3,"MATCH"} else {print $3,"MISMATCH"}}' f1 f2
AWK doesn't read input files simultaneously.
First you need to read f2 into an array, then you can use the array to determine what will be in $3 while processing f1.
awk 'NR==FNR{a[$1]=$2;next} {$3=(($1 in a&&a[$1]==$2)?"":"MIS")"MATCH"} 1' f2 f1
Could you please try following(This will print those values which are present in file1 and not in file2 for NON-MATCH keyword printing).
awk '
FNR==NR{
a[$1]=$0
next
}
{
print $0,$1 in a?"MATCH":"NON-MATCH"
}' Input_file2 Input_file1
Additional solution: If you want to print non-matching lines for both files(since above code prints only those which are in file1 and NOT in file2).
awk '
FNR==NR{
a[$1]=$2
next
}
{
print $0,$1 in a?"MATCH":"NON-MATCH-FILE1"
b[$1 in a?$1:""]
}
END{
for(i in a){
if(!(i in b)){
print a[i]" NON-MATCH-FILE2"
}
}
}' Input_file2 Input_file1
Would like know how to handle below situation, sample input delimited by space and want to format as comma-separated output.
All the text in a line up until the first field starting with a digit should be considered as a single field in the output. In the sample data, there are always 3 numeric fields at the end of a line; in the real data, there are 14 such fields.
Input.txt
mmm 4394850 4465411 2579770
xxx yyy 2155419 2178791 1516446
aaa bbb (incl. ccc) 14291585 14438704 6106341
U.U.(W) 6789781 6882021 5940226
nnn 7335050 7534302 2963345
Have tried the command below, but I know it is incomplete:
awk 'BEGIN {FS =" "; OFS = ","} {print $1,$2,$3,$4,$5,$6} ' Input.txt
Desired output:
mmm,4394850,4465411,2579770
xxx yyy,2155419,2178791,1516446
aaa bbb (incl. ccc),14291585,14438704,6106341
U.U.(W),6789781,6882021,5940226
nnn,7335050,7534302,2963345
With GNU awk for gensub():
$ awk '{match($0,/[0-9 ]+$/); print substr($0,1,RSTART-1) gensub(/ /,",","g",substr($0,RSTART,RLENGTH))}' file
mmm,4394850,4465411,2579770
xxx yyy,2155419,2178791,1516446
aaa bbb (incl. ccc),14291585,14438704,6106341
U.U.(W),6789781,6882021,5940226
nnn,7335050,7534302,2963345
with other awks, save the 2nd substr() output in a var and use gsub():
awk '{match($0,/[0-9 ]+$/); digs=substr($0,RSTART,RLENGTH); gsub(/ /,",",digs); print substr($0,1,RSTART-1) digs}' file
Assuming that it's the last 3 columns that are numerical (as in your example):
awk '{for(i=1;i<=NF;++i)printf "%s%s",$i,(i<NF-3?OFS:(i<NF?",":ORS))}' file
Basically print each field followed by a space, comma or newline depending on the field number.
Another awk
awk '$0=gensub(/ ([0-9]+)/,",\\1","g")' file
mmm,4394850,4465411,2579770
xxx yyy,2155419,2178791,1516446
aaa bbb (incl. ccc),14291585,14438704,6106341
U.U.(W),6789781,6882021,5940226
nnn,7335050,7534302,2963345
I have two files:
cat file1:
0 xxx
1 yyy
1 zzz
0 aaa
cat file2:
A bbb
B ccc
C ddd
D eee
How do I get the following output using awk:
B ccc
C ddd
My question is, how do I print lines from file2 only if a certain field in file1 (i.e. field 1) matches a certain value (i.e. 1)?
Additional information:
Files file1 and file2 have an equal number of lines.
Files file1 and file2 have millions of lines and cannot be read into memory.
file1 has 4 columns.
file2 has approximately 1000 columns.
Try doing this (a bit obfuscated):
awk 'NR==FNR{a[NR]=$1}NR!=FNR&&a[FNR]' file1 file2
On multiples lines it can be clearer (reminder, awk works like this : condition{action} :
awk '
NR==FNR{arr[NR]=$1}
NR!=FNR && arr[FNR]
' file1 file2
If I remove the "clever" parts of the snippet :
awk '
if (NR == FNR) {arr[NR]=$1}
if (NR != FNR && arr[FNR]) {print $0}
' file1 file2
When awk find a condition alone (without action) like NR!=FNR && arr[FNR], it print by default on STDOUT implicitly is the expressions is TRUE (> 0)
Explanations
NR is the number of the current record from the start of input
FNR is the ordinal number of the current record in the current file (so NR is different than FNR on the second file)
arr[NR]=$1 : feeding array arr with indice of the current NR with the first column
if NR!=FNR we are in the next file and if the value of the array if 1, then we print
No as clean as a awk solution
$ paste file2 file1 | sed '/0/d' | cut -f1
B
C
You mentioned something about millions of lines, in order to just do a single pass through the files, I'd resort to python. Something like this perhaps (python 2.7):
with open("file1") as fd1, open("file2") as fd2:
for l1, l2 in zip(fd1, fd2):
if not l1.startswith('0'):
print l2.strip()
awk '{
getline value <"file2";
if ($1)
print value;
}' file1