Related
I am trying to remove duplicate lines from a file including the original ones but the following command that I am trying is sorting the lines and I want them to be in the same order as they are in input file.
awk '{++a[$0]}END{for(i in a) if (a[i]==1) print i}' test.txt
Input:
123
aaa
456
123
aaa
888
bbb
Output I want:
456
888
bbb
Simpler code if you are okay with reading input file twice:
$ awk 'NR==FNR{a[$0]++; next} a[$0]==1' ip.txt ip.txt
456
888
bbb
With single pass:
$ awk '{a[NR]=$0; b[$0]++} END{for(i=1;i<=NR;i++) if(b[a[i]]==1) print a[i]}' ip.txt
456
888
bbb
If you want to do this in awk only then could you please try following; if not worried about order.
awk '{a[$0]++};END{for(i in a){if(a[i]==1){print i}}}' Input_file
To get the unique values in same order in which they occur in Input_file try following.
awk '
!a[$0]++{
b[++count]=$0
}
{
c[$0]++
}
END{
for(i=1;i<=count;i++){
if(c[b[i]]==1){
print b[i]
}
}
}
' Input_file
Output will be as follows.
456
888
bbb
Explanation: Adding detailed explanation for above code.
awk ' ##Starting awk program from here.
!a[$0]++{ ##Checking condition if current line is NOT occur in array a with more than 1 occurrence then do following.
b[++count]=$0 ##Creating an array b with index count whose value is increasing with 1 and its value is current line value.
}
{
c[$0]++ ##Creating an array c whose index is current line and its value is occurrence of current lines.
}
END{ ##Starting END block for this awk program here.
for(i=1;i<=count;i++){ ##Starting for loop from here.
if(c[b[i]]==1){ ##Checking condition if value of array c with index is value of array b with index i equals to 1 then do following.
print b[i] ##Printing value of array b.
}
}
}
' Input_file ##Mentioning Input_file name here.
awk '{ b[$0]++; a[n++]=$0; }END{ for (i in a){ if(b[a[i]]==1) print a[i] }}' input
Lines are added to array b, the order of lines is kept in array a.
If, in the end, the count is 1, the line is printed.
Sorry, i misread the question at first, and i corrected the answer, to be almost the same as #Sundeep ...
I am trying to understand how two file processing works. So here created an example.
file1.txt
zzz pq Fruit Apple 10
zzz rs Fruit Car 50
zzz tu Study Book 60
file2.txt
aa bb Book 100
cc dd Car 200
hj kl XYZ 500
ee ff Apple 300
ff gh ABC 400
I want to compare 4th column of file1 to 3rd column of file2, if matched then print the 3rd,4th,5th column of file1 followed by 3rd, 4th column of file2 with sum of 5th column of file1 and 4th column of file2.
Expected Output:
Fruit Apple 10 300 310
Fruit Car 50 200 250
Study Book 60 100 160
Here what I have tried:
awk ' FNR==NR{ a[$4]=$5;next} ( $3 in a){ print $3, a[$4],$4}' file1.txt file2.txt
Code output;
Book 100
Car 200
Apple 300
I am facing problem in printing file1 column and how to store the other column of file1 in array a. Please guide me.
Could you please try following.
awk 'FNR==NR{a[$4]=$3 OFS $4 OFS $5;b[$4]=$NF;next} ($3 in a){print a[$3],$NF,b[$3]+$NF}' file1.txt file2.txt
Output will be as follows.
Study Book 60 100 160
Fruit Car 50 200 250
Fruit Apple 10 300 310
Explanation: Adding explanation for above code now.
awk ' ##Starting awk program here.
FNR==NR{ ##Checking condition FNR==NR which will be TRUE when first Input_file named file1.txt is being read.
a[$4]=$3 OFS $4 OFS $5 ##Creating an array named a whose index is $4 and value is 3rd, 4th and 5th fields along with spaces(By default OFS value will be space for awk).
b[$4]=$NF ##Creating an array named b whose index is $4 and value if $NF(last field of current line).
next ##next keyword will skip all further lines from here.
}
($3 in a){ ##Checking if 3rd field of current line(from file2.txt) is present in array a then do following.
print a[$3],$NF,b[$3]+$NF ##Printing array a whose index is $3, last column value of current line and then SUM of array b with index $3 and last column value here.
}
' file1.txt file2.txt ##Mentioning Input_file names file1.txt and file2.txt
File1
111,222,560,0.7
111,333,560,0.2
111,444,560,0.1
File2
2017,111,560,0.0537
2018,111,560,0.0296
2019,111,560,0.0624
Desired output:
2017,111,560,0.0537,222,0.7
2018,111,560,0.0296,222,0.7
2019,111,560,0.0624,222,0.7
2017,111,560,0.0537,333,0.2
2018,111,560,0.0296,333,0.2
2019,111,560,0.0624,333,0.2
2017,111,560,0.0537,444,0.1
2018,111,560,0.0296,444,0.1
2019,111,560,0.0296,444,0.1
I tried awk NR==FNR command but it’s displaying only the last matched...
reads every line and check if column 1 and 3 of file1 exists in file2:
2017,111,560,0.0537,444,0.1
2018,111,560,0.0296,444,0.1
2019,111,560,0.0296,444,0.1
I tried awk NR==FNR command but it’s displaying only the last
matched...
reads every line and check if column 1 and 3 of file1 exists in file2:
Using awk and sort
awk 'BEGIN{
# set input and output field separator
FS=OFS=","
}
# read first file f1
# index key field1 and field3 of file1 (f1)
{
k=$1 FS $3
}
# save 2nd and last field of file1 (f1) in array a, key being k
FNR==NR{
a[k]=(k in a ? a[k] RS:"") $2 OFS $NF;
# stop processing go to next line
next
}
# read 2nd file f2 from here
# 2nd and 3rd field of fiel2 (f2) used as key
{
k=$2 FS $3
}
# if key exists in array a
k in a{
# split array value by RS row separator, and put it in array t
split(a[k],t,RS);
# iterate array t, print and sort
for(i=1; i in t; i++)
print $0,t[i] | "sort -t, -nk5"
}
' f1 f2
Test Results:
$ cat f1
111,222,560,0.7
111,333,560,0.2
111,444,560,0.1
$ cat f2
2017,111,560,0.0537
2018,111,560,0.0296
2019,111,560,0.0624
$ awk 'BEGIN{FS=OFS=","}{k=$1 FS $3}FNR==NR{a[k]=(k in a ? a[k] RS:"") $2 OFS $NF; next}{k=$2 FS $3}k in a{split(a[k],t,RS); for(i=1; i in t; i++)print $0,t[i] | "sort -t, -nk5" }' f1 f2
2017,111,560,0.0537,222,0.7
2018,111,560,0.0296,222,0.7
2019,111,560,0.0624,222,0.7
2017,111,560,0.0537,333,0.2
2018,111,560,0.0296,333,0.2
2019,111,560,0.0624,333,0.2
2017,111,560,0.0537,444,0.1
2018,111,560,0.0296,444,0.1
2019,111,560,0.0624,444,0.1
Following awk may help you in same.
awk -F, '
FNR==NR{
a[FNR]=$0;
next
}
{
for(i=1;i<=length(a);i++){
print a[i] FS $2 FS $NF
}
}' Input_file2 Input_file1
Adding explanation too for code as follows.
awk -F, ' ##Setting field separator as comma here for all the lines.
FNR==NR{ ##Using FNR==NR condition which will be only TRUE then first Input_file named File2 is being read.
##FNR and NR both indicates the number of lines for a Input_file only difference is FNR value will be RESET whenever a new file is being read and NR value will be keep increasing till all Input_files are read.
a[FNR]=$0; ##Creating an array named a whose index is FNR(current line) value and its value is current line value.
next ##Using next statement will sip all further statements now.
}
{
for(i=1;i<=length(a);i++){##Starting a for loop from variable i value from 1 to length of array a value. This will be executed on 2nd Input_file reading.
print a[i] FS $2 FS $NF ##Printing the value of array a whose index is variable i and printing 2nd and last field of current line.
}
}' File2 File1 ##Mentioning the Input_file names here.
another one with join/awk
$ join -t, -j99 file2 file1 |
awk -F, -v OFS=, '$3==$6 && $4==$8 {print $2,$3,$4,$5,$7,$9}'
Would like to print unique lines based on first field AND latest Date & Time of third field,
keep the latest date and time occurrence of that line and remove duplicate of other occurrences.
Having around 50 million rows , file is not sorted ...
Input.csv
10,ab,15-SEP-14.11:09:06,abc,xxx,yyy,zzz
20,ab,23-SEP-14.08:09:35,abc,xxx,yyy,zzz
10,ab,25-SEP-14.08:09:26,abc,xxx,yyy,zzz
62,ab,12-SEP-14.03:09:23,abc,xxx,yyy,zzz
58,ab,22-JUL-14.05:07:07,abc,xxx,yyy,zzz
20,ab,23-SEP-14.07:09:35,abc,xxx,yyy,zzz
Desired Output:
10,ab,25-SEP-14.08:09:26,abc,xxx,yyy,zzz
20,ab,23-SEP-14.08:09:35,abc,xxx,yyy,zzz
62,ab,12-SEP-14.03:09:23,abc,xxx,yyy,zzz
58,ab,22-JUL-14.05:07:07,abc,xxx,yyy,zzz
Have attempeted partial commands and in-complete due to Date and Time format of the file un sorting order ...
awk -F, '!seen[$1,$3]++' Input.csv
Looking for your suggestions ...
this awk command will do it for you:
awk -F, -v OFS=',' '{sub(/[.]/," ",$3);"date -d\""$3"\" +%s"|getline d}
!($1 in b)||d>b[$1] {b[$1] =d; a[$1] = $0}
END{for(x in a)print a[x]}' file
first line transforms the original $3 into valid date format string and get the seconds from 1970 via date cmd, so that we could later compare.
using a and b two arrays to hold the final result and the latest date (seconds)
the END block print all rows from a
test with your example data:
kent$ cat f
10,ab,15-SEP-14.11:09:06,abc,xxx,yyy,zzz
20,ab,23-SEP-14.08:09:35,abc,xxx,yyy,zzz
10,ab,25-SEP-14.08:09:26,abc,xxx,yyy,zzz
62,ab,12-SEP-14.03:09:23,abc,xxx,yyy,zzz
58,ab,22-JUL-14.05:07:07,abc,xxx,yyy,zzz
20,ab,23-SEP-14.07:09:35,abc,xxx,yyy,zzz
kent$ awk -F, '{sub(/[.]/," ",$3);"date -d\""$3"\" +%s"|getline d}
!($1 in b)||d>b[$1] { b[$1] =d;a[$1] = $0 }
END{for(x in a)print a[x]}' f
10 ab 25-SEP-14 08:09:26 abc xxx yyy zzz
20 ab 23-SEP-14 08:09:35 abc xxx yyy zzz
58 ab 22-JUL-14 05:07:07 abc xxx yyy zzz
62 ab 12-SEP-14 03:09:23 abc xxx yyy zzz
This should do:
sort -t , -k 3 file | awk -F, '{a[$1]=$0} END {for (i in a) print a[i]}'
62,ab,12-SEP-14.03:09:23,abc,xxx,yyy,zzz
58,ab,22-JUL-14.05:07:07,abc,xxx,yyy,zzz
10,ab,25-SEP-14.08:09:26,abc,xxx,yyy,zzz
20,ab,23-SEP-14.08:09:35,abc,xxx,yyy,zzz
I have two files:
cat file1:
0 xxx
1 yyy
1 zzz
0 aaa
cat file2:
A bbb
B ccc
C ddd
D eee
How do I get the following output using awk:
B ccc
C ddd
My question is, how do I print lines from file2 only if a certain field in file1 (i.e. field 1) matches a certain value (i.e. 1)?
Additional information:
Files file1 and file2 have an equal number of lines.
Files file1 and file2 have millions of lines and cannot be read into memory.
file1 has 4 columns.
file2 has approximately 1000 columns.
Try doing this (a bit obfuscated):
awk 'NR==FNR{a[NR]=$1}NR!=FNR&&a[FNR]' file1 file2
On multiples lines it can be clearer (reminder, awk works like this : condition{action} :
awk '
NR==FNR{arr[NR]=$1}
NR!=FNR && arr[FNR]
' file1 file2
If I remove the "clever" parts of the snippet :
awk '
if (NR == FNR) {arr[NR]=$1}
if (NR != FNR && arr[FNR]) {print $0}
' file1 file2
When awk find a condition alone (without action) like NR!=FNR && arr[FNR], it print by default on STDOUT implicitly is the expressions is TRUE (> 0)
Explanations
NR is the number of the current record from the start of input
FNR is the ordinal number of the current record in the current file (so NR is different than FNR on the second file)
arr[NR]=$1 : feeding array arr with indice of the current NR with the first column
if NR!=FNR we are in the next file and if the value of the array if 1, then we print
No as clean as a awk solution
$ paste file2 file1 | sed '/0/d' | cut -f1
B
C
You mentioned something about millions of lines, in order to just do a single pass through the files, I'd resort to python. Something like this perhaps (python 2.7):
with open("file1") as fd1, open("file2") as fd2:
for l1, l2 in zip(fd1, fd2):
if not l1.startswith('0'):
print l2.strip()
awk '{
getline value <"file2";
if ($1)
print value;
}' file1