I have two csv files I need to compare against one column.
My member.csv file looks like:
ID|lastName|firstName
01|Lastname01|Firstname01
02|Lastname02|Firstname02
The second file check-ID.csv looks like:
Lastname01|Name01|pubID01|Hash01
Lastname02|Name02|pubID02|Hash02a
Lastname03|Name03|pubID03|Hash03
Lastname02|Name02|pubID02|Hash02b
Lastname01|Name01|pubID01|Hash01b
--> Lastname03 is not in my member.csv !
What I want is to check if the value of the first column of check-ID.csv is equal to value of second column in member.csv.
My attempt with script.awk is
NR==FNR{a[$1]=$1; b[$1]=$0; next}
$2==a[$1]{ delete b[$1]}
END{for (i in b ) print b[i]}
executing with
awk -f script.awk check-ID.csv member.csv
The problem is that the result is not filtered.
I like to get a filtered and sorted output so only members are listed like this:
Lastname01|Name01|pubID01|Hash01
Lastname01|Name01|pubID01|Hash01b
Lastname02|Name02|pubID02|Hash02a
Lastname02|Name02|pubID02|Hash02b
Any help appreciated!
Could you please try following. I think you were close only thing is you could change your Input_files reading sequence. Where I am reading members Input_file first and then check-ID.csv because later Input_file has all details in it which needs to be printed and we need to only check for 2nd field from members Input_file.
awk '
BEGIN{
FS="|"
}
FNR==NR{
a[$2]
next
}
($1 in a)
' members.csv check-ID.csv |
sort -t'|' -k1
Explanation: Adding detailed explanation for above.
awk ' ##Starting awk program from here.
BEGIN{ ##Starting BEGIN section of this program from here.
FS="|" ##Setting field separator as | here.
}
FNR==NR{ ##Checking condition if FNR==NR which will be TRUE when first Input_file named members.csv is being read.
a[$2] ##Creating array a with index 2nd field here.
next ##next will skip all further statements from here.
}
($1 in a) ##Checking condition if 1st field is preent in a then print that line.
' members.csv check-ID.csv | ##Mentioning Input_file names here and sending its output to sort command.
sort -t'|' -k1 ##Sorting output(which we got from awk command above) by setting separator as | and by first field.
Related
table1.csv:
33622|AAA
33623|AAA
33624|BBB
33625|CCC
33626|DDD
33627|AAA
33628|BBB
33629|EEE
33630|FFF
Aims:
33622|AAA
33623|AAA
33624|BBB
33625|CCC
33626|DDD
33627|AAA
33628|BBB
Using command:
awk 'BEGIN{FS="|";OFS="|"} {if($2=="AAA" && $2=="BBB" && $2=="CCC" && $2=="DDD"){print $1,$2}}' table1.csv
However, trying to be more automatic, since the categories may increase.
list1.csv:
AAA BBB CCC DDD
list=`cat list1.csv`
awk -v list=$list 'BEGIN{FS="|";OFS="|"} {if($2==list){print $1,$2}}' table1.csv
Which means, can I stored $2=="AAA" && $2=="BBB" ....... into a variable by using list1.csv?
Expected output:
33622|AAA
33623|AAA
33624|BBB
33625|CCC
33626|DDD
33627|AAA
33628|BBB
So, any suggestion on storing the multiple condition in one variable?
Thanks all!
$ awk 'NR==FNR{for(i=1;i<=NF;i++)a[$i];next}FNR==1{FS="|";$0=$0}($2 in a)' list table
Output:
33622|AAA
33623|AAA
33624|BBB
33625|CCC
33626|DDD
33627|AAA
33628|BBB
Explained:
$ awk '
NR==FNR { # process list
for(i=1;i<=NF;i++) # hash all items in file
a[$i]
next # possibility for multiple lines
}
FNR==1 { # changing FS in the beginning of table file
FS="|"
$0=$0
}
($2 in a)' list table
Almost same logic Like James Brown's nice answer, just adding here a small variant which is setting field separator in Input_file places itself.
awk 'FNR==NR{for(i=1;i<=NF;i++){arr[$i]};next} ($2 in arr)' list FS="|" table
Explanation: Adding detailed explanation for above.
awk ' ##Starting awk program from here.
FNR==NR{ ##Checking condition which will be TRUE when list is being read.
for(i=1;i<=NF;i++){ ##Going through all fields here.
arr[$i] ##Creating arr with index of current column value here.
}
next ##next will skip all further statements from here.
}
($2 in arr) ##Checking condition if 2nd field is present in arr then print that line from table file.
' list FS="|" table ##mentioning Input_file(s) here and setting FS as | before table file.
Here are the input:
files1.csv
21|AAAAA|1023
21|BBBBB|1203
21|CCCCC|2533
22|DDDDD|1294
22|EEEEE|1249
22|FFFFF|4129
22A|GGGGG|4121
22A|HHHHH|1284
31B|IIIII|5403
31B|JJJJJ|1249
file2.csv
21|A800
22|B900
22A|C1000
31B|D1000
expect output:
files3.csv
21|A800|AAAAA|1023
21|A800|BBBBB|1203
21|A800|CCCCC|2533
22|B900|EEEEE|1249
22|B900|FFFFF|4129
22A|C1000|GGGGG|4121
22A|C1000|HHHHH|1284
31B|D1000|IIIII|5403
31B|D1000|JJJJJ|1249
currently tried using join,
join -a1 -t '|' -1 1 -2 1 -o 1.1,2.2,1.2,1.3 file1.csv file2.csv > file3.csv
But it found that some rows missed matching, so i turn my concept to use most likely vlookup functionality for this two files. Please help.
Thanks all
Could you please try following with awk, written and tested with GNU awk with shown samples.
awk '
BEGIN{
FS=OFS="|"
}
FNR==NR{
arr[$1]=$2
next
}
($1 in arr){
$1=($1 OFS arr[$1])
}
1
' file2.csv file1.csv
Explanation: Adding detailed explanation for above.
awk ' ##Starting awk program from here.
BEGIN{ ##Starting BEGIN section from here of this program.
FS=OFS="|" ##Setting | as field separator and output field separator.
}
FNR==NR{ ##Checking condition if FNR==NR which will be TRUE when file2.csv is being read.
arr[$1]=$2 ##Creating arr with index of 1st field and value of 2nd field.
next ##next will skip all further statements from here.
}
($1 in arr){ ##checking condition if $1 is present in arr then do following.
$1=($1 OFS arr[$1]) ##Saving current $1 OFS and value of arr with index of $1 in $1.
}
1 ##1 will print the current line.
' file2.csv file1.csv ##Mentioning Input_file names here.
I tested the join command you provided and I think it produces the intended output on my machine (FreeBSD 12.2-RELEASE):
21|A800|AAAAA|1023
21|A800|BBBBB|1203
21|A800|CCCCC|2533
22|B900|DDDDD|1294
22|B900|EEEEE|1249
22|B900|FFFFF|4129
22A|C1000|GGGGG|4121
22A|C1000|HHHHH|1284
31B|D1000|IIIII|5403
31B|D1000|JJJJJ|1249
It's possible that you need to sort both files first on the columns (or in this case where you join the first columns the whole line should also work) that you intend to join, i.e. join -a1 -t '|' -1 1 -2 1 -o 1.1,2.2,1.2,1.3 <(sort file1.csv) <(sort file2.csv) > file3.csv
I have the following situation:
FILE_1 which contains a ddl and it could be like this:
column_A varchar(50)
column_B int
column_C varchar (100)
and FILE_2 which contains what is changed in FILE_1, for example
column_B varchar(50)
I would like to update FILE_1 changing from "column_B int" to "column_B varchar(50)"
and of course I would need also all other line unchanged
using awk I managed to identify the changed line
awk 'NR==FNR{a[$1]++;next} ($1 in a)' file_1 file_2
but I don't know how I can write it into file_1
how could I achieve it?
Could you please try following, written and tested with shown samples in GNU awk.
awk 'FNR==NR{a[$1]=$2;next} ($1 in a){$2=a[$1]} 1' file_2 file_1
In case above command is printing correct results for you on terminal then you could use following which will save output into Input_file itself.
awk 'NR==FNR{a[$1]++;next} ($1 in a)' file_1 file_2 > temp && mv temp file_1
Explanation: Adding detailed explanation for above.
awk ' ##Starting awk program from here.
FNR==NR{ ##Checking condition if FNR==NR which will be true when first Input_file named file_2 is being read.
a[$1]=$2 ##Creating array a with index of $1 and its value is $2 from current line.
next ##next will skip all further statements from here.
}
($1 in a){ ##Checking condition if first field is present in array a then do following.
$2=a[$1] ##Assigning array a value with index of first field to 2nd field.
}
1 ##1 will print current line here.
' file_2 file_1 ##Mentioning Input_file names here.
I have two CSV files in the form of
file1
A,44
A,21
B,65
C,79
file2
A,7
B,4
C,11
I used awk as
awk -F, 'NR==FNR{a[$1]=$0;next} ($1 in a){print a[$1]","$2 }' file1.csv file2.csv
producing
A,44,7
A,21,7
B,65,4
C,79,11
a[$1] prints the entire line from file1. How can I omit the first columns in both files (the first column is only used to match the second columns) to produce:
44,7
21,7
65,4
79,11
In other words, how can I pass the columns from the first file to the print block, as $2 does for the second file?
Could you please try following, tested and written on shown samples only.
awk 'BEGIN{FS=OFS=","} FNR==NR{a[$1]=$2;next} ($1 in a){print $2,a[$1]}' file2 file1
Explanation: Adding detailed explanation for above.
awk ' ##Starting awk program from here.
BEGIN{ ##Starting BEGIN section from here.
FS=OFS="," ##Setting field and output field separator as comma here.
}
FNR==NR{ ##Checking condition FNR==NR which will be TRUE when file2 is being read.
a[$1]=$2 ##Creating array a with index $1 and value is $2 from current line.
next ##next will skip all further statement from here.
}
($1 in a){ ##Statements from here will be executed when file1 is being read and it's checking if $1 is present in array a then do following.
print $2,a[$1] ##Printing 2nd field and value of array a with index $1 here.
}
' file2 file1 ##Mentioning Input_file names here.
Output will be as follows for shown samples.
44,7
21,7
65,4
79,11
2nd solution: More Generic solution, where considering that your both Input_files could have duplicates in that case it will print 1st value of A in Input_file1 to first value of Input_file2 and so on.
awk '
BEGIN{
FS=OFS=","
}
FNR==NR{
a[$1]
b[$1,++c[$1]]=$2
next
}
($1 in a){
print $2,b[$1,++d[$1]]
}
' file2 file1
You can join them using the join command and chose which fields you want to have in the output:
kent$ join -t',' -o 1.2,2.2 file1 file2
44,7
21,7
65,4
79,11
I have a file which lines look like this:
chr1 66999275 67216822 + SGIP1;SGIP1;SGIP1;SGIP1;MIR3117
I now want to edit the last column to remove duplicates, so that it would only be SGIP1;MIR3117.
If I only have the last column, I can use the following awk code to remove the duplicates.
a="SGIP1;SGIP1;SGIP1;SGIP1;MIR3117"
echo "$a" | awk -F";" '{for (i=1;i<=NF;i++) if (!a[$i]++) printf("%s%s",$i,FS)}{printf("\n")}'
This returns SGIP1;MIR3117;
However, I can not figure out how I can use this to only affect my fifth column. If I just pipe in the whole line, I get SGIP1 two times, as awk then treats everything in front of the first semicolon as one column.
Is there an elegant way to do this?
Could you please try following.
awk '
{
num=split($NF,array,";")
for(i=1;i<=num;i++){
if(!found[array[i]]++){
val=(val?val ";":"")array[i]
}
}
$NF=val
val=""
}
1
' Input_file
Explanation: Adding detailed explanation for above code here.
awk ' ##Starting awk program from here.
{
num=split($NF,array,";") ##Using split function of awk to split last field($NF) of current line into array named array with ; delimiter.
for(i=1;i<=num;i++){ ##Running a loop fro i=1 to till total number of elements of array here.
if(!found[array[i]]++){ ##Checking condition if any element of array is NOT present in found array then do following.
val=(val?val ";":"")array[i] ##Creaating variable val and keep adding value of array here(whoever satisfy above condition).
}
}
$NF=val ##Setting val value to last field of current line here.
val="" ##Nullifying variable val here.
}
1 ##1 will print edited/non-edited line here.
' Input_file ##Mentioning Input_file name here.
I don't consider it "elegant", and it works under a certain number of assumptions.
awk -F"+" '{printf("%s+ ",$1);split($2,a,";"); for(s in a){gsub(" ", "", a[s]); if(!c[a[s]]++) printf("%s;", a[s])}}' test.txt
Tested on your input, returns:
chr1 66999275 67216822 + SGIP1;MIR3117;