I have data in below format in a file
"123","XYZ","M","N","P,Q"
"345",
"987","MNO","A,B,C"
I always want to have 5 entries in the row , so if the count of fields in 2 then 3 extra ("") needs to be added.
"123","XYZ","M","N","P,Q"
"345","","","",""
"987","MNO","A,B,C","",""
I looked upto the solution on the page
Add Extra Strings Based on count of fields- Sed/Awk
which has very similar requirement but when I try it fails as I have comma (,) within the field also.
Thanks.
In GNU awk with your shown samples, please try following code.
awk -v s1="\"" -v FPAT='[^,]*|"[^"]+"' '
BEGIN{ OFS="," }
FNR==NR{
nof=(NF>nof?NF:nof)
next
}
NF<nof{
val=""
i=($0~/,$/?NF:NF+1)
for(;i<=nof;i++){
val=(val?val OFS:"")s1 s1
}
sub(/,$/,"")
$0=$0 OFS val
}
1
' Input_file Input_file
Explanation: Adding detailed explanation for above.
awk -v s1="\"" -v FPAT='[^,]*|"[^"]+"' ' ##Starting awk program from here setting FPAT to csv file parsing here.
BEGIN{ OFS="," } ##Starting BEGIN section of this program setting OFS to comma here.
FNR==NR{ ##Checking condition FNR==NR here, which will be true for first time file reading.
nof=(NF>nof?NF:nof) ##Create nof to get highest NF value here.
next ##next will skip all further statements from here.
}
NF<nof{ ##checking if NF is lesser than nof then do following.
val="" ##Nullify val here.
i=($0~/,$/?NF:NF+1) ##Setting value of i as per condition here.
for(;i<=nof;i++){ ##Running loop till value of nof matches i here.
val=(val?val OFS:"")s1 s1 ##Creating val which has value of "" in it.
}
sub(/,$/,"") ##Removing ending , here.
$0=$0 OFS val ##Concatinate val here.
}
1 ##Printing current line here.
' Input_file Input_file ##Mentioning Input_file names here.
EDIT: Adding this code here, where keeping a variable named nof where we can give our number of fields value which should be added minimum in all missing lines, in case any line is having more than minimum field values then it will take that value to add those many number of fields in missing field line.
awk -v s1="\"" -v nof="5" -v FPAT='[^,]*|"[^"]+"' '
BEGIN{ OFS="," }
FNR==NR{
nof=(NF>nof?NF:nof)
next
}
NF<nof{
val=""
i=($0~/,$/?NF:NF+1)
for(;i<=nof;i++){
val=(val?val OFS:"")s1 s1
}
sub(/,$/,"")
$0=$0 OFS val
}
1
' Input_file Input_file
Here is one for GNU awk using FPAT when [you] always want to have 5 entries in the row :
$ awk '
BEGIN {
FPAT="([^,]*)|(\"[^\"]+\")"
OFS=","
}
{
NF=5 # set NF to limit too long records
for(i=1;i<=NF;i++) # iterate to NF and set empties to ""
if($i=="")
$i="\"\""
}1' file
Output:
"123","XYZ","M","N","P,Q"
"345","","","",""
"987","MNO","A,B,C","",""
Here is a an awk command that would work with any version of awk:
awk -v n=5 -v ef=',""' -F '","' '
{
sub(/,+$/, "")
for (i=NF; i<n; ++i)
$0 = $0 ef
} 1' file
"123","XYZ","M","N","P,Q"
"345","","","",""
"987","MNO","A,B,C","",""
With perl, assuming every field is double quoted:
$ perl -pe 's/,$//; s/$/q(,"") x (4 - s|","|$&|g)/e' ip.txt
"123","XYZ","M","N","P,Q"
"345","","","",""
"987","MNO","A,B,C","",""
# if the , at the end of line isn't present
$ perl -pe 's/$/q(,"") x (4 - s|","|$&|g)/e' ip.txt
"123","XYZ","M","N","P,Q"
"345","","","",""
"987","MNO","A,B,C","",""
s|","|$&|g will search for "," and replace it back. The return value is number of replacements, which is then used to determine how many fields have to be appended.
The e flag allows you to use Perl code in the replacement section.
q operator helps to use different delimiter for single quoted string.
Here's an alternate solution that creates an array and then adds empty fields if necessary.
perl -lne '#f = /"[^"]+"|[^,]+/g; print join ",", #f, qw("") x (4 - $#f)'
/"[^"]+"|[^,]+/g defines fields as double quoted strings (with no double quote inside, so escaped quotes won't work with this solution) or non , characters (at least one, so , at end of line will be ignored).
qw("") x (4 - $#f) determines the extra fields to be appended. qw("") creates an array with single element of value "" which is then multiplied using the x operator.
Another perl way using -a for autosplit and -F to set the separator:
perl -lanF'/"*,*"/' -e 'print join ",", map "\"$_\"", #F[1..5]'
-F'/"*,*"/' - this uses an autosplit separator of double quote optionally preceeded by commas and quotes
-a uses that separator to autosplit into #F
-l adds linebreaks to print and -n will process input in stream mode w/o printing unless explicitly told to
map "\"$_\"", #F[1..5] takes exactly 5 fields, even undefined ones, and adds double quotes
print join ",", map ... takes the results of the map above, joins into a string with commas, and prints
(Note: because each line starts with a field delimiter, I'm ignoring the empty $F[0] element)
This might work for you (GNU sed):
sed ':a;s/"[^"]*"/&/5;t;s/$/,""/;ta' file
If there are 5 fields, bail out.
Otherwise, append an empty field and repeat.
Here are the input:
files1.csv
21|AAAAA|1023
21|BBBBB|1203
21|CCCCC|2533
22|DDDDD|1294
22|EEEEE|1249
22|FFFFF|4129
22A|GGGGG|4121
22A|HHHHH|1284
31B|IIIII|5403
31B|JJJJJ|1249
file2.csv
21|A800
22|B900
22A|C1000
31B|D1000
expect output:
files3.csv
21|A800|AAAAA|1023
21|A800|BBBBB|1203
21|A800|CCCCC|2533
22|B900|EEEEE|1249
22|B900|FFFFF|4129
22A|C1000|GGGGG|4121
22A|C1000|HHHHH|1284
31B|D1000|IIIII|5403
31B|D1000|JJJJJ|1249
currently tried using join,
join -a1 -t '|' -1 1 -2 1 -o 1.1,2.2,1.2,1.3 file1.csv file2.csv > file3.csv
But it found that some rows missed matching, so i turn my concept to use most likely vlookup functionality for this two files. Please help.
Thanks all
Could you please try following with awk, written and tested with GNU awk with shown samples.
awk '
BEGIN{
FS=OFS="|"
}
FNR==NR{
arr[$1]=$2
next
}
($1 in arr){
$1=($1 OFS arr[$1])
}
1
' file2.csv file1.csv
Explanation: Adding detailed explanation for above.
awk ' ##Starting awk program from here.
BEGIN{ ##Starting BEGIN section from here of this program.
FS=OFS="|" ##Setting | as field separator and output field separator.
}
FNR==NR{ ##Checking condition if FNR==NR which will be TRUE when file2.csv is being read.
arr[$1]=$2 ##Creating arr with index of 1st field and value of 2nd field.
next ##next will skip all further statements from here.
}
($1 in arr){ ##checking condition if $1 is present in arr then do following.
$1=($1 OFS arr[$1]) ##Saving current $1 OFS and value of arr with index of $1 in $1.
}
1 ##1 will print the current line.
' file2.csv file1.csv ##Mentioning Input_file names here.
I tested the join command you provided and I think it produces the intended output on my machine (FreeBSD 12.2-RELEASE):
21|A800|AAAAA|1023
21|A800|BBBBB|1203
21|A800|CCCCC|2533
22|B900|DDDDD|1294
22|B900|EEEEE|1249
22|B900|FFFFF|4129
22A|C1000|GGGGG|4121
22A|C1000|HHHHH|1284
31B|D1000|IIIII|5403
31B|D1000|JJJJJ|1249
It's possible that you need to sort both files first on the columns (or in this case where you join the first columns the whole line should also work) that you intend to join, i.e. join -a1 -t '|' -1 1 -2 1 -o 1.1,2.2,1.2,1.3 <(sort file1.csv) <(sort file2.csv) > file3.csv
I have two CSV files in the form of
file1
A,44
A,21
B,65
C,79
file2
A,7
B,4
C,11
I used awk as
awk -F, 'NR==FNR{a[$1]=$0;next} ($1 in a){print a[$1]","$2 }' file1.csv file2.csv
producing
A,44,7
A,21,7
B,65,4
C,79,11
a[$1] prints the entire line from file1. How can I omit the first columns in both files (the first column is only used to match the second columns) to produce:
44,7
21,7
65,4
79,11
In other words, how can I pass the columns from the first file to the print block, as $2 does for the second file?
Could you please try following, tested and written on shown samples only.
awk 'BEGIN{FS=OFS=","} FNR==NR{a[$1]=$2;next} ($1 in a){print $2,a[$1]}' file2 file1
Explanation: Adding detailed explanation for above.
awk ' ##Starting awk program from here.
BEGIN{ ##Starting BEGIN section from here.
FS=OFS="," ##Setting field and output field separator as comma here.
}
FNR==NR{ ##Checking condition FNR==NR which will be TRUE when file2 is being read.
a[$1]=$2 ##Creating array a with index $1 and value is $2 from current line.
next ##next will skip all further statement from here.
}
($1 in a){ ##Statements from here will be executed when file1 is being read and it's checking if $1 is present in array a then do following.
print $2,a[$1] ##Printing 2nd field and value of array a with index $1 here.
}
' file2 file1 ##Mentioning Input_file names here.
Output will be as follows for shown samples.
44,7
21,7
65,4
79,11
2nd solution: More Generic solution, where considering that your both Input_files could have duplicates in that case it will print 1st value of A in Input_file1 to first value of Input_file2 and so on.
awk '
BEGIN{
FS=OFS=","
}
FNR==NR{
a[$1]
b[$1,++c[$1]]=$2
next
}
($1 in a){
print $2,b[$1,++d[$1]]
}
' file2 file1
You can join them using the join command and chose which fields you want to have in the output:
kent$ join -t',' -o 1.2,2.2 file1 file2
44,7
21,7
65,4
79,11
I am trying to filter out content in CSV file based on condition on 2nd column.
Example :
myfile.csv:
A,2,Z
C,1,B
D,9,X
BB,3,NN
DD,8,PP
WA,10,QR
exclude.list
2
9
8
desired output file
C,1,B
BB,3,NN
WA,10,QR
If i wanted to exclude 2 , i could use : awk -F',' ' $2!="2" {print }' myfile.csv. I am trying to figure how to iterate over exclude.list file to exclude all values in the file.
1st Solution (preferred): Following awk may help you.
awk 'FNR==NR{a[$1];next} !($2 in a)' exclude.list FS="," myfile.csv
2nd Solution (Comprehensive): Adding one more awk by changing Input_file(s) sequence of reading, though first solution is more preferable I am adding this to cover all possibilities of solutions :)
awk '
FNR==NR{
a[$2]=$0;
if(!b[$2]++){ c[++i]=$2 };
next}
($1 in a) { delete a[$1]}
END{
for(j=1;j<=i;j++){
if(a[c[j]]){print a[c[j]]}
}}
' FS="," myfile.csv FS=" " exclude.list