Add additional fields based on field count - awk

I have data in below format in a file
"123","XYZ","M","N","P,Q"
"345",
"987","MNO","A,B,C"
I always want to have 5 entries in the row , so if the count of fields in 2 then 3 extra ("") needs to be added.
"123","XYZ","M","N","P,Q"
"345","","","",""
"987","MNO","A,B,C","",""
I looked upto the solution on the page
Add Extra Strings Based on count of fields- Sed/Awk
which has very similar requirement but when I try it fails as I have comma (,) within the field also.
Thanks.

In GNU awk with your shown samples, please try following code.
awk -v s1="\"" -v FPAT='[^,]*|"[^"]+"' '
BEGIN{ OFS="," }
FNR==NR{
nof=(NF>nof?NF:nof)
next
}
NF<nof{
val=""
i=($0~/,$/?NF:NF+1)
for(;i<=nof;i++){
val=(val?val OFS:"")s1 s1
}
sub(/,$/,"")
$0=$0 OFS val
}
1
' Input_file Input_file
Explanation: Adding detailed explanation for above.
awk -v s1="\"" -v FPAT='[^,]*|"[^"]+"' ' ##Starting awk program from here setting FPAT to csv file parsing here.
BEGIN{ OFS="," } ##Starting BEGIN section of this program setting OFS to comma here.
FNR==NR{ ##Checking condition FNR==NR here, which will be true for first time file reading.
nof=(NF>nof?NF:nof) ##Create nof to get highest NF value here.
next ##next will skip all further statements from here.
}
NF<nof{ ##checking if NF is lesser than nof then do following.
val="" ##Nullify val here.
i=($0~/,$/?NF:NF+1) ##Setting value of i as per condition here.
for(;i<=nof;i++){ ##Running loop till value of nof matches i here.
val=(val?val OFS:"")s1 s1 ##Creating val which has value of "" in it.
}
sub(/,$/,"") ##Removing ending , here.
$0=$0 OFS val ##Concatinate val here.
}
1 ##Printing current line here.
' Input_file Input_file ##Mentioning Input_file names here.
EDIT: Adding this code here, where keeping a variable named nof where we can give our number of fields value which should be added minimum in all missing lines, in case any line is having more than minimum field values then it will take that value to add those many number of fields in missing field line.
awk -v s1="\"" -v nof="5" -v FPAT='[^,]*|"[^"]+"' '
BEGIN{ OFS="," }
FNR==NR{
nof=(NF>nof?NF:nof)
next
}
NF<nof{
val=""
i=($0~/,$/?NF:NF+1)
for(;i<=nof;i++){
val=(val?val OFS:"")s1 s1
}
sub(/,$/,"")
$0=$0 OFS val
}
1
' Input_file Input_file

Here is one for GNU awk using FPAT when [you] always want to have 5 entries in the row :
$ awk '
BEGIN {
FPAT="([^,]*)|(\"[^\"]+\")"
OFS=","
}
{
NF=5 # set NF to limit too long records
for(i=1;i<=NF;i++) # iterate to NF and set empties to ""
if($i=="")
$i="\"\""
}1' file
Output:
"123","XYZ","M","N","P,Q"
"345","","","",""
"987","MNO","A,B,C","",""

Here is a an awk command that would work with any version of awk:
awk -v n=5 -v ef=',""' -F '","' '
{
sub(/,+$/, "")
for (i=NF; i<n; ++i)
$0 = $0 ef
} 1' file
"123","XYZ","M","N","P,Q"
"345","","","",""
"987","MNO","A,B,C","",""

With perl, assuming every field is double quoted:
$ perl -pe 's/,$//; s/$/q(,"") x (4 - s|","|$&|g)/e' ip.txt
"123","XYZ","M","N","P,Q"
"345","","","",""
"987","MNO","A,B,C","",""
# if the , at the end of line isn't present
$ perl -pe 's/$/q(,"") x (4 - s|","|$&|g)/e' ip.txt
"123","XYZ","M","N","P,Q"
"345","","","",""
"987","MNO","A,B,C","",""
s|","|$&|g will search for "," and replace it back. The return value is number of replacements, which is then used to determine how many fields have to be appended.
The e flag allows you to use Perl code in the replacement section.
q operator helps to use different delimiter for single quoted string.
Here's an alternate solution that creates an array and then adds empty fields if necessary.
perl -lne '#f = /"[^"]+"|[^,]+/g; print join ",", #f, qw("") x (4 - $#f)'
/"[^"]+"|[^,]+/g defines fields as double quoted strings (with no double quote inside, so escaped quotes won't work with this solution) or non , characters (at least one, so , at end of line will be ignored).
qw("") x (4 - $#f) determines the extra fields to be appended. qw("") creates an array with single element of value "" which is then multiplied using the x operator.

Another perl way using -a for autosplit and -F to set the separator:
perl -lanF'/"*,*"/' -e 'print join ",", map "\"$_\"", #F[1..5]'
-F'/"*,*"/' - this uses an autosplit separator of double quote optionally preceeded by commas and quotes
-a uses that separator to autosplit into #F
-l adds linebreaks to print and -n will process input in stream mode w/o printing unless explicitly told to
map "\"$_\"", #F[1..5] takes exactly 5 fields, even undefined ones, and adds double quotes
print join ",", map ... takes the results of the map above, joins into a string with commas, and prints
(Note: because each line starts with a field delimiter, I'm ignoring the empty $F[0] element)

This might work for you (GNU sed):
sed ':a;s/"[^"]*"/&/5;t;s/$/,""/;ta' file
If there are 5 fields, bail out.
Otherwise, append an empty field and repeat.

Related

Countif like function in AWK with field headers

I am looking for a way of counting the number of times a value in a field appears in a range of fields in a csv file much the same as countif in excel although I would like to use an awk command if possible.
So column 6 should have the range of values and column 7 would have the times the value appears in column 7, as per below
>awk -F, '{print $0}' file3
f1,f2,f3,f4,f5,test
row1_1,row1_2,row1_3,SBCDE,row1_5,SBCD
row2_1,row2_2,row2_3,AWERF,row2_5,AWER
row3_1,row3_2,row3_3,ASDFG,row3_5,ASDF
row4_1,row4_2,row4_3,PRE-ASDQG,row4_5,ASDQ
row4_1,row4_2,row4_3,PRE-ASDQF,row4_5,ASDQ
>awk -F, '{print $6}' file3
test
SBCD
AWER
ASDF
ASDQ
ASDQ
What i want is:
f1,f2,f3,f4,f5,test,count
row1_1,row1_2,row1_3,SBCDE,row1_5,SBCD,1
row2_1,row2_2,row2_3,AWERF,row2_5,AWER,1
row3_1,row3_2,row3_3,ASDFG,row3_5,ASDF,1
row4_1,row4_2,row4_3,PRE-ASDQG,row4_5,ASDQ,2
row4_1,row4_2,row4_3,PRE-ASDQF,row4_5,ASDQ,2
#adds field name count that I want:
awk -F, -v OFS=, 'NR==1{ print $0, "count"}
NR>1{ print $0}' file3
Ho do I get the output I want?
I have tried this from previous/similar question but no joy,
>awk -F, 'NR>1{c[$6]++;l[NR>1]=$0}END{for(i=0;i++<NR;){split(l[i],s,",");print l[i]","c[s[1]]}}' file3
row4_1,row4_2,row4_3,PRE-ASDQF,row4_5,ASDQ,
,
,
,
,
,
very similar question to this one
similar python related Q, for my ref
I would harness GNU AWK for this task following way, let file.txt content be
f1,f2,f3,f4,f5,test
row1_1,row1_2,row1_3,SBCDE,row1_5,SBCD
row2_1,row2_2,row2_3,AWERF,row2_5,AWER
row3_1,row3_2,row3_3,ASDFG,row3_5,ASDF
row4_1,row4_2,row4_3,PRE-ASDQG,row4_5,ASDQ
row4_1,row4_2,row4_3,PRE-ASDQF,row4_5,ASDQ
then
awk 'BEGIN{FS=OFS=","}NR==1{print $0,"count";next}FNR==NR{arr[$6]+=1;next}FNR>1{print $0,arr[$6]}' file.txt file.txt
gives output
f1,f2,f3,f4,f5,test,count
row1_1,row1_2,row1_3,SBCDE,row1_5,SBCD,1
row2_1,row2_2,row2_3,AWERF,row2_5,AWER,1
row3_1,row3_2,row3_3,ASDFG,row3_5,ASDF,1
row4_1,row4_2,row4_3,PRE-ASDQG,row4_5,ASDQ,2
row4_1,row4_2,row4_3,PRE-ASDQF,row4_5,ASDQ,2
Explanation: this is two-pass approach, hence file.txt appears twice. I inform GNU AWK that , is both field separator (FS) and output field separator (OFS), then for first line (header) I print it followed by count and instruct GNU AWK to go to next line, so nothing other is done regarding 1st line, then for first pass, i.e. where global number of line (NR) is equal to number of line in file (FNR) I count number of occurences of values in 6th field and store them as values in array arr, then instruct GNU AWK to get to next line, so onthing other is done in this pass. During second pass for all lines after 1st (FNR>1) I print whole line ($0) followed by corresponding value from array arr
(tested in GNU Awk 5.0.1)
You did not copy the code from the linked question properly. Why change l[NR] to l[NR>1] at all? On the other hand, you should change s[1] to s[6] since it's the sixth field that has the key you're counting:
awk -F, 'NR>1{c[$6]++;l[NR]=$0}END{for(i=0;i++<NR;){split(l[i],s,",");print l[i]","c[s[6]]}}'
You can also output the header with the new field name:
awk -F, -vOFS=, 'NR==1{print $0,"count"}NR>1{c[$6]++;l[NR]=$0}END{for(i=0;i++<NR;){split(l[i],s,",");print l[i],c[s[6]]}}'
One awk idea:
awk '
BEGIN { FS=OFS="," } # define input/output field delimiters as comma
{ lines[NR]=$0
if (NR==1) next
col6[NR]=$6 # copy field 6 so we do not have to parse the contents of lines[] in the END block
cnt[$6]++
}
END { for (i=1;i<=NR;i++)
print lines[i], (i==1 ? "count" : cnt[col6[i]] )
}
' file3
This generates:
f1,f2,f3,f4,f5,test,count
row1_1,row1_2,row1_3,SBCDE,row1_5,SBCD,1
row2_1,row2_2,row2_3,AWERF,row2_5,AWER,1
row3_1,row3_2,row3_3,ASDFG,row3_5,ASDF,1
row4_1,row4_2,row4_3,PRE-ASDQG,row4_5,ASDQ,2
row4_1,row4_2,row4_3,PRE-ASDQF,row4_5,ASDQ,2

awk print sum of group of lines

I have a file with a column named (effect) which has rows separated by blank lines,
(effect)
1
1
1
(effect)
1
1
1
1
(effect)
1
1
I know how to print the sum of column like
awk '{sum+=$1;} END{print sum;}' file.txt
Using awk how can I print the sum of each (effect) in for loop? such that I have three lines or multiple lines in other cases like below
sum=3
sum=4
sum=2
You can check if there is an (effect) part, and print the sum when encountering either the (effect) part or when in the END block.
awk '
$1 == "(effect)" { if(seen) print "sum="sum; seen = 1; sum = 0 }
/[0-9]/ { sum += $1 }
END { if (seen) print "sum="sum }
' file
Output
sum=3
sum=4
sum=2
With your shown samples, please try following awk code. Written and tested in GNU awk.
awk -v RS='(^|\n)?\\(effect\\)[^(]*' '
RT{
gsub(/\(effect\)\n|\n+[[:space:]]*$/,"",RT)
num=split(RT,arr,ORS)
print "sum="num
}
' Input_file
Explanation: Simple explanation would be, using GNU awk. In awk program set RS as (^|\n)?\\(effect\\)[^(]* regex for whole Input_file. In main program checking condition if RT is NOT NULL then using gsub(Global substitution) function to substitute (effect)\n and \n+[[:space:]]*$(new lines followed by spaces at end of value) with NULL in RT. Then splitting value of RT into array named arr with delimiter of ORS and saving its(total contents value OR array length value) into variable named num, then printing sum= along with value of num here to get required results.
With shown samples, output will be as follows:
sum=3
sum=4
sum=2
This should work in any version of awk:
awk '{sum += $1} $0=="(effect)" && NR>1 {print "sum=" sum; sum=0}
END{print "sum=" sum}' file
sum=3
sum=4
sum=2
Similar to #Ravinder's answer, but does not depend on the name of the header:
awk -v RS='' -v FS='\n' '{
sum = 0
for (i=2; i<=NF; i++) sum += $i
printf "sum=%d\n", sum
}' file
RS='' means that sequences of 2 or more newlines separate records.
The Field Separator is newline.
The for loop omits field #1, the header.
However that means that empty lines truly need to be empty: no spaces or tabs allowed. If your data might have blank lines that contain whitespace, you can set
-v RS='\n[[:space:]]*\n'
$ awk -v RS='(effect)' 'NR>1{sum=0; for(i=1;i<=NF;i++) sum+=$i; print "sum="sum}' file
sum=3
sum=4
sum=2

multiple condition store in variable and use as if condition in awk

table1.csv:
33622|AAA
33623|AAA
33624|BBB
33625|CCC
33626|DDD
33627|AAA
33628|BBB
33629|EEE
33630|FFF
Aims:
33622|AAA
33623|AAA
33624|BBB
33625|CCC
33626|DDD
33627|AAA
33628|BBB
Using command:
awk 'BEGIN{FS="|";OFS="|"} {if($2=="AAA" && $2=="BBB" && $2=="CCC" && $2=="DDD"){print $1,$2}}' table1.csv
However, trying to be more automatic, since the categories may increase.
list1.csv:
AAA BBB CCC DDD
list=`cat list1.csv`
awk -v list=$list 'BEGIN{FS="|";OFS="|"} {if($2==list){print $1,$2}}' table1.csv
Which means, can I stored $2=="AAA" && $2=="BBB" ....... into a variable by using list1.csv?
Expected output:
33622|AAA
33623|AAA
33624|BBB
33625|CCC
33626|DDD
33627|AAA
33628|BBB
So, any suggestion on storing the multiple condition in one variable?
Thanks all!
$ awk 'NR==FNR{for(i=1;i<=NF;i++)a[$i];next}FNR==1{FS="|";$0=$0}($2 in a)' list table
Output:
33622|AAA
33623|AAA
33624|BBB
33625|CCC
33626|DDD
33627|AAA
33628|BBB
Explained:
$ awk '
NR==FNR { # process list
for(i=1;i<=NF;i++) # hash all items in file
a[$i]
next # possibility for multiple lines
}
FNR==1 { # changing FS in the beginning of table file
FS="|"
$0=$0
}
($2 in a)' list table
Almost same logic Like James Brown's nice answer, just adding here a small variant which is setting field separator in Input_file places itself.
awk 'FNR==NR{for(i=1;i<=NF;i++){arr[$i]};next} ($2 in arr)' list FS="|" table
Explanation: Adding detailed explanation for above.
awk ' ##Starting awk program from here.
FNR==NR{ ##Checking condition which will be TRUE when list is being read.
for(i=1;i<=NF;i++){ ##Going through all fields here.
arr[$i] ##Creating arr with index of current column value here.
}
next ##next will skip all further statements from here.
}
($2 in arr) ##Checking condition if 2nd field is present in arr then print that line from table file.
' list FS="|" table ##mentioning Input_file(s) here and setting FS as | before table file.

Applying awk operation to a specific column

I have a file which lines look like this:
chr1 66999275 67216822 + SGIP1;SGIP1;SGIP1;SGIP1;MIR3117
I now want to edit the last column to remove duplicates, so that it would only be SGIP1;MIR3117.
If I only have the last column, I can use the following awk code to remove the duplicates.
a="SGIP1;SGIP1;SGIP1;SGIP1;MIR3117"
echo "$a" | awk -F";" '{for (i=1;i<=NF;i++) if (!a[$i]++) printf("%s%s",$i,FS)}{printf("\n")}'
This returns SGIP1;MIR3117;
However, I can not figure out how I can use this to only affect my fifth column. If I just pipe in the whole line, I get SGIP1 two times, as awk then treats everything in front of the first semicolon as one column.
Is there an elegant way to do this?
Could you please try following.
awk '
{
num=split($NF,array,";")
for(i=1;i<=num;i++){
if(!found[array[i]]++){
val=(val?val ";":"")array[i]
}
}
$NF=val
val=""
}
1
' Input_file
Explanation: Adding detailed explanation for above code here.
awk ' ##Starting awk program from here.
{
num=split($NF,array,";") ##Using split function of awk to split last field($NF) of current line into array named array with ; delimiter.
for(i=1;i<=num;i++){ ##Running a loop fro i=1 to till total number of elements of array here.
if(!found[array[i]]++){ ##Checking condition if any element of array is NOT present in found array then do following.
val=(val?val ";":"")array[i] ##Creaating variable val and keep adding value of array here(whoever satisfy above condition).
}
}
$NF=val ##Setting val value to last field of current line here.
val="" ##Nullifying variable val here.
}
1 ##1 will print edited/non-edited line here.
' Input_file ##Mentioning Input_file name here.
I don't consider it "elegant", and it works under a certain number of assumptions.
awk -F"+" '{printf("%s+ ",$1);split($2,a,";"); for(s in a){gsub(" ", "", a[s]); if(!c[a[s]]++) printf("%s;", a[s])}}' test.txt
Tested on your input, returns:
chr1 66999275 67216822 + SGIP1;MIR3117;

How can I replace all middle characters with '*'?

I would like to replace middle of word with ****.
For example :
ifbewofiwfib
wofhwifwbif
iwjfhwi
owfhewifewifewiwei
fejnwfu
fehiw
wfebnueiwbfiefi
Should become :
if********ib
wo*******if
iw***wi
ow**************ei
fe***fu
fe*iw
wf***********fi
So far I managed to replace all but the first 2 chars with:
sed -e 's/./*/g3'
Or do it the long way:
grep -o '^..' file > start
cat file | sed 's:^..\(.*\)..$:\1:' | awk -F. '{for (i=1;i<=length($1);i++) a=a"*";$1=a;a=""}1' > stars
grep -o '..$' file > end
paste -d "" start stars > temp
paste -d "" temp end > final
I would use Awk for this, if you have a GNU Awk to set the field separator to an empty string (How to set the field separator to an empty string?).
This way, you can loop through the chars and replace the desired ones with "*". In this case, replace from the 3rd to the 3rd last:
$ awk 'BEGIN{FS=OFS=""}{for (i=3; i<=NF-2; i++) $i="*"} 1' file
if********ib
wo*******if
iw***wi
ow**************ei
fe***fu
fe*iw
wf***********fi
If perl is okay:
$ perl -pe 's/..\K.*(?=..)/"*" x length($&)/e' ip.txt
if********ib
wo*******if
iw***wi
ow**************ei
fe***fu
fe*iw
wf***********fi
..\K.*(?=..) to match characters other than first/last two characters
See regex lookarounds section for details
e modifier allows to use Perl code in replacement section
"*" x length($&) use length function and string repetition operator to get desired replacement string
You can do it with a repetitive substitution, e.g.:
sed -E ':a; s/^(..)([*]*)[^*](.*..)$/\1\2*\3/; ta'
Explanation
This works by repeating the substitution until no change happens, that is what the :a; ...; ta bit does. The substitution consists of 3 matched groups and a non-asterisk character:
(..) the start of the string.
([*]*) any already replaced characters.
[^*] the character to be replaced next.
(.*..) any remaining characters to replace and the end of the string.
Alternative GNU sed answer
You could also do this by using the hold space which might be simpler to read, e.g.:
h # save a copy to hold space
s/./*/g3 # replace all but 2 by *
G # append hold space to pattern space
s/^(..)([*]*)..\n.*(..)$/\1\2\3/ # reformat pattern space
Run it like this:
sed -Ef parse.sed input.txt
Output in both cases
if********ib
wo*******if
iw***wi
ow**************ei
fe***fu
fe*iw
wf***********fi
Following awk may help you on same. It should work in any kind of awk versions.
awk '{len=length($0);for(i=3;i<=(len-2);i++){val=val "*"};print substr($0,1,2) val substr($0,len-1);val=""}' Input_file
Adding a non-one liner form of solution too now.
awk '
{
len=length($0);
for(i=3;i<=(len-2);i++){
val=val "*"};
print substr($0,1,2) val substr($0,len-1);
val=""
}
' Input_file
Explanation: Adding explanation now for above code too.
awk '
{
len=length($0); ##Creating variable named len whose value is length of current line.
for(i=3;i<=(len-2);i++){ ##Starting for loop which starts from i=3 too till len-2 value and doing following:
val=val "*"}; ##Creating a variable val whose value is concatenating the value of it within itself.
print substr($0,1,2) val substr($0,len-1);##Printing substring first 2 chars and variable val and then last 2 chars of the current line.
val="" ##Nullifying the variable val here, so that old values should be nullified for this variable.
}
' Input_file ##Mentioning the Input_file name here.