extract and print all occurrences of disk file (.img) from a configuration file - awk

I have vm configuration files from which I need to print all the disks (26 alphanumeric characters followed by .img) existing within each file.
here is an extract of one of the files
[root#~]# cat demo_vm.cfg
disk = ['file:/OVS/Repositories/0004fb00000300007b8afb76a3377693/VirtualDisks/0004fb0000120000a17dfe12ac74818f.img,xvda,w', 'file:/OVS/Repositories/0004fb00000300007b8afb76a3377693/VirtualDisks/0004fb0000120000e66ace31dac64d98.img,xvdb,w', 'file:/OVS/Repositories/0004fb00000300007b8afb76a3377693/VirtualDisks/0004fb000012000082fbb45a02e24096.img,xvdd,w']
I want to extract the below (all references of 26alphanum.img in the file) :
0004fb0000120000a17dfe12ac74818f.img
0004fb0000120000e66ace31dac64d98.img
0004fb000012000082fbb45a02e24096.img
some files have 3 disks some have only one for which I usually run this and have what I want but in case of multiple occurrences I can only print the first one.
# awk -F [/,] '/disk/ { print $6}' demo_vm.cfg
0004fb0000120000a17dfe12ac74818f.img
Thanks in advance I spent hours trying splits and regex patterns without conclusive result.
This is my first question in SOverflow.
EDIT
here are the 3 types of content put in separate files (1= one 26[alnum].img occurrence, 2= two 26[alnum].img occurrences , 3= three 26[alnum].img occurrences )
# cat demo_vm_1.cfg
disk = ['file:/OVS/Repositories/0004fb00000300007b8afb76a3377693/VirtualDisks/0004fb000012000065a82a4df5e7112b.img,xvda,w']
[root ~]# cat demo_vm_2.cfg
disk = ['file:/OVS/Repositories/0004fb0000030000a079ca25909e5455/VirtualDisks/0004fb0000120000822cb8b0602ee042.img,xvda,w', 'file:/OVS/Repositories/0004fb0000030000a079ca25909e5455/VirtualDisks/0004fb000012000073d5fd864a0ba6b1.img,xvdb,w']
# cat demo_vm_3.cfg
disk = ['file:/OVS/Repositories/0004fb00000300007b8afb76a3377693/VirtualDisks/0004fb0000120000a17dfe12ac74818f.img,xvda,w', 'file:/OVS/Repositories/0004fb00000300007b8afb76a3377693/VirtualDisks/0004fb0000120000e66ace31dac64d98.img,xvdb,w', 'file:/OVS/Repositories/0004fb00000300007b8afb76a3377693/VirtualDisks/0004fb000012000082fbb45a02e24096.img,xvdd,w']
Initial script
my initial script that creates the remove commands for the .cfg files and the pointed images inside each of them had a problem when the cfg had more than one disk reference. I guess I can adapt it now to use grep -Eo instead of awk
strings=(`find /vm_backup/VirtualMachines/*/vm.cfg`)
for i in "${strings[#]}"; do
echo "rm -f $i" >> drop_vm_final.sh
awk -F [/,] '/disk/ { print $6}' "$i" | awk '{print "rm -f /vm_backup/VirtualDisks/"$0}' >>drop_vm_bkp_final.sh
done

$ grep -Eo '[[:alnum:]]{26}\.img' file
0000120000a17dfe12ac74818f.img
0000120000e66ace31dac64d98.img
000012000082fbb45a02e24096.img
If that's not all you need then edit your question to provide more truly representative sample input/output that that doesn't work for.

Could you please try following based on your shown samples.
awk '
match($0,/[[:alnum:]]{26}\.img/){
print substr($0,RSTART,RLENGTH)
}
' Input_file
OR to get all matched values in a single line try following.
awk '
{
while(match($0,/[[:alnum:]]{26}\.img/)){
print substr($0,RSTART,RLENGTH)
$0=substr($0,RSTART+RLENGTH)
}
}' Input_file
Explanation: Adding detailed explanation for above.
awk ' ##Starting awk program from here.
{
while(match($0,/[[:alnum:]]{26}\.img/)){ ##Running while loop to match alpha numerics 26 in number followed by .img if this match found then do following.
print substr($0,RSTART,RLENGTH) ##Printing matched sub string of that matched regex from current line.
$0=substr($0,RSTART+RLENGTH) ##Saving rest of the line(after matched string) to current line here.
}
}' Input_file ##mentioning Input_file name here.

Based on your code
awk -F [/,] '/disk/ { print $6}' demo_vm.cfg
you can complete the print adding $14 and $22
awk -F [/,] '{ print $6,$14,$22}' OFS='\n' demo_vm.cfg
0004fb0000120000a17dfe12ac74818f.img
0004fb0000120000e66ace31dac64d98.img
0004fb000012000082fbb45a02e24096.img

Related

awk to extract days from line

I have the following csv file
238013750030646-2;;"Default";"2020-10-01 00:40:36";;"opening";0;3591911;283940640
238013750030646-2;;"Default";"2020-10-03 00:40:36";;"closing line";0;89320;283940640
238013750030646-2;;"something-else";"2020-10-04 00:40:36";;"started";0;0;283940640
238013750030646-2;;"default else";"2020-10-08 05:42:06";;"opening";0;2410;283940640
Im trying to store each line in a specific file matching the date from each line, with the date being in the 4th column of each line, so first line ("2020-10-01 00:40:36") should be in output-01.csv, second line in output-03.csv etc
This awk command
awk -F";|-" -vOFS='\t' '{print > "output-"$7".csv"}' testing.csv
half works but fails on line 3 because of the - in the 3rd column, and line 4 because of the in the 3rd column - this produces output-10.csv
Is there a way to run the awk command twice ? then i could extract the date using the ; separator and then split using -
Using gawk takes care of unsorted file too :
awk 'match($0,/([0-9]{4})-([0-9]{2})-([0-9]{2})/,arr){
file=sprintf("output-%s.csv",arr[3]);
if(!seen[file]++){
print >file;
next
}
}{
print >>file;
close(file);
}' infile
Explanation:
awk 'match($0,/([0-9]{4})-([0-9]{2})-([0-9]{2})/,arr){ # match for regex
file=sprintf("output-%s.csv",arr[3]); # file variable using array arr value, 3rd index
if(!seen[file]++){ # if not seen file name before in array seen
print >file; # print content to file
next # go to next line
}
}{
print >>file; # append content to file
close(file); # close file
}' infile
Try this:
$ awk -F';' -v OFS='\t' '{split($4,a,/[- ]/); file = "output-"a[3]".csv";
$1=$1; print > file; close(file)}' testing.csv
split($4,a,/[- ]/) this will split 4th field further based on space or - characters, saved in array a
file = "output-"a[3]".csv" output filename
$1=$1 since there's no other command changing contents of input line, this is needed to rebuild input line, otherwise OFS will not be applied
print > file print input line to required file
close(file) calling close, useful if there are too many file names
You can also use file = "output-" substr($4,10,2) ".csv" instead of split if the 4th column is consistent as shown in the sample.
With your shown samples, please try following, written and tested in GNU awk.
awk '
match($0,/[0-9]{4}(-[0-9]{2}){2}/){
outputFile=substr($0,RSTART+8,RLENGTH-8)".csv"
print >> (outputFile)
close(outputFile)
}
' Input_file
Explanation: Adding detailed explanation for above.
awk ' ##Starting awk program from here.
match($0,/[0-9]{4}(-[0-9]{2}){2}/){ ##using match function to match yyyy-mm-dd here in line.
outputFile=substr($0,RSTART+8,RLENGTH-8)".csv" ##Getting matched regex sub-string into outputFile here.
print >> (outputFile) ##Printing current line into outputFile here.
close(outputFile) ##Closing output file to avoid too many files opened error.
}
' Input_file ##Mentioning Input_file name here.
To do this efficiently you should sort on the key field first:
awk -F';' '{print $4, NR, $0}' file |
sort -k1,1 -k3,3n |
awk '
{ curr=$1; sub(/([^ ]+ ){2}/,"") }
curr != prev { close(out); out="output-" (++c) ".csv"; prev=curr }
{ print > out }
'
$ head output*.csv
==> output-1.csv <==
238013750030646-2;;"Default";"2020-10-01 00:40:36";;"opening";0;3591911;283940640
==> output-2.csv <==
238013750030646-2;;"Default";"2020-10-03 00:40:36";;"closing line";0;89320;283940640
==> output-3.csv <==
238013750030646-2;;"something-else";"2020-10-04 00:40:36";;"started";0;0;283940640
==> output-4.csv <==
238013750030646-2;;"default else";"2020-10-08 05:42:06";;"opening";0;2410;283940640
The above will work using any awk+sort in any shell on every Unix box. See the many similar examples on this site for an explanation.

Concatenate the sequence to the ID in fasta file

Here is my input file
>OTU1;size=4;
ATTCCGGGTTTACT
ATTCCTTTTATCGA
ATC
>OTU2;size=10;
CGGATCTAGGCGAT
ACT
>OTU3;size=5;
ATTCCCGGGATCTA
ACTTTTC
The expected output file is:
>OTU1;size=4;ATTCCGGGTTTACTATTCCTTTTATCGAATC
>OTU2;size=10;CGGATCTAGGCGATACT
>OTU3;size=5;ATTCCCGGGATCTAACTTTTC
I've tried the code from Remove line breaks in a FASTA file
but this doesn't work for me, and I am not sure how to modify the code from that post...
Any suggestion? Thanks in advance!
Here is another awk script. Using the awk internal parsing mechanism.
awk 'BEGIN{RS=">";OFS="";}NR>1{$1=$1;print ">"$0}' input.txt
Output is:
>OTU1;size=4;ATTCCGGGTTTACTATTCCTTTTATCGAATC
>OTU2;size=10;CGGATCTAGGCGATACT
>OTU3;size=5;ATTCCCGGGATCTAACTTTTC
Explanation:
awk '
BEGIN { # initialize awk internal variables
RS=">"; # set `RS`=record separator to `>`
OFS=""; # set `OFS`=output field separator to empty string.
}
NR>1 { # handle from 2nd record (1st record is empty).
$1=$1; # regenerate the output line
print ">"$0 # print out ">" with computed output line
}' input.txt
$ awk '{printf "%s%s", (/^>/ ? ors : ""), $0; ors=ORS} END{print ""}' file
>OTU1;size=4;ATTCCGGGTTTACTATTCCTTTTATCGAATC
>OTU2;size=10;CGGATCTAGGCGATACT
>OTU3;size=5;ATTCCCGGGATCTAACTTTTC
Could you please try following too.
awk -v RS=">" 'NR>1{gsub(/\n/,"");print ">"$0}' Input_file
My original attempt was awk -v RS=">" -v FS="\n" -v OFS="" 'NF>1{$1=$1;print ">"$0}' Input_file but later I saw it is already answered buy dudi boy so written another(first mentioned) one.
Similar to my answer here:
$ awk 'BEGIN{RS=">"; FS="\n"; ORS=""}
(FNR==1){next}
{ name=$1; seq=$0; gsub(/(^[^\n]*|)\n/,"",seq) }
{ print ">" name seq }' file1.fasta file2.fasta file3.fasta ...

Count field separators on each line of input file and if missing/exceeding, output filename to error file

I have to validate the input file, Input.txt, for proper number of field separators on each row and if even one row including the header is missing or exceeding the correct number of field separators then print the name of the file to errorfiles.txt and exit.
I have another file to use as reference for the correct number of field separators, valid.txt, then compare the number of field separators on each row of the input file with the number of field separators in the valid.txt file.
awk -F '|' '{ print NF-1; exit }' valid.txt > fscount
awk -F '|' '(NF-1) != "cat fscount" { print FILENAME>"errorfiles.txt"; exit}' Input.txt
This is not working.
awk -F '|' '{ print NF-1; exit }' valid.txt > fscount
awk -F '|' '(NF-1) != "cat fscount" { print FILENAME>"errorfiles.txt"; exit}' Input.txt
It is not fully clear what your requirement is, to print the FILENAME on just a single input file provided, perhaps you wanted to loop over a list of files on a directory running this command?
Anyway, to use the content of the file in the context of awk, just use its -v switch and use input re-direction on the file
awk -F '|' -v count="$(<fscount)" -v fname="errorfiles.txt" '(NF-1) != (count+0) { print FILENAME > fname; close(fname); exit}' Input.txt
Notice the use of close(filename) here, which is generally required when you are manipulating files inside awk constructs. The close() call just closes the file descriptor associated with opening the file pointed by filename explicitly, instead of letting the OS do it.
GNU awk solution:
awk -F '|' 'ARGIND==1{aimNF=NF; nextfile} ARGIND==2{if (NF!=aimNF) {print FILENAME > "errorfiles.txt"; exit}}' valid.txt Input.txt
You can do it with just one command,
-- use awk to read two files, store NF number of 1st file, and compare it in the second file.
For other awk you can replace ARGIND==1 with FILENAME==ARGV[1], and so on.
Or if you are sure first file won't be empty, use NR==FNR and NR>FNR instead.

Removing content of a column based on number of occurences

I have a file (; seperated) with data like this
111111121;000-000.1;000-000.2
111111211;000-000.1;000-000.2
111112111;000-000.1;000-000.2
111121111;000-000.1;000-000.2
111211111;000-000.1;000-000.2
112111111;000-000.1;000-000.2
121111112;000-000.2;020-000.8
121111121;000-000.2;020-000.8
121111211;000-000.2;020-000.8
121113111;000-000.3;000-200.2
211111121;000-000.1;000-000.2
I would like to remove any $3 that has less than 3 occurences, so the outcome would be like
111111121;000-000.1;000-000.2
111111211;000-000.1;000-000.2
111112111;000-000.1;000-000.2
111121111;000-000.1;000-000.2
111211111;000-000.1;000-000.2
112111111;000-000.1;000-000.2
121111112;000-000.2;020-000.8
121111121;000-000.2;020-000.8
121111211;000-000.2;020-000.8
121113111;000-000.3
211111121;000-000.1;000-000.2
That is, only $3 got deleted, as it had only a single occurence
Sadly I am not really sure if (thus how) this could be done relatively easily (as doing the =COUNT.IF matching, and manuel delete in Excel feels quite embarrassing)
$ awk -F';' 'NR==FNR{cnt[$3]++;next} cnt[$3]<3{sub(/;[^;]+$/,"")} 1' file file
111111121;000-000.1;000-000.2
111111211;000-000.1;000-000.2
111112111;000-000.1;000-000.2
111121111;000-000.1;000-000.2
111211111;000-000.1;000-000.2
112111111;000-000.1;000-000.2
121111112;000-000.2;020-000.8
121111121;000-000.2;020-000.8
121111211;000-000.2;020-000.8
121113111;000-000.3
211111121;000-000.1;000-000.2
or if you prefer:
$ awk -F';' 'NR==FNR{cnt[$3]++;next} {print (cnt[$3]<3 ? $1 FS $2 : $0)}' file file
this awk one-liner can help, it processes the file twice:
awk -F';' 'NR==FNR{a[$3]++;next}a[$3]<3{NF--}7' file file
Though that awk solutions are the best in terms of performance, your goal could be also achieved with something like this:
while IFS=" " read a b;do
if [[ "$a" -lt "3" ]];then
sed -i "s/$b//" b.txt
fi
done <<<"$(cut -d";" -f3 b.txt |sort |uniq -c)"
Operation is based on the output of cut which counts occurrences.
$cut -d";" -f3 b.txt |sort |uniq -c
7 000-000.2
1 000-200.2
3 020-000.8
Above works for editing source file in place, so keep a back up for testing.
You can feed the file twice to awk. On the first run you gather a statistic that you use in the second run:
script.awk
FNR == NR { stats[ $3 ]++
next
}
{ if( stats[$3] < 3) print $1 $2
else print
}
Run it like this: awk -F\; -f script.awk yourfile yourfile .
The condition FNR == NR is true during processing of the first filename given to awk. The next statement skips the second block.
Thus the second block is only used for processing the second filename given to awk (which is here the same as the first filename).

awk to read specific column from a file

I have a small problem and I would appreciate helping me in it.
In summary, I have a file:
1,5,6,7,8,9
2,3,8,5,35,3
2,46,76,98,9
I need to read specific lines from it and print them into another text document. I know I can use (awk '{print "$2" "$3"}') to print the second and third columns beside each other. However, I need to use two statement as (awk '{print "$2"}' >> file.text) then (awk '{print "$3"}' >> file.text), but the two columns would appear under each other and not beside each other.
How can I make them appear beside each other?
If you must extract the columns in separate processes, use paste to stitch them together. I assume your shell is bash/zsh/ksh, and I assume the blank lines in your sample input should not be there.
paste -d, <(awk -F, '{print $2}' file) <(awk -F, '{print $3}' file)
produces
5,6
3,8
46,76
Without the process substitutions:
awk -F, '{print $2}' file > tmp1
awk -F, '{print $3}' file > tmp2
paste -d, tmp1 tmp2 > output
Update based on your answer:
On first appearance, that's a confusing setup. Does this work?
for (( x=1; x<=$number_of_features; x++ )); do
feature_number=$(sed -n "$x {p;q}" feature.txt)
if [[ -f out.txt ]]; then
cut -d, -f$feature_number file.txt > out.txt
else
paste -d, out.txt <(cut -d, -f$feature_number file.txt) > tmp &&
mv tmp out.txt
fi
done
That has to read the file.txt file a number of times. It would clearly be more efficient to only have to read it once:
awk -F, -f numfeat=$number_of_features '
# read the feature file into an array
NR==FNR {
colno[++i] = $0
next
}
# now, process the file.txt and emit the desired columns
{
sep = ""
for (i=1; i<=numfeat; i++) {
printf "%s%s", sep, $(colno[i])
sep = FS
}
print ""
}
' feature.txt file.txt > out.txt
Thanks all for contributing in the answers. I believe that i should be more clearer in my question, sorry for that.
My code is as follow:
for (( x = 1; x <= $number_of_features ; x++ )) # the number extracted from a text file
do
feature_number=$(awk 'FNR == "'$x'" {print}' feature.txt)
awk -F, '{print $"'$feature_number'"}' file.txt >> out.txt
done
Basically, I extract the feature number (which is the same as column number) from a text document and then print that column. the text document may contains many features number.
The thing is, each time I have different features number (which reflect the column number). so, applying the above solutions are not sufficient for this problem.
I hope it is clearer now.
Waiting for your comments please.
Thanks
Ahmad
instead of using awks file redirection, use shell redirection eg
awk '{print $2,$3}' >> file
the comma is replaced with the value of the output field seperator( space by default ).