Count field separators on each line of input file and if missing/exceeding, output filename to error file - awk

I have to validate the input file, Input.txt, for proper number of field separators on each row and if even one row including the header is missing or exceeding the correct number of field separators then print the name of the file to errorfiles.txt and exit.
I have another file to use as reference for the correct number of field separators, valid.txt, then compare the number of field separators on each row of the input file with the number of field separators in the valid.txt file.
awk -F '|' '{ print NF-1; exit }' valid.txt > fscount
awk -F '|' '(NF-1) != "cat fscount" { print FILENAME>"errorfiles.txt"; exit}' Input.txt
This is not working.
awk -F '|' '{ print NF-1; exit }' valid.txt > fscount
awk -F '|' '(NF-1) != "cat fscount" { print FILENAME>"errorfiles.txt"; exit}' Input.txt

It is not fully clear what your requirement is, to print the FILENAME on just a single input file provided, perhaps you wanted to loop over a list of files on a directory running this command?
Anyway, to use the content of the file in the context of awk, just use its -v switch and use input re-direction on the file
awk -F '|' -v count="$(<fscount)" -v fname="errorfiles.txt" '(NF-1) != (count+0) { print FILENAME > fname; close(fname); exit}' Input.txt
Notice the use of close(filename) here, which is generally required when you are manipulating files inside awk constructs. The close() call just closes the file descriptor associated with opening the file pointed by filename explicitly, instead of letting the OS do it.

GNU awk solution:
awk -F '|' 'ARGIND==1{aimNF=NF; nextfile} ARGIND==2{if (NF!=aimNF) {print FILENAME > "errorfiles.txt"; exit}}' valid.txt Input.txt
You can do it with just one command,
-- use awk to read two files, store NF number of 1st file, and compare it in the second file.
For other awk you can replace ARGIND==1 with FILENAME==ARGV[1], and so on.
Or if you are sure first file won't be empty, use NR==FNR and NR>FNR instead.

Related

awk to extract days from line

I have the following csv file
238013750030646-2;;"Default";"2020-10-01 00:40:36";;"opening";0;3591911;283940640
238013750030646-2;;"Default";"2020-10-03 00:40:36";;"closing line";0;89320;283940640
238013750030646-2;;"something-else";"2020-10-04 00:40:36";;"started";0;0;283940640
238013750030646-2;;"default else";"2020-10-08 05:42:06";;"opening";0;2410;283940640
Im trying to store each line in a specific file matching the date from each line, with the date being in the 4th column of each line, so first line ("2020-10-01 00:40:36") should be in output-01.csv, second line in output-03.csv etc
This awk command
awk -F";|-" -vOFS='\t' '{print > "output-"$7".csv"}' testing.csv
half works but fails on line 3 because of the - in the 3rd column, and line 4 because of the in the 3rd column - this produces output-10.csv
Is there a way to run the awk command twice ? then i could extract the date using the ; separator and then split using -
Using gawk takes care of unsorted file too :
awk 'match($0,/([0-9]{4})-([0-9]{2})-([0-9]{2})/,arr){
file=sprintf("output-%s.csv",arr[3]);
if(!seen[file]++){
print >file;
next
}
}{
print >>file;
close(file);
}' infile
Explanation:
awk 'match($0,/([0-9]{4})-([0-9]{2})-([0-9]{2})/,arr){ # match for regex
file=sprintf("output-%s.csv",arr[3]); # file variable using array arr value, 3rd index
if(!seen[file]++){ # if not seen file name before in array seen
print >file; # print content to file
next # go to next line
}
}{
print >>file; # append content to file
close(file); # close file
}' infile
Try this:
$ awk -F';' -v OFS='\t' '{split($4,a,/[- ]/); file = "output-"a[3]".csv";
$1=$1; print > file; close(file)}' testing.csv
split($4,a,/[- ]/) this will split 4th field further based on space or - characters, saved in array a
file = "output-"a[3]".csv" output filename
$1=$1 since there's no other command changing contents of input line, this is needed to rebuild input line, otherwise OFS will not be applied
print > file print input line to required file
close(file) calling close, useful if there are too many file names
You can also use file = "output-" substr($4,10,2) ".csv" instead of split if the 4th column is consistent as shown in the sample.
With your shown samples, please try following, written and tested in GNU awk.
awk '
match($0,/[0-9]{4}(-[0-9]{2}){2}/){
outputFile=substr($0,RSTART+8,RLENGTH-8)".csv"
print >> (outputFile)
close(outputFile)
}
' Input_file
Explanation: Adding detailed explanation for above.
awk ' ##Starting awk program from here.
match($0,/[0-9]{4}(-[0-9]{2}){2}/){ ##using match function to match yyyy-mm-dd here in line.
outputFile=substr($0,RSTART+8,RLENGTH-8)".csv" ##Getting matched regex sub-string into outputFile here.
print >> (outputFile) ##Printing current line into outputFile here.
close(outputFile) ##Closing output file to avoid too many files opened error.
}
' Input_file ##Mentioning Input_file name here.
To do this efficiently you should sort on the key field first:
awk -F';' '{print $4, NR, $0}' file |
sort -k1,1 -k3,3n |
awk '
{ curr=$1; sub(/([^ ]+ ){2}/,"") }
curr != prev { close(out); out="output-" (++c) ".csv"; prev=curr }
{ print > out }
'
$ head output*.csv
==> output-1.csv <==
238013750030646-2;;"Default";"2020-10-01 00:40:36";;"opening";0;3591911;283940640
==> output-2.csv <==
238013750030646-2;;"Default";"2020-10-03 00:40:36";;"closing line";0;89320;283940640
==> output-3.csv <==
238013750030646-2;;"something-else";"2020-10-04 00:40:36";;"started";0;0;283940640
==> output-4.csv <==
238013750030646-2;;"default else";"2020-10-08 05:42:06";;"opening";0;2410;283940640
The above will work using any awk+sort in any shell on every Unix box. See the many similar examples on this site for an explanation.

extract and print all occurrences of disk file (.img) from a configuration file

I have vm configuration files from which I need to print all the disks (26 alphanumeric characters followed by .img) existing within each file.
here is an extract of one of the files
[root#~]# cat demo_vm.cfg
disk = ['file:/OVS/Repositories/0004fb00000300007b8afb76a3377693/VirtualDisks/0004fb0000120000a17dfe12ac74818f.img,xvda,w', 'file:/OVS/Repositories/0004fb00000300007b8afb76a3377693/VirtualDisks/0004fb0000120000e66ace31dac64d98.img,xvdb,w', 'file:/OVS/Repositories/0004fb00000300007b8afb76a3377693/VirtualDisks/0004fb000012000082fbb45a02e24096.img,xvdd,w']
I want to extract the below (all references of 26alphanum.img in the file) :
0004fb0000120000a17dfe12ac74818f.img
0004fb0000120000e66ace31dac64d98.img
0004fb000012000082fbb45a02e24096.img
some files have 3 disks some have only one for which I usually run this and have what I want but in case of multiple occurrences I can only print the first one.
# awk -F [/,] '/disk/ { print $6}' demo_vm.cfg
0004fb0000120000a17dfe12ac74818f.img
Thanks in advance I spent hours trying splits and regex patterns without conclusive result.
This is my first question in SOverflow.
EDIT
here are the 3 types of content put in separate files (1= one 26[alnum].img occurrence, 2= two 26[alnum].img occurrences , 3= three 26[alnum].img occurrences )
# cat demo_vm_1.cfg
disk = ['file:/OVS/Repositories/0004fb00000300007b8afb76a3377693/VirtualDisks/0004fb000012000065a82a4df5e7112b.img,xvda,w']
[root ~]# cat demo_vm_2.cfg
disk = ['file:/OVS/Repositories/0004fb0000030000a079ca25909e5455/VirtualDisks/0004fb0000120000822cb8b0602ee042.img,xvda,w', 'file:/OVS/Repositories/0004fb0000030000a079ca25909e5455/VirtualDisks/0004fb000012000073d5fd864a0ba6b1.img,xvdb,w']
# cat demo_vm_3.cfg
disk = ['file:/OVS/Repositories/0004fb00000300007b8afb76a3377693/VirtualDisks/0004fb0000120000a17dfe12ac74818f.img,xvda,w', 'file:/OVS/Repositories/0004fb00000300007b8afb76a3377693/VirtualDisks/0004fb0000120000e66ace31dac64d98.img,xvdb,w', 'file:/OVS/Repositories/0004fb00000300007b8afb76a3377693/VirtualDisks/0004fb000012000082fbb45a02e24096.img,xvdd,w']
Initial script
my initial script that creates the remove commands for the .cfg files and the pointed images inside each of them had a problem when the cfg had more than one disk reference. I guess I can adapt it now to use grep -Eo instead of awk
strings=(`find /vm_backup/VirtualMachines/*/vm.cfg`)
for i in "${strings[#]}"; do
echo "rm -f $i" >> drop_vm_final.sh
awk -F [/,] '/disk/ { print $6}' "$i" | awk '{print "rm -f /vm_backup/VirtualDisks/"$0}' >>drop_vm_bkp_final.sh
done
$ grep -Eo '[[:alnum:]]{26}\.img' file
0000120000a17dfe12ac74818f.img
0000120000e66ace31dac64d98.img
000012000082fbb45a02e24096.img
If that's not all you need then edit your question to provide more truly representative sample input/output that that doesn't work for.
Could you please try following based on your shown samples.
awk '
match($0,/[[:alnum:]]{26}\.img/){
print substr($0,RSTART,RLENGTH)
}
' Input_file
OR to get all matched values in a single line try following.
awk '
{
while(match($0,/[[:alnum:]]{26}\.img/)){
print substr($0,RSTART,RLENGTH)
$0=substr($0,RSTART+RLENGTH)
}
}' Input_file
Explanation: Adding detailed explanation for above.
awk ' ##Starting awk program from here.
{
while(match($0,/[[:alnum:]]{26}\.img/)){ ##Running while loop to match alpha numerics 26 in number followed by .img if this match found then do following.
print substr($0,RSTART,RLENGTH) ##Printing matched sub string of that matched regex from current line.
$0=substr($0,RSTART+RLENGTH) ##Saving rest of the line(after matched string) to current line here.
}
}' Input_file ##mentioning Input_file name here.
Based on your code
awk -F [/,] '/disk/ { print $6}' demo_vm.cfg
you can complete the print adding $14 and $22
awk -F [/,] '{ print $6,$14,$22}' OFS='\n' demo_vm.cfg
0004fb0000120000a17dfe12ac74818f.img
0004fb0000120000e66ace31dac64d98.img
0004fb000012000082fbb45a02e24096.img

How to extract word from a string that may/may not start with a single quote

Sample string:
'kernel-rt|kernel-alt|/kernel-' 'headers|xen|firmware|tools|python|utils'
cut -d' ' -f 1 string.txt gives me
'kernel-rt|kernel-alt|/kernel-'
But how do we proceed further to get just the 'kernel' from it?
Assuming you want only the 3rd kernel (in bold) and not the others
'kernel-rt|kernel-alt|/kernel-' 'headers|xen|firmware|tools|python|utils'
Here is how you extract it using single command awk (standard Linux gawk).
input="kernel-rt|kernel-alt|/kernel-' 'headers|xen|firmware|tools|python|utils"
echo $input|awk -F"|" '{split($3,a,"-");match(a[1],"[[:alnum:]]+",b);print b[0]}'
explanation
-F"|" specify field separator is | so that only is 3rd field required
split($3,a,"-") split 3rd field by -, left part assigned to a[1]
match(a[1],"[[:alnum:]]+",b) from a[1] extract sequence of alphanumeric string into b[0]
print b[0] output the matched string.
If you want to extract kernel from 2nd or 1st fields. Change $3 to $2 or $1.
$ cat file
'kernel-rt|kernel-alt|/kernel-' 'headers|xen|firmware|tools|python|utils'
$
$ awk '{print $1}' file
'kernel-rt|kernel-alt|/kernel-'
$
$ awk '{gsub(/\047/,"",$1); print $1}' file
kernel-rt|kernel-alt|/kernel-
$
$ awk '{gsub(/\047/,""); split($1,f,/[|]/); print f[1]}' file
kernel-rt
and just to make you think...
$ awk '{gsub(/\047|\.*/,"")}1' file
kernel-rt

Remove bad characters to file name while spliting with awk

I have a large file that I split with awk, using the last column as the name for the new files, but one of the columns include a "/", which gives can't open error.
I have tried make a function to transform the name for the file but awk don't use it when I run it, maybe a error from part:
tried_func() {
echo $1 | tr "/" "_"
}
awk -F ',' 'NR>1 {fname="a_map/" tried_func $NF".csv"; print >> fname;
close(fname)}' large_file.csv
Large_file.csv
A, row, I don't, need
plenty, with, columns, good_name
alot, off, them, another_good_name
more, more, more, bad/name
expected res:
list of file i a_map:
good_name.csv
another_good_name.csv
bad_name.csv
actual res:
awk: can't open file a_map/bad/name.csv
Don't need to be a function, if I can just skip the "/" in awk that is fab too.
Awk is not part of the shell, it's an independent programming language, so you can't call shell functions that way. Instead, just do the whole thing within awk:
$ awk -F ',' '
NR>1 {
gsub(/\//,"_",$NF) # replace /s with _s
fname="a_map/" $NF ".csv"
print >> fname
close(fname)
}' file

Use AWK to search through fasta file, given a second file containing sequence names

I have a 2 files. One is a fasta file contain multiple fasta sequences, while another file includes the names of candidate sequences I want to search (file Example below).
seq.fasta
>Clone_18
GTTACGGGGGACACATTTTCCCTTCCAATGCTGCTTTCAGTGATAAATTGAGCATGATGGATGCTGATAATATCATTCCCGTGT
>Clone_23
GTTACGGGGGGCCGAAAAACACCCAATCTCTCTCTCGCTGAAACCCTACCTGTAATTTGCCTCCGATAGCCTTCCCCGGTGA
>Clone_27-1
GTTACGGGGACCACACCCTCACACATACAAACACAAACACTTCAAGTGACTTAGTGTGTTTCAGCAAAACATGGCTTC
>Clone_27-2
GTTACGGGGACCACACCCTCACACATACAAACACAAACACTTCAAGTGACTTAGTGTGTTTCAGCAAAACATGGCTTCGTTTTGTTCTAGATTAACTATCAGTTTGGTTCTGTTTGTCCTCGTACTGGGTTGTGTCAATGCACAACTT
>Clone_34-1
GTTACGGGGGAATAACAAAACTCACCAACTAACAACTAACTACTACTTCACTTTTCAACTACTTTACTACAATACTAAGAATGAAAACCATTCTCCTCATTATCTTTGCTCTCGCTCTTTTCACAAGAGCTCAAGTCCCTGGCTACCAAGCCATCG
>Clone_34-3
GTTACGGGGGAATAACAAAACTCACCAACTAACAACTAACTACTACTTCACTTTTCAACTACTTTACTACAATACTAAGAATGAAAACCATTCTCCTCATTATCTTTGCTCTCGCTCTTTTCACAAGAGCTCAAGTCCCTGGCTACCAAGCCATCGATATCGCTGAAGCCCAATC
>Clone_44-1
GTTACGGGGGAATCCGAATTCACAGATTCAATTACACCCTAAAATCTATCTTCTCTACTTTCCCTCTCTCCATTCTCTCTCACACACTGTCACACACATCC
>Clone_44-3
GTTACGGGGGAATCCGAATTCACAGATTCAATTACACCCTAAAATCTATCTTCTCTACTTTCCCTCTCTCCATTCTCTCTCACACACTGTCACACACATCCCGGCAGCGCAGCCGTCGTCTCTACCCTTCACCAGGAATAAGTTTATTTTTCTACTTAC
name.txt
Clone_23
Clone_27-1
I want to use AWK to search through the fasta file, and obtain all the fasta sequences for given candidates whose names were saved in another file.
awk 'NR==FNR{a[$1]=$1} BEGIN{RS="\n>"; FS="\n"} NR>FNR {if (match($1,">")) {sub(">","",$1)} for (p in a) {if ($1==p) print ">"$0}}' name.txt seq.fasta
The problem is that I can only extract the sequence of first candidate in name.txt, like this
>Clone_23
GTTACGGGGGGCCGAAAAACACCCAATCTCTCTCTCGCTGAAACCCTACCTGTAATTTGCCTCCGATAGCCTTCCCCGGTGA
Can anyone help to fix one-line awk command above?
If it is ok or even desired to print the name as well, you can simply use grep:
grep -Ff name.txt -A1 a.fasta
-f name.txt picks patterns from name.txt
-F treats them as literal strings rather than regular expressions
A1 prints the matching line plus the subsequent line
If the names are not desired in output I would simply pipe to another grep:
above_command | grep -v '>'
An awk solution can look like this:
awk 'NR==FNR{n[$0];next} substr($0,2) in n && getline' name.txt a.fasta
Better explained in a multiline version:
# True as long as we are reading the first file, name.txt
NR==FNR {
# Store the names in the array 'n'
n[$0]
next
}
# I use substr() to remove the leading `>` and check if the remaining
# string which is the name is a key of `n`. getline retrieves the next line
# If it succeeds the condition becomes true and awk will print that line
substr($0,2) in n && getline
$ awk 'NR==FNR{n[">"$0];next} f{print f ORS $0;f=""} $0 in n{f=$0}' name.txt seq.fasta
>Clone_23
GTTACGGGGGGCCGAAAAACACCCAATCTCTCTCTCGCTGAAACCCTACCTGTAATTTGCCTCCGATAGCCTTCCCCGGTGA
>Clone_27-1
GTTACGGGGACCACACCCTCACACATACAAACACAAACACTTCAAGTGACTTAGTGTGTTTCAGCAAAACATGGCTTC