Understand the code of Split file to fasta

Understand the code of Split file to fasta - awk

I understand the matching pattern but how the sequence is read from the matching pattern as the code is matching only pattern ">chr" then how sequence goes to the output file?
awk '/^>chr/ {OUT=substr($0,2) ".fa"}; {print >> OUT; close(OUT)}' Input_File

Could you please go through following explanation once.
awk ' ##Starting awk program here.
/^>chr/{ ##Checking condition if any line starts from string chr then do following.
OUT=substr($0,2) ".fa" ##Creating variable OUT whose value is substring of first 2 letters and concatenating .fa string to it.
} ##Closing block for condition ^>chr here.
{
print >> OUT ##Printing current line to variable OUT value which is formed above and is writing output into out file.
close(OUT) ##If we keep writing lot of files we will get "Too many files opened error(s)" so closing these files in backend to avoid that error.
}
' Input_File ##Mentioning Input_file here which we are processing through awk.

Related

How to extract data in such a pattern using grep or awk?

I have multiple instances of the following pattern in my document:
Dipole Moment: [D]
X: 1.5279 Y: 0.1415 Z: 0.1694 Total: 1.5438
I want to extract the total dipole moment, so 1.5438. How can I pull this off?
When I throw in grep "Dipole Moment: [D]" filename, I don't get the line after. I am new to these command line interfaces. Any help you can provide would be greatly appreciated.

Could you please try following. Written and tested with shown samples in GNU awk.
awk '/Dipole Moment: \[D\]/{found=1;next} found{print $NF;found=""}' Input_file
Explanation: Adding detailed explanation for above.
awk ' ##Starting awk program from here.
/Dipole Moment: \[D\]/{ ##Checking if line contains Dipole Moment: \[D\] escaped [ and ] here.
found=1 ##Setting found to 1 here.
next ##next will skip all further statements from here.
}
found{ ##Checking condition if found is NOT NULL then do following.
print $NF ##Printing last field of current line here.
found="" ##Nullifying found here.
}
' Input_file ##Mentioning Input_file name here.

Sed alternative:
sed -rn '/^Dipole/{n;s/(^[[:space:]]{5}.*[[:space:]]{5})(.*)(([[:space:]]{5}.*+[:][[:space:]]{5}.*){3})/\2/p}' file
Search for the line beginning with "Dipole" then read the next line. Split this line into three sections based on regular expressions and substitute the line for the second section only, printing the result.

How to find a match to a partial string and then delete the string from the reference file using awk?

I have a problem that I have been trying to solve, but have not been able to figure out how to do it. I have a reference file that has all of the devices in my inventory by bar code.
Reference file:
PTR10001,PRINTER,SN A
PTR10002,PRINTER,SN B
PTR10003,PRINTER,SN C
MON10001,MONITOR,SN A
MON10002,MONITOR,SN B
MON10003,MONITOR,SN C
CPU10001,COMPUTER,SN A
CPU10002,COMPUTER,SN B
CPU10003,COMPUTER,SN C
What I would like to do is make a file where I only have to put the abbreviation of what I need on it.
File 2 would look like this:
PTR
CPU
MON
MON
The desired output of this would be a file that would tell me what items by barcode that I need to pull off the shelf.
Desired output file:
PTR10001
CPU10001
MON10001
MON10002
As seen in the output, since I cannot have 2 of the same barcode, I need it to look through the reference file and find the first match. After the number is copied to the output file, I would like to remove the number from the reference file so that it doesn't repeat the number.
I have tried several iterations of awk, but have not been able get the desired output.
The closest that I have gotten is the following code:
awk -F'/' '{ key = substr($1,1,3) } NR==FNR {id[key]=$1; next} key in id { $1=id[key] } { print }' $file1 $file2 > $file3
I am writing this in ksh, and would like use awk as I think this would be the best answer to the problem.
Thanks for helping me with this.

First solution:
From your detailed description, I assume order doesn't matter, as you want to know what to pull off the shelf. So you could do the opposite, first read file2, count the items, and then go to the shelf and get them.
awk -F, 'FNR==NR{c[$0]++; next} c[substr($1,1,3)]-->0{print $1}' file2 file1
output:
PTR10001
MON10001
MON10002
CPU10001
Second solution:
Your awk is very close to what you want, but you need a second dimension in your array, and not overwriting the existing ids. We will do it with a pseudo-2-d array (BTW GNU awk has real 2-dimensional arrays) where we store the ids like PTR10001,PTR10002,PTR10003, we retrieve them with split and we remove from shelf also.
> cat tst.awk
BEGIN { FS="," }
NR==FNR {
key=substr($1,1,3)
ids[key] = (ids[key]? ids[key] "," $1: $1) #append new id.
next
}
$0 in ids {
split(ids[$0], tmp, ",")
print(tmp[1])
ids[$0]=substr(ids[$0],length(tmp[1])+2) #remove from shelf
}
Output
awk -f tst.awk file1 file2
PTR10001
CPU10001
MON10001
MON10002
Here we keep the order of file2 as this is based on the idea you have tried.

Could you please try following, written and tested with shown samples in GNU awk.
awk '
FNR==NR{
iniVal[$0]++
next
}
{
counter=substr($0,1,3)
}
counter in iniVal{
if(++currVal[counter]<=iniVal[counter]){
print $1
if(currVal[counter]==iniVal[counter]){ delete iniVal[$0] }
}
}
' Input_file2 FS="," Input_file1
Explanation: Adding detailed explanation for above.
awk ' ##Starting awk program from here.
FNR==NR{ ##Checking condition if FNR==NR which is true when Input_file2 is being read.
iniVal[$0]++ ##Creating array iniVal with index of current line with increment of 1 each time it comes here.
next ##next will skip all further statements from here.
}
{
counter=substr($0,1,3) ##Creating counter variable which has 1st 3 characters of Input_file1 here.
}
counter in iniVal{ ##Checking if counter is present in iniVal then do following.
if(++currVal[counter]<=iniVal[counter]){ ##Checking if currValarray with index of counter value is lesser than or equal to iniVal then do following.
print $1 ##Printing 1st field of current line here.
if(currVal[counter]==iniVal[counter]){ ##Checking if currVal value is equal to iniVal with index of counter here.
delete iniVal[$0] ##If above condition is TRUE then deleting iniVal here.
}
}
}
' Input_file2 FS="," Input_file1 ##Mentioning Input_file names here.

AIX/KSH Extract string from a comma seperated line

I want to extract the part "virtual_eth_adapters" from the following comma seperated line:
lpar_io_pool_ids=none,max_virtual_slots=300,"virtual_serial_adapters=0/server/1/any//any/1,1/server/1/any//any/1","virtual_scsi_adapters=166/client/1/ibm/166/0,266/client/2/ibm/266/0",virtual_eth_adapters=116/0/263,proc_mode=shared,min_proc_units=0.5,desired_proc_units=2.0,max_proc_units=8.0
Im using AIX with ksh.
I found a workaround with awk and the -F flag to seperate the string with a delimiter and then printing the item ID. But if the input string changes the id may differ...

1st solution: Could you please try following in case you want to print string virtual_eth_adapters too in output.
awk '
match($0,/virtual_eth_adapters[^,]*/){
print substr($0,RSTART,RLENGTH)
}
' Input_file
Output will be as follows.
virtual_eth_adapters=116/0/263
2nd solution: In case you want to print only value for String virtual_eth_adapters then try following.
awk '
match($0,/virtual_eth_adapters[^,]*/){
print substr($0,RSTART+21,RLENGTH-21)
}
' Input_file
Output will be as follows.
116/0/263
Explanation: Adding explanation for code.
awk ' ##Starting awk program here.
match($0,/virtual_eth_adapters[^,]*/){ ##Using match function of awk here, to match from string virtual_eth_adapters till first occurrence of comma(,)
print substr($0,RSTART,RLENGTH) ##Printing sub-string whose starting value is RSTART and till value of RLENGTH, where RSTART and RLENGTH variables will set once a regex found by above line.
}
' Input_file ##Mentioning Input_file name here.

I do use these approach to get data out in middle of lines.
awk -F'virtual_eth_adapters=' 'NF>1{split($2,a,",");print a[1]}' file
116/0/263
Its short and easy to learn. (no counting or regex needed)
-F'virtual_eth_adapters=' split the line by virtual_eth_adapters=
NF>1 if there are more than one field (line contains virtual_eth_adapters=)
split($2,a,",") split last part of line in to array a separated by ,
print a[1] print first part of array a

And one more solution (assuming the position of the string)
awk -F\, '{print $7}'
If you need only the value try this:
awk -F\, '{print $7}'|awk -F\= '{print $2}'
Also is possible to get the value on this way:
awk -F\, '{split($7,a,"=");print a[2]}'

use awk to split one file into several small files by pattern

I have read this post about using awk to split one file into several files:
and I am interested in one of the solutions provided by Pramod and jaypal singh:
awk '/^>chr/ {OUT=substr($0,2) ".fa"}; {print >> OUT; close(OUT)}' Input_File
Because I still can not add any comment so I ask in here.
If the input is
>chr22
asdgasge
asegaseg
>chr1
aweharhaerh
agse
>chr14
gasegaseg
How come it will result in three files:
chr22.fasta
chr1.fasta
chr14.fasta
As an example, in chr22.fasta:
>chr22
asdgasge
asegaseg
I understand the first part
/^>chr/ {OUT=substr($0,2) ".fa"};
and these commands:
/^>chr/ substr() close() >>
But I don't understand that how awk split the input by the second part:
{print >> OUT; close(OUT)}
Could anyone explain more details about this command? Thanks a lot!

Could you please go through following and let me know if this helps you.
awk ' ##Starting awk program here.
/^>chr/{ ##Checking condition if a line starts from string chr then do following.
OUT=substr($0,2) ".fa" ##Create variable OUT whose value is substring of current line and starts from letter 2nd to till end. concatenating .fa to it too.
}
{
print >> OUT ##Printing current line(s) in file name whose value is variable OUT.
close(OUT) ##using close to close output file whose value if variable OUT value. Basically this is to avoid "TOO MANY FILES OPENED ERROR" error.
}' Input_File ##Mentioning Input_file name here.
You could take reference from man awk page for used functions of awk too as follows.
substr(s, i [, n]) Returns the at most n-character substring of s starting at i. If n is omitted, the rest of s is used.

The part you are asking questions about is a bit uncomfortable:
{ print $0 >> OUT; close(OUT) }
With this part, the awk program does the following for every line it processes:
Open the file OUT
Move the file pointer the the end of the file OUT
append the line $0 followed by ORS to the file OUT
close the file OUT
Why is this uncomfortable? Mainly because of the structure of your files. You should only close the file when you finished writing to it and not every time you write to it. Currently, if you have a fasta record of 100 lines, it will open, close the file 100 times.
A better approach would be:
awk '/^>chr/{close(OUT); OUT=substr($0,2)".fasta" }
{print > OUT }
END {close(OUT)}'
Here we only open the file the first time we write to it and we close it when we don't need it anymore.
note: the END statement is not really needed.

remove all lines in a file containing a string from another file

I'd like to remove all the lines of a file based on matching a string from another file. This is what I have used but it only deletes some:
grep -vFf to_delete.csv inputfile.csv > output.csv
Here are sample lines from my input file (inputfile.csv):
Ata,Aqu,Ama3,Abe,0.053475,0.025,0.1,0.11275,0.1,0.15,0.83377
Ata135,Aru2,Aba301,A29,0.055525,0.025,0.1,0.082825,0.075,0.125
Ata135,Atb,Aca,Am54,0.14695,0.1,0.2,0.05255,0.025,0.075,0.8005,
Adc,Aru7,Ama301,Agr84,0.002075,0,0.025,0.240075,0.2,0.
My file "to_delete.csv" looks like this for example:
Aqu
Aca
So any line with those strings should get deleted, in this case, lines 1 and 3 should get deleted. Sample desired output:
Ata135,Aru2,Aba301,A29,0.055525,0.025,0.1,0.082825,0.075,0.125
Adc,Aru7,Ama301,Agr84,0.002075,0,0.025,0.240075,0.2,0.

EDIT: Since OP had carriage characters in his files so adding solution for that too now.
cat -v Input_file ##To check if carriage returns are there or not.
tr -d '\r' < Input_file > temp_file && mv temp_file Input_file
Since your samples of Input_file and expected output is not clear so couldn't fully test it, could you please try following.(if you are ok with awk), append > temp_file && mv temp_file Input_file in code to save output into Input_file itself.
awk -F, 'FNR==NR{a[$0];next} {for(i=1;i<=NF;i++){if($i in a){next}}} 1' to_delete.csv Input_file > temp_file && mv temp_file Input_file
Explanation: Adding explanation for above code too now.
awk -F, ' ##Setting field separator as comma here.
FNR==NR{ ##checking condition FNR==NR which will be TRUE when first Input_file is being read.
a[$0] ##Creating an array named a whose index is $0.
next ##next will skip all further statements from here.
}
{
for(i=1;i<=NF;i++){ ##Starting a for loop from value i=1 to till value of NF.
if($i in a){ ##checking if $i is present in array a if yes then go into this condition block.
next ##next will skip all further statements(since we DO NOt want to print any matching contents)
} ##Closing if block now.
} ##Closing for block here.
} ##Closing block which should be executed for 2nd Input_file here.
1 ##awk works on pattern and action method so making condition TRUE here and not mentioning any action so by default print of current line will happen.
' to_delete.csv Input_file ##Mentioning Input_file names here now.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Understand the code of Split file to fasta - awk

I understand the matching pattern but how the sequence is read from the matching pattern as the code is matching only pattern ">chr" then how sequence goes to the output file? awk '/^>chr/ {OUT=substr($0,2) ".fa"}; {print >> OUT; close(OUT)}' Input_File

Related

How to extract data in such a pattern using grep or awk?

How to find a match to a partial string and then delete the string from the reference file using awk?

AIX/KSH Extract string from a comma seperated line

use awk to split one file into several small files by pattern

remove all lines in a file containing a string from another file

Categories

Resources