Ensuring that opening data before awk condition is dealt with - awk

I have XML which contains:
</body></text></xml>
<?xml version="1.0" encoding="utf-8"><?xml-stylesheet type="text/xsl" href="stylesheetv1_1.xsl" ?><text><body>
I need to split the file at each XML declaration.
I've been trying the following awk line, but it fails, and I don't know why. Any help gratefully received.
awk '/<?xml v/{filename=NR".xml"}; {print >filename}' sourcefile.xml
where sourcefile.xml contains the data to be split.
I thought it might be an issue with escaping the question mark, but that seems not to be the issue.
The xml tag is preceded by \r\n
I'm using Gitbash for Windows.
What I need to end up with are a load of separate files, all of which end with
</body></text></xml>
and begin with
<?xml version="1.0" etc
The shell responds with 'expression for `>' redirection has null string value' but I'm afraid I'm not sure what that means. I also get no output files at all.

The error you are getting means that your redirection out to a file is pointing to a filename that is undefined. Your filename variable is blank at some point during script execution.
Try setting that filename variable at the BEGIN block of the awk script to insure that the records occuring before your first "<?xml v" match has somewhere to go:
awk 'BEGIN{filename="prexmlgarbage.xml"} /<\?xml v/{filename=NR".xml"}; {print >filename}' sourcefile.xml
I've also added an escape character before the question mark so you are properly matching on the string <?xml v
You could also put a condition before your print block, if you don't want to capture records before your first "<?xml v" hit:
awk '/<\?xml v/{filename=NR".xml"}; filename!=""{print >filename}' sourcefile.xml

Related

Recursively search directory for occurrences of each string from one column of a .csv file

I have a CSV file--let's call it search.csv--with three columns. For each row, the first column contains a different string. As an example (punctuation of the strings is intentional):
Col 1,Col 2,Col 3
string1,valueA,stringAlpha
string 2,valueB,stringBeta
string'3,valueC,stringGamma
I also have a set of directories contained within one overarching parent directory, each of which have a subdirectory we'll call source, such that the path to source would look like this: ~/parentDirectory/directoryA/source
What I would like to do is search the source subdirectories for any occurrences--in any file--of each of the strings in Col 1 of search.csv. Some of these strings will need to be manually edited, while others can be categorically replaced. I run the following command . . .
awk -F "," '{print $1}' search.csv | xargs -I# grep -Frli # ~/parentDirectory/*/source/*
What I would want is a list of files that match the criteria described above.
My awk call gets a few hits, followed by xargs: unterminated quote. There are some single quotes in some of the strings in the first column that I suspect may be the problem. The larger issue, however, is that when I did a sanity check on the results I got (which seemed far too few to be right), there was a vast discrepancy. I ran the following:
ag -l "searchTerm" ~/parentDirectory
Where searchTerm is a substring of many (but not all) of the strings in the first column of search.csv. In contrast to my above awk-based approach which returned 11 files before throwing an error, ag found 154 files containing that particular substring.
Additionally, my current approach is too low-resolution even if it didn't error out, in that it wouldn't distinguish between which results are for which strings, which would be key to selectively auto-replacing certain strings. Am I mistaken in thinking this should be doable entirely in awk? Any advice would be much appreciated.

replacing fasta headers gives mismatch

probably a simple issue, but I cannot seem to solve it like this.
I want to replace the headers of a FASTA file. This file is a subset of a larger file, but headers were adjusted in the process. I want to add the original headers since it includes essential information.
I selected the headers from the subset (subset.fasta) using grep, and used this to match and extract the headers from the original file, giving 'correct.headers'. They are the same number of headers and in the same order, so this should be ok.
I found the code below which should do what I want according to the description. I've only started learning awk, so I can't really control it, though.
awk 'NR == FNR { o[n++] = $0; next } /^>/ && i < n { $0 = ">" o[i++] } 1' correct.headers subset.fasta > subset.correct.fasta
(source: Replace fasta headers using sed command)
However, there are some 100 more output lines than expected, and there's a shift starting after a couple of million lines.
My workflow was like this:
I had a subsetted fasta-file (created by a program extracting certain taxa) where the headers were missing info:
>header_1
read_1
>header_2
read_2
...
>header_n
read_n
I extracted the headers from this subsetted file using grep, giving the subset headers file:
>header_1
>header_2
...
>header_n
I matched the first part of the header to extract the original headers from the non-subsetted file using grep:
>header_1 info1
>header_2 info2
...
>header_n info_n
giving the same number of headers, matching order, etc.
I then used this file to replace the headers in the subset with the original ones using above awk line, but this gives a mismatch at a certain point and adds additional lines.
result
>header_1 info1
read_1
>header_2 info2
read_2
...
>header_x info_x
read_n
Where/why does it go wrong?
Thanks!

removing unconventional field separators (^#^#^#) in a text file [duplicate]

I have a text file containing unwanted null characters (ASCII NUL, \0). When I try to view it in vi I see ^# symbols, interleaved in normal text. How can I:
Identify which lines in the file contain null characters? I have tried grepping for \0 and \x0, but this did not work.
Remove the null characters? Running strings on the file cleaned it up, but I'm just wondering if this is the best way?
I’d use tr:
tr < file-with-nulls -d '\000' > file-without-nulls
If you are wondering if input redirection in the middle of the command arguments works, it does. Most shells will recognize and deal with I/O redirection (<, >, …) anywhere in the command line, actually.
Use the following sed command for removing the null characters in a file.
sed -i 's/\x0//g' null.txt
this solution edits the file in place, important if the file is still being used. passing -i'ext' creates a backup of the original file with 'ext' suffix added.
A large number of unwanted NUL characters, say one every other byte, indicates that the file is encoded in UTF-16 and that you should use iconv to convert it to UTF-8.
I discovered the following, which prints out which lines, if any, have null characters:
perl -ne '/\000/ and print;' file-with-nulls
Also, an octal dump can tell you if there are nulls:
od file-with-nulls | grep ' 000'
If the lines in the file end with \r\n\000 then what works is to delete the \n\000 then replace the \r with \n.
tr -d '\n\000' <infile | tr '\r' '\n' >outfile
Here is example how to remove NULL characters using ex (in-place):
ex -s +"%s/\%x00//g" -cwq nulls.txt
and for multiple files:
ex -s +'bufdo!%s/\%x00//g' -cxa *.txt
For recursivity, you may use globbing option **/*.txt (if it is supported by your shell).
Useful for scripting since sed and its -i parameter is a non-standard BSD extension.
See also: How to check if the file is a binary file and read all the files which are not?
I used:
recode UTF-16..UTF-8 <filename>
to get rid of zeroes in file.
I faced the same error with:
import codecs as cd
f=cd.open(filePath,'r','ISO-8859-1')
I solved the problem by changing the encoding to utf-16
f=cd.open(filePath,'r','utf-16')
Remove trailing null character at the end of a PDF file using PHP, . This is independent of OS
This script uses PHP to remove a trailing NULL value at the end of a binary file, solving a crashing issue that was triggered by the NULL value. You can edit this script to remove all NULL characters, but seeing it done once will help you understand how this works.
Backstory
We were receiving PDF's from a 3rd party that we needed to upload to our system using a PDF library. In the files being sent to us, there was a null value that was sometimes being appended to the PDF file. When our system processed these files, files that had the trailing NULL value caused the system to crash.
Originally we were using sed but sed behaves differently on Macs and Linux machines. We needed a platform independent method to extract the trailing null value. Php was the best option. Also, it was a PHP application so it made sense :)
This script performs the following operation:
Take the binary file, convert it to HEX (binary files don't like exploding by new lines or carriage returns), explode the string using carriage return as the delimiter, pop the last member of the array if the value is null, implode the array using carriage return, process the file.
//In this case we are getting the file as a string from another application.
// We use this line to get a sample bad file.
$fd = file_get_contents($filename);
//We trim leading and tailing whitespace and convert the string into hex
$bin2hex = trim(bin2hex($fd));
//We create an array using carriage return as the delminiter
$bin2hex_ex = explode('0d0a', $bin2hex);
//look at the last element. if the last element is equal to 00 we pop it off
$end = end($bin2hex_ex);
if($end === '00') {
array_pop($bin2hex_ex);
}
//we implode the array using carriage return as the glue
$bin2hex = implode('0d0a', $bin2hex_ex);
//the new string no longer has the null character at the EOF
$fd = hex2bin($bin2hex);

Mark duplicate headers in a fasta file

I have a big Fasta file which I want to modify. It basically consists of many sequences with headers that start ">". My Problem is, that some of the Headers are not unique, even though the Sequences are unique.
Example:
>acrdi|AD19082
STSTAFPLLTQFYGCAIIILVLAMCCSCLVYAMYFMNSSGLQTHESTVTQKVKDFSLQ
WLQPILFGCSWRHRLIAKSRRNRSKIQPMTGTEPPWNESKDAFENLKTWALNKQNRNCLL
EINFLEAKDFIVMCKDVVCFEEDDKDERNLNLCLKTLTEAFRFLRNCCAETPKNQSFVIS
SGVAKQAIEVILILLRPVFQEREKGTEVITDTIRSGLQLLGNTVVKNIDTQEFIWNCCCP
QFFLDVLLSRHHSIQDCLCMIIFNCLNQQRRLQLVNNPKIISQIVHLCADKSLLEWGYFI
LDCLICEGLFPDLYQGMEFDPLARIILLDLFQVKITDALDESSERTERTETPKELYASSL
NYLAEQFETYFIDIIQRLQQLDYSSNDFFQVLVVTRLLSLLSTSTGLKSSMTGLQDRASL
LETCVDLLRETSKPEAKAAFKRPGTSYWEYVLPTFP
>acrdi|AD19082
MLRQSEPPWNESKDAFENLKTWALNKQNRNCLLEINFLEAKDFIVMCKDVVCFEEDDKDE
RNLNLCLKTLTEAFRFLRNCCAETPKNQSFVISSGVAKQAIEVILILLRPVFQEREKGTE
VITDTIRSGLQLLGNTVVKNIDTQEFIWNCCCPQFFLDVLLSRHHSIQDCLCMIIFNCLN
QQRRLQLVNNPKIISQIVHLCADKSLLEWGYFILDCLICEGLFPDLYQGMEFDPLARIIL
LDLFQVKITDALDESSERTERTETPKELYASSLNYLAEQFETYFIDIIQRLQQLDYSSND
FFQVLVVTRLLSLLSTSTGLKSSMTGLQDRASLLETCVDLLRETSKPEAKAAFSNVSSFP
HSVDSGRISPSHGFQRDLVRVIGNMCYQHFPNQEKVRELDGIPLLLDHCNIDDHNPYICQ
WAIFAIRNVLENNKENQDIVASIHPLGLADMSRLQQFGVDAVEFDGEKI
Now I want to find all duplicates in my big Fasta File and append numbers to the duplicates, so that I know which duplicate it is (1,2,3,...,x). When a new duplicate is found (one with another header), the counter should start from the beginning.
The output should be something like this:
>acrdi|AD19082
STSTAFPLLTQFYGCAIIILVLAMCCSCLVYAMYFMNSSGLQTHESTVTQKVKDFSLQ
WLQPILFGCSWRHRLIAKSRRNRSKIQPMTGTEPPWNESKDAFENLKTWALNKQNRNCLL
EINFLEAKDFIVMCKDVVCFEEDDKDERNLNLCLKTLTEAFRFLRNCCAETPKNQSFVIS
SGVAKQAIEVILILLRPVFQEREKGTEVITDTIRSGLQLLGNTVVKNIDTQEFIWNCCCP
QFFLDVLLSRHHSIQDCLCMIIFNCLNQQRRLQLVNNPKIISQIVHLCADKSLLEWGYFI
LDCLICEGLFPDLYQGMEFDPLARIILLDLFQVKITDALDESSERTERTETPKELYASSL
NYLAEQFETYFIDIIQRLQQLDYSSNDFFQVLVVTRLLSLLSTSTGLKSSMTGLQDRASL
LETCVDLLRETSKPEAKAAFKRPGTSYWEYVLPTFP
>acrdi|AD19082-1
MLRQSEPPWNESKDAFENLKTWALNKQNRNCLLEINFLEAKDFIVMCKDVVCFEEDDKDE
RNLNLCLKTLTEAFRFLRNCCAETPKNQSFVISSGVAKQAIEVILILLRPVFQEREKGTE
VITDTIRSGLQLLGNTVVKNIDTQEFIWNCCCPQFFLDVLLSRHHSIQDCLCMIIFNCLN
QQRRLQLVNNPKIISQIVHLCADKSLLEWGYFILDCLICEGLFPDLYQGMEFDPLARIIL
LDLFQVKITDALDESSERTERTETPKELYASSLNYLAEQFETYFIDIIQRLQQLDYSSND
FFQVLVVTRLLSLLSTSTGLKSSMTGLQDRASLLETCVDLLRETSKPEAKAAFSNVSSFP
HSVDSGRISPSHGFQRDLVRVIGNMCYQHFPNQEKVRELDGIPLLLDHCNIDDHNPYICQ
WAIFAIRNVLENNKENQDIVASIHPLGLADMSRLQQFGVDAVEFDGEKI
I would prefer a method with awk or sed, so that I can easily modify the code to run on all files in a directory.
I have to admit, that I am just starting to learn programming and parsing, but I hope this is not a stupid question.
THX in advance for the help.
An awk script:
BEGIN {
OFS="\n";
ORS=RS=">";
}
{
name = $1;
$1 = "";
suffix = names[name] ? "-" names[name] : "";
print name suffix $0, "\n";
names[name]++;
}
The above uses the ">" as a record separator, and checks the first field (which is the header name that can be duplicated). For each line it prints, it adds a suffix after the header name for each additional time the field appears (i.e. '-1' for the first dup, '-2' for the second...)

inserting character in file , jython

I have written a simple program where to read the first 4 characters and get the integer of it and read those many character and write xxxx after it . Although the program is working the only issues instead of inserting the character , its replacing.
file = open('C:/40_60.txt','r+')
i=0
while 1:
char = int(file.read(4))
if not char: break
print file.read(char)
file.write('xxxx')
print 'done'
file.close()
I am having issue with writing data .
considering this is my sample data
00146456135451354500107589030015001555854640020
and expected output is
001464561354513545xxxx00107589030015001555854640020
but actually my above program is giving me this output
001464561354513545xxxx7589030015001555854640020
ie. xxxx overwrites 0010.
Please suggest.
Files do not support an "insert"-operation. To get the effect you want, you need to rewrite the whole file. In your case, open a new file for writing; output everything you read and in addition, output your 'xxxx'.