I'm using ragel with C as the host language.
I can recognise a newline simply with '\n', but I need to recognise the start of the file as an alternative.
In other implementations of regex this could be given by \A or $, but $ is reserved for other purposes, '\A' maps to something else (alarm?) and \A gives a parser error.
I don't think there's an escape sequence for that. However, you can detect it by checking if Ragel's ts variable equals 0.
In text format You have 3 choice:
Old mac to nr 9 \n\r (and Commodore, Apple II, Microware OS-9)
Unices and new Mac OS X \n ( and BeOS, AmigaOS, MorphOS, RISC OS, Multics)
Windows uses \r\n (and DOS, OS/2, Symbian, DEC RT-11)
in Ragel defining end of line
endline = ( "\r" | "\n" )+ #{ increase_line_number; };
the start of line is begining (any - endline)
Related
I have a text file containing unwanted null characters (ASCII NUL, \0). When I try to view it in vi I see ^# symbols, interleaved in normal text. How can I:
Identify which lines in the file contain null characters? I have tried grepping for \0 and \x0, but this did not work.
Remove the null characters? Running strings on the file cleaned it up, but I'm just wondering if this is the best way?
I’d use tr:
tr < file-with-nulls -d '\000' > file-without-nulls
If you are wondering if input redirection in the middle of the command arguments works, it does. Most shells will recognize and deal with I/O redirection (<, >, …) anywhere in the command line, actually.
Use the following sed command for removing the null characters in a file.
sed -i 's/\x0//g' null.txt
this solution edits the file in place, important if the file is still being used. passing -i'ext' creates a backup of the original file with 'ext' suffix added.
A large number of unwanted NUL characters, say one every other byte, indicates that the file is encoded in UTF-16 and that you should use iconv to convert it to UTF-8.
I discovered the following, which prints out which lines, if any, have null characters:
perl -ne '/\000/ and print;' file-with-nulls
Also, an octal dump can tell you if there are nulls:
od file-with-nulls | grep ' 000'
If the lines in the file end with \r\n\000 then what works is to delete the \n\000 then replace the \r with \n.
tr -d '\n\000' <infile | tr '\r' '\n' >outfile
Here is example how to remove NULL characters using ex (in-place):
ex -s +"%s/\%x00//g" -cwq nulls.txt
and for multiple files:
ex -s +'bufdo!%s/\%x00//g' -cxa *.txt
For recursivity, you may use globbing option **/*.txt (if it is supported by your shell).
Useful for scripting since sed and its -i parameter is a non-standard BSD extension.
See also: How to check if the file is a binary file and read all the files which are not?
I used:
recode UTF-16..UTF-8 <filename>
to get rid of zeroes in file.
I faced the same error with:
import codecs as cd
f=cd.open(filePath,'r','ISO-8859-1')
I solved the problem by changing the encoding to utf-16
f=cd.open(filePath,'r','utf-16')
Remove trailing null character at the end of a PDF file using PHP, . This is independent of OS
This script uses PHP to remove a trailing NULL value at the end of a binary file, solving a crashing issue that was triggered by the NULL value. You can edit this script to remove all NULL characters, but seeing it done once will help you understand how this works.
Backstory
We were receiving PDF's from a 3rd party that we needed to upload to our system using a PDF library. In the files being sent to us, there was a null value that was sometimes being appended to the PDF file. When our system processed these files, files that had the trailing NULL value caused the system to crash.
Originally we were using sed but sed behaves differently on Macs and Linux machines. We needed a platform independent method to extract the trailing null value. Php was the best option. Also, it was a PHP application so it made sense :)
This script performs the following operation:
Take the binary file, convert it to HEX (binary files don't like exploding by new lines or carriage returns), explode the string using carriage return as the delimiter, pop the last member of the array if the value is null, implode the array using carriage return, process the file.
//In this case we are getting the file as a string from another application.
// We use this line to get a sample bad file.
$fd = file_get_contents($filename);
//We trim leading and tailing whitespace and convert the string into hex
$bin2hex = trim(bin2hex($fd));
//We create an array using carriage return as the delminiter
$bin2hex_ex = explode('0d0a', $bin2hex);
//look at the last element. if the last element is equal to 00 we pop it off
$end = end($bin2hex_ex);
if($end === '00') {
array_pop($bin2hex_ex);
}
//we implode the array using carriage return as the glue
$bin2hex = implode('0d0a', $bin2hex_ex);
//the new string no longer has the null character at the EOF
$fd = hex2bin($bin2hex);
my perl version on linux server is :
This is perl 5, version 16, subversion 3 (v5.16.3) built for x86_64-linux-thread-multi
I have a test as below.
echo "mac:abcdefg1234" | grep -Po "(?<=mac:).*(?=\d+)"
The result is abcdefg123.
But the greedy match does not work.The result I want is abcdefg.
How can I get the content between "mac:" and "digital" (as many as is allowed)
(?<=mac:)[^\d]*(?=\d+) thats the content beetween.
[^\d]* means all not digital with length >=0. Typing a ^ after [ negates the character class. The result is that the character class matches any character that is not in the character class. It also match (invisible) line break characters.
Two related questions.
Perl 6 is so smart that it understands a grapheme as one character, whether it is one Unicode symbol (like ä, U+00E4) or two and more combined symbols (like p̄ and ḏ̣). This little code
my #symb;
#symb.push("ä");
#symb.push("p" ~ 0x304.chr); # "p̄"
#symb.push("ḏ" ~ 0x323.chr); # "ḏ̣"
say "$_ has {$_.chars} character" for #symb;
gives the following output:
ä has 1 character
p̄ has 1 character
ḏ̣ has 1 character
But sometimes I would like to be able to do the following.
1) Remove diacritics from ä. So I need some method like
"ä".mymethod → "a"
2) Split "combined" symbols into parts, i.e. split p̄ into p and Combining Macron U+0304. E.g. something like the following in bash:
$ echo p̄ | grep . -o | wc -l
2
Perl 6 has great Unicode processing support in the Str class. To do what you are asking in (1), you can use the samemark method/routine.
Per the documentation:
multi sub samemark(Str:D $string, Str:D $pattern --> Str:D)
method samemark(Str:D: Str:D $pattern --> Str:D)
Returns a copy of $string with the mark/accent information for each character changed such that it matches the mark/accent of the corresponding character in $pattern. If $string is longer than $pattern, the remaining characters in $string receive the same mark/accent as the last character in $pattern. If $pattern is empty no changes will be made.
Examples:
say 'åäö'.samemark('aäo'); # OUTPUT: «aäo»
say 'åäö'.samemark('a'); # OUTPUT: «aao»
say samemark('Pêrl', 'a'); # OUTPUT: «Perl»
say samemark('aöä', ''); # OUTPUT: «aöä»
This can be used both to remove marks/diacritics from letters, as well as to add them.
For (2), there are a few ways to do this (TIMTOWTDI). If you want a list of all the codepoints in a string, you can use the ords method to get a List (technically a Positional) of all the codepoints in the string.
say "p̄".ords; # OUTPUT: «(112 772)»
You can use the uniname method/routine to get the Unicode name for a codepoint:
.uniname.say for "p̄".ords; # OUTPUT: «LATIN SMALL LETTER PCOMBINING MACRON»
or just use the uninames method/routine:
.say for "p̄".uninames; # OUTPUT: «LATIN SMALL LETTER PCOMBINING MACRON»
If you just want the number of codepoints in the string, you can use codes:
say "p̄".codes; # OUTPUT: «2»
This is different than chars, which just counts the number of characters in the string:
say "p̄".chars; # OUTPUT: «1»
Also see #hobbs' answer using NFD.
This is the best I was able to come up with from the docs — there might be a simpler way, but I'm not sure.
my $in = "Él está un pingüino";
my $stripped = Uni.new($in.NFD.grep: { !uniprop($_, 'Grapheme_Extend') }).Str;
say $stripped; # El esta un pinguino
The .NFD method converts the string to normalization form D (decomposed), which separates graphemes out into base codepoints and combining codepoints whenever possible. The grep then returns a list of only those codepoints that don't have the "Grapheme_Extend" property, i.e. it removes the combining codepoints. the Uni.new(...).Str then assembles those codepoints back into a string.
You can also put these pieces together to answer your second question; e.g.:
$in.NFD.map: { Uni.new($_).Str }
will return a list of 1-character strings, each with a single decomposed codepoint, or
$in.NFD.map(&uniname).join("\n")
will make a nice little unicode debugger.
I can't say this is better or faster, but I strip diacritics in this way:
my $s = "åäö";
say $s.comb.map({.NFD[0].chr}).join; # output: "aao"
I have below string which has enter character coming randomely and fields are separated by ~$~ and end with ##&.
Please help me to merge broken line into one.
In below string enter character is occured in address field (4/79A)
-------Sting----------
23510053~$~ABC~$~4313708~$~19072017~$~XYZ~$~CHINNUSAMY~$~~$~R~$~~$~~$~~$~42~$~~$~~$~~$~~$~28022017~$~
4/79A PQR Marg, Mumbai 4000001~$~TN~$~637301~$~Owns~$~RAT~$~31102015~$~12345~$~##&
Thanks in advance.
Rupesh
Seems to be a (more or less) duplicate of https://stackoverflow.com/a/802439/3595749
Note, you should ask to your client to remove the CRLF signs (rather than aplying the code below).
Nevertheless, try this:
cat inputfile | tr -d '\n' | sed 's/##&/##\&\n/g' >outputfile
Explanation:
tr is to remove the carriage return,
sed is to add it again (only when ##& is encountred). s/##&/##\&\n/g is to substitute "##&" by "##&\n" (I add a carriage return and "&" must be escaped). This applies globally (the "g" letter at the end).
Note, depending of the source (Unix or Windows), "\n" must be replaced by "\r\n" in some cases.
I have a program that iterates over all lines of a text file, adds spaces between the characters, and writes the output to the same file. However, if there are multiple lines in the input, I want the output to have separate lines as well. I tried:
let text = format!(r"{}\n", line); // Add newline character to each line (while iterating)
file.write_all(text.as_bytes()); // Write each line + newline
Here is an example input text file:
foo
bar
baz
And its output:
f o o\n b a r\n b a z
It seems that Rust treats "\n" as an escaped n character, but using r"\n" treats it as a string. How can I have Rust treat \n as a newline character to write multiple lines to a text file?
Note: I can include the rest of my code if you need it, let me know.
Edit: I am on Windows 7 64 bit
The problem is the 'r' in front of your string. Remove it and your program will print newlines instead of '\n'.
Also note that only most Unices use '\n' as newline. Windows uses "\r\n".