Line terminator getting added to plain text file - line

I'm a little confused by some behavior I'm seeing with text files on my Mac. When I open a new file in vim and type in a single character (let's say the letter "t") into the file with no carriage return and hit save and then do a hex dump on the file (using vim's :r !xxd command), I see the following:
00000000: 740a t.
There is still a line feed oa in the file. And when I look at the file properties on the file, there are two bytes, not one. How did it get in there if I didn't type it?

Ok, so it turns out vim automatically adds a newline character at the last line to comply with Posix standard that all lines must end with a new line. You can turn this off with :set noeol in vim.

Related

removing unconventional field separators (^#^#^#) in a text file [duplicate]

I have a text file containing unwanted null characters (ASCII NUL, \0). When I try to view it in vi I see ^# symbols, interleaved in normal text. How can I:
Identify which lines in the file contain null characters? I have tried grepping for \0 and \x0, but this did not work.
Remove the null characters? Running strings on the file cleaned it up, but I'm just wondering if this is the best way?
I’d use tr:
tr < file-with-nulls -d '\000' > file-without-nulls
If you are wondering if input redirection in the middle of the command arguments works, it does. Most shells will recognize and deal with I/O redirection (<, >, …) anywhere in the command line, actually.
Use the following sed command for removing the null characters in a file.
sed -i 's/\x0//g' null.txt
this solution edits the file in place, important if the file is still being used. passing -i'ext' creates a backup of the original file with 'ext' suffix added.
A large number of unwanted NUL characters, say one every other byte, indicates that the file is encoded in UTF-16 and that you should use iconv to convert it to UTF-8.
I discovered the following, which prints out which lines, if any, have null characters:
perl -ne '/\000/ and print;' file-with-nulls
Also, an octal dump can tell you if there are nulls:
od file-with-nulls | grep ' 000'
If the lines in the file end with \r\n\000 then what works is to delete the \n\000 then replace the \r with \n.
tr -d '\n\000' <infile | tr '\r' '\n' >outfile
Here is example how to remove NULL characters using ex (in-place):
ex -s +"%s/\%x00//g" -cwq nulls.txt
and for multiple files:
ex -s +'bufdo!%s/\%x00//g' -cxa *.txt
For recursivity, you may use globbing option **/*.txt (if it is supported by your shell).
Useful for scripting since sed and its -i parameter is a non-standard BSD extension.
See also: How to check if the file is a binary file and read all the files which are not?
I used:
recode UTF-16..UTF-8 <filename>
to get rid of zeroes in file.
I faced the same error with:
import codecs as cd
f=cd.open(filePath,'r','ISO-8859-1')
I solved the problem by changing the encoding to utf-16
f=cd.open(filePath,'r','utf-16')
Remove trailing null character at the end of a PDF file using PHP, . This is independent of OS
This script uses PHP to remove a trailing NULL value at the end of a binary file, solving a crashing issue that was triggered by the NULL value. You can edit this script to remove all NULL characters, but seeing it done once will help you understand how this works.
Backstory
We were receiving PDF's from a 3rd party that we needed to upload to our system using a PDF library. In the files being sent to us, there was a null value that was sometimes being appended to the PDF file. When our system processed these files, files that had the trailing NULL value caused the system to crash.
Originally we were using sed but sed behaves differently on Macs and Linux machines. We needed a platform independent method to extract the trailing null value. Php was the best option. Also, it was a PHP application so it made sense :)
This script performs the following operation:
Take the binary file, convert it to HEX (binary files don't like exploding by new lines or carriage returns), explode the string using carriage return as the delimiter, pop the last member of the array if the value is null, implode the array using carriage return, process the file.
//In this case we are getting the file as a string from another application.
// We use this line to get a sample bad file.
$fd = file_get_contents($filename);
//We trim leading and tailing whitespace and convert the string into hex
$bin2hex = trim(bin2hex($fd));
//We create an array using carriage return as the delminiter
$bin2hex_ex = explode('0d0a', $bin2hex);
//look at the last element. if the last element is equal to 00 we pop it off
$end = end($bin2hex_ex);
if($end === '00') {
array_pop($bin2hex_ex);
}
//we implode the array using carriage return as the glue
$bin2hex = implode('0d0a', $bin2hex_ex);
//the new string no longer has the null character at the EOF
$fd = hex2bin($bin2hex);

Combine SQL files with command `copy` in a batch file introduce an incorrect syntaxe because it does add an invisible character `U+FEFF`

In a pre-build event, a batch file is executed to combine multiple SQL files into a single one.
It is done using this command :
COPY %#ProjectDir%\Migrations\*.sql %#ProjectDir%ContinuousDeployment\AllFilesMergedTogether.sql
Everything appear to work fine but somehow the result give an incorrect syntaxe error.
After two hours of investigation, it turn out the issue is caused by an invisible character that remain invisible even with notepad++.
Using an online website, the character has been spotted and is U+FEFF has shown in following image.
Here are the two input scripts.
PRINT 'Script1'
PRINT 'Script2'
Here is the output given by the copy command.
PRINT 'Script1'
PRINT 'Script2'
Additional info :
Batch file is encoded with UTF-8
Input files are encoded with UTF-8-BOM
Output file is encoded with UTF-8-BOM.
I'm not sure it is possible to change the encoding output of command copy.
I've tried and failed.
What should be done to eradicate this extremely frustrating parasitic character?
It has turned out that changing encoding of input files to ANSI does fix the issue.
No more pesky character(s).
Also, doing so does change the encoding of the result file to UTF-8 instead of UTF-8-BOM which is great I believe.
Encoding can be changed using Notepad++ as show in following picture.

How to remove unknown line break (special character) in text file?

I have a text file which shows a Line Break in UltraEdit if we replace a special character in text file manually it works fine. Unknown Line Break. I have to change it manually and then process the files.
Please let me know some way how to remove all occurrences of this character with VB.Net code.
If I replace ♀ in UltraEdit, it replaces line break with my desired string. But in my VB string I cannot use this character or line break.
The character you have in your file is the form-feed character usually used as control character for a page break.
In UltraEdit in Page Setup configuration dialog (a printing related dialog) there is the option Page break code which has by default the decimal value 12 (hexadecimal 0C) which is the form-feed character.
A page break can be displayed in UltraEdit with a horizontal line across the document window on enabling Show Page Breaks as Lines in menu/ribbon View.
The form-feed character can be removed in UltraEdit with searching for ^b on using a normal, non regular expression or an UltraEdit regular expression replace, or with searching for \f on using a Unix or Perl regular expression replace.
In VB.Net code ChrW(12) can be used to reference the form-feed control character as suggested already by Hans Passant.

What is the meaning of the file names flanked by the '#' sign and how can I remove them?

When I do the 'ls' command in the terminal on my Raspberry Pi 2, I see different types of names of files, some like "#example.cpp#", as well as others like "homework1.cpp~".
What do these two file types mean, and how can I get rid of them? Simply using the 'rm' command doesn't seem to be working for me. Thanks!
Some applications will create a copy of a file and use special characters when creating the filename for the copy. For instance some text editors will make a copy of a file you are starting to edit by using the same name and adding a tilde character (~) to the end of the file. That way you will have a backup of the file that you are about to edit.
Another reason would be if an application is processing the file into a temporary file with the temporary file then being used for the next step. For example perhaps the C/C++ compiler is reading the file homework1.cpp with the C Preprocessor to generate the temporary file #homework1.cpp# which is then compiled by the compiler to generate the object code file.
I am not familiar with raspberry pi so am not sure as to what may be creating the filenames with the pound sign (#) on the front and back. Perhaps it is the C++ compiler. I am pretty sure the files with the tilde character on appended to the end of the file name is a back file from vi or vim containing a copy of the file at the time it was last opened with the text editor.
One thing that you could do is to look in those files to see what is there using a Linux command or a text editor. If you use a text editor I would copy the file to another folder as a back up and then look at it there.
Edit: Someone just posted and then deleted an answer which also mentioned about how to remove these files.
What I read was that the rm command is used however for some kinds of special characters you will need to use quotes around the name and you may also need to use an escape to escape certain special characters.
The command shell reads the command line you type in and makes changes to the text before passing it on to the command you type in. So if the filename has a space in it, say jj Johny then when you remove the file you have to specify rm "jj Johny" since spaces are used by the command processor to separate out arguments.
The other poster mentioned that you had to escape out the pound sign (#) using the back slash character in order to prevent it from being modified by the command processor.

Writing/reading a file in binary mode in Clisp

I'm writing this program that's supposed to read from a file, do some stuff with the content and write to an output file preserving the original line endings. If the file has CRLF endings, the output file should also have that. My problem is in writing the line ending especially with the CLISP implementation(it works with gcl). When I try to write a linefeed character(LF), the file ends up having CRLF endings. I'm guessing this is something to do with CLISP's implementation.
I need a way to write the file in binary mode like in other langauages. The standard I/O functions in the specification only take an optional stream name and the content to be written.
You can reproduce that behaviour with something like this:
(with-open-file (out-file "test.dat" :direction :output)
(setf ending #\linefeed)
(princ "First Line" out-file)
(write-char ending out-file)
(princ "Second Line" out-file)
(write-char ending out-file)
(princ "Second Line" out-file))
I need a solution that works in windows.
You need to specify :EXTERNAL-FORMAT argument, mentioning the line terminator mode:
(with-open-file (out-file "test.dat" :direction :output :external-format :unix)
...)
The external format defaults to :dos on windows because that is the standard on Microsoft systems.
Note that you do not want binary mode if you are actually writing text. In Common Lisp (as opposed to C and Emacs Lisp), there is a very clear separation between binary i/o (reading and writing bytes) and text i/o (reaching and writing characters), just like a number is not a character and vice versa, even though characters have an integer code.