dos2unix: Binary symbol 0x04 found at line 1703 - utf-16

I download a file from the OECD http://stats.oecd.org/Index.aspx?datasetcode=CRS1 ('CRS 2013 data.txt') by selecting Export-> Related files. I want to work with this file in Ubuntu (14.04 LTS).
When I run:
dos2unix CRS\ 2013\ data.txt
I see:
dos2unix: Binary symbol 0x0004 found at line 1703
dos2unix: Skipping binary file CRS 2013 data.txt
I check the encoding of the file with:
file --mime-encoding CRS\ 2013\ data.txt
and see:
CRS 2013 data.txt: utf-16le
I do:
iconv -l | grep utf-16le
which doesn't return anything so I do:
iconv -l | grep UTF-16LE
which returns:
UTF-16LE//
Then I run:
iconv --verbose -f UTF-16LE -t UTF-8 CRS\ 2013\ data.txt -o crs_2013_data_temp.txt
and check:
file --mime-encoding crs_2013_data_temp.txt
and see:
crs_2013_data_temp.txt: utf-8
Then I try:
dos2unix crs_2013_data_temp.txt
and get:
dos2unix: Binary symbol 0x04 found at line 1703
dos2unix: Skipping binary file crs_2013_data_temp.txt
I then try to force it:
dos2unix -f crs_2013_data_temp.txt
It works i.e., dos2unix completes the conversion without bailing out/complaining but when I open the file I see entries like "FoÄŤa and ÄŚajniÄŤe".
My question is why? Is it because the BOM is not visible to dos2unix? Because it's missing?
Have I not done the conversion right?
How do I convert this file (correctly?) so that I can read it.

That 0x0004 character you are seeing in your file has nothing at all to do with the BOM (which is fine, by the way) -- it's an EOT (End of Transmission) character from the C0 control set, and has been at that codepoint since 7-bit ASCII was the new hotness. (It's also the familiar Control-D Unix EOF sequence.)
Unfortunately, the pre-dos2unix way of applying tr to the file to strip the carriage returns won't work directly since the file is UTF-16; since iconv works for you, though, you can use it to convert to UTF-8 (which tr will work on), and then run this tr command:
tr -d '\r' < crs_2013_data_temp.txt > crs_2013_data_unix.txt
in order to get the text file into the Unix line ending convention. You will have to keep an eye on whatever tools you're feeding the file to, though, to make sure that they don't choke on the Ctrl-D/EOT character; if they do, you can use
tr -d '\004' < crs_2013_data_unix.txt > crs_2013_data_clean.txt
to get rid of it.
As to how it got there in the first place? I blame the Belgians for letting it sneak into the data they gave the OECD, which they probably keyed in with cat - > file or some other similarly underwhelming means. Also, some text editors try to be a bit too helpful by hiding control characters, even though other tools will bail out when they see them as they think you just stuffed a binary file in that was pretending to be text for a while.

I think this command is OK for your problem:
cat file | tr -d "\r" > new_file

That's how I solved:
find . -type f -exec sed -i 's/\r//' {} \;

Related

Redirecting files from a directory using awk

I am running an awk command for every text file in a directory. As of now it displays to stdout. I will like it to save those changes to the actual files themselves. My command is
awk{ORS=(/^\- **\ **/?"":RS)}1 *.txt >> *.txt
Every time I redirect the command it saves everything into one file. Is there anyway I can save the changes back to the files themselves?
-s and blanks aren't regexp metacharacters outside of bracket expressions so no need to escape them in your regexp. You do need to enclose your script in single quote delimiters though. This will do what your script is apparently trying to do:
awk '{ORS=(/^- ** **/?"":RS)}1'
You cannot write to the same file you are reading. If you try to do that with any command (awk, sed, grep, whatever):
command file > file
then the shell can do whatever it likes, including executing > file before command file and so emptying the file before your command opens it.
To overwrite the input file with GNU awk 4.* would be:
awk -i inplace '{ORS=(/^- ** **/?"":RS)}1' *.txt
and with other awks you'd need something like:
for file in *.txt; do
awk '{ORS=(/^- ** **/?"":RS)}1' "$file" > tmp && mv tmp "$file"
done

Issue with genstrings for Swift file

genstrings works well to extract localizable content from .m file as,
find . -name \*.m | xargs genstrings -o en.lproj
But, not working for .swift file as,
find . -name \*.swift | xargs genstrings -o en.lproj
The genstrings tool works fine with swift as far as I am concerned. Here is my test:
// MyClass.swift
let message = NSLocalizedString("This is the test message.", comment: "Test")
then, in the folder with the class
# generate strings for all swift files (even in nested directories)
$ find . -name \*.swift | xargs genstrings -o .
# See results
$ cat Localizable.strings
/* Test */
"This is the test message." = "This is the test message.";
$
I believe genstrings works as intended, however Apple's xargs approach to generate strings from all your project's files is flawed and does not properly parse paths containing spaces.
That might be the reason why it's not working for you.
Try using the following:
find . -name \*.swift | tr '\n' '\0' | xargs -0 genstrings -o .
We wrote a command line tool that works for Swift files and merges the result of apples genstrings tool.
It allows for key and value in NSLocalizedString
https://github.com/KeepSafe/genstrings_swift
There's an alternative tool called SwiftGenStrings
Hello.swift
NSLocalizedString("hello", value: "world", comment: "Hi!")
SwiftGenStrings:
$ SwiftGenStrings Hello.swift
/* Hi! */
"hello" = "world";
Apple genstrings:
$ genstrings Hello.swift
Bad entry in file Hello.swift (line = 1): Argument is not a literal string.
Disclaimer: I worked on SwiftGenStrings.
There is a similar question here:
How to use genstrings across multiple directories?
find ./ -name "*.m" -print0 | xargs -0 genstrings -o en.lproj
The issue I was having with find/genstrings was twofold:
When it reached folder names with spaces (generated by the output of find), it would exit with an error
When it reached the file where I had my custom routine defined, it was giving me an error when trying to parse my actual function definition
To fix both those problems I'm using the following:
find Some/Path/ \( -name "*.swift" ! -name "MyExcludedFile.swift" \) | sed "s/^/'/;s/$/'/" | xargs genstrings -o . -s MyCustomLocalizedStringRoutine
To summarize, we use the find command to both find and exclude your Swift files, then pipe the results into the sed command which will wrap each file path in quotes, then finally pipe that result into the genstrings command
Xcode now includes a powerful tool for extracting localizations.
Just select your project on the left then Editor menu >> Export localizations.
You'll get a folder with all the text in your files as well as the Localizable.strings and InfoPlist.strings
More details here:
https://developer.apple.com/library/archive/documentation/MacOSX/Conceptual/BPInternational/LocalizingYourApp/LocalizingYourApp.html

AWK to process compressed files and printing original (compressed) file names

I would like to process multiple .gz files with gawk.
I was thinking of decompressing and passing it to gawk on the fly
but I have an additional requirement to also store/print the original file name in the output.
The thing is there's 100s of .gz files with rather large size to process.
Looking for anomalies (~0.001% rows) and want to print out the list of found inconsistencies ALONG with the file name and row number that contained it.
If I could have all the files decompressed I would simply use FILENAME variable to get this.
Because of large quantity and size of those files I can't decompress them upfront.
Any ideas how to pass filename (in addition to the gzip stdout) to gawk to produce required output?
Assuming you are looping over all the files and piping their decompression directly into awk something like the following will work.
for file in *.gz; do
gunzip -c "$file" | awk -v origname="$file" '.... {print origname " whatever"}'
done
Edit: To use a list of filenames from some source other than a direct glob something like the following can be used.
$ ls *.awk
a.awk e.awk
$ while IFS= read -d '' filename; do
echo "$filename";
done < <(find . -name \*.awk -printf '%P\0')
e.awk
a.awk
To use xargs instead of the above loop will require the body of the command to be in a pre-written script file I believe which can be called with xargs and the filename.
this is using combination of xargs and sh (to be able to use pipe on two commands: gzip and awk):
find *.gz -print0 | xargs -0 -I fname sh -c 'gzip -dc fname | gawk -v origfile="fname" -f printbadrowsonly.awk >> baddata.txt'
I'm wondering if there's any bad practice with the above approach…

How to determine the line ending of a file

I have a bunch (hundreds) of files that are supposed to have Unix line endings. I strongly suspect that some of them have Windows line endings, and I want to programmatically figure out which ones do.
I know I can just run flip -u or something similar in a script to convert everything, but I want to be able to identify those files that need changing first.
You can use the file tool, which will tell you the type of line ending. Or, you could just use dos2unix -U which will convert everything to Unix line endings, regardless of what it started with.
You could use grep
egrep -l $'\r'\$ *
Something along the lines of:
perl -p -e 's[\r\n][WIN\n]; s[(?<!WIN)\n][UNIX\n]; s[\r][MAC\n];' FILENAME
though some of that regexp may need refining and tidying up.
That'll output your file with WIN, MAC, or UNIX at the end of each line. Good if your file is somehow a dreadful mess (or a diff) and has mixed endings.
Here's the most failsafe answer. Stimms answer doesn account for subdirectories and binary files
find . -type f -exec file {} \; | grep "CRLF" | awk -F ':' '{ print $1 }'
Use file to find file type. Those with CRLF have windows return characters. The output of file is delimited by a :, and the first field is the path of the file.
Unix uses one byte, 0x0A (LineFeed), while windows uses two bytes, 0x0D 0x0A (Carriage Return, Line feed).
If you never see a 0x0D, then it's very likely Unix. If you see 0x0D 0x0A pairs then it's very likely MSDOS.
Windows use char 13 & 10 for line ending, unix only one of them ( i don't rememeber which one ). So you can replace char 13 & 10 for char 13 or 10 ( the one, which use unix ).
When you know which files has Windows line endings (0x0D 0x0A or \r \n), what you will do with that files? I supose, you will convert them into Unix line ends (0x0A or \n). You can convert file with Windows line endings into Unix line endings with sed utility, just use command:
$> sed -i 's/\r//' my_file_with_win_line_endings.txt
You can put it into script like this:
#!/bin/bash
function travers()
{
for file in $(ls); do
if [ -f "${file}" ]; then
sed -i 's/\r//' "${file}"
elif [ -d "${file}" ]; then
cd "${file}"
travers
cd ..
fi
done
}
travers
If you run it from your root dir with files, at end you will be sure all files are with Unix line endings.

DOS filename escaping for use with *nix commands

I want to escape a DOS filename so I can use it with sed. I have a DOS batch file something like this:
set FILENAME=%~f1
sed 's/Some Pattern/%FILENAME%/' inputfile
(Note: %~f1 - expands %1 to a Fully qualified path name - C:\utils\MyFile.txt)
I found that the backslashes in %FILENAME% are just escaping the next letter.
How can I double them up so that they are escaped?
(I have cygwin installed so feel free to use any other *nix commands)
Solution
Combining Jeremy and Alexandru Nedelcu's suggestions, and using | for the delimiter in the sed command I have
set FILENAME=%~f1
cygpath "s|Some Pattern|%FILENAME%|" >sedcmd.tmp
sed -f sedcmd.tmp inputfile
del /q sedcmd.tmp
This will work. It's messy because in BAT files you can't use set var=`cmd` like you can in unix.
The fact that echo doesn't understand quotes is also messy, and could lead to trouble if Some Pattern contains shell meta characters.
set FILENAME=%~f1
echo s/Some Pattern/%FILENAME%/ | sed -e "s/\\/\\\\/g" >sedcmd.tmp
sed -f sedcmd.tmp inputfile
del /q sedcmd.tmp
[Edited]: I am suprised that it didn't work for you. I just tested it, and it worked on my machine. I am using sed from http://sourceforge.net/projects/unxutils and using cmd.exe to run those commands in a bat file.
You could try as alternative (from the command prompt) ...
> cygpath -m c:\some\path
c:/some/path
As you can guess, it converts backslashes to slashes.
#Alexandru & Jeremy, Thanks for your help. You both get upvotes
#Jeremy
Using your method I got the following error:
sed: -e expression #1, char 8:
unterminated `s' command
If you can edit your answer to make it work I'd accept it. (pasting my solution doesn't count)
Update: Ok, I tried it with UnixUtils and it worked. (For reference, the UnixUtils I downloaded was dated March 1, 2007, and uses GNU sed version 3.02, my Cygwin install has GNU sed version 4.1.5)