Text files appearing as bynaries on Mac Os X - byte-order-mark

I have >5000 textual files generated in Windows from PDF files that I need to process on a Mac OS X machine. I run dos2unix on all of them to correct the newline and to convert the encoding from UTF-16LE to UTF-8.
In 4949 cases everything goes fine, but for 320 files dos2unix skips the executions saying they are binary files.
This is coherent with of file -c that gives me data for the 320 skipped files and text for the other files. However they are text from a visual inspection ...
How can I repair the 320? At first I suspected it was the presence of the BOM, but it appears also on the files that don't give problems.
Furthermore, both the data and the text files start with:
0000000 ff fe 3d 00 20 00 70 00 61 00 67 00 65 00 20 00
0000010 31 00 20 00 3d 00 0a 00 0d 00 0d 00 0a 00
Any hint?
Thanks in advance.

I have found that sometimes text files contain unprintable ASCII characters. In such cases, even though the files are "text" files, dos2unix thinks they are binary. If this is the case, you can use the tr command as such:
tr -cd '\11\12\15\40-\176' < file.txt
This is the basic command and will clean out those unprintable characters and output your new ASCII-clean text to stdout. To actually save this output as a file, just pipe the output to a file:
tr -cd '\11\12\15\40-\176' < file.txt > newfile.txt
Now newfile.txt is your text file on which you can run dos2unix.
The complement (ie, -c) of string '\11\12\15\40-\176' means that the tr command strips out everything but the characters defined in that string, which are:
octal \11: tab
octal \12: new line
octal \15: carriage return
octal \40-\176: all the good/normal keyboard characters

According to dos2unix --help, you can pass the argument --force to dos2unix to "force conversion of binary files". So in your shell, while inside a directory with just the 320 skipped files, you might type dos2unix --force *.

You could try the latest version of dos2unix (6.0.3). It will print the line number of the first binary symbol. This may help you to analyse the problem.
Best regards,

Related

How to track down Invalid utf8 character string

Running a search in PHPMyAdmin for an ip address to unblock from a WordPress plug in, I get this on one of the tables:
Warning: #1300 Invalid utf8 character string: '\x8B\x08\x00\x00\x00\x00\x00\x00\x03\x14\xD6y8\x15\xEF\x17\x0...'
Warning: #1300 Invalid utf8 character string: '\x8B\x08\x00\x00\x00\x00\x00\x00\x03\x00\x1E\x80\xE1\x7Fa:2:{...'
I tried to search for part of the strings, but cannot find where they are in the db.
These look suspicious to me, I've had some SQL injection compromises in the past and I'm fearing that's what it may indicate.
How do I track down where these strings actually are in the db if I cannot find by the PHPMyAdmin search?
Thank you.
Those look like gzip headers which are missing their leading \x1f. I expect it's there but not part of the warning because \x1f is a valid UTF-8 character but \x8b is not.
1F 2-byte magic number of a gzip file
8B |
08 compression method (08 = deflate)
00 1 byte header flags (00 = it's probably compressed text)
00 4 byte timestamp
00 |
00 |
00 |
00 Extra flags
03 Operating System (03 = Unix)
After that, data begins.
Something is trying to read gzipped text as UTF-8.

Converting string into REG_BINARY

I am making an app in visualstudios's VB to autoinstall the printer in windows. Problem is, that the printer needs a login and pass. I found registry entry, where this is stored, but the password is stored in REG_BINARY format.
Here is how it looks after manually writing the password into printer settings - see UserPass:
Please could you tell me how to convert password (in string) into the reg_binary (see attachement - red square)?
The password in this case was 09882 and it has been stored as 98 09 e9 4c c3 24 26 35 14 6f 83 67 8c ec c4 90. Is there any function in VB to convert 09882 into this REG_BINARY format please?
REG_BINARY means that it is binary data and binary data in .NET is represent by a Byte array. The values you see in RegEdit are the hexadecimal values of the individual bytes, which is a common representation because every byte can be represented by two digits. You need to convert your String to a Byte array and then save it to the Registry like any other data.
How you do that depends on what the application expects. Maybe it is simply converting the text to Bytes based on a specific encoding, e.g. Encoding.ASCII.GetBytes. Maybe it's a hash. You might need to research and/or experiment to find out exactly what's expected.

How to remove a specific pattern of junk values from a file using awk or sed?

I have two types of pattern in my xml file which I want to remove without disturbing any other meaningful patterns.
testname="#TEST-Loop${c}- 05030502956 #TEST - verify that the Handler returns an error indicating â~#~\call barredâ~#~]." enabled="true">
I want to change it to
testname="#TEST-Loop${c}- 05030502956 #TEST - verify that the Handler returns an error indicating call barred." enabled="true">
I tried below code but it didnt worked
awk '{if(match($0,/#TEST.*" enabled="true">$/))
gsub(/â~#~\\/,"");
gsub(/â~#~\]/,"");
print}' $file >> tmp.jmx && mv tmp.jmx $file
The pattern you are attempting to replace looks like a mangled UTF-8 character viewed in some legacy 8-bit encoding. Because you don't specify which encoding that is, we have to do a fair amount of guesswork.
You are asking about Unix tools, so this answer assumes that you are using some U*x derivative or have access to similar tools on your local box (Cygwin?)
To find the actual bytes in the string you want to replace, you can do something like
bash$ grep -o '...~#~...' -m1 "$file" |
> od -Ax -tx1o1
0000000 67 20 e2 7e 40 7e 5c 63 61 0a
147 040 342 176 100 176 134 143 141 012
000000a
I use od for portability reasons; you might prefer hexdump or xxd or some other tool. The output includes both hex and octal, as octal is preferred in Awk but hex is ubiquitous in programming otherwise. I keep a couple of characters of context around the match in case â would in fact be stored in a multibyte encoding in your sample, but here, in this somewhat speculative example, it turns out it is represented by the single byte 0xE2 (octal 342). (This would identify your terminal encoding as Latin-1 or some close relative; maybe one of the CP125x Windows encodings.)
Armed with this information, we can proceed with
awk '{ gsub(/\342~#~./, "") }1' "$file"
to replace the pesky character sequence, or perhaps
sed $'s/\xe2~#~.//' "$file"
which assumes your shell is Bash or some near-compatible which allows you to use C-style strings $'...' -- alternatively, if you know your sed dialect supports a particular notation for unprintable characters, you can use that, but that's even less portable.
(If your sed supports the -i option, or your Awk supports --inline, you can replace the file in-place, i.e. have the script replace the file with a modified version without the need for redirection or temporary files. Again, this has portability issues.)
I want to emphasize that we cannot guess your encoding so your question should ideally include this information. See further the Stack Overflow character-encoding tag wiki for guidance on what to include in a question like this.

Base64url encoded representation puzzle

I'm writing a cookie authentication library that replicates that of an existing system. I'm able to create authentication tokens that work. However testing with a token with known value, created by the existing system, I encountered the following puzzle.
The original encoded string purports to be base64url encoded. And, in fact, using any of several base64url code modules and online tools, the decoded value is the expected result.
However base64url encoding the decoded value (again using any of several tools) doesn't reproduce the original string. Both encoded strings decode to the expected results, so apparently both representations are valid.
How? What's the difference?
How can I replicate the original encoded results?
original encoded string: YWRtaW46NTVGRDZDRUE6vtRbQoEXD9O6R4MYd8ro2o6Rzrc
my base64url decode: admin:55FD6CEA:[encrypted hash]
Encoding doesn't match original but the decoded strings match.
my base64url encode: YWRtaW46NTVGRDZDRUE677-977-9W0Lvv70XD9O6R--_vRh377-977-92o7vv73Otw
my base64url decode: admin:55FD6CEA:[encrypted hash]
(Sorry, SSE won't let me show the unicode representation of the hash. I assure you, they do match.)
This string:
YWRtaW46NTVGRDZDRUE6vtRbQoEXD9O6R4MYd8ro2o6Rzrc
is not exactly valid Base64. Valid Base64 consists in a sequence of characters among uppercase letters, lowercase letters, digits, '/' and '+'; it must also have a length which is a multiple of 4; 1 or 2 final '=' signs may appear as padding so that the length is indeed a multiple of 4. This string contains only Base64-valid characters, but only 47 of them, and 47 is not a multiple of 4. With an extra '=' sign at the end, this becomes valid Base64.
That string:
YWRtaW46NTVGRDZDRUE677-977-9W0Lvv70XD9O6R--_vRh377-977-92o7vv73Otw
is not valid Base64. It contains several '-' and one '_' sign, neither of which should appear in a Base64 string. If some tool is decoding that string into the "same" result as the previous string, then the tool is not implementing Base64 at all, but something else (and weird).
I suppose that your strings got garbled at some point through some copy&paste mishap, maybe related to a bad interpretation of bytes as characters. This is the important point: bytes are NOT characters.
It so happens that, traditionally, in older times, computers got on the habit of using so-called "code pages" which were direct mappings of characters onto bytes, with each character being encoded as exactly one byte. Thus came into existence some tools (such as Windows' notepad.exe) that purport to do the inverse, i.e. show the contents of a file (nominally, some bytes) as they character counterparts. This, however, fails when the bytes are not "printable characters" (while a code page such as "Windows-1252" maps each character to a byte value, there can be byte values that are not the mapping of a printable character). This also began to fail even more when people finally realized that there were only 256 possible byte values, and a lot more possible characters, especially when considering Chinese.
Unicode is an evolving standard that maps characters to code units (i.e. numbers), with a bit more than 100000 currently defined. Then some encoding rules (there are several of them, the most frequent being UTF-8) encode the characters into bytes. Crucially, one character can be encoded over several bytes.
In any case, a hash value (or whatever you call an "encrypted hash", which is probably a confusion, because hashing and encrypting are two distinct things) is a sequence of bytes, not characters, and thus is never guaranteed to be the encoding of a sequence of characters in any code page.
Armed with this knowledge, you may try to put some order into your strings and your question.
Edit: thanks to #marfarma for pointing out the URL-safe Base64 encoding where the '+' and '/' characters are replaced by '-' and '_'. This makes the situation clearer. When adding the needed '=' signs, the first string then decodes to:
00000000 61 64 6d 69 6e 3a 35 35 46 44 36 43 45 41 3a be |admin:55FD6CEA:.|
00000010 d4 5b 42 81 17 0f d3 ba 47 83 18 77 ca e8 da 8e |.[B.....G..w....|
00000020 91 ce b7 |...|
while the second becomes:
00000000 61 64 6d 69 6e 3a 35 35 46 44 36 43 45 41 3a ef |admin:55FD6CEA:.|
00000010 bf bd ef bf bd 5b 42 ef bf bd 17 0f d3 ba 47 ef |.....[B.......G.|
00000020 bf bd 18 77 ef bf bd ef bf bd da 8e ef bf bd ce |...w............|
00000030 b7 |.|
We now see what happened: the first string was decoded to bytes but someone fed these bytes to some display system or editors that really expected UTF-8. Some of these bytes were not valid UTF-8 encoding of anything, so they were replaced with the Unicode code point U+FEFF ZERO WIDTH NO-BREAK SPACE, i.e. a space character with no width (thus, nothingness on the screen). The characters where then reencoded as UTF-8, each U+FEFF yielding the EF BF BD sequence of three bytes.
Therefore, the hash value was badly mangled, but the bytes that were altered show up as nothing when interpreted (wrongly) as characters, and what was put in their place also shows up as nothing. Hence no visible difference on the screen.

Redirection rules with special characters

I want to use redirect 301 rules (i.e. I hope to be able to avoid rewriting rules) to redirect URLs that contain special characters (like é, à ,...) like for instance
redirect 301 /éxàmple http://mydomain.com/example
However, simply adding this doesn't work. Any suggestions?
How to troubleshoot this on a Windows system
On Windows, you can use Notepad++ to enter Unicode characters correctly. After launching Notepad++, select 'Encoding in UTF-8 without BOM' from the 'Encoding' menu, then type your Unicode characters and save the file.
To make sure that the characters have been saved properly, download a hex editor for Windows and make sure that é is saved as c3 89 and à is saved as c3 a0.
Previous response where I assumed that you are on a Linux system
Most likely the Unicode characters have not been saved properly in .htaccess file.
What do you get when you try this command:
grep -o .x.mple .htaccess | od -t x1 -c
You should get this if your Unicode characters are saved correctly.
0000000 c3 a9 78 c3 a0 6d 70 6c 65 0a 65 78 61 6d 70 6c
303 251 x 303 240 m p l e \n e x a m p l
0000020 65 0a
e \n
0000022
If you have xxd or hd installed, you can get a neater output to do your troubleshooting:
$ grep -o .x.mple .htaccess | xxd -g1
0000000: c3 a9 78 c3 a0 6d 70 6c 65 0a 65 78 61 6d 70 6c ..x..mple.exampl
0000010: 65 0a e.
In all the outputs you can see that é is saved as the binary numbers: c3 89. You can see from http://www.fileformat.info/info/unicode/char/e9/index.htm that the é when encoded in UTF-8 is indeed two-bytes: 0xC3 and 0xA9.
Similarly, à in UTF-8 format is: 0xC3 0xA0. See http://www.fileformat.info/info/unicode/char/e0/index.htm. You can see these codes in the output as well.
These should work, but it depends on some things that you have to check as a checklit:
Do you have mod_alias enabled? If not, you should run a2enmod mod_alias
Do you have some redirection to your example page? (Redirections are applied before aliases)
Then, instead of converting it to UTF-8, you can try to put the characters as they're encoded by browsers, for example %C3%A9 for é, etc.