How to remove a specific pattern of junk values from a file using awk or sed? - awk

I have two types of pattern in my xml file which I want to remove without disturbing any other meaningful patterns.
testname="#TEST-Loop${c}- 05030502956 #TEST - verify that the Handler returns an error indicating â~#~\call barredâ~#~]." enabled="true">
I want to change it to
testname="#TEST-Loop${c}- 05030502956 #TEST - verify that the Handler returns an error indicating call barred." enabled="true">
I tried below code but it didnt worked
awk '{if(match($0,/#TEST.*" enabled="true">$/))
gsub(/â~#~\\/,"");
gsub(/â~#~\]/,"");
print}' $file >> tmp.jmx && mv tmp.jmx $file

The pattern you are attempting to replace looks like a mangled UTF-8 character viewed in some legacy 8-bit encoding. Because you don't specify which encoding that is, we have to do a fair amount of guesswork.
You are asking about Unix tools, so this answer assumes that you are using some U*x derivative or have access to similar tools on your local box (Cygwin?)
To find the actual bytes in the string you want to replace, you can do something like
bash$ grep -o '...~#~...' -m1 "$file" |
> od -Ax -tx1o1
0000000 67 20 e2 7e 40 7e 5c 63 61 0a
147 040 342 176 100 176 134 143 141 012
000000a
I use od for portability reasons; you might prefer hexdump or xxd or some other tool. The output includes both hex and octal, as octal is preferred in Awk but hex is ubiquitous in programming otherwise. I keep a couple of characters of context around the match in case â would in fact be stored in a multibyte encoding in your sample, but here, in this somewhat speculative example, it turns out it is represented by the single byte 0xE2 (octal 342). (This would identify your terminal encoding as Latin-1 or some close relative; maybe one of the CP125x Windows encodings.)
Armed with this information, we can proceed with
awk '{ gsub(/\342~#~./, "") }1' "$file"
to replace the pesky character sequence, or perhaps
sed $'s/\xe2~#~.//' "$file"
which assumes your shell is Bash or some near-compatible which allows you to use C-style strings $'...' -- alternatively, if you know your sed dialect supports a particular notation for unprintable characters, you can use that, but that's even less portable.
(If your sed supports the -i option, or your Awk supports --inline, you can replace the file in-place, i.e. have the script replace the file with a modified version without the need for redirection or temporary files. Again, this has portability issues.)
I want to emphasize that we cannot guess your encoding so your question should ideally include this information. See further the Stack Overflow character-encoding tag wiki for guidance on what to include in a question like this.

Related

Mainframe Migration to USS/Github

We are trying to migrate Programs (Not Files) from Mainframe to USS, then ultimately to Github.
We have a Program that is having an issue during the migration. These program(s) contains hex character(s) and is being reformatted during the transfer from Mainframe PDS to Unix. is there a command i can insert so that Unix will not reformat the values during transfer from MF PDS?
Edit:
Program contains EBCDIC characters x'15' (newline) and x'0D' (carriage return) which introduces spaces x'40' to the file as it is transported 
from z/OS to USS. These padded x'40's pushes the rest of characters into the next line.
I am using the below command to transfer from Mainframe to Unix. This command is triggerred inside the Mainframe by a Batch Agent.
cp -U -S a=.CPY -T -O c=IBM-1047 "//'Insert PDS Here'" /data/Github
The 2 code snippet that is having an issue has a hex value below, Hex values start on 00 - 0F and 10 - 1F
444444444444444444444444444444444444470000000000000000744444444444
0000000000000000000000000000000000000D0123456789ABCDEFDB0000000000
444444444444444444444444444444444444471111111111111111744444444444
0000000000000000000000000000000000000D0123456789ABCDEFDB0000000000
Unix reformats with a new line when being viewed in iDZ
Reformatted Line
In the old days, prior to the 1985 standard, COBOL did not allow hex literals. Back then, when needed, programmers would enter the hex values directly into the editor. Some programmers didn't get the memo, and continue to do things this way. Any supported mainframe COBOL compiler will now allow hex literals in addition to the old way.
I would suggest modifying the offending COBOL program source line to read
05 FILLER PIC X(16)
VALUE X'000102030405060708090A0B0C0D0E0F'.

removing unconventional field separators (^#^#^#) in a text file [duplicate]

I have a text file containing unwanted null characters (ASCII NUL, \0). When I try to view it in vi I see ^# symbols, interleaved in normal text. How can I:
Identify which lines in the file contain null characters? I have tried grepping for \0 and \x0, but this did not work.
Remove the null characters? Running strings on the file cleaned it up, but I'm just wondering if this is the best way?
I’d use tr:
tr < file-with-nulls -d '\000' > file-without-nulls
If you are wondering if input redirection in the middle of the command arguments works, it does. Most shells will recognize and deal with I/O redirection (<, >, …) anywhere in the command line, actually.
Use the following sed command for removing the null characters in a file.
sed -i 's/\x0//g' null.txt
this solution edits the file in place, important if the file is still being used. passing -i'ext' creates a backup of the original file with 'ext' suffix added.
A large number of unwanted NUL characters, say one every other byte, indicates that the file is encoded in UTF-16 and that you should use iconv to convert it to UTF-8.
I discovered the following, which prints out which lines, if any, have null characters:
perl -ne '/\000/ and print;' file-with-nulls
Also, an octal dump can tell you if there are nulls:
od file-with-nulls | grep ' 000'
If the lines in the file end with \r\n\000 then what works is to delete the \n\000 then replace the \r with \n.
tr -d '\n\000' <infile | tr '\r' '\n' >outfile
Here is example how to remove NULL characters using ex (in-place):
ex -s +"%s/\%x00//g" -cwq nulls.txt
and for multiple files:
ex -s +'bufdo!%s/\%x00//g' -cxa *.txt
For recursivity, you may use globbing option **/*.txt (if it is supported by your shell).
Useful for scripting since sed and its -i parameter is a non-standard BSD extension.
See also: How to check if the file is a binary file and read all the files which are not?
I used:
recode UTF-16..UTF-8 <filename>
to get rid of zeroes in file.
I faced the same error with:
import codecs as cd
f=cd.open(filePath,'r','ISO-8859-1')
I solved the problem by changing the encoding to utf-16
f=cd.open(filePath,'r','utf-16')
Remove trailing null character at the end of a PDF file using PHP, . This is independent of OS
This script uses PHP to remove a trailing NULL value at the end of a binary file, solving a crashing issue that was triggered by the NULL value. You can edit this script to remove all NULL characters, but seeing it done once will help you understand how this works.
Backstory
We were receiving PDF's from a 3rd party that we needed to upload to our system using a PDF library. In the files being sent to us, there was a null value that was sometimes being appended to the PDF file. When our system processed these files, files that had the trailing NULL value caused the system to crash.
Originally we were using sed but sed behaves differently on Macs and Linux machines. We needed a platform independent method to extract the trailing null value. Php was the best option. Also, it was a PHP application so it made sense :)
This script performs the following operation:
Take the binary file, convert it to HEX (binary files don't like exploding by new lines or carriage returns), explode the string using carriage return as the delimiter, pop the last member of the array if the value is null, implode the array using carriage return, process the file.
//In this case we are getting the file as a string from another application.
// We use this line to get a sample bad file.
$fd = file_get_contents($filename);
//We trim leading and tailing whitespace and convert the string into hex
$bin2hex = trim(bin2hex($fd));
//We create an array using carriage return as the delminiter
$bin2hex_ex = explode('0d0a', $bin2hex);
//look at the last element. if the last element is equal to 00 we pop it off
$end = end($bin2hex_ex);
if($end === '00') {
array_pop($bin2hex_ex);
}
//we implode the array using carriage return as the glue
$bin2hex = implode('0d0a', $bin2hex_ex);
//the new string no longer has the null character at the EOF
$fd = hex2bin($bin2hex);

Using awk to detect UTF-8 multibyte character

I am using awk (symlinked to gawk on my machine) to read through a file and get a character count per line to test if a file is fixed width. I can then re-use the following script with the -b --characters-as-bytes option to see if the file is fixed width by byte.
#!/usr/bin/awk -f
BEGIN {
width = -1;
}
{
len = length($0);
if (width == -1) {
width = len;
} else if (len != 0 && len != width) {
exit 1;
}
}
I want to do something similar to test whether each line in a file has the same amount of bytes and characters to assume all characters are a single byte (I do realize this is subject false negatives). The challenge is that I would like to run through the file one time and break out at first mismatch. Is there a way to set the -b option from within an awk script similar to how you can adjust FS. If this isn't possible, I'm open to options outside of awk. I can always just write this in C if I have to, but I wanted to make sure there isn't something already available.
Efficiency is what I am aiming for here. Having this information will help me skip a costly process, so I don't this in itself to be costly. I'm dealing with files that can be over 100 million lines long.
Clarification
I want something like the above. Something like this
#!/usr/bin/awk -f
{
if (length($0) != bytelength($0))
exit 1;
}
I don't need any output. I will just trigger off the return code ($? in bash). So exit 1 if this fails. Obviously bytelength is not a function. I'm just looking for a way to achieve this without running awk twice.
UPDATE
sundeep's solution works for what I have described above:
awk -F '' -l ordchr '{for(i=1;i<=NF;i++) if(ord($i)<0) {exit 1;}}'
I was operating under the assumption that awk would count a higher-end character with a Windows single-byte encoding above 0x7F as a single character, but it actually doesn't count it at all. So byte length would still not be the same as length. I guess I will need to write this in C for something that specific.
Conclusion
So I think I did a poor job of explaining my problem. I receive data that is encoded in either UTF-8 or Windows' style single-byte encoding like CP1252. I wanted to check if there are any multibyte characters in the file and exit if found. I originally wanted to do this in awk, but I playing with files that may have a different encoding has proven difficult.
So in a nutshell if we assume a file with a single character in it:
CHARACTER FILE_ENCODING ALL_SINGLE_BYTE IN_HEX
á UTF-8 false 0xC3 0xA1
á CP1252 true 0xE1
a ANY true 0x61
You seem to be targeting UTF-8 specifically. Indeed first multibyte character in UTF-8 encoding starts 0b11xxxxxx and the next byte needs to be 0b10xxxxxx where x represents any value (from wikipedia).
So you can detect such sequence with sed by matching the hex ranges and exit with nonzero exit status if found:
LC_ALL=C sed -n '/[\xC0-\xFF][\x80-\xBF]/q1'
Ie. match bytes in ranges [0b11000000-0b11111111][0b10000000-0b10111111].
I think \x?? and q are both GNU extensions to sed.
The best answer is imho actually the one with grep provided by Sundeep in the comment. You should try to get that working. The answer below makes use of sed in a similar way. I will probably delete it, as it's really doesn't add anything to grep's solution.
What about this?
[[ -z "$(LANG=C sed -z '/[\x80-\xFF]/d' <(echo -e 'one\ntwo\nth⌫ree'))" ]]
echo $?
<(echo -e 'one\ntwo\nth⌫ree') is just an example file with a multibyte character in it
the whole sed command does one of two things:
outputs the empty string if the file contains a multibyte character
outputs the full file if it doesn't
the [[ -z string ]] returns 0 or 1 if the string has length zero.
Quote from the same wikipedia page above :
Fallback and auto-detection: Only a small subset of possible byte
strings are a valid UTF-8 string: the bytes C0, C1, and F5 through FF
cannot appear, and bytes with the high bit set must be in pairs, and
other requirements.
in octal code that means xC0 = \300, xC1 = \301 and xF5 = \365 -> xFF = \377 being non-valid UTF-8.
Knowing that this space isn't valid UTF-8 is plenty useful in terms of wiggle room for one to insert custom delimiters inside any string :
pick any of those bytes, say \373, and once a quick if statement is used to verify it doesn't exist for that line, you can now perform custom text manipulation tricks of your choice, with a single-byte delimiter, even if it involves inserting them right in between the UTF8 bytes for a single code point, and it won't ruin the unicode at all. once you're done with the logic block, simply use a quick gsub( ) to remove all traces of it.
If that byte (\373 ie \xFB) exist, well, you're likely encountering either a binary file, or partially corrupted UTF8 text data.
One use case, such as in my own modules, is a UTF-8 code-point-level-safe* substr( ) function. So instead of manually counting out the points 1 at a time, first use regex to count out max bytes of any code-point. Let's say 3-bytes (since 4-bytes ones are still rare in practice).
Then i apply 1 pad of \373 next to the 2-byte ones (I pad it to the left of [\302-\337]), and 2 pads of it, i.e. \373\373, next to ASCII ones, and voila, now all UTF8 code points have a fixed width, so a substr( ) becomes a mere multiplication exercise of it.
run a byte-level substr( ) on those start and end points, apply gsub(/[\373]+/, "", s) to throw away all the padding bytes, and now you have a usable* UTF-8-safe substr( ) function for all the variants of awk that aren't unicode-aware. This approach also works for multi-line records, and absolutely does not affect how FS and RS interacts with the record.
(if u need 4-bytes, just pad more)
*i haven't incorporated any fancy logic to account for code-points that are post-decomposition components that supposedly grouped together as a single logical unit for string manipulation purposes.
for non-unicode aware versions of awk,
gawk -b/ LC_ALL=C /mawk/mawk2 'BEGIN {
reUTF8="([\\000-\\177]|" \
"[\\302-\\337][\\200-\\277]|" \
"\\340[\\240-\\277][\\200-\\277]|" \
"\\355[\\200-\\237][\\200-\\277]|" \
"[\\341-\\354\\356-\\357][\\200-\\277]" \
"[\\200-\\277]|\\360[\\220-\\277]" \
"[\\200-\\277][\\200-\\277]|" \
"[\\361-\\363][\\200-\\277][\\200-\\277]" \
"[\\200-\\277]|\\364[\\200-\\217]" \
"[\\200-\\277][\\200-\\277])" }'
Set this regex. You should be able to get total UTF8-compliant character count as counted by gnu-wc -lcm, even for purely binary files like mp3s or mp4s or compressed gz/xz/zip that. As long as your data itself is UTF8-compliant, then this will count it, as specified in Unicode 13.
Your locale settings don't matter here whatsoever, nor is your platform, OS version, awk version, or awk variant.
$ echo; time pvE0 < MV84/*BLITZE*webm | gwc -lcm
in0: 449MiB 0:00:10 [44.4MiB/s] [44.4MiB/s] [================================================>] 100%
1827289 250914815 471643928
real 0m10.188s
user 0m10.075s
sys 0m0.352s
$ echo; time pvE0 < MV84/*BLITZE*webm | mawk2x 'BEGIN { FS = "^$"} { bytes += lengthB0(); chars += lengthC0(); } END { print --NR, chars+NR, bytes+NR }'
in0: 449MiB 0:00:16 [27.0MiB/s] [27.0MiB/s] [================================================>] 100%
1827289=250914815=471643928
real 0m16.756s
user 0m16.621s
sys 0m0.449s
the file being tested is a 449 MB .webm music video clip from youtube that's 3840x2160 VP9 + Opus codecs. not too shabby for an interpreted scripting language to be this close to compiled C-binaries.
And it's only this slow for the hideously long regex to account for invalid bytes. If you're extremely sure your data is fully UTF8 compliant text, you can further optimize that regex so that mawk2 can go faster than both gnu-wc and bsd-wc :
$ brc; time pvE0 < "${m3t}" | awkwc4m
in0: 1.85GiB 0:00:14 [ 128MiB/s] [ 128MiB/s] [================================================>] 100%
12,494,275 lines 1,285,316,715 utf8 (349,725,658 uc) 1,891.656 MB ( 1983544693) /dev/stdin
real 0m14.753s <—- Custom Bash function that's entirely AWK
$ brc; time pvE0 < "${m3t}" |gwc -lcm
in0: 1.85GiB 0:00:28 [67.3MiB/s] [67.3MiB/s] [================================================>] 100%
12494275 1285316715 1983544693
real 0m28.165s <—— GNU WC
$ brc; time pvE0 < "${m3t}" |wc -lcm
in0: 1.85GiB 0:00:22 [85.5MiB/s] [85.5MiB/s] [================================================>] 100%
12494275 1285316715
real 0m22.181s <—— BSD WC
ps : "${m3t}" is a 1.85GB flat .txt file that's 12.5 million rows, and 13 fields each, filled to the brim with multibyte unicode characters (349.7 million of them).
gawk -e (in unicode mode) will complain about that regex. To circumvent that annoyance, use this regex which is the same as the one above, but expanded out to make gawk -e happy
([\000-\177]|((\302|\303|\304|\305|\306|\307|\310|\311|\312|\313|\314|\315|\316|\317|\320|\321|\322|\323|\324|\325|\326|\327|\330|\331|\332|\333|\334|\335|\336|\337)|(\340)(\240|\241|\242|\243|\244|\245|\246|\247|\250|\251|\252|\253|\254|\255|\256|\257|\260|\261|\262|\263|\264|\265|\266|\267|\270|\271|\272|\273|\274|\275|\276|\277)|(\355)(\200|\201|\202|\203|\204|\205|\206|\207|\210|\211|\212|\213|\214|\215|\216|\217|\220|\221|\222|\223|\224|\225|\226|\227|\230|\231|\232|\233|\234|\235|\236|\237))(\200|\201|\202|\203|\204|\205|\206|\207|\210|\211|\212|\213|\214|\215|\216|\217|\220|\221|\222|\223|\224|\225|\226|\227|\230|\231|\232|\233|\234|\235|\236|\237|\240|\241|\242|\243|\244|\245|\246|\247|\250|\251|\252|\253|\254|\255|\256|\257|\260|\261|\262|\263|\264|\265|\266|\267|\270|\271|\272|\273|\274|\275|\276|\277)|((\341|\342|\343|\344|\345|\346|\347|\350|\351|\352|\353|\354|\356|\357)|(\360)(\220|\221|\222|\223|\224|\225|\226|\227|\230|\231|\232|\233|\234|\235|\236|\237|\240|\241|\242|\243|\244|\245|\246|\247|\250|\251|\252|\253|\254|\255|\256|\257|\260|\261|\262|\263|\264|\265|\266|\267|\270|\271|\272|\273|\274|\275|\276|\277)|(\361|\362|\363)(\200|\201|\202|\203|\204|\205|\206|\207|\210|\211|\212|\213|\214|\215|\216|\217|\220|\221|\222|\223|\224|\225|\226|\227|\230|\231|\232|\233|\234|\235|\236|\237|\240|\241|\242|\243|\244|\245|\246|\247|\250|\251|\252|\253|\254|\255|\256|\257|\260|\261|\262|\263|\264|\265|\266|\267|\270|\271|\272|\273|\274|\275|\276|\277)|(\364)(\200|\201|\202|\203|\204|\205|\206|\207|\210|\211|\212|\213|\214|\215|\216|\217))(\200|\201|\202|\203|\204|\205|\206|\207|\210|\211|\212|\213|\214|\215|\216|\217|\220|\221|\222|\223|\224|\225|\226|\227|\230|\231|\232|\233|\234|\235|\236|\237|\240|\241|\242|\243|\244|\245|\246|\247|\250|\251|\252|\253|\254|\255|\256|\257|\260|\261|\262|\263|\264|\265|\266|\267|\270|\271|\272|\273|\274|\275|\276|\277){2})
== update = 9-20-21 ========
so turns out even the pre-slicing isn't necessary at all.
gawk -e 'BEGIN { ORS = ":";
a0 = a = "\354\236\274";
n = 1; # this # is for how many bytes
# you'd like to see
b1 = b = \
sprintf("%.*s",n + 1,a = "\301" a);
sub("^"b, "", a)
sub(/^\301/,"", b)
sub("\236|\270|\271|\272|\273|\274|\275|\276|\277",":&", a)
# for that string,
# chain up every byte in \x80-\xBF range,
# but make sure not to tag on "( )" at the 2 ends.
# that will make the regex a lot slower,
# for reasons unclear to me
printf(":" a0 "|" b1 "|" b ORS a "|") } ' | odview
yielding this output
: 잼 ** ** | 301 354 | 354 : 236 : 274 |
072 354 236 274 174 301 354 174 354 072 236 072 274 174
: ? 9e ? | ? ? | ? : 9e : ? |
58 236 158 188 124 193 236 124 236 58 158 58 188 124
3a ec 9e bc 7c c1 ec 7c ec 3a 9e 3a bc 7c
voila ~~ using only sprintf() and [g]sub(), every individual byte is at ur fingertip, even when in unicode code, without needing to use arrays at all.
===========================
since we're on the topic of awk and UTF8, a quick tip share (only on the multi-byte part):
if you're in gawk unicode-aware mode, and wanna access individual bytes of just a few utf8 chars (e.g. performing URL encoding, analyze them individually, or like packing a DWORD32), but don't wanna use the cost-heavy approach of gsub(//,"&"SUBSEP) then splitting into an array, a quick-n-dirty method is just
gsub(/\302|\303|\304|\305|\306|\307|\310|\311|\312\
|\313|\314|\315|\316|\317|\320|\321|\322|\323|\324
|\325|\326|\327|\330|\331|\332|\333|\334|\335|\336
|\337|\340|\341|\342|\343|\344|\345|\346|\347|\350
|\351|\352|\353|\354|\355|\356|\357|\360|\361|\362
|\363|\364/, "&\300")
잼 ** ** = 354 *300*<---236 274
354 236 274 075 354 300 236 274
? 9e ? = ? ? 9e ?
236 158 188 61 236 192 158 188
ec 9e bc 3d ec c0 9e bc
Basically, "slicing" properly encoded UTF8 characters right between the leading byte and the trailing ones. In my personal trial-and-error, i find the 13 bytes illegal within UTF8 (xC0 xC1 xF5-xFF) to be best suited for this task.
say original var is called b3. then use
b2 = sprintf("%.3s",b3)
to extract out \354 \300 \236.
sub(b2,"",b3)
so now b3 will only have \274.
b1 = sprintf("%.1s", b2)
b1 will now just now \354
sub(b1"\300","",b2)
and finally, b2 will actually just be the 2nd byte of \236
The reason why this painfully tedious process is that 1 gsub doubling every byte then another full array split() plus 3 more array entry lookups can be slightly slow. If you wanna count bytes first,
lenBytes = match($0, /$/) - 1;
# i only recently discovered
# this trick that works decently well
that match one even works for randon collection of bytes that have no resemblance to Unicode, and gawk is very happy to give you the exact result. That's the only meaningful way to run match( ) against random bytes and not get an error message from gawk. (the other being match($0,/^/) but that's quite uselsss. try doing .* / . / .+ all will end up erroring about bad character in locale.
** don't use index( ). if you need exact positions, then just split into array.
And if you need to do byte-level substring
Don't directly use substr() for random bytes in gawk unicode-mode.
Use sprintf("%.53s",b3) instead.
Before slicing, that syntax gives you 53 unicode characters.
After slicing, it's 53 bytes from start of string.
i even chain them up myself as if they're gensub() even though it's good ole' sub() :
if (sub(reANY340357,"&\301",z)||3==b) {
sub((x=sprintf("%.1s",(y=sprintf("%.3s",z))sub(y,"",z)))"\301","",y)
And once you're done with everything you need, a quick gsub(/\300|\301/, "") will restore you the proper UTF8 string.
Hope this is useful =)
Note : The code in this answer can be used to detect valid UTF-8 multi-byte characters. It will also fail if there are invalid UTF-8 byte sequences. However, it does not guarantee that your file is intended to be UTF-8. All valid UTF-8 code is also valid CP1252, but not all CP1252 is valid UTF-8.
So it seems this may be a bit of a niche problem. For me, that means time to resort to C. This should work, but, in the spirit of the question, I won't accept it in case someone can come up with an awk solution.
Here is my C solution I called hasmultibyte:
#include <stdio.h>
#include <stdlib.h>
void check_for_multibyte(FILE* in)
{
int c = 0;
while ((c = getc(in)) != EOF) {
/* Floating continuation byte */
if ((c & 0xC0) == 0x80)
exit(5);
/* utf8 multi-byte start */
if ((c & 0xC0) == 0xC0) {
int continuations = 1;
switch (c & 0xF0) {
case 0xF0:
continuations = 3;
break;
case 0xE0:
continuations = 2;
}
int i = 0;
for (; i < continuations; ++i)
if ((getc(in) & 0xC0) != 0x80)
exit(5);
exit(0);
}
}
}
int main (int argc, char** argv)
{
FILE* in = stdin;
int i = 1;
do {
if (i != argc) {
in = fopen(argv[i], "r");
if (!in) {
perror(argv[i]);
exit(EXIT_FAILURE);
}
}
check_for_multibyte(in);
if (in != stdin)
fclose(in);
} while (++i < argc);
return 5;
}
In the shell environment, you could then use it like this:
if hasmultibyte file.txt; then
...
fi
It will also read from stdin if not file is provided if you want to use it on the end of a pipeline:
if cat file.txt | hasmultibyte; then
...
fi
TEST
Here is a test of the program. I created 3 files with the name Hernández in it:
name_ascii.txt - Uses a instead of á.
name_cp1252.txt - Encoded in CP1252
name_utf-8.txt - Encoded in UTF-8 (default)
The � you see is due to the invalid UTF-8 that the terminal is expecting. It is, in fact the character á in CP1252.
> file name_*
name_ascii.txt: ASCII text
name_cp1252.txt: ISO-8859 text
name_utf-8.txt: UTF-8 Unicode text
> cat name_*
Hernandez
Hern�ndez
Hernández
> hasmultibyte name_ascii.txt && echo multibyte
> hasmultibyte name_cp1252.txt && echo multibyte
> hasmultibyte name_utf-8.txt && echo multibyte
multibyte
Update
This code has been updated from the original. It has been changed to read the first byte of a multibyte character and read how many bytes the character should be. This can be determined as follows.
first byte number of bytes
110xxxxx 2
1110xxxx 3
11110xxx 4
This method is more reliable and will reduce inaccuracies. The original method searched for a byte of the form 11xxxxxx and checked the next byte for a continuation byte (10xxxxxx). That will produce a false positive given something like â„x in a CP1252 file. In binary, this is 11100010 10000100 01111000. The first byte claims a character of 3 bytes, the second is a continuation byte, but the third is not. This is not a valid UTF-8 sequence.
Additional testing
> # create files
> echo "â„¢" | iconv -f UTF-8 -t CP1252 > 3byte.txt
> echo "Ââ„¢" | iconv -f UTF-8 -t CP1252 > 3byte_fail.txt
> echo "â„x" | iconv -f UTF-8 -t CP1252 > 3byte_fail2.txt
> hasmultibyte 3byte.txt; echo $?
0
> hasmultibyte 3byte_fail.txt; echo $?
5
> hasmultibyte 3byte_fail2.txt; echo $?
5

SHA256 generation different for file and content of this file

I use online SHA256 converters to calculate a hash for a given file. There, I have seen an effect I don't understand.
For testing purposes, I wanted to calculate the hash for a very simple file. I named it "test.txt", and its only content is the string "abc", followed by a new line (I just pressed enter).
Now, when I put "abc" and newline into a SHA256 generator, I get the hash
edeaaff3f1774ad2888673770c6d64097e391bc362d7d6fb34982ddf0efd18cb
But when I put the complete file into the same generator, I get the hash
552bab6864c7a7b69a502ed1854b9245c0e1a30f008aaa0b281da62585fdb025
Where does the difference come from? I used this generator (in fact, I tried several ones, and they always yield the same result):
https://emn178.github.io/online-tools/sha256_checksum.html
Note that this different does not arise without newlines. If the file just contains the string "abc", the hash is
ba7816bf8f01cfea414140de5dae2223b00361a396177a9cb410ff61f20015ad
for the file as well as just for the content.
As noted in my comment, the difference is caused by how newline characters are represented across different operating systems (see details here):
On UNIX and UNIX-like systems, newlines are represented by a line feed character (\n).
On DOS and Windows systems, newlines are represented by a carriage return followed by a line feed character (\r\n).
Compare the following two commands and their output, corresponding to the SHA256 values in your question:
echo -en "abc\n" | sha256sum
edeaaff3f1774ad2888673770c6d64097e391bc362d7d6fb34982ddf0efd18cb
echo -en "abc\r\n" | sha256sum
552bab6864c7a7b69a502ed1854b9245c0e1a30f008aaa0b281da62585fdb025
The issue you are having could come from the character encoding of the new line.
In windows the new line is escaped with \r\n and in linux is escaped with \n.
These 2 have a different dec value (\r is 13 and \n is 10).
More info you can find here:
https://en.wikipedia.org/wiki/Newline
https://en.wikipedia.org/wiki/List_of_Unicode_characters
Even i faced same issue. but providing the data in hex mode helped to understand the actual behavior.
Canonicalization of data needs to be performed before SHA calculations which will eliminate such issues. Canonicalization needs to be performed both at Generation side and also at verification side.

Failure to read full line including embedded zero bytes

Lua script:
i=io.read()
print(i)
Command line:
echo -e "sala\x00m" | lua ll.lua
Output:
sala
I want it to print all character from input, similar to this:
salam
in HEX editor:
0000000: 7361 6c61 006d 0a sala.m.
How can I print all character from input?
You tripped over one of the few places where the Lua standard library is still not 8-bit-clean.
Specifically, file reading line-by-line is not embedded-0 proof.
The reason it isn't yet is an unfortunate combination of:
Only standard C90 or equally portable constructs are allowed for the core, which does not provide for efficient 0-clean text parsing.
Every solution discussed to date on the mailinglist under that constraint has considerable overhead.
Embedded 0-bytes in text files are quite rare.
Workarounds:
Use a modified library, fixing these formats: "*l" "*L" for file:read(...)
parse your raw data yourself. (read a block using a number or as much as possible using "*a")
Badger the Lua developers/maintainers for a bugfix until they give in.