How to read text files transfered as binary - file-io

My code copies files from ftp (using text transfer mode) to local disk and then trys to process them.
All files contain only text and values are seperated using new line. Sometimes files are moved to this ftp using binary transfer mode and looks like this will mess up line-ends.
Using hex editor, I compared line ends depending the transfer mode used to send files to ftp:
using text mode: file endings are 0D 0A
using binary mode: file endings are 0D 0D 0A
Is it possible to modify my code so it could read files in both cases?
Code from job that illustrates my problem and shows how i'm reading file:
(here i use same file, that contains 14 rows of data)
int i;
container con;
container files = ["c:\\temp\\axa_keio\\ascii.txt", "c:\\temp\\axa_keio\\binary.txt"];
boolean purchLineFirstRow;
IO inFile;
;
for(i=1; i<=conlen(files); i++)
{
inFile = new AsciiIO(conpeek(files,i), "R");
inFile.inFieldDelimiter('\n');
con = inFile.read();
info(int2str(conlen(con)));
}
Files come from Unix system to Windows sytem.
Not sure but maybe the question could be: "Which inFieldDelimiter values should i use to read both Unix and Windows line ends?"

Use inRecordDelimiter:
inFile.inRecordDelimiter('\n');
instead of:
inFile.inFieldDelimiter('\n');
There may still be a dangling CR on the last field, you may wish remove this:
strRem(conpeek(con, conlen(con)), '\r')
See also: http://en.wikipedia.org/wiki/Line_endings

Related

Issues converting a small Hex value to a Binary value

I am trying to take the contents of a file that has a Hex number and convert that number to Binary and output to a file.
This is what I am trying but not getting the binary value:
xxd -r -p Hex.txt > Binary.txt
The contents of Hex.txt is: ff
I have also tried FF and 0xFF, but would like to just use ff since the device I am pulling the info from has it in that format.
Instead of 11111111 which it should be, I get a y with 2 dots above it.
If I change it to ee, I get an i with 2 dots. It seems to be reading it just fine but according to what I have read on the xxd -r -p command, it is not outputing it in the correct format.
The other ways I have found to convert Hex to Binary have either also not worked or is a pretty big Bash script that seems unnecessary to do what I thought would be a simple task.
This also gives me the y with 2 dots.
$ for i in $(cat Hex.txt) ; do printf "\x$i" ; done > Binary.txt
For some reason almost every solution I find gives me this format instead of a human readable Binary value with 1s and 0s.
Any help is appreciated. I am planning on using this in a script to pull the Relay values from Digital Loggers devices using curl and giving Home Assistant a readable file to record the Relay State. Digital Loggers curl cmd gives the state of all 8 relays at once using Hex instead of being able to pull the status of a specific relay.
If "file.txt" contains:
fe
0a
and you run this:
perl -ane 'printf("%08b\n",hex($_))' file.txt
You'll get this:
11111110
00001010
If you use it a lot, you might want to make a bash function of it in your login profile along these lines - being extremely respectful of spaces and semi-colons that might look unnecessary:
bin(){ perl -ane 'printf("%08b\n",hex($_))' $1 ; }
Then you'll be able to do:
bin file.txt
If you dislike Perl for some reason, you can achieve something similar without it as follows:
tr '[:lower:]' '[:upper:]' < file.txt |
while read h ; do
echo "obase=2; ibase=16; $h" | bc
done

Writing lines to a binary file

I'm further playing with Raku's CommaIDE and I wanna print a binary file line by line.
I've tried this, but it doesn't work:
for "G.txt".IO.lines -> $line {
say $_;
}
How shall I fix it ? It's obviously incorrect.
EDIT
this doesn't work either, see the snippet bellow
for "G.txt".IO.lines -> $line {
say $line;
}
You're showing us h.raku but Comma is giving you an error regarding c.raku, which is some other file in your Comma project.
It looks like you're working with a text file, not binary. Raku makes a clear distinction here: a text file is treated as text, regardless of encoding. If it's UTF-8, using .lines as you are now should work just fine because that's the default. If it's some other encoding, you can call .lines(:enc<some-other-encoding>). If it's truly binary, then the concept of "lines" really has no meaning, and you want something more like .slurp(:bin), which will give you a Buf[uint8] for working on the byte level.
The question specifically refers to reading a binary file, for which reading line-wise may (or may not) make sense--depending on the file.
Here's code to read a binary file straight from the docs (using class IO::CatHandle):
~$ raku -e '(my $f1 = "foo".IO).spurt: "A\nB\nC\n"; (my $f2 = "foo"); with IO::CatHandle.new: $f2 {.encoding: Nil; .slurp.say;};'
Buf[uint8]:0x<41 0A 42 0A 43 0A>
Compare to reading the file with default encoding (utf8):
~$ raku -e '(my $f1 = "foo".IO).spurt: "A\nB\nC\n"; (my $f2 = "foo"); with IO::CatHandle.new: $f2 {.slurp.say;};'
A
B
C
See:
https://docs.raku.org/routine/encoding
Note: the read method uses class IO::Handle which reads binary by default. So the code is simply:
~$ raku -e '(my $file1 = "foo".IO).spurt: "A\nB\nC\n"; my $file2 = "foo".IO; given $file2.open { .read.say; .close;};'
Buf[uint8]:0x<41 0A 42 0A 43 0A>
See:
https://docs.raku.org/type/IO::Handle#method_read
For further reading, see discussion of Perl5's <> diamond-operator-equivalent in Raku:
https://docs.raku.org/language/5to6-nutshell#while_until
...and some (older) mailing-list discussion of the same:
https://www.nntp.perl.org/group/perl.perl6.users/2018/11/msg6295.html
Finally, the docs refer to writing a mixed utf8/binary file here (useful for further testing):
https://docs.raku.org/routine/encoding#Examples

sqlQuery in R fails when called via source() [duplicate]

The following, when copied and pasted directly into R works fine:
> character_test <- function() print("R同时也被称为GNU S是一个强烈的功能性语言和环境,探索统计数据集,使许多从自定义数据图形显示...")
> character_test()
[1] "R同时也被称为GNU S是一个强烈的功能性语言和环境,探索统计数据集,使许多从自定义数据图形显示..."
However, if I make a file called character_test.R containing the EXACT SAME code, save it in UTF-8 encoding (so as to retain the special Chinese characters), then when I source() it in R, I get the following error:
> source(file="C:\\Users\\Tony\\Desktop\\character_test.R", encoding = "UTF-8")
Error in source(file = "C:\\Users\\Tony\\Desktop\\character_test.R", encoding = "utf-8") :
C:\Users\Tony\Desktop\character_test.R:3:0: unexpected end of input
1: character.test <- function() print("R
2:
^
In addition: Warning message:
In source(file = "C:\\Users\\Tony\\Desktop\\character_test.R", encoding = "UTF-8") :
invalid input found on input connection 'C:\Users\Tony\Desktop\character_test.R'
Any help you can offer in solving and helping me to understand what is going on here would be much appreciated.
> sessionInfo() # Windows 7 Pro x64
R version 2.12.1 (2010-12-16)
Platform: x86_64-pc-mingw32/x64 (64-bit)
locale:
[1] LC_COLLATE=English_United Kingdom.1252
[2] LC_CTYPE=English_United Kingdom.1252
[3] LC_MONETARY=English_United Kingdom.1252
[4] LC_NUMERIC=C
[5] LC_TIME=English_United Kingdom.1252
attached base packages:
[1] stats graphics grDevices utils datasets methods
[7] base
loaded via a namespace (and not attached):
[1] tools_2.12.1
and
> l10n_info()
$MBCS
[1] FALSE
$`UTF-8`
[1] FALSE
$`Latin-1`
[1] TRUE
$codepage
[1] 1252
On R/Windows, source runs into problems with any UTF-8 characters that can't be represented in the current locale (or ANSI Code Page in Windows-speak). And unfortunately Windows doesn't have UTF-8 available as an ANSI code page--Windows has a technical limitation that ANSI code pages can only be one- or two-byte-per-character encodings, not variable-byte encodings like UTF-8.
This doesn't seem to be a fundamental, unsolvable problem--there's just something wrong with the source function. You can get 90% of the way there by doing this instead:
eval(parse(filename, encoding="UTF-8"))
This'll work almost exactly like source() with default arguments, but won't let you do echo=T, eval.print=T, etc.
We talked about this a lot in the comments to my previous post but I don't want this to get lost on page 3 of comments: You have to set the locale, it works with both input from the R-console (see screenshot in comments) as well as with input from file see this screenshot:
The file "myfile.r" contains:
russian <- function() print ("Американские с...");
The console contains:
source("myfile.r", encoding="utf-8")
> Error in source(".....
Sys.setlocale("LC_CTYPE","ru")
> [1] "Russian_Russia.1251"
russian()
[1] "Американские с..."
Note that the file-in fails and it points to the same character as the original poster's error (the one after "R). I can not do this with Chinese because i would have to install "Microsoft Pinyin IME 3.0", but the process is the same, you just replace the locale with "chinese" (the naming is a bit inconsistent, consult the documentation).
I think the problem lies with R. I can happily source UTF-8 files, or UCS-2LE files with many non-ASCII characters in. But some characters cause it to fail. For example the following
danish <- function() print("Skønt H. C. Andersens barndomsomgivelser var meget fattige, blev de i hans rige fantasi solbeskinnede.")
croatian <- function() print("Dodigović. Kako se Vi zovete?")
new_testament <- function() print("Ne provizu al vi trezorojn sur la tero, kie tineo kaj rusto konsumas, kaj jie ŝtelistoj trafosas kaj ŝtelas; sed provizu al vi trezoron en la ĉielo")
russian <- function() print ("Американские суда находятся в международных водах. Япония выразила серьезное беспокойство советскими действиями.")
is fine in both UTF-8 and UCS-2LE without the Russian line. But if that is included then it fails. I'm pointing the finger at R. Your Chinese text also appears to be too hard for R on Windows.
Locale seems irrelevant here. It's just a file, you tell it what encoding the file is, why should your locale matter?
For me (on windows) I do:
source.utf8 <- function(f) {
l <- readLines(f, encoding="UTF-8")
eval(parse(text=l),envir=.GlobalEnv)
}
It works fine.
Building on crow's answer, this solution makes RStudio's Source button work.
When hitting that Source button, RStudio executes source('myfile.r', encoding = 'UTF-8')), so overriding source makes the errors disappear and runs the code as expected:
source <- function(f, encoding = 'UTF-8') {
l <- readLines(f, encoding=encoding)
eval(parse(text=l),envir=.GlobalEnv)
}
You can then add that script to an .Rprofile file, so it will execute on startup.
I encounter this problem when a try to source a .R file containing some Chinese characters. In my case, I found that merely set "LC_CTYPE" to "chinese" is not enough. But setting "LC_ALL" to "chinese" works well.
Note that it's not enough to get encoding right when you read or write plain text file in Rstudio (or R?) with non-ASCII. The locale setting counts too.
PS. the command is Sys.setlocale(category = "LC_CTYPE",locale = "chinese"). Please replace locale value correspondingly.
On windows, when you copy-paste a unicode or utf-8 encoded string into a text-control that is set to single-byte-input (ascii... depending on locale), the unknown bytes will be replaced by questionmarks. If i take the first 4 characters of your string and copy-paste it into e.g. Notepad and then save it, the file becomes in hex:
52 3F 3F 3F 3F
what you have to do is find an editor which you can set to utf-8 before copy-pasting the text into it, then the saved file (of your first 4 characters) becomes:
52 E5 90 8C E6 97 B6 E4 B9 9F E8 A2 AB
This will then be recognized as valid utf-8 by [R].
I used "Notepad2" for trying this, but i am sure there are many more.

perl gunzip to buffer and gunzip to file have different byte orders

I'm using Perl v5.22.1, Storable 2.53_01, and IO::Uncompress::Gunzip 2.068.
I want to use Perl to gunzip a Storable file in memory, without using an intermediate file.
I have a variable $zip_file = '/some/storable.gz' that points to this zipped file.
If I gunzip directly to a file, this works fine, and %root is correctly set to the Storable hash.
gunzip($zip_file, '/home/myusername/Programming/unzipped');
my %root = %{retrieve('/home/myusername/Programming/unzipped')};
However if I gunzip into memory like this:
my $file;
gunzip($zip_file, \$file);
my %root = %{thaw($file)};
I get the error
Storable binary image v56.115 more recent than I am (v2.10)`
so the Storable's magic number has been butchered: it should never be that high.
However, the strings in the unzipped buffer are still correct; the buffer starts with pst which is the correct Storable header. It only seems to be multi-byte variables like integers which are being broken.
Does this have something to do with byte ordering, such that writing to a file works one way while writing to a file buffer works in another? How can I gunzip to a buffer without it ruining my integers?
That's not related to unzip but to using retrieve vs. thaw. They both expect different input, i.e. thaw expect the output from freeze while retrieve expects the output from store.
This can be verified with a simple test:
$ perl -MStorable -e 'my $x = {}; store($x,q[file.store])'
$ perl -MStorable=freeze -e 'my $x = {}; print freeze($x)' > file.freeze
On my machine this gives 24 bytes for the file created by store and 20 bytes for freeze. If I remove the leading 4 bytes from file.store the file is equivalent to file.freeze, i.e. store just added a 4 byte header. Thus you might try to uncompress the file in memory, remove the leading 4 bytes and run thaw on the rest.

Cannot append to file when some other process writes to it on *nix systems

I have a very simple piece of code which just writes a small amount of data to a file at some regular interval. Once my program has created the file and appended some data, when I open this file in vim(or any other editor for that matter) and edit it, my process cannot seem to update the file anymore. I do not see any errors being returned from the syscall. I tried tracing the system calls, and did not observe anything weird even while the file is NOT being updated.
Since each process gets its own file table entry which has the current offset, all I was expecting was an output file with data interspersed with writes from the two non-cooperating processes(possibly garbled too). But what I am observing is that my program cannot update the file anymore once any other editor writes to the file.
Couple of other interesting observations
1) When I cat something to the output file, my program can continue to update no problem
2) When multiple instances of my own program are writing to the same file, everything is fine again
I understand that there's mandatory locking to prevent multiple writes, but I am trying to understand whats happening underneath. Also this kind of scenario behaves normally for some loggers (like system log, apache logs etc)
Any ideas to explain this behavior?. Also any hints on how I can debug this further?
My code is pretty simple:
1 int main(int argc, char** argv)
2 {
3 const char* buf;
4 if(argc < 2)
5 buf = "test->";
6 else
7 buf = argv[1];
8
9 int fd;
10 if((fd = open("test.log", O_CREAT|O_WRONLY|O_APPEND, 0644)) == -1) {
11 perror("Cannot open test.log");
12 exit(1);
13 }
14
15 int num_bytes = strlen(buf), num_bytes_written = -1;
16
17 while(1) {
18 if((num_bytes_written = write(fd, buf, num_bytes)) == -1) {
19 perror("Could not write to fd");
20 }
21 fsync(fd);
22 sleep(5);
23 }
24 }
When the vim(1) editor exits, it's likely replacing the original file with the edited version. Your process is holding the original file open but that file no longer exists in the sense that it's directory entry has been replaced and, so, no process that doesn't already have the file open can access it. Your process is now appending to a file that can't be accessed by any other process. Once your process closes the file, it will be gone for good (unless you run a partition recovery program).
Your vim editor works on a cached version of your file. It modifies this cache while your other programs append to the original file. When you save with vim, you overwrite the original file with the updated cached file and loose all logs.