Reading characters from a text file - file-io

I have an assignment to read characters one by one from one text file and write them to another text file using post script. I am to the point where I can read line by line from a text file and write the line to another.
This version works...
%!PS
/infile (input.txt) (r) file def % open files and save file objects
/outfile (output.txt) (w) file def
/buff 128 string def % your buffer for reading operations
{ % loop
infile buff readstring
{ %ifelse
outfile exch writestring
}
{ %else
outfile exch writestring
infile closefile
outfile closefile
exit % exit the loop
} ifelse
} bind loop
But when i try to read individual characters I get an error saying its typemismatch and i am unsure how to resolve it.
Here is the code:
/infile (input.txt) (r) file def % open files and save file objects
/outfile (output.txt) (w) file def
/buff 1 string def % your buffer for reading operations
{ % loop
infile buff read
{ %ifelse
outfile exch write
}
{ %else
outfile exch write
infile closefile
outfile closefile
exit % exit the loop
} ifelse
} bind loop

So... Where does the error occur ? read, write ? something else ?
Whenever you debug PostScript this should be an important place to start, figure out which operator is throwing your error, and where in your program the error is thrown. If you aren't sure sprinkle 'print' and '==' around to follow the execution thread. Eg
infile buff read
{ %ifelse
(in read loop, character read is ) print dup == flush
outfile exch write
}
{ %else
outfile exch write
infile closefile
outfile closefile
exit % exit the loop
} ifelse
Note that you are trying to write to outfile when 'read' returns false (your %else clause), that really isn't going to work, because 'read' didn't leave anything on the stack (it didn't read anything, that's why it returned false). I suspect this is your problem, though I haven't tried to debug the program other than desk checking.

Related

Non blocking read from GNU awk coprocess?

I would like to implement incremental execution of scripts using gawk in order to interleave script source and script output in a document.
The idea would be to read script lines into awk to print them and also pipe them into an appropriate interpreter. Then, on a queue from the input file, read any output from the coprocess and print it to standard output. But it seems that I must know how much output has been generated before looping over the coprocess output.
Is there any way to do a non-blocking read from the coprocess?
function script_checkpoint() {
while(("python3" |& getline output) > 0)
print output
}
/^# checkpoint/ { script_checkpoint(); next }
{ print; print $0 |& "python3" }
END { script_checkpoint() }
EDIT: I have tried to implement this without using a coprocess by buffering the input lines until a checkpoint and just letting the interpreter print to standard out itself but the interpreter always buffers its output until the stream closes. I don't want to close it until the program ends to preserve its internal state.
EDIT: made it more clear that my first intended use case is running python scripts. Here is a sample input/output pair.
print('first line')
# checkpoint
print('second line')
should result in
print('first line')
first line
print('second line')
second line
The general issue:
while ((interpreter |& getline output) > 0) runs until it sees an EOF but ...
interpreter does not end/terminate/exit, thus no EOF is sent so ...
awk hangs while waiting for interpreter to send more data so ...
we end up with a deadlock situation (awk waiting for input from interpreter; interpreter waiting for input from awk)
Assumptions:
need to maintain a single invocation of interpreter throughout the run (per a comment from OP); net result: awk cannot depend on interpreter sending an EOF
interpreter can be modified (to generate additional output)
the awk script has no way of knowing how many lines of output will be generated by interpreter
One idea is to setup a handshake between awk and interpreter. Within the while ((interpreter |& getline output) > 0) loop we'll test for our handshake and when we see it break out of the loop and return back to the main awk script.
For demo purposes I'll use a simple bash script that does some handshake processing otherwise just prints to stdout whatever it reads from stdin:
$ cat interpreter
#!/usr/bin/bash
while read -r line
do
if [[ "${line}" = 'checkpoint' ]] # received 'checkpoint' handshake?
then
echo "CHECKPOINT" # send "CHECKPOINT" handshake/acknowledgement
continue
else
echo "interpreter: $line"
fi
done
Demo awk code with handshake logic:
awk '
function script_checkpoint() {
while (( cmd |& getline output) > 0) {
if ( output == "CHECKPOINT" ) # received "CHECKPOINT" handshake/acknowledgement?
break
print output
}
}
BEGIN { cmd= "./interpreter" }
/^# checkpoint/ { print "checkpoint" |& cmd # send "checkpoint" handshake
script_checkpoint()
next
}
{ print "awk: " $0
print $0 |& cmd
}
END { print "awk: last checkpoint" # in case last line of input is not "# checkpoint" we will ...
print "checkpoint" |& cmd # send one last "checkpoint" handshake
script_checkpoint()
print "awk: done"
}
' test.dat
Sample input file:
$ cat test.dat
line1
line2
# checkpoint
line3
line4
# checkpoint
line5
Output:
awk: line1
awk: line2
interpreter: line1
interpreter: line2
awk: line3
awk: line4
interpreter: line3
interpreter: line4
awk: line5
awk: last checkpoint
interpreter: line5
awk: done
NOTES:
awk will still hang in the event interpreter crashes and/or fails to send back the CHECKPOINT handshake
if the strings checkpoint and/or CHECKPOINT can show up in the 'normal' data streams then update the code to use strings that are not expected in the data streams
It sounds like you're trying to do something like this:
BEGIN { cmd="/my/python/script/path" }
function script_checkpoint( output) {
close(cmd,"to")
while ( (cmd |& getline output) > 0 ) {
print output
}
close(cmd)
}
/^# checkpoint/ {
script_checkpoint()
next
}
{
print
print |& cmd
}
END { script_checkpoint() }

Unix/Perl/Python: substitute list on big data set

I've got a mapping file of about 13491 key/value pairs which I need to use to replace the key with the value in a data set of about 500000 lines divided over 25 different files.
Example mapping:
value1,value2
Example input: field1,field2,**value1**,field4
Example output: field1,field2,**value2**,field4
Please note that the value could be in different places on the line with more than 1 occurrence.
My current approach is with AWK:
awk -F, 'NR==FNR { a[$1]=$2 ; next } { for (i in a) gsub(i, a[i]); print }' mapping.txt file1.txt > file1_mapped.txt
However, this is taking a very long time.
Is there any other way to make this faster? Could use a variety of tools (Unix, AWK, Sed, Perl, Python etc.)
Note   See the second part for a version that uses Text::CSV module to parse files
Load mappings into a hash (dictionary), then go through your files and test each field for whether there is such a key in the hash, replace with value if there is. Write each line out to a temporary file, and when done move it into a new file (or overwrite the processed file). Any tool has to do that, more or less.
With Perl, tested with a few small made-up files
use warnings;
use strict;
use feature 'say';
use File::Copy qw(move);
my $file = shift;
die "Usage: $0 mapping-file data-files\n" if not $file or not #ARGV;
my %map;
open my $fh, '<', $file or die "Can't open $file: $!";
while (<$fh>) {
my ($key, $val) = map { s/^\s+|\s+$//gr } split /\s*,\s*/; # see Notes
$map{$key} = $val;
}
my $outfile = "tmp.outfile.txt.$$"; # but better use File::Temp
foreach my $file (#ARGV) {
open my $fh_out, '>', $outfile or die "Can't open $outfile: $!";
open my $fh, '<', $file or die "Can't open $file: $!";
while (<$fh>) {
s/^\s+|\s+$//g; # remove leading/trailing whitespace
my #fields = split /\s*,\s*/;
exists($map{$_}) && ($_=$map{$_}) for #fields; # see Notes
say $fh_out join ',', #fields;
}
close $fh_out;
# Change to commented out line once thoroughly tested
#move($outfile, $file) or die "can't move $outfile to $file: $!";
move($outfile, 'new_'.$file) or die "can't move $outfile: $!";
}
Notes.
The check of data against mappings is written for efficiency: We must look at each field, there's no escaping that, but then we only check for the field as a key (no regex). For this all leading/trailing spaces need be stripped. Thus this code may change whitespace in output data files; in case this is important for some reason it can of course be modified to preserve original spaces.
It came up in comments that a field in data can differ in fact, by having extra quotes. Then extract the would-be key first
for (#fields) {
$_ = $map{$1} if /"?([^"]*)/ and exists $map{$1};
}
This starts the regex engine on every check, what affects efficiency. It would help to clean up that input CSV data of quotes instead, and run with the code as it is above, with no regex. This can be done by reading files using a CSV-parsing module; see comment at the end.
For Perls earlier than 5.14 replace
my ($key, $val) = map { s/^\s+|\s+$//gr } split /\s*,\s*/;
with
my ($key, $val) = map { s/^\s+|\s+$//g; $_ } split /\s*,\s*/;
since the "non-destructive" /r modifier was introduced only in v5.14
If you'd rather that your whole operation doesn't die for one bad file, replace or die ... with
or do {
# print warning for whatever failed (warn "Can't open $file: $!";)
# take care of filehandles and such if/as needed
next;
};
and make sure to (perhaps log and) review output.
This leaves room for some efficiency improvements, but nothing dramatic.
The data, with commas separating fields, may (or may not) be valid CSV. Since the question doesn't at all address this, and doesn't report problems, it is unlikely that any properties of the CSV data format are used in data files (delimiters embedded in data, protected quotes).
However, it's still a good idea to read these files using a module that honors full CSV, like Text::CSV. That also makes things easier, by taking care of extra spaces and quotes and handing us cleaned-up fields. So here's that -- the same as above, but using the module to parse files
use warnings;
use strict;
use feature 'say';
use File::Copy qw(move);
use Text::CSV;
my $file = shift;
die "Usage: $0 mapping-file data-files\n" if not $file or not #ARGV;
my $csv = Text::CSV->new ( { binary => 1, allow_whitespace => 1 } )
or die "Cannot use CSV: " . Text::CSV->error_diag ();
my %map;
open my $fh, '<', $file or die "Can't open $file: $!";
while (my $line = $csv->getline($fh)) {
$map{ $line->[0] } = $line->[1]
}
my $outfile = "tmp.outfile.txt.$$"; # use File::Temp
foreach my $file (#ARGV) {
open my $fh_out, '>', $outfile or die "Can't open $outfile: $!";
open my $fh, '<', $file or die "Can't open $file: $!";
while (my $line = $csv->getline($fh)) {
exists($map{$_}) && ($_=$map{$_}) for #$line;
say $fh_out join ',', #$line;
}
close $fh_out;
move($outfile, 'new_'.$file) or die "Can't move $outfile: $!";
}
Now we don't have to worry about spaces or overall quotes at all, what simplifies things a bit.
While it is difficult to reliably compare these two approaches without realistic data files, I benchmarked them for (made-up) large data files that involve "similar" processing. The code using Text::CSV for parsing runs either around the same, or (up to) 50% faster.
The constructor option allow_whitespace makes it remove extra spaces, perhaps contrary to what the name may imply, as I do by hand above. (Also see allow_loose_quotes and related options.) There is far more, see docs. The Text::CSV defaults to Text::CSV_XS, if installed.
You're doing 13,491 gsub()s on every one of your 500,000 input lines - that's almost 7 billion full-line regexp search/replaces total. So yes, that would take some time and it's almost certainly corrupting your data in ways you just haven't noticed as the result of one gsub() gets changed by the next gsub() and/or you get partial replacements!
I saw in a comment that some of your fields can be surrounded by double quotes. If those fields can't contain commas or newlines and assuming you want full string matches then this is how to write it:
$ cat tst.awk
BEGIN { FS=OFS="," }
NR==FNR {
map[$1] = $2
map["\""$1"\""] = "\""$2"\""
next
}
{
for (i=1; i<=NF; i++) {
if ($i in map) {
$i = map[$i]
}
}
print
}
I tested the above on a mapping file with 13,500 entries and an input file of 500,000 lines with multiple matches on most lines in cygwin on my underpowered laptop and it completed in about 1 second:
$ wc -l mapping.txt
13500 mapping.txt
$ wc -l file500k
500000 file500k
$ time awk -f tst.awk mapping.txt file500k > /dev/null
real 0m1.138s
user 0m1.109s
sys 0m0.015s
If that doesn't do exactly what you want efficiently then please edit your question to provide a MCVE and clearer requirements, see my comment under your question.
There is some commentary below suggesting that the OP needs to handle real CSV data, whereas the question says:
Please note that the value could be in different places on the line with more than 1 occurrence.
I have taken this to mean that these are lines, not CSV data, and that a regex-based solution is required. The OP also confirmed that interpretation in a comment above.
As noted in other answers, however, it is faster to break the data into fields and simply lookup the replacement in the map.
#!/usr/bin/env perl
use strict;
use warnings;
# Load mappings.txt into a Perl
# Hash %m.
#
open my $mh, '<', './mappings.txt'
or die "open: $!";
my %m = ();
while ($mh) {
chomp;
my #f = split ',';
$m{$f[0]} = $f[1];
}
# Load files.txt into a Perl
# Array #files.
#
open my $fh, '<', './files.txt';
chomp(my #files = $fh);
# Update each file line by line,
# using a temporary file similar
# to sed -i.
#
foreach my $file (#files) {
open my $fh, '<', $file
or die "open: $!";
open my $th, '>', "$file.bak"
or die "open: $!";
while ($fh) {
foreach my $k (keys %m) {
my $v = $m[$k];
s/\Q$k/$v/g;
}
print $th;
}
rename "$file.bak", $file
or die "rename: $!";
}
I assume of course that you have your mappings in mappings.txt and file list in files.txt.
According to your comments, you have proper CSV. The following properly handles quoting and escapes when reading from the map file, when reading from a data file, and when writing to a data file.
It seems you want match entire fields. The following does this. It even supports fields that contains commas (,) and/or quotes ("). It does the comparisons using a hash lookup, which is much faster than a regex match.
#!/usr/bin/perl
use strict;
use warnings;
use feature qw( say );
use Text::CSV_XS qw( );
my $csv = Text::CSV_XS->new({ auto_diag => 2, binary => 1 });
sub process {
my ($map, $in_fh, $out_fh) = #_;
while ( my $row = $csv->getline($in_fh) ) {
$csv->say($out_fh, [ map { $map->{$_} // $_ } #$row ]);
}
}
die "usage: $0 {map} [{file} [...]]\n"
if #ARGV < 1;
my $map_qfn = shift;
my %map;
{
open(my $fh, '<', $map_qfn)
or die("Can't open \"$map_qfn\": $!\n");
while ( my $row = $csv->getline($fh) ) {
$map{$row->[0]} = $row->[1];
}
}
if (#ARGV) {
for my $qfn (#ARGV) {
open(my $in_fh, '<', $qfn)
or warn("Can't open \"$qfn\": $!\n"), next;
rename($qfn, $qfn."~")
or warn("Can't rename \"$qfn\": $!\n"), next;
open(my $out_fh, '>', $qfn)
or warn("Can't create \"$qfn\": $!\n"), next;
eval { process(\%map, $in_fh, $out_fh); 1 }
or warn("Error processing \"$qfn\": $#"), next;
close($out_fh)
or warn("Error writing to \"$qfn\": $!\n"), next;
}
} else {
eval { process(\%map, \*STDIN, \*STDOUT); 1 }
or warn("Error processing: $#");
close(\*STDOUT)
or warn("Error writing to STDOUT: $!\n");
}
If you provide no files names beyond the map file, it reads from STDIN and outputs to STDOUT.
If you provide one or more file names beyond the map file, it replaces the files in-place (though it leaves a backup behind).

looks like widechar input from getline in awk

I'm having trouble with AWK that I've never seen before.
I'm reading in a file, no special chars, and printing it back out.
When I read a text file, it prints out with a NUL between every char.
Reading an HTML file works exactly as expected and prints out what was read in.
Code snippet:
while ((getline line < In) > 0) {
print ":0:", line, ":0:" > "out";
reads the line "signature1"
and prints
":0: xFFxFEsNULiNULgNULnNULaNULtNULuNULrNULeNUL1NUL/r
NUL :0:/r/n"
as viewed in Notepad++.
"In" is the input filename.
I assume it is some Language setting on my machine, but I can't find anything.
A second print line, redirected to a file, prints every other line in Chinese.
TL;RD; Complete text of the app:
BEGIN { ProcessFile(); }
function ProcessFile() {
In = "default.txt";
Works = "NoProblem.html";
Out = "quote.txt";
RS = "/n";
while ((getline textLine < In) > 0) {
print "*0*", textLine, "*0:*" > "out.txt";
print textLine > Out; # prints every other line in Chinese ???
}
close(In);
close(Out);
}
Output of the second print line:
signature1
਍猀椀最渀愀琀甀爀攀㈀ഀഀ

Endless recursion in gawk-script

Please pardon me in advance for posting such a big part of my problem, but I just can't put my finger on the part that fails...
I got input-files like this (abas-FO if you care to know):
.fo U|xiininputfile = whatever
.type text U|xigibsgarnich
.assign U|xigibsgarnich
..
..Comment
.copy U|xigibswohl = Spaß
.ein "ow1/UWEDEFTEST.FOP"
.in "ow1/UWEINPUT2"
.continue BOTTOM
.read "SOemthing" U|xttmp
!BOTTOM
..
..
Now I want to recursivly follow each .in[put]/.ein[gabe]-statement, parse the mentioned file and if I don't know it yet, add it to an array. My code looks like this:
#!/bin/awk -f
function getFopMap(inputregex, infile, mandantdir, infiles){
while(getline f < infile){
#printf "*"
#don't match if there is a '
if(f ~ inputregex "[^']"){
#remove .input-part
sub(inputregex, "", f)
#trim right
sub(/[[:blank:]]+$/, "", f)
#remove leading and trailing "
gsub(/(^\"|\"$)/,"" ,f)
if(!(f in infiles)){
infiles[f] = "found"
}
}
}
close(infile)
for (i in infiles){
if(infiles[i] == "found"){
infiles[i] = "parsed"
cmd = "test -f \"" i "\""
if(system(cmd) == 0){
close(cmd)
getFopMap(inputregex, f, mandantdir, infiles)
}
}
}
}
BEGIN{
#Matches something like [.input myfile] or [.ein "ow1/myfile"]
inputregex = "^\\.(in|ein)[^[:blank:]]*[[:blank:]]+"
#Get absolute path of infile
cmd = "python -c 'import os;print os.path.abspath(\"" ARGV[1] "\")'"
cmd | getline rootfile
close(cmd)
infiles[rootfile] = "parsed"
getFopMap(inputregex, rootfile, mandantdir, infiles)
#output result
for(infile in infiles) print infile
exit
}
I call the script (in the same directory the paths are relative to) like this:
./script ow1/UWEDEFTEST.FOP
I get no output. It just hangs up. If I remove the comment before the printf "*" command, I'm seeing stars, without end.
I appreciate every help and hints how to do it better.
My awk:
gawk Version 3.1.7
idk it it's your only problem but you're calling getline incorrectly and consequently will go into an infinite loop in some scenarios. Make sure you fully understand all of the caveats at http://awk.info/?tip/getline and you might want to use the recursion example there as the starting point for your code.
The most important item initially for your code is that when getline fails it can return a negative value so then while(getline f < infile) will create an infinite loop since the failing getline will always be returning non-zero and will so continue to be called and continue to fail. You need to use while ( (getline f < infile) > 0) instead.

Run TCL script in regular intervals with continuous results

I have encountered a problem in one of my TCL scripts. I need to run it in an infinite loop with a terminating condition and in every loop I need to write some output. This is the basic code that im using:
proc wr {i} {
puts -nonewline "$i"
}
proc do {roof} {
set end 0
while {$end < $roof} {
after 1000
wr $end
incr end
}
}
do 10
The expected behaviour is that every second there will be a new output until $end == $roof. But instead after running this script, the console window is busy for 10 seconds and after that time, the entire output prints out at once.
Thank you for your advice :)
The problem is that you don't flush stdout.
If you modify your script so it flushes stdout:
proc wr {i} {
puts -nonewline "$i"
flush stdout
}
proc do {roof} {
set end 0
while {$end < $roof} {
after 1000
wr $end
incr end
}
}
do 10
It will work. You can also change the buffering of the stdout channel to none, the default is line:
fconfigure stdout -buffering none
If you write more than one line, the default buffering will flush stdout when it encounters a newline, but you never write a newline.