How to dump GNU screen scrollback buffer while preserving ANSI control sequences? - gnu-screen

In a screen session, its scrollback buffer can be saved to file by entering ^A:hardcopy -h /path/to/filename. However, this strips all ANSI control sequences from the output.
I want something like less -R but for saving the scrollback buffer.
example script to produce coloured text:
#!/bin/bash
# both times, the word 'red' is printed in bright red text.
printf 'example \x1b[1;31mred\x1b[m output\n' |tee example.log
cat example.log
you can also view the file with less -R example.log

When the terminal consumes ANSI sequences, it does not store them in memory verbatim, but instead converts to display attributes of the individual characters on screen. hardcopy, apparently, was not designed to output those attributes.
You might get what you need if you enable logging, however. See this answer, for example.

Related

Can't get Ghostscript "viewraw.ps" or "viewrgb.ps" programs to work (scrambled output)

I've had good results in the past using the "viewjpeg.ps" PostScript program included with Ghostscript to place JPEG images into generated PDFs. Now I'm trying to do the same for bitmaps, and I just haven't been able to make it work. My hunch is that the program I need is either "viewraw.ps" or "viewrgb.ps," and I can see that the parameters expected will be a bit different from those passed to "viewjpeg.ps."
So far this is what I have:
"C:\Program Files\gs\gs9.10\bin\gswin64c.exe" -q -sDEVICE=pdfwrite -DNOSAFER -r200x200 -sOutputFile=o.pdf z:\home\dell\reporting\viewrgb.ps -c "(out002.bmp) 6800 viewrgb"
This gets pretty close to what I want, but my bitmap (though clearly identifiable) is scrambled in the output PDF: compressed vertically, upside-down, and somewhat wrong in color.
I have attempted to address these issues by tweaking the "width" parameter (6800 above). My bitmap is 1,700 pixels wide, and uses 4 bytes per pixel, so 1,700 * 4 = 6,800 seemed like a logical choice. I've also tried 1,700 (width in pixels) and 54,400 (bits per image row). 5,100 (3 * 1,700) seemed to work best, but it's still wrong.
Note that "viewjpeg.ps" does not expect a "width" parameter, so I haven't had to deal with this before. (It was an examination of "viewrgb.ps" that made me realize this parameter was required.)
Can anyone spot my mistake, or maybe point me to an example that uses "viewraw.ps" or "viewrgb.ps"?
You haven't said (or I missed it) what format your 'bitmaps' are, and you haven't supplied an example to look at so I can't tell (or experiment).
You say your output is 4 bytes per pixel so that's either CMYK or something like RGBa. Either way viewrgb isn't going to work, because it only expects 3 channels. It's intended to view the output of the Ghostscript bitrgb device.
Viewraw just reads raw data, straight image samples, no header IIRC and it's CMYK, so unless your 4 bytes are CMYK then it's not going to be correct either.
Since both of these are RAW format, they don't expect a header, if your image format includes a header, then that's going to be treated as image data which will certainly cause the image to be drawn incorrectly.
Both of these PostScript programs will display a usage message on the back channel if you invoke them incorrectly.
You don't need -dNOSAFER with such an old version of Ghostscript (9.10).
-r has little effect on pdfwrite and will have no effect at all when you feed it an image as input; you should probably omit that.

Avoiding fragmenting of text extracted from PDF after processing with Ghostscript

After processing with Ghostscript, I sometimes see whitespace breaking up the words as seen with pdftotext or in a PDF viewer when searching or selecting. Possibly unrelated but the anomalies seem to correspond with kerning variations in the rendered font.
Is there a way to avoid this?
For example, from GS 9.23 (also occurred with earlier versions):
gs -sDEVICE=pdfwrite \
-dNOPAUSE -dQUIET -dPARANOIDSAFER -dBATCH \
-sOutputFile=./output.pdf input.pdf
Excerpt from pdftotext input.pdf:
Review this manual before
operating deep cleaner
while pdftotext output.pdf:
Re vie w t his m a nua l be fore
ope ra t ing de e p c le a ne r
Ghostscript and the pdfwrite device (as explained in VectorDevices.htm) does not simply 'fiddle' with the input when producing a PDF file. The input (from whatever source; PDF, PostScript, XPS, PCL, PCL-XL) is fully interpreted into marking operations, those marking operations are sent to the device which turns them back into PDF constructs.
So the low level (PDF) format describing the page need not bear any relation to the low level format of the input. In particular you cannot expect the PDF operations in the input to be reflected in the output.
The visual appearance will be the same (or should be, because that's the main goal), but the actual operations won't be.
The reason for the difference in the text output is because, basically, there is no 'metadata' in a PDF file that describes words, paragraphs, columns etc. When you extract text from a PDF file what you actually get is a series of character codes and positions.
Its up to the text extraction code to try and make some sense of that. I'd guess that pdftotext is using the rather naive approach of assuming that text strings are words.
This is problematic because there are numerous different ways to handle kerning, justification and other spacings in PDF. You could do something like :
(Te) Tj
10 0 Td
(st) Tj
Or :
[(Te) 2 (st)] TJ
The pdfwrite device doesn't know what the original was, so what it emits could be either of those, depending on some heuristics. The chances of it matching the original are low.
I suspect that pdftotext would regard the first operation as "Te st" and the second operation as "Test"
One possible solution would be to use Ghostscript's txtwrite device to extract the text, it might do a better job.
As with your other question, it would be best to supply examples when asking these kinds of questions, because without that its pretty much guesswork.
TL;DR
Is there a way to avoid this?
No.

How to get bounding boxes of elements in EPS files

I need to check if a EPS/PDF file contains any vector elements
First I convert the PDF to EPS and remove all text elements and images from the file like this
pdftocairo -f $page_number -l $page_number -eps $input - | sed '/BT/,/ET/ d' | sed '/^8 dict dup begin$/,/^Q$/ c Q' > $output
But how can I then check if any elements are written to the canvas?
What do you mean, exactly, by 'vector elements' ? Anything except an actual bitmap image ? Why do you care ? Perhaps if you explained what you want to achieve it would be easier to help you.
Note that the approach you are using is by no means guaranteed to work, there can easily be 'elements' in the file which won't be removed by your rather basic approach to finding image.
You could use Ghostscript; run the file to a bitmap and specify -dFILTERTEXT and -dFILTERIMAGES. Then examine the pixels fo the bitmap to see if any are non-white. If they are, then there was vector content i the file. You could probably use something like ImageMagick to count the colours and see if there's more than 1.
Or run the file to bitmap twice, once normally, and once with -dFILTERVECTOR. Compare the two bitmaps (MD5 on them would be sufficient). If there are no differences then there was no vector content.
Any PDF that has vector elements will use at least one of the path painting operators. According to chapter 8 of the PDF standard, those are:
S, s, f, F, f*, B, B*, b, b*, n
Of course, since PDF files can be complex, you'll also need it in a standard form. You can do that using the qpdf program's QDF format. (apt install qpdf if you don't have it).
qpdf -qdf schedule.pdf - | egrep -m1 -q '\b[SsfFBbn]\*?$' && echo Yup
That'll print "Yup" if the file schedule.pdf has vector graphics in it.
Note: I think this will do the job for you, but it is not fool proof. It's possible to get false negatives if your PDF is loading vectors from an external file, embedding raw postscript, or doing some other trickiness. And, of course it can have false positives (for example, a file that draws a completely transparent, 0pt dot in white ink on a white background).
Other answers have addressed identifying the drawing operators in a plain text stream. For the other question,
But how can I then check if any elements are written to the canvas?
For this, the elements need to be part of a content stream that is referred to
in the /Contents member of the Page object.
If you read in all of the pdf objects, there will be a tree connecting all the content streams to the Root object declared in the trailer.
Trailer : /Root is a reference to the Document Catalog object
Document Catalog : /Pages is an array of Page objects or Pages nodes
Page : /Contents is an array of references to Content Stream objects that draw the elements of the page
It is possible for there to be stray content stream objects that are not referenced in the Document tree. By traversing the Pages tree you could collect any and all actual content and then feed that result to one of the solutions from the other answers.

Extracting a specific value from a text file

I am running a script that outputs data.
I am specifically trying to extract one number. However, each time I run the script and get the output file, the position of the number I am interested in will be in a different position (due to the log nature of the output file).
I have tried several awk, sed, grep commands but I can't get any to work as many of them rely on the position of the word or number remaining constant.
This is what I am dealing, with. The value I require is the bold one:
Energy initial, next-to-last, final =
-5.96306582435 -5.96306582435 -5.96349956298
You can try
awk '{print $(i++%3+6)}' infile

Is there a tool to clean the output of the script(1) tool?

script(1) is a tool for keeping a record of an interactive terminal session; by default it writes to the file transcript. My problem is that I use ksh93, which has readline features, and so the transcript is mucked up with all sorts of terminal escape sequences and it can be very difficult to reconstruct the command that was actually executed. Not to mention the stray ^M's and the like.
I'm looking for a tool that will read a transcript file written by script, remove all the junk, and reconstruct what the shell thought it was executing, so I have something that shows $PS1 and the commands actually executed. Failing that, I'm looking for suggestions on how to write such a tool, ideally using knowledge from the terminfo database, or failing that, just using ANSI escape sequences.
A cheat that looks in shell history, as long as it really really works, would also be acceptable.
Doesn't cat/more work by default for browsing the transcript? Do you intend to create a script out of the commands actually executed (which in my experience can be dangerous)?
Anyway, 3 years without an answer, so I will give it a shot with an incomplete solution. If your are only interested in the commands actually typed, remove the non-printable characters, then replace PS1' with something readable and unique, and grep for that unique string. Like this:
$ sed -i 's/[^[:print:]]//g' transcript
$ sed 's/]0;cartman#southpark: ~cartman#southpark:~/CARTMAN/g' transcript | grep CARTMAN
Explanation: After first sed, PS1' can be taken from one of the first few lines of the transcript file, as is -- PS1' is different from PS1 -- and can be modified with a unique readable string ("CARTMAN" here). Note that the dollar sign at the end of the prompt was left out intentionally.
In the few examples that I tried, this didn't solve everything but took care of most issues.
This is essentially the same question asked recently in Can I programmatically “burn in” ANSI control codes to a file using unix utils? -- removing all nonprinting characters will not fix
embedded escape sequences
backspace/overstriking for underlining
use of carriage-returns for overstriking