Add first column if all other columns are the same (AWK) - awk

I have a file with the following data:
25 POSIX shell script, ASCII text executable
25 POSIX shell script, ASCII text executable
3 PostScript document text conforming DSC level 3.0, type EPS, Level 2
2 PostScript document text conforming DSC level 3.0, type EPS, Level 2
23 PostScript document text conforming DSC level 3.0, type EPS, Level 2
4 SVG Scalable Vector Graphics image
4 SVG Scalable Vector Graphics image
and would like to sum first field if all other fields are the same, so the output should be:
50 POSIX shell script, ASCII text executable
28 PostScript document text conforming DSC level 3.0, type EPS, Level 2
8 SVG Scalable Vector Graphics image
I tried this awk command:
awk '{ a[$2]+=$1 }END{ for(i in a) print a[i],i }' inputfile
which prints:
25 POSIX
28 PostScript
8 SVG
but I can't find a way to print the rest of the line

This is one way:
$ awk '{v=$1;$1="";s[$0]+=v}END{for(i in s)print s[i] i}' file
8 SVG Scalable Vector Graphics image
50 POSIX shell script, ASCII text executable
28 PostScript document text conforming DSC level 3.0, type EPS, Level 2
Explained:
$ awk '{
v=$1 # store value in $1
$1="" # empty $1, record gets rebuilt
s[$0]+=v # sum indexing on $1less record
}
END { # in the end
for(i in s) # loop all
print s[i] i # ... and output
}' file

$ awk '{n=$1; sub(/[0-9]+ +/,""); a[$0]+=n} END{ for(i in a) print a[i],i }' file
28 PostScript document text conforming DSC level 3.0, type EPS, Level 2
50 POSIX shell script, ASCII text executable
8 SVG Scalable Vector Graphics image

Another awk with 'sort'
$ sort -k2 sergio.txt | awk ' { t=$1; $1=""; c=$0;if(c==p) { s+=b} else { if(NR>1) print s+b,p; s=0} p=c;b=t} END { print s+b,p } ' sergio.txt
50 POSIX shell script, ASCII text executable
28 PostScript document text conforming DSC level 3.0, type EPS, Level 2
8 SVG Scalable Vector Graphics image
$
Input file:
$ cat sergio.txt
25 POSIX shell script, ASCII text executable
25 POSIX shell script, ASCII text executable
3 PostScript document text conforming DSC level 3.0, type EPS, Level 2
2 PostScript document text conforming DSC level 3.0, type EPS, Level 2
23 PostScript document text conforming DSC level 3.0, type EPS, Level 2
4 SVG Scalable Vector Graphics image
4 SVG Scalable Vector Graphics image
$

Related

Ghostscript converting PostScript to PDF seems to ignore the page size / BoundingBox

I have created a PostScript file from a TIFF image using ImageMagick.
The command-line I am using is:
convert input.tif[0] -density 600 -alpha Off -size 5809x9408 -depth 16 intermediate.ps
This takes my input tiff image (just the main image, and not the thumbnail via using [0]) and creates a .ps file from the bitmap.
When I look at the header of my PostScript file, I can see that it has the correct page size:
%!PS-Adobe-3.0
%%Creator: (ImageMagick)
%%Title: (intermediate.ps)
%%CreationDate: (2017-05-22T08:43:44+10:00)
%%BoundingBox: -0 -0 697 1129
%%HiResBoundingBox: 0 0 697.08 1129
%%DocumentData: Clean7Bit
%%LanguageLevel: 1
%%Orientation: Portrait
%%PageOrder: Ascend
%%Pages: 1
%%EndComments
Yet, when I use GhostScript to convert this to a PDF, unless I go to a lot of trouble to specify otherwise, gs is cropping it and putting it on a US Letter sized page.
gs -dPDFA=1 -dBATCH -dNOPAUSE -sDEVICE=pdfwrite -sDefaultRGBProfile=AdobeRGB1998.icc -dOverrideICC -sOutputFile=output.pdf -r600 -P PDFA_def.ps intermediate.ps
When I open the resulting PDF, the crop box is 612 x 792 pt wich is US Letter. It should be 697 x 1129 pt, the size of the Bounding Box in the PostScript file.
I have created a custom .joboptions file using Acrobat Distiller that sets image compression and the like, and in this file if I specify the page size at the end, then the resulting PDF comes out the correct size:
<<
/HWResolution [600 600]
/PageSize [697.080 1128.960]
>> setpagedevice
Now this isn't a huge issue for a one-off conversion, but I have to convert a large number of images and I don't want to set the page size manually for every single file.
The lines you quote above are comments and, from the comments present, suggest that this is an EPS file, not a PostScript program.
The main difference is that EPS is 'encapsulated' which means its intended to be placed verbatim inside a PostScript program. The enclosing program contains the intelligence regarding the media size, and arranges to set the context such that the EPS is scaled, rotated, translated so that it fits appropriately on the media.
In order to do this successfully, the EPS file must follow certain rules; in particular it must not set any media size itself (because that would mess with the enclosing program).
So it seems likely to me that what you have is an EPS file which does not request any media size at all. So its hardly surprising that you have to tell Ghostscript what you want to do with it.
Now in order for the enclosing program to place the EPS it needs to know its characteristics, the size and shape of the content. That's what the comments are for. Ordinarily an EPS file is read by an application (eg MS Word, LibreOffice etc) which parses out those comments and uses the information when generating the final PostScript program. The reason an EPS uses comments to store this information is precisely so that it has no effect on the actual content of the EPS and so the entire EPS can be included without further processing by the application.
The short answer is that if you read the Ghostscript documentation here you will find descriptions of the EPSCrop and EPSFitPage command line switches which will do all the work for you.

Raw pdf color conversion (with known conversion formula) from RGB to CMYK

This question is related to
Script (or some other means) to convert RGB to CMYK in PDF?
however way more specific. Consider that I am not an expert in print production ;)
Situation: For printing I am only allowed to use two colors, Cyan and Black. The printery requests the final PDF to be in DeviceCMYK with only the Channels C and K used.
pdflatex automatically does that (with the xcolor package) for all fonts and drawn objects, however I have more than 100 sketches/figures in PDF format which are embedded in the manuscript. Due to an admittedly badly designed workflow (late realization that Inkscape cannot export CMYK PDFs), all these figures were created in Inkscape, and thus are RGB PDFs.
However, the only used colors within Inkscape were RGB complements of CMY(K), e.g. 100% Cyan is (0,255,255) RGB and 50% K is (127,127,127) etc.
Problem: I need to convert all these PDF figures from RGB to DeviceCMYK (or alternatively the whole PDF of the final manuscript) with a specific conversion formula.
I did a lot of google research and tried the often suggested ways of using e.g. Ghostscript or various print production tools in Adobe Acrobat, however all of the conversion techniques I found so far wanted to use ICC color profiles or used some other conversion strategy which filled the channels MY and spared some C and K, for example.
I know the exact conversion formula for the raw color numbers from our Inkscape-RGBs to the channels C and K, however I do not know or find any program or tool that allows me to manually specify conversion formulas.
Question: Is there any workflow to convert my PDFs from RGB to C(MY)K manually with my own specific conversion formula for the raw numbers with the converted PDF being in DeviceCMYK using a tool, script or Adobe product?
Due to the large number of figures I would prefer a batched solution which doesn't require too much coding from my side, but if it should be the only solution, I'd also be open minded for a workflow like "load/convert/save" within a program for every single figure or writing a small program with an easy-to-handle C++ PDF API for example.
Limitations and additional info: A different file format (like TikZ figures) is not possible any more since it does not work perfectly and the necessary adaptions to the figures would create too much overhead. A maybe helpful information: Since the figures are created in Inkscape, there are no raster images within the PDFs. I also do not want all figures to be converted to raster images during the color conversion.
Edit:
I have created an example of a RGB PDF-figure created with inkscape.
I also did a manual object-by-object color conversion to a CMYK-PDF with Illustrator, to show how the result should look like. Illustrator stores the axial shading in a DeviceN colorspace with the colors cyan and black, which is close enough^^
Here is an idea, I think it will work if your PDF files are using exclusively the colorspaces DeviceGray, DeviceRGB and DeviceCMYK:
1- Convert all your PDF files to Postscript (with pdf2ps from ghostscript for example)
2- Write a Postscript program that redefines the operators setrgbcolor, setgray and setcolor with your own implementation in the Postscript language, your implementation will internally use setcmykcolor and it will compute the values using your custom formula.
Here is an example for redefining the setgray operator:
% The operator setcmykcolor expects 4 values in the stack
% When setgray is called, we can expect to have 1 value in the stack, we will
% use it for the black component of cmyk by adding 3 zeros and rolling the
% top 4 elements of the stack 3 times
/setgray { 0 0 0 4 3 roll setcmykcolor } bind def
3- Paste your Postcript program at the begining of each resulting ps file from step 1.
4- Convert all your files back to PDF (with ps2pdf for example)
See it in action by saving this piece of code as sample.ps:
/setgray { 0 0 0 4 3 roll setcmykcolor } bind def
0.5 setgray
0 0 moveto
600 600 lineto
stroke
showpage
Convert it to PDF with ghostscript using this command line (I used version 9.14):
gswin64c.exe -dNOPAUSE -dBATCH -sDEVICE=pdfwrite -sOutputFile=sample.pdf sample.ps
The resulting PDF will have the following page content:
q 0.1 0 0 0.1 0 0 cm
/R7 gs
10 w
% The K operator is the PDF equivalent of setcmykcolor in postscript
0 0 0 0.5 K
0 0 m
3000 3000 l
S
Q
As you can see, the ps-> pdf conversion will preserve the cmky colors specified in postscript with the setcmykcolor operator.
Maybe you can post your formula as a new question and someone could help you out translating it to postscript.
Since you have access to Illustrator, you might want to try importing the PDF into Illustrator and using Illustrator's scripting capabilities to iterate over the elements and replace fill/stroke RGB colors with their CMYK replacement colors.
The difficulty will be with the shading patterns (Gradients) used in the PDF; if they are imported as GradientColor, then in theory it's a matter of digging into the GradientColor to find the base RGB colors and substitute their CMYK replacement.
A very similar problem was solved using the ActivePDF.dll with C++ (or C#??).

Tesseract identifies a 0 as a Q

I am using Tesseract OCR for getting an exclusively numeric string in a PDF file.
The PDF contains : 66600O3377.pdf
but Tesseract recognizes : 66600Q3377.pdf
The input is a TIFF file, the quality is good enough (see the screenshot).
Is there a way to improve the Tesseract accuracy ? I could always change Q for a 0 but I'm afraid of further unexpected mistakes.
This is in Tesseract FAQ:
Run a tesseract command like this to only permit digits in input image:
tesseract imagename outputbase digits

Rotating a PDF file by n degrees, where n is not a multiple of 90

The problem I am facing is as following. I have a source document, src.pdf.
I need to insert the contents of src.pdf into target.pdf, rotated by n degrees, where n is NOT a multiple of 90.
Any help would be appreciated, thanks.
EDIT 1:
PDF contains no annotations.
I can use any solution which relies on utilities, or write my own code, preferably in C#/Python/Ruby/Perl, but not limited to a language.
The platform is Windows Server 2008 R2, I prefer to stick to the existing server but Linux is also an option. Latest (stable) GhostScript and pdftk are already installed.
If a new language is not a problem, LateX could be an option. You can include a pdf as a figure in a tex file, and you will be able to use dedicated option like rescaling and rotating function. Then, compile it to obtain a new pdf.
The very simple following code works for me :
\documentclass[a4paper]{article}
\usepackage{graphicx}
\begin{document}
\includegraphics[scale=0.5,angle=10]{test.pdf}
\end{document}
From this pdf:
I get this new one:
It will however need some manual ajustements to get exactly what you want...
You can do it with TexLive like this:
\documentclass{article}
\usepackage{pdfpages}
\begin{document}
\includepdf[pages={-},angle=30]{main}
\end{document}
It will rotate the entire pdf - every page!
I'm not the one who figured this out, however - check this thread for the original solution (and give that fellow a point!)
This is an example showing how to do that using Java and the iText library. With minimal changes that code should be usable with C# and iTextSharp, too, giving the sample #neo could not provide on short notice in his answer.
The sample takes the first page ofsource.pdfand inserts it intotarget.pdfin all multiples of 30°, i.e. of 2*pi/12, but as that angle is explicitly given in the code, you can rotate by any angle.
Document document = new Document();
PdfWriter writer = PdfWriter.getInstance(document, new FileOutputStream("target.pdf"));
document.open();
PdfReader origPdfReader = new PdfReader("source.pdf");
PdfImportedPage importedPage = writer.getImportedPage(origPdfReader, 1);
PdfContentByte canvas = writer.getDirectContent();
for (int i = 0; i < 12; i++)
{
AffineTransform transform = AffineTransform.getRotateInstance(Math.PI * i / 6.0,
importedPage.getWidth() / 2, importedPage.getHeight() / 2);
canvas.addTemplate(importedPage, transform);
document.newPage();
}
document.close();
Depending on your use case you may not only want to rotate (as you asked for) but also to scale it down to fit the page. In that case simply addtransform.scale(scaleX, scaleY)before using thetransform.
Since you do not have to deal with annotations, you could try using any PDF library of your choice that allows you to decompose PDF dictionaries and decode the page content. Once you get the page content, you can insert a transformation matrix at the beginning of the page: [ cos θ sin θ −sin θ cos θ 0 0 ]
I would recommend taking a look at the PDF Reference Document from Adobe, specifically the section about the transformation matrix.
For example if you have the following page content object (40 0 obj):
10 0 obj % Page object
<< /Type /Page
/Parent 5 0 R
/Resources 20 0 R
/Contents 40 0 R
>>
endobj
40 0 obj % Page content
BT
/F1 1 Tf
12 0 0 12 100 600 Tm
(Hello) Tj
ET
endobj
And you want to rotate the whole page by 45 degrees, assuming cos(45)=sin(45)=0.7, your resulting page content will be:
40 0 obj
0.7 0.7 -0.7 0.7 0 0 cm
BT
/F1 1 Tf
12 0 0 12 100 600 Tm
(Hello) Tj
ET
endobj
After you finish adding the transformation matrix, you can re-compose your PDF file. The library you have chosen should then add compression filters and encoding filters as needed.
iText for example can decompose and recompose PDF files. See the method PdfReader.getPageContent for details.
I wrote some software which can do this:
cpdf -rotate-contents 45 in.pdf -o out.pdf
Commercial, I'm afraid. See Chapter 3 of the manual.

How to adjust BoundingBox of an EPS file?

I want to crop main area of a PS or PDF file to create an EPS file without white space. Commands of ghostrcipt, ps2pdf, epstools can crop the main drawing out of the document file.
The problem is that they only crop in its original form, but I want to create an EPS file with BoundingBox 0 0 x y; cropped and moved to the bottom left corner.
The difference i when we want to insert the resulting EPS file inside a PS document. When having BoundingBox x0 y0 x y, the PS document inserts the EPS file at point x0 y0, instead of where we are.
EXAMPLE:
Consider a simple PS file as
%!
/Times-Roman findfont
11 scalefont setfont
72 700 moveto
(This is a test)show
if converting it to EPS with a command like
ps2eps test.ps test.eps
It will produce
%!PS-Adobe-2.0 EPSF-2.0
%%BoundingBox: 72 700 127 708
%%HiResBoundingBox: 72.000000 700.000000 127.000000 707.500000
%%EndComments
% EPSF created by ps2eps 1.68
%%BeginProlog
save
countdictstack
mark
newpath
/showpage {} def
/setpagedevice {pop} def
%%EndProlog
%%Page 1 1
/Times-Roman findfont
11 scalefont setfont
72 700 moveto
(This is a test)show
%%Trailer
cleartomark
countdictstack
exch sub { end } repeat
restore
%%EOF
It has been cropped in its original coordinates, and the resulting BoundingBox is 72 700 127 708. Now if trying to insert this EPS file within a PS document, it tries to nest at this coordinate.
It will be useful if creating an EPS file with BoundingBox: 0 0 55 8. Of course, all drawing coordinates (here moveto) must be modified with this new reference.
NOTE: As stated, my purpose from fixing the BoundingBox reference point is to make it importable within PS document. Thus, an alternative answer to this question is: how to insert an EPS file inside PS document regardless of its BoundingBox.
For example, how to insert this EPS file at location 200 200 255 208 of a PS document. I try to insert the EPS with the following code, but it will not work unless the BoundingBox is started from 0 0:
200 200 translate
save
/showpage {} bind def
(test.eps)run
restore
What about simply un-translating?
-72 -700 translate
Either in the eps itself, or in the prep section before the inclusion?
AWKward!
The following typescript illustrates an awk script which performs the desired modifications
to the eps, guided by the DSC comments (just like Mama used to do!).
The advantage is: if you can guarantee that the input EPS conforms sufficiently to DSC to provide these markers, this approach will be orders-of-magnitude faster than passing the file through ghostscript.
Simplicity is both the advantage and the limitation of this program. It scans for DSC comments, extracts values from the BoundingBox comment, suppresses the HiResBoundingBox, and adds postscript 'translate' and 'rectclip' commnds just after the Page comment. This should produce the correct results so long as the EPS really is bona-fide. But the ghostscript approach in the other answer will produce results on input files with less reliable DSC-conformance (because it's not taking shortcuts, it treats DSC as comments and completely ignores them).
Strictly speaking the 'rectclip' shouldn't be necessary, but the question asks that the output be "cropped".
592(1)11:27 AM:~ 0> cat epscrop.awk
/%%BoundingBox: ([^ ]*) ([^ ]*) ([^ ]*) ([^ ]*)/{x=$2;y=$3;w=$4-x;h=$5-y;print $1,0,0,w,h}
!/%%BoundingBox:/&&!/%%HiRes/{print}
/%%Page /{print -x,-y,"translate"; print 0,0,w,h,"rectclip"}
593(1)11:27 AM:~ 0> awk -f epscrop.awk etest.eps
%!PS-Adobe-2.0 EPSF-2.0
%%BoundingBox: 0 0 55 8
%%EndComments
% EPSF created by ps2eps 1.68
%%BeginProlog
save
countdictstack
mark
newpath
/showpage {} def
/setpagedevice {pop} def
%%EndProlog
%%Page 1 1
-72 -700 translate
0 0 55 8 rectclip
/Times-Roman findfont
11 scalefont setfont
72 700 moveto
(This is a test)show
%%Trailer
cleartomark
countdictstack
exch sub { end } repeat
restore
%%EOF
To convert it to an EPS with the BoundingBox-style you want, I would use Ghostscript and let the EPS make a roundtrip: EPS => PDF => EPS.
The trick is to ensure that the PDF uses a media size that is the same as the BoundingBox width and height are by adding the -dEPSCrop param.
These two commands create your 'EPS without white space':
1st step: convert EPS to PDF:
gs \
-o so#12682621.pdf \
-sDEVICE=pdfwrite \
-dEPSCrop \
so#12682621.eps
2nd step: convert PDF back to EPS:
gs \
-o so#12682621.roundtripped.eps \
-sDEVICE=epswrite \
so#12682621.pdf
To test the fidelity of your resulting EPS, you could use ImageMagick's compare to show the differences, pixel-wise in red, as a PNG file:
compare \
-density 600 \
12682621.roundtripped.eps \
12682621.eps \
-compose src \
12682621.png
which results in:
You'll notice that there are some pixel differences. They are caused by the value of 707.500000 from the %%HiResBoundingBox, which leads to a rounding error later on (PNG can't have 'half pixels').