Do the empty spaces between the codes add more size than the original application size? - size

i am so scary if my question was weird .
but i would like to know if the spaces between the codes add more size than the original application size?

Related

How to convert a multi-page PDF table to a spreadsheet format?

I have a huge PDF file with 300+ pages on which a big 10+ column table is spread. I am using Linux and would like to have a simple command line command which would convert this table to a text importable to a spreadsheet.
Currently I am using pdftotext -layout, and gives quite good results, other than every page is considered independently and column widths and positions change from page to page (due to different maximum column content width on each page), so I cannot simply import the resulting text file to a spreadsheet application and split it to columns by a fixed column width.
I have tried to crop every column on every page (their position is identical across the whole PDF file), but in the result the empty rows are merged together, so the rows with content will be shifted with respect to each other.
If pdftotext had an option to convert the file with a STRICT LAYOUT (not by column content width), that would help. Or if I could stack all pages in PDF file to a single page, that could also solve it.
What are the options to solve this problem?
You are misunderstanding the nature of the content of a PDF file. There are no tables in PDf, there is no metadata (generally) to describe the content as a table. The text you see on the page may not be laid out in the reading order.
For example the PDF file might contain a line of text drawn at the top of the page, then one at the bottom, then a paragraph in the middle before jumping back up to the top for a headline.
In addition there may be no spaces between two text fragments. Text is drawn at an absolute position on the page, so you can draw (for example ) cell A, then move the current point by say 1 cm, rthen draw cell B and so on. Since there's no 'space' characters between the two cells, a naive text extraction will, naturally, assume the two lines of text are coninuous.
The STRICT LAYOUT you want isn't impossible, but you can't do it with a simple text file, because the original layout isn't made up of simple text characters, sometimes the space between two characters, or two fragments of text is done by moving the current point before drawing the text.
Ghostscript's txtwrite device in its simplest mode attempts to replicate the layout by replacing the white space with actual space characters in a fixed pitch font. This 'might' be good enough for you, but it equally well might not. That's because it operates by defining the smallest distance used on the page as being one space character. All distances between text is then replaced by a number of space characters, as many as are required to make up the space. This can (and often does) result in very wide output files with a lot of white space.
Essentially what you seem to want isn't really possible, you can't take a rich format like PDF and replicate it, including the layout, with nothing more than text characters.

How to substitude all "\t" (tab characters) with white space in a PDF

Hello i am trying to convert a pdf book about programming to mobi format with Calibre.
The problem I am facing is that the code blocks inside the converted version completely lose indentation.
I managed with a regular expression to correctly indent the lines that where indented using white spaces. I did so transforming every two white spaces to two non-breaking-spaces.
Some of the code blocks unfortunately are indented using the tab character, so the regular expression is not working in these cases.
I came to realize that during the conversion from pdf to mobi there is an intermediate step in which the pdf is converted to hmtl and there is when the tab information is lost because no special tag is being generated to carry this information.
So i think the best solution is to edit the very pdf itself and replace all the tab characters(\t) to two white spaces (\s\s). This way the regular expression i mentioned before will work for all the code block references and the code will be indented properly.
but i have no idea which software to use that has this functionality of substituting pdf elements.
I doubt that the 'tabs' are contained in the PDF as tabs. The 'tab' character (0x04 in ASCII) has no special significance in PDF, and in particular it does not move the current point, it simply draws a glyph. As a result, if you do (A\tB) what you will see when the PDF is rendered is 'AB'. Or 'A*B' where the * is some other character you didn't expect (often a square)
So you would probably actually have to convert current point movement operators into white space drawing There's no realistic way that can be automated, since no tool can tell where a movement was a 'tab' and where it was a reposition.
So you will need to do it manually.
The challenge here is that the page content stream is likely to be compressed, so the first thing you will have to do is decompress the PDF. There are a number of tools which will do this for you, MuPDF is one, I think pdftk is another.
Then you will need to locate the position where you want to insert space, this could be challenging, as the font may be re-encoded to something other than ASCII so it may be hard to identify the correct position. Once you've done that, you can insert the space(s) you want into the text strings, again bearing in mind that the font in use may be re-encoded, and subset. This means that a space may not be 0x20 and indeed the font may not even contain a space glyph. And of course you need to remove the operations to reposition the current point.
Finally, after you've modified the contents, you need to remember that PDF is a binary format, and the xref table contains the position of every element in the file. If you've edited the file its likely that you will have altered the length of one or more elements, which will change the offset of any following elements, so you will need to recalculate those and update the xref table.
I suspect you are going to find it easier to modify the conversion from PDF to HTML, or modify the HTML, than to try and alter the PDF file.

Increase font for pdf using Inkscape

I use to produce pdf graphs with R then I like to modify them using inkscape.
Yet when I increase font, letter size increase but letter spacing don't as you can see in example.
I have the same problem when I do the same with pdf from latex.
Thank for your help
Perhaps you have broken the text into individual letters, and are applying the new font size to those, rather than to the entire word? You may need to recreate the Xlabel text/group the letters back together.
Although there is an answer for exactly the same question here, I will duplicate it:
You should select the text you want to resize and then remove manual kerning either before or after the resizing. This can be done by clicking Text -> Remove Manual Kerns.

How to remove the white margin?

Recently, I've noticed that my theme has a white margin on the right side and I don't want it. I've posted an image so that you see what I'm talking about. Do you know whu this white stripe is there and how I can remove it?
I conducted tests with javascript capturing document sizes, tested many sizes and came to the result that less than 1260px in the blank begins to appear.
For example, with 1259px, the space 1px arm appeared. The maximum size that it achieves is about 60px. Thus, leaving the minimum size of html with 1260px, at least for me, the blank did not show up anymore.
In your theme, open global.css file, in the end of file add this line:
html{
min-width:1260px;
}

how to fix square boxes in pdf? [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 1 year ago.
Improve this question
while going through my pdf for regular expressions, and in many places i see that some characters are replaced by square boxes which is some ASCII code
Is there any way i can fix this?
i have checked this link
http://www.tableausoftware.com/support/knowledge-base/square-boxes
http://acrobatusers.com/tutorials/text-matching-regular-expressions
and others but did not find any solution... aatched is how the square boxes look...
As stema said, this has nothing to do with regular expressions.
Neither is it about some "pdf escape sequences", as PDF uses binary safe text encodings.
These square blocks are usually shown in place of some characters that doesn't have a representation in the chosen font. Often, it happens that the typesetting software replaces some quotes or other characters with a 'nicer' Unicode alternative; but the font doesn't have those characters.
You could try to copy/paste the text from the PDF into some other document and replace the font, or even use some PDF editing tools (enfocus PitStop is one of the most popular; it's cheap but not free) to replace the font with another more complete.
At first, this has nothing to do with regex, except that the document you are writing is about regular expressions.
I assume, the sequence that is replaced by a square is \s, isn't it?
I think the problem here is that some regular expression shortcuts are interpreted as escape sequences in the pdf creation process and therefor not printed literally.
You don't write how you create your pdf, but I would assume that will be OK when you escape the backslashes, when you want to print them literally.
So when you want to see a \s in the pdf, type \\s in your source format. (If you have somewhere a escaped backslash you want to print like \\ then write \\\\).
Javier's answer is nearly complete. But let me add this:
You'll have a small chance to get Acrobat Reader display the square boxes using a "substitute" font by toggling a certain setting in its application preferences.
IIRC, the setting is called 'Use local fonts'. You can usually find it in the Page display section of the preferences settings, but over the different releases Adobe kept adding, removing or re-locating different settings...
Background info: If you have NOT enabled Use local fonts, then you require the Reader to only use the PDF-embedded fonts for displaying all text. In case the font is embedded, but misses some required glyphs, enabling said setting may find the required font on your system to render the text, or the Reader may use its built-in Multiple Master fonts which will try to fake the look of the original glyph, more or less....
Copy and paste the text to another word sheet.
Select the text that contains the squares.
Control + space.
Choose the font that you want for your text again.
Save the word as a pdf again.
.- Voila: it is done!