I have a private PDF document which has about 0.6MB, but when I watermark it with PyPDF2 it grows to 12 MB (the watermarking document is < 0.4MB). I think that this is related to compression, but I don't understand how.
It especially confuses me why the original PDF is so huge (uncompressed).:
No images
No embedded files
Just 15 pages and the extracted text has 1467 characters
I was thinking that it might be embedded fonts:
$ pdffonts example.pdf
name type encoding emb sub uni object ID
------------------------------------ ----------------- ---------------- --- --- --- ---------
AAAAAB+ArialMT CID TrueType Identity-H yes yes yes 8 0
AAAAAC+OpenSans-Regular TrueType MacRoman yes yes no 13 0
AAAAAD+MyriadPro-Regular Type 1C MacRoman yes yes no 14 0
AAAAAE+MyriadPro-Regular Type 1C MacRoman yes yes no 15 0
AAAAAF+OpenSans-Regular TrueType MacRoman yes yes no 16 0
AAAAAG+OpenSans-Regular TrueType MacRoman yes yes no 17 0
AAAAAH+OpenSans-Regular TrueType MacRoman yes yes no 18 0
AAAAAI+OpenSans-Bold TrueType MacRoman yes yes no 19 0
AAAAAJ+OpenSans-Regular TrueType MacRoman yes yes no 20 0
AAAAAK+OpenSans-Italic TrueType MacRoman yes yes no 21 0
AAAAAL+OpenSans-Regular TrueType MacRoman yes yes no 31 0
AAAAAM+OpenSans-Regular TrueType MacRoman yes yes no 35 0
AAAAAN+MyriadPro-Regular Type 1C MacRoman yes yes no 36 0
AAAAAO+MyriadPro-Regular Type 1C MacRoman yes yes no 37 0
AAAAAP+OpenSans-Regular TrueType MacRoman yes yes no 38 0
AAAAAQ+OpenSans-Regular TrueType MacRoman yes yes no 39 0
AAAAAR+OpenSans-Regular TrueType MacRoman yes yes no 40 0
AAAAAS+OpenSans-Bold TrueType MacRoman yes yes no 41 0
AAAAAT+OpenSans-Regular TrueType MacRoman yes yes no 42 0
AAAAAU+Arial-BoldMT CID TrueType Identity-H yes yes yes 53 0
AAAAAV+ArialMT CID TrueType Identity-H yes yes yes 54 0
AAAAAW+Arial-ItalicMT CID TrueType Identity-H yes yes yes 60 0
How can I check the (uncompressed) size of the embedded fonts?
With
mutool extract example.pdf
you can extract all images / fonts.
In my case, the sum of all fonts (and two images I missed) was 0.3 kB... so my search continues.
Related
I have a PDF document (that is my schoolbook) and the problem is that although the text is printed normally, it is copied in the form of some random glyphs. I found, that it is because of text being encoded on cp1251 but trying to be decoded as cp1252 (or viceversa idk but copied glyphs belong to 1252). Pasting text to decoder from 1252 to 1251 I can get the original text (pic related)
To solve my problem of text searching and copying I just used OCR, but maybe there is a way to change it's encoding in some pdf headers? Also I do need to copy some of the illustrations for school seminars, but Inkscape and AI still output theese glyphs in 1252.
Opening the text in Adobe Acrobat DC, I saw that he was complaining about the font 1251 Times. In Npp i found such ones
1146 0 obj
<<
/Ascent 756
/CapHeight 750
/Descent -195
/Flags 32
/FontBBox [-91 -224 1237 943]
/FontFamily (1251 Times)
/FontFile2 1147 0 R
/FontName /OGAHOK+1251Times
/FontStretch /Normal
/FontWeight 400
/ItalicAngle 0
/StemV 90
/Type /FontDescriptor
>>
endobj
1145 0 obj
<<
/BaseFont /OGAHOK+1251Times
/Encoding /WinAnsiEncoding
/FirstChar 32
/FontDescriptor 1146 0 R
/LastChar 255
/Subtype /TrueType
/Type /Font
/Widths [351 0 0 0 0 0 828 0 392 392 0 0 326 448 288 455 531 533 532 532 532 532 532 531 531 532 288 0 0 0 0 0 864 724 714 776 0 706 0 0 875 417 0 0 0 0 882 0 661 0 770 599 678 0 0 983 0 0 0 0 0 0 0 0 0 495 539 499 565 489 322 491 583 294 0 532 287 887 590 566 563 0 376 385 332 568 486 729 0 503 476 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 554 554 0 952 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 896 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 699 714 0 747 0 0 597 886 0 812 0 1034 875 0 877 0 776 678 729 0 0 858 0 0 0 0 0 0 759 0 0 495 559 523 434 539 489 757 449 622 622 577 550 715 636 566 622 563 499 468 503 764 500 621 553 880 880 0 760 501 517 820 546]
>>
endobj
1150 0 obj
<<
/Filter /FlateDecode
/Length1 32416
/Length 24094
>>
stream
By replacing all occurrences of 1251 with 1252, I have achieved nothing. What is the right way to di this thing? And is there such a right way?
OGAHOK+1251Times (or similar six random characters and a nametag of a font)
Very often indicates the source was recognised as OCR (One Character Relative to another) thus each letter or a line of letters or a page of letters can have its own font, that here look-likes Times Roman in, as you discovered, 1251 style lettering.
So changing the name to 1252 would be like saying the Times is Verdana it can not change the raw data.
I am surprised, but pleased for you, that you can get some readable 1251 to convert to 1252, however reasonable conversion within the potentially corrupted font metrics would be neigh on impossible to replace one symbol at a time to the other and maintain string shape see the varying /Widths.
However without your base PDF file that is based on experience rather than a fail with your source.
[Update]
Wow! that file has 600 fonts ! something has processed those badly
The problem seems to stem from the use of WinAnsiEncoding rather than some UTF-8 or compatible coding method. I am looking to see if there is any way to modify, but not sure if it could help or make things worse. Here I can try editing settings but in this screenshot from Tracker PDF X-change Editor making changes does not help, unless the text is cut, converted and pasted back.
I have a dataframe in the form of:
no ans freq
1 Yes 23
No 89
2 Yes 45
No 76
3 Yes 99
I would like to drop ones that only have Yes or only NO as the second index (no and ans are indices). This would give:
no ans freq
1 Yes 23
No 89
2 Yes 45
No 76
You could groupby "no" and transform to get the group size. If 2 keep, else drop:
df[df.groupby(level='no')['freq'].transform('size').eq(2)]
input:
freq
no ans
1 Yes 23
No 89
2 Yes 45
No 76
3 Yes 99
output:
freq
no ans
1 Yes 23
No 89
2 Yes 45
No 76
I am studying ELF and having a doubt for a while. I tried to search for the answer but in vain. I'd apprciated if somebody could give me the answer or guide me to the place to find an answer.
Almost all the documents I read about ELF said .text section contains executable binary code (and .data contains data....). However, when I used readelf to see the sections contain in an obj file, I saw no .text section but a section called i.main which contains the executable code (from the code contained in this section I found the machine code). The following shows the sections parsed by readelf
Section Headers:
[Nr] Name Type Addr Off Size ES Flg Lk Inf Al
[ 0] NULL 00000000 000000 000000 00 0 0 0
[ 1] i.main PROGBITS 00000000 000034 00000a 00 AX 0 0 2
[ 2] .arm_vfe_header PROGBITS 00000000 000040 000004 00 0 0 4
[ 3] .comment PROGBITS 00000000 000044 0001c6 00 0 0 1
[ 4] .debug_frame PROGBITS 00000000 00020a 00003c 00 0 0 1
[ 5] .debug_info PROGBITS 00000000 000246 000088 00 0 0 1
[ 6] .debug_info PROGBITS 00000000 0002ce 0000dc 00 0 0 1
[ 7] .debug_line PROGBITS 00000000 0003aa 000030 00 0 0 1
[ 8] .debug_line PROGBITS 00000000 0003da 000044 00 0 0 1
[ 9] .debug_loc PROGBITS 00000000 00041e 000014 00 0 0 1
[10] .debug_macinfo PROGBITS 00000000 000432 000308 00 0 0 1
[11] .debug_pubnames PROGBITS 00000000 00073a 00001b 00 0 0 1
[12] __ARM_grp..debug_ GROUP 00000000 000758 000008 04 14 14 4
[13] .debug_abbrev PROGBITS 00000000 000760 0005a4 00 G 0 0 1
[14] .symtab SYMTAB 00000000 000d04 000110 10 21 13 4
[15] .rel.debug_frame REL 00000000 000e14 000010 08 14 4 4
[16] .rel.debug_info REL 00000000 000e24 000018 08 14 5 4
[17] .rel.debug_info REL 00000000 000e3c 000038 08 14 6 4
[18] .rel.debug_line REL 00000000 000e74 000008 08 14 8 4
[19] .rel.debug_pubnam REL 00000000 000e7c 000008 08 14 11 4
[20] .shstrtab STRTAB 00000000 000e84 0000f2 00 0 0 1
[21] .strtab STRTAB 00000000 000f76 0001b3 00 0 0 1
[22] .ARM.attributes ARM_ATTRIBUTES 00000000 001129 000044 00 0 0 1
It seems that the section name can be arbitrarily chosen (am I right?) If so, then my questiones are
how to tell which section contains what? (for example, which section contains code and which section contains read only data....).
How to know the definition of each section, for example how do I know the section "[12] __ARM_grp..debug_" is for what purpose?
Thanks in advance.
As for the first part of your question, when determining what sections contain code and which sections contain read only data, a good thing to look for is the section attribute flags.
With the readelf -S command, an X indicates that the section contains executable instructions, an A indicates that the section occupies memory during process execution, and a W indicates that the section should be writable.
So in your object file, there is one section, i.main that is executable, and it is also read only. The other sections aren't writable, but not read only since they aren't in memory at all.
I'm not very familiar with ARM binaries, so I can't really address the other parts of your question.
I have formatted a thumbdrive with Fat32 and placed a file in the root directory named sampleFile.txt and with the contents "oblique". I looked at the drive in Disk Investigator and I found in the RootDirSector: sector 4096 the following
0040 53 41 4D 50 4C 45 7E 31 S A M P L E ~ 1 83 65 77 80 76 69 126 49
0048 54 58 54 20 00 36 81 5B T X T . 6 . [ 84 88 84 32 0 54 129 91
0050 2E 45 2E 45 00 00 89 5B . E . E . . . [ 46 69 46 69 0 0 137 91
0058 2E 45 03 00 07 00 00 00 . E . . . . . . 46 69 3 0 7 0 0 0
How do I find the location of the sector cluster where the actual data of the file is located? Here is some additional info:
Logical drive: G
Size: 3 Gb (popularly 3 Gb)
Logical sectors: 3889016
Bytes per sector: 1024
Sectors per Cluster: 8
Cluster size: 8192
File system: FAT32
Number of copies of FAT: 2
Sectors per FAT: 1899
Start sector for FAT1: 298
Start sector for FAT2: 2197
Root DIR Sector: 4096
Root DIR Cluster: 2
2-nd Cluster Start Sector: 4096
Ending Cluster: 485616
Media Descriptor: 248
Root Entries: 0
Heads: 255
Hidden sectors: 0
Backup boot sector: 6
Reserved sectors: 298
FS Info sector: 1
Sectors per track: 63
File system version: 0
SerialVolumeID: 4A95395B
Volume Label: NO NAME
The "Short File Name Entry" contains the starting cluster of the file. Because the test file is very small, it only requires a cluster disk space.
In this case, 8192 bytes for a 7 byte string. So therefore, the FAT does not matter, because the file is not span multiple clusters. However, your file entry is incomplete. A FAT32 file name entry is 32 bytes long.
Offset 1Ah contains the starting cluster (2 bytes length). If offset 14h (2 bytes length) contains a value, then 1Ah is the low word, 14h the high word of the starting cluster.
I'm not sure, but I think the system area is counted sector wise, the data area cluster wise. The data area begins after the fat2. Unusually, your disk has a sector size of 1024 bytes.
In my project I need to know what a zlib header looks like. I've heard it's rather simple but I cannot find any description of the zlib header.
For example, does it contain a magic number?
zlib magic headers
78 01 - No Compression/low
78 9C - Default Compression
78 DA - Best Compression
Link to RFC
0 1
+---+---+
|CMF|FLG|
+---+---+
CMF (Compression Method and flags)
This byte is divided into a 4-bit compression method and a 4-
bit information field depending on the compression method.
bits 0 to 3 CM Compression method
bits 4 to 7 CINFO Compression info
CM (Compression method)
This identifies the compression method used in the file. CM = 8
denotes the "deflate" compression method with a window size up
to 32K. This is the method used by gzip and PNG and almost everything else.
CM = 15 is reserved.
CINFO (Compression info)
For CM = 8, CINFO is the base-2 logarithm of the LZ77 window
size, minus eight (CINFO=7 indicates a 32K window size). Values
of CINFO above 7 are not allowed in this version of the
specification. CINFO is not defined in this specification for
CM not equal to 8.
In practice, this means the first byte is almost always 78 (hex)
FLG (FLaGs)
This flag byte is divided as follows:
bits 0 to 4 FCHECK (check bits for CMF and FLG)
bit 5 FDICT (preset dictionary)
bits 6 to 7 FLEVEL (compression level)
The FCHECK value must be such that CMF and FLG, when viewed as
a 16-bit unsigned integer stored in MSB order (CMF*256 + FLG),
is a multiple of 31.
FLEVEL (Compression level)
These flags are available for use by specific compression
methods. The "deflate" method (CM = 8) sets these flags as
follows:
0 - compressor used fastest algorithm
1 - compressor used fast algorithm
2 - compressor used default algorithm
3 - compressor used maximum compression, slowest algorithm
ZLIB/GZIP headers
Level | ZLIB | GZIP
1 | 78 01 | 1F 8B
2 | 78 5E | 1F 8B
3 | 78 5E | 1F 8B
4 | 78 5E | 1F 8B
5 | 78 5E | 1F 8B
6 | 78 9C | 1F 8B
7 | 78 DA | 1F 8B
8 | 78 DA | 1F 8B
9 | 78 DA | 1F 8B
Deflate doesn't have common headers
The ZLIB header (as defined in RFC1950) is a 16-bit, big-endian value - in other words, it is two bytes long, with the higher bits in the first byte and the lower bits in the second.
It contains these bitfields from most to least significant:
CINFO (bits 12-15, first byte)
Indicates the window size as a power of two, from 0 (256 bytes) to 7 (32768 bytes). This will usually be 7. Higher values are not allowed.
CM (bits 8-11)
The compression method. Only Deflate (8) is allowed.
FLEVEL (bits 6-7, second byte)
Roughly indicates the compression level, from 0 (fast/low) to 3 (slow/high)
FDICT (bit 5)
Indicates whether a preset dictionary is used. This is usually 0.
(1 is technically allowed, but I don't know of any Deflate formats that define preset dictionaries.)
FCHECK (bits 0-4)
A checksum (5 bits, 0..31), whose value is calculated such that the entire value divides 31 with no remainder.*
Typically, only the CINFO and FLEVEL fields can be freely changed, and FCHECK must be calculated based on the final value. Assuming no preset dictionary, there is no choice in what the other fields contain, so a total of 32 possible headers are valid. Here they are:
FLEVEL: 0 1 2 3
CINFO:
0 08 1D 08 5B 08 99 08 D7
1 18 19 18 57 18 95 18 D3
2 28 15 28 53 28 91 28 CF
3 38 11 38 4F 38 8D 38 CB
4 48 0D 48 4B 48 89 48 C7
5 58 09 58 47 58 85 58 C3
6 68 05 68 43 68 81 68 DE
7 78 01 78 5E 78 9C 78 DA
The CINFO field is rarely, if ever, set by compressors to be anything other than 7 (indicating the maximum 32KB window), so the only values you are likely to see in the wild are the four in the bottom row (beginning with 78).
* (You might wonder if there's a small amount of leeway on the value of FCHECK - could it be set to either of 0 or 31 if both pass the checksum? In practice though, this can only occur if FDICT=1, so it doesn't feature in the above table.)
Following is the Zlib compressed data format.
+---+---+
|CMF|FLG| (2 bytes - Defines the compression mode - More details below)
+---+---+
+---+---+---+---+
| DICTID | (4 bytes. Present only when FLG.FDICT is set.) - Mostly not set
+---+---+---+---+
+=====================+
|...compressed data...| (variable size of data)
+=====================+
+---+---+---+---+
| ADLER32 | (4 bytes of checksum)
+---+---+---+---+
Mostly, FLG.FDICT (Dictionary flag) is not set. In such cases the DICTID is simply not present. So, the total hear is just 2 bytes.
The header values(CMF and FLG) with no dictionary are defined as follows.
CMF | FLG
0x78 | 0x01 - No Compression/low
0x78 | 0x9C - Default Compression
0x78 | 0xDA - Best Compression
More at ZLIB RFC
All answers here are most probably correct, however - if you want to manipulate ZLib compression stream directly, and it was produced by using gz_open, gzwrite, gzclose functions - then there is extra 10 leading bytes header before zlib compression steam comes - and those are produced by function gz_open - header looks like this:
fprintf(s->file, "%c%c%c%c%c%c%c%c%c%c", gz_magic[0], gz_magic[1],
Z_DEFLATED, 0 /*flags*/, 0,0,0,0 /*time*/, 0 /*xflags*/, OS_CODE);
And results in following hex dump: 1F 8B 08 00 00 00 00 00 00 0B
followed by zlib compression stream.
But there is also trailing 8 bytes - they are uLong - crc over whole file, uLong - uncompressed file size - look for following bytes at end of stream:
putLong (s->file, s->crc);
putLong (s->file, (uLong)(s->in & 0xffffffff));