Detect PDF form field radio button (radiobutton) shape / style - pdf

I need to programmatically categorize which shape a pdf form field radiobutton has. Therefore I created a test pdf using *crobat. I added a radiobutton group where each widget is using a different style.
One way could be to check the CA key of the appearance characteristics dictionary (MK) which is mapped to the ZapfDingbats font:
/MK<</BC[0.0]>> //CIRCLE (normally l)
/MK<</BC[0.0]/CA(4)>> //CHECK
/MK<</BC[0.0]/CA(8)>> //CROSS
/MK<</BC[0.0]/CA(u)>> //DIAMOND
/MK<</BC[0.0]/CA(n)>> //SQUARE
/MK<</BC[0.0]/CA(H)>> //STAR
However in the example PDF for the circle the CA key does not exist (it should have been /CA(l)). To implicitly assume a round shape does not seem correct.
Another idea would be to look at the appearance dictionary. For the example given in the pdf spec it seems possible:
stream
q
0 0 1 rg
BT
/ZaDb 12 Tf
0 0 Td
(l) Tj
ET
Q
endstream
However the normal appearance generated by *crobat looks like that:
stream
q
1 0 0 1 9 9 cm
8.5 0 m
8.5 4.6946 4.6946 8.5 0 8.5 c
-4.6946 8.5 -8.5 4.6946 -8.5 0 c
-8.5 -4.6946 -4.6946 -8.5 0 -8.5 c
4.6946 -8.5 8.5 -4.6946 8.5 0 c
s
Q
0.501953 G
q
0.7071 0.7071 -0.7071 0.7071 9 9 cm
7.5 0 m
7.5 4.1423 4.1423 7.5 0 7.5 c
-4.1423 7.5 -7.5 4.1423 -7.5 0 c
S
Q
0.75293 G
q
0.7071 0.7071 -0.7071 0.7071 9 9 cm
-7.5 0 m
-7.5 -4.1423 -4.1423 -7.5 0 -7.5 c
4.1423 -7.5 7.5 -4.1423 7.5 0 c
S
Q
q
1 0 0 1 9 9 cm
3.5 0 m
3.5 1.9331 1.9331 3.5 0 3.5 c
-1.9331 3.5 -3.5 1.9331 -3.5 0 c
-3.5 -1.9331 -1.9331 -3.5 0 -3.5 c
1.9331 -3.5 3.5 -1.9331 3.5 0 c
f
Q
endstream
My question: Is there a way to detect that a widget annotation has a round shape / circular style? I know that any arbitrary shape can be defined as an appearance however for the use case at hand the differentiation of those 6 styles is more than enough.
If the answer somehow depends on the pdf lib (due to certain functionality): currently openPDF is used and other libs like pdfbox or iText are in use, too.

First you can check if the /CA entry is 'l'. If /CA does not exist you can check if the appearance stream contains 'c', 'v', or 'y' operators (curve operators). If they are present you can assume a circular style.
This is a empiric approach but it might work for you situation.

Related

PDF m l operators

I am using a PDF parser to extract lines from a pdf document. It fails on a specific doc generated pdf. The smallest pdf that it fails for has a 1 cell 1 row table, but the stream shows a 2 cell 1 row table. I have these questions:-
Why does the stream show 2 cells instead of just 1?
What are those re operators for, as there are no rectangles?
Who generates these instructions, is it MS Word? Or the PDF Printing application (Cute PDF Writer)?
Here is the pdf :-
Here is the relevant stream:-
stream
q 0.12 0 0 0.12 0 0 cm
/R7 gs
q
647 5996 m
700 5996 l
700 5885 l
647 5885 l
h
W n
0 0 0 rg
q
8.33333 0 0 8.33333 0 0 cm BT
/R8 11.04 Tf
0.998087 0 0 1 77.64 709.2 Tm
()Tj
ET
Q
Q
q
700 5996 m
746 5996 l
746 5885 l
700 5885 l
h
W n
0 0 0 rg
q
8.33333 0 0 8.33333 0 0 cm BT
/R8 11.04 Tf
0.998087 0 0 1 84 709.2 Tm
()Tj
ET
Q
Q
0 0 0 rg
600 5996 4 4 re
f
600 5996 4 4 re
f
604 5996 3892 4 re
f
4496 5996 4 4 re
f
4496 5996 4 4 re
f
600 5884 4 112 re
f
600 5880 4 4 re
f
600 5880 4 4 re
f
604 5880 3892 4 re
f
4496 5884 4 112 re
f
4496 5880 4 4 re
f
4496 5880 4 4 re
f
q
8.33333 0 0 8.33333 0 0 cm BT
/R8 11.04 Tf
0.998087 0 0 1 72 695.28 Tm
()Tj
ET
Q
Q
endstream
and here is the image drawn using the m and l instructions above :-
Why does the stream show 2 cells instead of just 1?
The stream does not show any cells at all. Only tagged PDFs may have a certain awareness of tables and table cells but your PDF does not look tagged.
What you (considering your question title) appear to mean are the sequences
647 5996 m
700 5996 l
700 5885 l
647 5885 l
h
W n
and
700 5996 m
746 5996 l
746 5885 l
700 5885 l
h
W n
But all they do is intersecting the current clip path with a rectangle. Thus, following drawing operations are restricted to the respective rectangle. Such restriction can be found in PDFs in many situations, table cells are only one of them, and such clip path changes are not even necessary for table cells...
Furthermore, considering the preceding transformation matrix change
0.12 0 0 0.12 0 0 cm
the rectangles above are fairly small, each probably large enough for a single character.
What are those re operators for, as there are no rectangles?
Well, they are rectangles.
Very small in height and/or width, but rectangles nonetheless.
And they are filled rectangles, cf. the f operator.
To make a long story short, the "lines" around the area we perceive as a table cell, are actually filled rectangles:
604 5996 3892 4 re
600 5884 4 112 re
604 5880 3892 4 re
4496 5884 4 112 re
Furthermore the corners of the cell are drawn as tiny squares (and each corner twice):
600 5996 4 4 re
600 5996 4 4 re
4496 5996 4 4 re
4496 5996 4 4 re
600 5880 4 4 re
600 5880 4 4 re
4496 5880 4 4 re
4496 5880 4 4 re
Thus, these re instructions give you the border edges and corners of what we perceive as table cell.
Who generates these instructions, is it MS Word? Or the PDF Printing application (Cute PDF Writer)?
The concrete instructions you see are PDF instructions. Thus, your printing application creates them.
Of course, though, your printing application creates them because that is how it interprets the MS Word output...
Cute PDF Writer apparently (from a quick glance on their web page) uses the Windows printing system. In general, in cases like this, you print from MS Word, and MS Word will try to use Windows methods to draw the lines and other items, which the printer driver (Cute PDF Writer in this case) will then translate to PDF commands. An intermediate stage with first rendering to PostScript and then translating to PDF is also possible.
So, that would mean that MS Word is responsible for the fact that two cells are drawn.
I only see one rectangle in the image of the PDF that you posted, so I'm not sure what is happening here. Also, I can't explain the other re commands. The rectangles in the second image look like they might be a frame around a two-on-one printed page, but the coordinates look strange, so it could also be something else.

remove redundant signals in pandas

I want to build correspondance between col1 and col2 with certain rule.
Label1 is like an on switch, and label2 is like an off switch. Once label1 is on, further operation on label1 will not re-open the switch until it is switched off by label2. Then label1 can switch on again.
For example, I have a following table:
index label1 label2 note
1 F T label2 is invalid because not switch on yet
2 T F label1 switch on
3 F F
4 T F useless action because it's on already
5 F T switch off
6 F F
7 T F switch on
8 F F
9 F T switch off
10 F F
11 F T invalid off operation, not on
The correct output is something like:
label1ix label2ix
2 5
7 9
What I tries is :
df['label2ix'] = df.loc[df.label2==T, index] # find the label2==True index
df['label2ix'].bfill(inplace=True) # backfill the column
mask = (df['label1'] == T) # label1==True, then get the index and label2ix
newdf = pd.Dataframe(df.loc[mask, ['index', 'label2ix']])
This is not correct because I have got is:
label1ix label2ix note
2 5 correct
4 5 wrong operation
7 9 correct
I am not sure how to filter out row 4.
I have got another idea,
df['label2ix'] = df.loc[df.label2==T, index] # find the label2==True index
df['label2ix'].bfill(inplace=True) # backfill the column
groups = df.groupby('label2ix')
firstlabel1 = groups['label1'].first()
But for this solution, I don't know how to get the first label1=T in each group.
And I am not sure if there is any more efficient way to do that? Grouping is usually slow.
Not tested yet, but here are few things you can try:
Option 1: For the first approach, you can filter out the 4 by:
newdf.groupby('label2ix').min()
but this approach might not work with more general data.
Option 2: This might work better in general:
# copy all on and off switches to a common column
# 0 - off, 1 - on
df['state'] = np.select([df.label1=='T', df.label2=='T'], [1,0], default=np.nan)
# ffill will fill the na with the state before it
# until changed by a new switch
df['state'] = df['state'].ffill().fillna(0)
# mark the changes of states
df['change'] = df['state'].diff()
At this point, df will be:
index label1 label2 state change
0 1 F T 0.0 NaN
1 2 T F 1.0 1.0
2 3 F F 1.0 0.0
3 4 T F 1.0 0.0
4 5 F T 0.0 -1.0
5 6 F F 0.0 0.0
6 7 T F 1.0 1.0
7 8 F F 1.0 0.0
8 9 F T 0.0 -1.0
9 10 F F 0.0 0.0
10 11 F T 0.0 0.0
which should be easy to track all the state changes:
switch_ons = df.loc[df['change'].eq(1), 'index']
switch_offs = df.loc[df['change'].eq(-1), 'index']
# return df
new_df = pd.DataFrame({'label1ix':switch_ons.values,
'label2ix':switch_offs.values})
and output:
label1ix label2ix
0 2 5
1 7 9

To One-Hot encode or not to One-Hot encode

My data set has the day of the week number (Mon = 1, Tue = 2, Wed = 3 ...)
My data look like this
WeekDay Col1 Col2 Target
1 2.2 8 126
6 3.5 4 354
1 8.0 2 322
3 7.2 4 465
7 3.2 5 404
6 3.8 3 134
1 3.6 5 455
1 5.5 8 345
6 7.0 6 442
Shall I one-hot encode WeekDay so it will look like this ?
WeekDay Col1 Col2 Target Mo Tu We Th Fr Sa Su
1 2.2 8 126 1 0 0 0 0 0 0
6 3.5 4 354 0 0 0 0 0 1 0
1 8.0 2 322 1 0 0 0 0 0 0
3 7.2 4 465 0 0 1 0 0 0 0
7 3.2 5 404 0 0 0 0 0 0 1
6 3.8 3 134 0 0 0 0 0 1 0
1 3.6 5 455 1 0 0 0 0 0 0
1 5.5 8 345 1 0 0 0 0 0 0
6 7.0 6 442 0 0 0 0 0 1 0
I am going to use Random Forest
You should not use one hot encoding since you are using a random forest model. An RF model will be able to find the patterns from label encoding as well and generally RF models perform worse with one hot encoding as they might decide to lost a few days when creating a tree. Also one hot encoding introduces the curse of dimensionality in your data, which is never good.
One hot encoding is better in cases of methods like linear regression or logistic regression, where 1 i.e. Monday might get more importance then 6 i.e. Saturday as these models have a multiplication model on the backend.
Generally, it's preferable to use One-Hot-Encoding, before use Random Forest. If this is only a categorical variable in your dataset then go for One-hot-Encoding. If you use R's random forest then as I know R's library deal with it itself. For scikit-learn that's not the case and you have to one-hot encode yourself. There is a trade off. One-Hot encoding introduces sparsity which is undesirable for tree-based models if the cardinality of the categorical variable is big, or in other words, there are many unique values in the categorical variable. However, Python's catboost deals with categorical variables.

My program reads PDF and try to find coordinate of each glyph in user space

it goes like this
q
0.1199951 0 0 0.1199951 0 0 cm
1 g
824 4101 267 389 re
f
Q
q
0.1199951 0 0 0.1199951 0 0 cm
1 g
824 4853 267 25 re
f
Q
q
0.1199951 0 0 0.1199951 0 0 cm
1 g
824 5241 267 25 re
f
Q
q
0.1199951 0 0 0.1199951 0 0 cm
1 g
1090 578 3081 1988 re
f
Q
q
0.1199951 0 0 0.1199951 0 0 cm
603 586 m
603 1800 l
649 1800 l
649 586 l
h
W n
8.3336724 0 0 8.3336724 0 0 cm
BT
/T1_0 5.04 Tf
0 1.0002 -1 0 76.8 70.32 Tm
(J)Tj
I want to ask what should be coordinate for J ?
My cropbox is 0 0 612 792 , Rotate value is 90.
So according to me
Th=1 default,
Tfs=5.04, from {/T1_0 5.04 Tf}
Trise=0 default,
teststatematrix
5.04 1 0
0 5.04 0
0 0 1
Tm
0 1.0002 0
-1 0 0
76.8 70.32 1
TRM = textstatematrix X Tm
-1 5.041 0
-5.040 0 0
76.800 70.320 1
So
[x,y,1] = [76.8, 70.32, 1] X TRM = [-354.413 457.469 1]
So x coordinate in user space is coming to be a negative number. Can you please Explain What mistake i am doing?
The matrix Trm calculated by the OP as
-1 5.041 0
-5.040 0 0
76.800 70.320 1
is the text rendering matrix described as follows in the PDF specification:
Conceptually, the entire transformation from text space to device space may be represented by a text rendering matrix, Trm:
(section 9.4.2, ISO 32000-1:2008)
The OP's mistake is not in calculating this matrix but in using it: This matrix contains the entire transformation from text space to device space,
Tj and other text-showing operators shall position the origin of the first glyph to be painted at the origin of text space.
(section 9.2.4 ISO 32000-1:2008)
and
The glyph origin is the point (0, 0) in the glyph coordinate system
(ibidem)
To determine, therefore, where the OP's
(J)Tj
puts the origin of the glyph J, one has to apply that matrix to the origin (0, 0), not to (76.8, 70.32) as the OP did.
Thus,
[x,y,1] = [0, 0, 1] X Trm = [76.8, 70.32, 1]
i.e. the coordinates of J are (76.8, 70.32) in device space. As the OP assumed the initial transformation matrix to have been the identity matrix, this device space essentially is the default user space.
Unfortunately the OP did not explain the coordinates in which coordinate system he is looking for. Thus, these coordinates probably are not the coordinates he was looking for.

Pdf setting the font color to the text

I am trying to add some text to a pdf file manually.I was able to add new text with a specific font. But i am not able to set the font color. So how can i do it manually?
(I just want to change these manually as i already have the code where i write these byte to make the pdf file)
Also how can i use graphic states specified in the pdf standard to manipulate the text so that feature changes does not affect the color changes etc.How exactly i can use the graphic state?
Source pdf file click here
Modified pdf file clcik here
The PDF color operators are listed in Table 74 of the PDF specification ISO 32000-1:2008.
In your case your added content stream is
104 0 obj
<</Length 105 0 R>>stream
/Helv 8 Tf
BT
1 0 0 1 15.67 150 Tm
(l)Tj
ET
/Helv 8 Tf
BT
1 0 0 1 17.88 190 Tm
(abcdefghijklmnopqr)Tj
ET
endstream
endobj
If e.g. you want the writing to be filled with red in a RGB color space, you add an 1 0 0 rg:
104 0 obj
<</Length 105 0 R>>stream
BT
1 0 0 1 15.67 150 Tm
/Helv 8 Tf
1 0 0 rg
[...]
EDIT
If you are afraid that that change may affect later text, remember to use the Graphics State Stack operators q and Q (cf. section 8.4.2 of the PDF specification). E.g.
q
0 1 -1 0 595.22 0 cm
q
BT
1 0 0 1 36 540 Tm
/Xi0 12 Tf
0.75 g
(Hello people!)Tj
0 g
ET
Q
Q
(Copied from How to add text object to existing pdf)