LibreOffice: Identifying 'Named Destinations' - pdf

I am working on an application that can open and display a PDF page using Poppler. I understand that Named Destinations are the right way to go about in order to open particular pages and in specific show an area within the page.
I figured it is possible to export headings and bookmarks in the PDF file by enabling Export outlines as named destinations option. However the names of these destinations look like below.
13 [ XYZ 96 726 null ] "5F5FRefHeading5F5F5FToc178915F2378596536"
14 [ XYZ 92 688 null ] "5F5FRefHeading5F5F5FToc179995F2378596536"
14 [ XYZ 92 655 null ] "5F5FRefHeading5F5F5FToc180015F2378596536"
14 [ XYZ 92 622 null ] "5F5FRefHeading5F5F5FToc187075F2378596536"
14 [ XYZ 92 721 null ] "5F5FRefHeading5F5F5FToc187095F2378596536"
There is no way to identify which heading is mapped to which destination. Page numbers are there but if there are multiple headings on the same page it would again take trial and error to identify the right one.
Questions
Is there any way in LibreOffice (writer) to find out what WILL BE the name of the destination once exported? Adobe Acrobat or PDF Studio Viewer have options to navigate through the list of destinations and 'see where they go'. To the best of my knowledge the navigation pane in LibreOffice does not show destination names.
Is there a guarantee that the names are maintained unique irrespective of any sections (headings) or pages that may get inserted before them?
I understand that LibreOffice uses 5F in place of _ because they are not allowed in PDF bookmarks. So if I replace those I am left with,
13 [ XYZ 96 726 null ] "__RefHeading___Toc17891_2378596536"
14 [ XYZ 92 688 null ] "__RefHeading___Toc17999_2378596536"
14 [ XYZ 92 655 null ] "__RefHeading___Toc18001_2378596536"
14 [ XYZ 92 622 null ] "__RefHeading___Toc18707_2378596536"
14 [ XYZ 92 721 null ] "__RefHeading___Toc18709_2378596536"
15 [ XYZ 96 726 null ] "__RefHeading___Toc18492_2378596536"
Decoding further the prefix (RefHeading) tells that the destination is from heading and the suffix (2378596536) is probably a unique number identifying the entire document (since it is the same for all entries). The middle portion appears to be a unique key however I am unable to identify the heading (or its section number) from this part.

Related

Search values in a Pandas DataFrame with values from another DataFrame

I have 2 dataframes.
df_dora
content
feature
id
1
cyber hygien
risk management
1
2
cyber risk
risk management
2
...
...
... ...
59
intellig share
information sharing
63
60
inform share
information sharing
64
df_corpus
content
id
meta.name
meta._split_id
0
market grow cyber attack...
56a2a2e28954537131a4aa734f49e361
14_Group_AG_2021
0
1
sec form file index
7aedfd4df02687d3dff9897c925da508
14_Group_AG_2021
1
...
...
...
...
213769
cyber secur alert parent compani fina...
ab10325601597f203f3f0af7aa647112
17_La_Banque_2021
8581
213770
intellig share statement parent compani fina...
6af5687ac31849d19d2048e0b2ca472d
17_La_Banque_2021
8582
I am trying to extract a count of each term listed in df_dora.content within df_corpus.content grouped by df_content.meta.name.
I tried to use isin
df = df_corpus[df_corpus.content.isin(df_dora.content)]
len(df)
Returns only 17 rows
content
id
meta.name
meta
41474
incid
a4c478e0fad1b9775c05e01d871b3aaf
3_Agricole_2021
10185
68690
oper risk
2e5139d82c242c89523110cc1110647a
10_Banking_Group_PLC_2021
5525
...
...
...
...
...
99259
risk report
a84eefb9a4772d13eb67f2d6ae5215cb
31_Building_Society_2021
4820
105662
risk manag
e8050be841fedb6dd10599e8b4892a9f
43_Bank_SA_2021
131
df_corpus.loc[df_corpus.content.isin(df_dora.content), 'content'].tolist()
also returns 17 rows
if I search for 2 of the terms that exist in df_dora directly in df_corpus
resiliency_term = df_corpus.loc[df_corpus['content'].str.contains("cyber risk|inform share", case=False)]
print(resiliency_term)
I get 243 rows (which matches what was in the original file.)
So given the above...my question is this how do I extract a count of each term listed in df_dora.content within df_corpus.content grouped by df_content.meta.name.
Thanks in advance for any help.
unique_vals = '|'.join(df_dora.content.unique())
df_corpus.groupby('meta.name').apply(lambda x: x.content.str.findall(unique_vals).explode().value_counts())
Output given your four lines of each:
17_La_Banque_2021 intellig share 1
Name: content, dtype: int64

Splitting the Filed Name (Table's Header) into Two Separate Lines

I have a dataset of the following structure:
Company.ID DDR (25632) PTL (89567)
2512 89 74
9875 78 96
7892 14 73
I would like to split the header into two different lines. With other words the second part of the header should or could be the first variable. How is possible to transform the dataset into the desired form (see below):
Company.ID DDR PTL
- (25632) (89567)
2512 89 74
9875 78 96
7892 14 73
To replicate the above example in Qlik, run the code below:
LOAD * Inline [
[Company.ID], [DDR (25632)], [PTL (89567)]
2512,89,74
9875,78,96
7892,14,73
];
Any help or tipp would be highly appreciated!
You need to loop columns, rename them and concatenate with new values. Here is example which I've written:
table:
LOAD * Inline [
Company.ID, DDR (25632), PTL (89567)
2512,89,74
9875,78,96
7892,14,73
];
For i=1 to NoOfFields('table')
LET vField = FieldName($(i),'table');
LET vFieldName_$(i) = SubField('$(vField)',' ',1);
LET vFieldValue_$(i) = SubField('$(vField)',' ',2);
If '$(vField)' <> '$(vFieldName_$(i))' THEN
Rename Field '$(vField)' TO '$(vFieldName_$(i))';
EndIf
next
Concatenate(table)
Load * Inline [
'$(vFieldName_1)', '$(vFieldName_2)', '$(vFieldName_3)'
'$(vFieldValue_1)', '$(vFieldValue_2)', '$(vFieldValue_3)'
];

PDF How to get Font object with id not in cross reference table

Like in this discussion,
Tj command with angle brackets
I'm faced with TJ operator where content is between angle brackets:
<00030037005200570044004F000300550048004600520051005100580056>Tj
the parent page gives the list of font object id's like this
Font /C2_0 39 0 R/T1_0 41 0 R/T1_1 43 0 R/T1_2 44 0 R
and for the object where the angle brackets string is, a Tf operator specifies that the font reference is C2_0
So from the font list, I know the C2 font object is 39
Ok, but now, what is the fastest way to access this 39 object that is embedded in a stream object having 16 as id. In this #16 object, there is the list of embedded objects
32 0 33 106 34 131 35 141 36 193 37 436 38 16720 39 16728 ....
So my quetion is how to get the 16 value, when I only know that the font object id 39 is not in the cross reference table? Do I have to parse all stream objects and read their stream object list to detect which one has the object 39?
Thanks for your attention.

Microsoft Internet Controls

I am using:
IE.ExecWB 17, 0 '// SelectAll
IE.ExecWB 12, 2 '// Copy selection
in an Excel VBA program successfully, but I am having trouble finding a reference for all ExecWB methods. Can anyone point me in the right direction?
Here is something from my database. I doubt you will find this on the web anymore. I will be surprised if you do...
ExecWB syntax is as follows:
object.ExecWB nCmdID, nCmdExecOpt, [pvaIn], [pvaOut]
The ExecWB method requires an OLE Command ID to be passed in to identify the command to execute. This value nCmdID is of type Long. The nCmdExecOpt parameter represents the value for the command execution option. Together, these values instruct the control as to what supported command to execute and what degree of user prompting should occur.
The last two parameters pvaIn and paOut are optional and is usually set to either NULL or an empty string.
Here is a complete list for the 1st parameter
OLECMDID_OPEN 1 Open
OLECMDID_NEW 2 Create a new document
OLECMDID_SAVE 3 Preservation
OLECMDID_SAVEAS 4 Save as
OLECMDID_SAVECOPYAS 5  
OLECMDID_PRINT 6 Print
OLECMDID_PRINTPREVIEW 7 Print preview
OLECMDID_PAGESETUP 8 Page setup
OLECMDID_SPELL 9 The spelling check
OLECMDID_PROPERTIES 10 Attribute
OLECMDID_CUT 11 Shear
OLECMDID_COPY 12 Replication
OLECMDID_PASTE 13 Paste
OLECMDID_PASTESPECIAL 14 Paste special
OLECMDID_UNDO 15 Revoke
OLECMDID_REDO 16 Repeat
OLECMDID_SELECTALL 17 Select all
OLECMDID_CLEARSELECTION 18 Clear selection
OLECMDID_ZOOM 19
OLECMDID_GETZOOMRANGE 20
OLECMDID_UPDATECOMMANDS 21 The update command
OLECMDID_REFRESH 22 Refresh
OLECMDID_STOP 23 Stop it
OLECMDID_HIDETOOLBARS 24 Hide toolbar
OLECMDID_SETPROGRESSMAX 25 Progress bar maximum
OLECMDID_SETPROGRESSPOS 26 Progress bar position
OLECMDID_SETPROGRESSTEXT 27 Progress bar text
OLECMDID_SETTITLE 28 Set the title
OLECMDID_SETDOWNLOADSTATE 29 Set download status
OLECMDID_STOPDOWNLOAD 30 Stop downloading
OLECMDID_ONTOOLBARACTIVATED 31
OLECMDID_FIND 32 Search
OLECMDID_DELETE 33 Delete
OLECMDID_HTTPEQUIV 34
OLECMDID_HTTPEQUIV_DONE 35
OLECMDID_ENABLE_INTERACTION 36 Allow the interaction
OLECMDID_ONUNLOAD 37 When uninstall
OLECMDID_PROPERTYBAG2 38
OLECMDID_PREREFRESH 39
OLECMDID_SHOWSCRIPTERROR 40
OLECMDID_SHOWMESSAGE 41 Display a message
OLECMDID_SHOWFIND 42 Display search
OLECMDID_SHOWPAGESETUP 43 Display page setup
OLECMDID_SHOWPRINT 44 Display and printing
OLECMDID_CLOSE 45 Close
OLECMDID_ALLOWUILESSSAVEAS 46
OLECMDID_DONTDOWNLOADCSS 47
OLECMDID_UPDATEPAGESTATUS 48
OLECMDID_PRINT2 49 Print 2
OLECMDID_PRINTPREVIEW2 50 Print preview
OLECMDID_SETPRINTTEMPLATE 51 Set the print template
OLECMDID_GETPRINTTEMPLATE 52 Get a print template
OLECMDID_PAGEACTIONBLOCKED 55
OLECMDID_PAGEACTIONUIQUERY 56
OLECMDID_FOCUSVIEWCONTROLS 57
OLECMDID_FOCUSVIEWCONTROLSQUERY 58
OLECMDID_SHOWPAGEACTIONMENU 59
OLECMDID_ADDTRAVELENTRY 60
OLECMDID_UPDATETRAVELENTRY 61
OLECMDID_UPDATEBACKFORWARDSTATE 62
OLECMDID_OPTICAL_ZOOM 63
OLECMDID_OPTICAL_GETZOOMRANGE 64
OLECMDID_WINDOWSTATECHANGED 65 windows status change
Here is a complete list for the 2nd parameter
OLECMDEXECOPT_DODEFAULT 0 Default parameters
OLECMDEXECOPT_PROMPTUSER 1 Prompt the user, namely the pop-up dialog box
LECMDEXECOPT_DONTPROMPTUSER 2 User is not prompted
OLECMDEXECOPT_SHOWHELP 3 displays help
Examples
WebBrowser.ExecWB(6,1) '<~~ Print
WebBrowser.ExecWB(7,1) '<~~ Print preview
WebBrowser.ExecWB(8,1) '<~~ The printed page setup

How to find the "lexical file" in Wordnet?

If you look at the original Wordnet search and select "Display options: Show Lexical File Info", you'll see an extremely useful classification of words called lexical file. Eg for "filling" we have:
<noun.substance>S: (n) filling, fill (any material that fills a space or container)
<noun.process>S: (n) filling (flow into something (as a container))
<noun.food>S: (n) filling (a food mixture used to fill pastry or sandwiches etc.)
<noun.artifact>S: (n) woof, weft, filling, pick (the yarn woven across the warp yarn in weaving)
<noun.artifact>S: (n) filling ((dentistry) a dental appliance consisting of ...)
<noun.act>S: (n) filling (the act of filling something)
The first thing in brackets is the "lexical file". Unfortunately I have not been able to find a SPARQL endpoint that provides this info
The latest RDF translation of Wordnet 3.0 points to two things:
Talis SPARQL endpoint. Use eg this query to check there's no such info:
DESCRIBE <http://purl.org/vocabularies/princeton/wn30/synset-chair-noun-1>
W3C's mapping description. Appendix D "Conversion details" describes something useful: wn:classifiedByTopic.
But it's not the same as lexical file, and is quite incomplete. Eg "chair" has nothing, while one of the senses of "completion" is in the topic "American Football"
DESCRIBE <http://purl.org/vocabularies/princeton/wn30/synset-completion-noun-1> ->
<j.1:classifiedByTopic rdf:resource="http://purl.org/vocabularies/princeton/wn30/synset-American_football-noun-1"/>
The question: is there a public Wordnet query API, or a database, that provides the lexical file information?
Using the Python NLTK interface:
from nltk.corpus import wordnet as wn
for synset in wn.synsets('can'):
print synset.lexname
I don't think you can find it in the RDF/OWL Representation of WordNet. It's in the WordNet distribution though: dict/lexnames. Here is the content of the file as of WordNet 3.0:
00 adj.all 3
01 adj.pert 3
02 adv.all 4
03 noun.Tops 1
04 noun.act 1
05 noun.animal 1
06 noun.artifact 1
07 noun.attribute 1
08 noun.body 1
09 noun.cognition 1
10 noun.communication 1
11 noun.event 1
12 noun.feeling 1
13 noun.food 1
14 noun.group 1
15 noun.location 1
16 noun.motive 1
17 noun.object 1
18 noun.person 1
19 noun.phenomenon 1
20 noun.plant 1
21 noun.possession 1
22 noun.process 1
23 noun.quantity 1
24 noun.relation 1
25 noun.shape 1
26 noun.state 1
27 noun.substance 1
28 noun.time 1
29 verb.body 2
30 verb.change 2
31 verb.cognition 2
32 verb.communication 2
33 verb.competition 2
34 verb.consumption 2
35 verb.contact 2
36 verb.creation 2
37 verb.emotion 2
38 verb.motion 2
39 verb.perception 2
40 verb.possession 2
41 verb.social 2
42 verb.stative 2
43 verb.weather 2
44 adj.ppl 3
For each entry of dict/data.*, the second number is the lexical file info. For example, this filling entry contains the number 13, which is noun.food.
07883031 13 n 01 filling 0 002 # 07882497 n 0000 ~ 07883156 n 0000 | a food mixture used to fill pastry or sandwiches etc.
It can be done through MIT JWI (MIT Java Wordnet Interface) a Java API to query Wordnet. There's a topic in this link showing how to implement a java class to access lexicographic
This is what worked for me,
Synset[] synsets = database.getSynsets(wordStr);
ReferenceSynset referenceSynset = (ReferenceSynset) synsets[i];
int lexicalCode =referenceSynset.getLexicalFileNumber();
Then use above table to deduce "lexnames" e.g. noun.time
If you're on Windows, chances are it is in your appdata, in the local directory. To get there, you will want to open your file browser, go to the top, and type in %appdata%
Next click on roaming, and then find the nltk_data directory. In there, you will have your corpora file. The full path is something like:
C:\Users\yourname\AppData\Roaming\nltk_data\corpora
and lexnames will present under
C:\Users\yourname\AppData\Roaming\nltk_data\corpora\wordnet.