Coordinates extracted from PDF are not exact - pdf

I'm working on rendering a georeferenced pdf within a map, I was able to retrieve the geolocation information from the pdf, but the coordinates I receive are not correct, they are a few meters apart from the places they really should be.
Opening the same PDF in Avenza Maps, it indicates this list of coordinates, and these are correct:
[-26.413082, -51.561534, -26.435838, -51.561643, -26.435909, -51.543773,-26.413152, -51.543667]
In the format I'm doing (reading the PDF as a String and doing a RegEx) I get these values:
[-26.43302 -51.56133 -26.41418 -51.56124 -26.41424 -51.54409 -26.43309 -51.54418]
[-26.45579 -51.59842 -26.41777 -51.59822 -26.41811 -51.51036 -26.45613 -51.51053]
But unfortunately none of the two reflect in the correct place (as in avenza).
That said, I opened the PDF in Notepad and found other values (more related to conversion and information), and I believe that maybe there is some way to convert the coordinates that I got through this other information, to the correct coordinates.
Follow the informations:
<?xpacket end="w"?>
endstream
endobj
294 0 obj
3495
endobj
295 0 obj
/DeviceRGB
endobj
296 0 obj
<</Length 297 0 R>>stream
/GS_init gs
/Group_6 Do
endstream
endobj
297 0 obj
24
endobj
298 0 obj
<</ExtGState 2 0 R/ColorSpace << /CS_P 295 0 R >>/XObject << /Group_6 6 0 R >>>>endobj
299 0 obj
<</Type /Group/S /Transparency/CS 295 0 R/I false/K false>>endobj
300 0 obj
<</Type /Page/Parent 301 0 R/Contents 296 0 R/Resources 298 0 R/MediaBox [0 0 841.88808 1190.5488]/ArtBox [0 0 841.88808 1190.5488]/UserUnit 1/Group 299 0 R/VP[<</Type /Viewport/BBox [14.1732 147.400915455 822.0456 1133.350548016]/Name (þÿ T S B I I)/Measure<</Type /Measure/Subtype /GEO/Bounds [0 0 0 1 1 1 1 0 0 0]/GPTS [ -26.43302 -51.56133 -26.41418 -51.56124 -26.41424 -51.54409 -26.43309 -51.54418]/LPTS [ 0 0 0 1 1 1 1 0]/GCS<</Type /PROJCS/WKT (PROJCS["SIRGAS_2000_UTM_Zone_22S",GEOGCS["GCS_SIRGAS_2000",DATUM["D_SIRGAS_2000",SPHEROID["GRS_1980",6378137.0,298.257222101]],PRIMEM["Greenwich",0.0],UNIT["Degree",0.0174532925199433]],PROJECTION["Transverse_Mercator"],PARAMETER["False_Easting",500000.0],PARAMETER["False_Northing",10000000.0],PARAMETER["Central_Meridian",-51.0],PARAMETER["Scale_Factor",0.9996],PARAMETER["Latitude_Of_Origin",0.0],UNIT["Meter",1.0]])>>>>>><</Type /Viewport/BBox [14.1732 14.1732 239.961243463 122.688692878]/Name (þÿ R e f e r e n c i a _ M a p a)/Measure<</Type /Measure/Subtype /GEO/Bounds [0 0 0 1 1 1 1 0 0 0]/GPTS [ -26.45579 -51.59842 -26.41777 -51.59822 -26.41811 -51.51036 -26.45613 -51.51053]/LPTS [ 0 0 0 1 1 1 1 0]/GCS<</Type /PROJCS/WKT (PROJCS["SIRGAS_2000_UTM_Zone_22S",GEOGCS["GCS_SIRGAS_2000",DATUM["D_SIRGAS_2000",SPHEROID["GRS_1980",6378137.0,298.257222101]],PRIMEM["Greenwich",0.0],UNIT["Degree",0.0174532925199433]],PROJECTION["Transverse_Mercator"],PARAMETER["False_Easting",500000.0],PARAMETER["False_Northing",10000000.0],PARAMETER["Central_Meridian",-51.0],PARAMETER["Scale_Factor",0.9996],PARAMETER["Latitude_Of_Origin",0.0],UNIT["Meter",1.0]])>>>>>>]>>endobj
301 0 obj
<</Type /Pages/Kids [ 300 0 R ]/Count 1>>endobj
302 0 obj
<<>>endobj
303 0 obj
<</Type /Catalog/Pages 301 0 R/PageMode /UseNone/PageLayout /SinglePage/ViewerPreferences <</PrintScaling /None /FitWindow true /DisplayDocTitle true>>/OpenAction [300 0 R /Fit]/OCProperties<</OCGs [ 10 0 R 11 0 R 12 0 R 13 0 R 14 0 R 15 0 R 16 0 R 17 0 R 18 0 R 19 0 R 20 0 R 21 0 R 22 0 R 35 0 R 36 0 R 43 0 R 44 0 R 47 0 R 50 0 R 53 0 R 56 0 R 59 0 R 62 0 R 63 0 R 64 0 R 65 0 R 66 0 R 67 0 R 68 0 R 69 0 R 76 0 R 77 0 R 80 0 R 83 0 R 90 0 R 93 0 R 96 0 R 99 0 R 102 0 R 105 0 R 108 0 R 111 0 R 114 0 R 117 0 R 120 0 R 123 0 R 126 0 R 129 0 R 132 0 R 135 0 R 138 0 R 141 0 R 148 0 R 149 0 R 152 0 R 155 0 R 158 0 R 161 0 R 176 0 R ]/D<</Name (Layers Tree)/Order [ 176 0 R 161 0 R 158 0 R 148 0 R [ 155 0 R 152 0 R 149 0 R ] 141 0 R 138 0 R 135 0 R 132 0 R 129 0 R 126 0 R 123 0 R 120 0 R 117 0 R 114 0 R 111 0 R 108 0 R 105 0 R 102 0 R 99 0 R 96 0 R 93 0 R 90 0 R 83 0 R 80 0 R 62 0 R [ 76 0 R [ 77 0 R ] 63 0 R [ 69 0 R 68 0 R 67 0 R 64 0 R [ 66 0 R 65 0 R ] ] ] 10 0 R [ 59 0 R 43 0 R [ 56 0 R 53 0 R 50 0 R 47 0 R 44 0 R ] 11 0 R [ 36 0 R 35 0 R 22 0 R 21 0 R 12 0 R [ 20 0 R 19 0 R 18 0 R 17 0 R 16 0 R 15 0 R 14 0 R 13 0 R ] ] ] ]/ListMode /VisiblePages>>>>/Metadata 293 0 R>>endobj
304 0 obj
<</Type/XRef/Size 305/W[1 4 2]/Filter/FlateDecode/Info 292 0 R/Root 303 0 R/ID [<c9167b70223726438d277b1b4409c053> <c9167b70223726438d277b1b4409c053>]/Length 923>>stream
I needed someone to tell me some way to get the correct coordinates, I hope this information helps to find

The PDF content in your question includes two ViewPort dictionaries.
These dictionaries map a location on the page ("BBox")
onto the GPTS referencing the specified WKT.
This is covered in the PDF 2.0 reference ISO-32000-2 section 12.9 & 12.10.
Unfortunately, this spec is not freely available, and it's not cheap.
Here are some definitions from the spec:
BBox:
A rectangle in default user space coordinates specifying the location of the viewport on the page.
The two coordinate pairs of the rectangle shall be specified in normalised form; that is, lower-left followed by upper-right, relative to the measuring coordinate system. This ordering shall determine the orientation of the measuring coordinate system (that is, the direction of the positive x and y axes) in this viewport, which may have a different rotation from the page.
GPTS:
(Required; PDF 2.0) An array of numbers that shall be taken pairwise, defining points in geographic space as degrees of latitude and longitude, respectively when defining a geographic coordinate system. These values shall be based on the geographic coordinate system described in the GCS dictionary. When defining a projected coordinate system, this array contains values in a planar projected coordinate space as eastings and northings. For Geospatial3D, when Geospatial feature information is present (requirement type Geospatial3D) in a 3D annotation, the GPTS array is required to hold 3D point coordinates as triples rather than pairwise where the third value of each tripe is an elevation value.
NOTE 2 Any projected coordinate system includes an underlying geographic coordinate system.
WKT:
A string of Well Known Text describing the geographic coordinate system.
The assumption is, if you're interested in Geospatial coordinates,
then you know what a WKT is, and what the projection means.
This may be enough information for you to map the geo coordinates for the
separate viewports to their locations on the page.
Here are the PDF Viewports in more readable form:
/VP [
<<
/Type
/Viewport
/BBox [14.1732 147.400915455 822.0456 1133.350548016]
/Name (TSBII)
/Measure <<
/Type
/Measure
/Subtype
/GEO
/Bounds [0 0 0 1 1 1 1 0 0 0]
/GPTS [ -26.43302 -51.56133 -26.41418 -51.56124
-26.41424 -51.54409 -26.43309 -51.54418]
/LPTS [ 0 0 0 1 1 1 1 0]
/GCS<<
/Type
/PROJCS
/WKT (
PROJCS["SIRGAS_2000_UTM_Zone_22S",
GEOGCS["GCS_SIRGAS_2000",
DATUM["D_SIRGAS_2000",SPHEROID["GRS_1980",6378137.0,298.257222101]],
PRIMEM["Greenwich",0.0],
UNIT["Degree",0.0174532925199433]
],
PROJECTION["Transverse_Mercator"],
PARAMETER["False_Easting",500000.0],
PARAMETER["False_Northing",10000000.0],
PARAMETER["Central_Meridian",-51.0],
PARAMETER["Scale_Factor",0.9996],
PARAMETER["Latitude_Of_Origin",0.0],
UNIT["Meter",1.0]
]
)
>>
>>
>>
<<
/Type
/Viewport
/BBox [14.1732 14.1732 239.961243463 122.688692878]
/Name (Referencia_Mapa)
/Measure <<
/Type
/Measure
/Subtype
/GEO
/Bounds [0 0 0 1 1 1 1 0 0 0]
/GPTS [ -26.45579 -51.59842 -26.41777 -51.59822
-26.41811 -51.51036 -26.45613 -51.51053]
/LPTS [ 0 0 0 1 1 1 1 0]
/GCS<<
/Type
/PROJCS
/WKT (
PROJCS["SIRGAS_2000_UTM_Zone_22S",
GEOGCS["GCS_SIRGAS_2000",
DATUM["D_SIRGAS_2000",SPHEROID["GRS_1980",6378137.0,298.257222101]],
PRIMEM["Greenwich",0.0],
UNIT["Degree",0.0174532925199433]
],
PROJECTION["Transverse_Mercator"],
PARAMETER["False_Easting",500000.0],
PARAMETER["False_Northing",10000000.0],
PARAMETER["Central_Meridian",-51.0],
PARAMETER["Scale_Factor",0.9996],
PARAMETER["Latitude_Of_Origin",0.0],
UNIT["Meter",1.0]])
>>
>>
>>
]
>>
Note that a PDF file is a structured document and not parsable as a string. These specific elements could be compressed, or might occur multiple times for different pages. You'll need a toolkit that can access Pages and Resources and Dictionaries in order to locate the ViewPorts.

Related

How itext7 java add multi signature field which using same /AP and /V

I'm new to IText7, and It took me two days to do that.
When generated a pdf have 3 pages, like
...
4 0 obj
<</Contents 5 0 R/MediaBox[0 0 595 842]/Parent 2 0 R/Resources<</Font<</F1 6 0 R>>>>/TrimBox[0 0 595 842]/Type/Page>>
endobj
7 0 obj
<</Contents 8 0 R/MediaBox[0 0 595 842]/Parent 2 0 R/Resources<</Font<</F1 6 0 R>>>>/TrimBox[0 0 595 842]/Type/Page>>
endobj
9 0 obj
<</Contents 10 0 R/MediaBox[0 0 595 842]/Parent 2 0 R/Resources<</Font<</F1 6 0 R>>>>/TrimBox[0 0 595 842]/Type/Page>>
endobj
...
then i signed it, the pdf added some object like this
...
1 0 obj
<</AcroForm 11 0 R/Pages 2 0 R/Type/Catalog>>
endobj
9 0 obj
<</Annots[13 0 R]/Contents 10 0 R/MediaBox[0 0 595 842]/Parent 2 0 R/Resources<</Font<</F1 6 0 R>>>>/TrimBox[0 0 595 842]/Type/Page>>
endobj
11 0 obj
<</Fields[13 0 R]/SigFlags 3>>
endobj
13 0 obj
<</AP<</N 18 0 R>>/F 132/FT/Sig/P 9 0 R/Rect[280.5 810 314.5 842]/Subtype/Widget/T(sig)/V 12 0 R>>
endobj
...
I notice the third page changed, it has an annots(object 13).
Now, i want modify itext code to add /Widget for each page but using same /AP and /V.Then there have same signature graphic in each page but only add signature once. like
...
4 0 obj
<</Annots[13 0 R]/Contents 5 0 R/MediaBox[0 0 595 842]/Parent 2 0 R/Resources<</Font<</F1 6 0 R>>>>/TrimBox[0 0 595 842]/Type/Page>>
endobj
7 0 obj
<</Annots[14 0 R]/Contents 8 0 R/MediaBox[0 0 595 842]/Parent 2 0 R/Resources<</Font<</F1 6 0 R>>>>/TrimBox[0 0 595 842]/Type/Page>>
endobj
9 0 obj
<</Annots[15 0 R]/Contents 10 0 R/MediaBox[0 0 595 842]/Parent 2 0 R/Resources<</Font<</F1 6 0 R>>>>/TrimBox[0 0 595 842]/Type/Page>>
endobj
13 0 obj
<</AP<</N 18 0 R>>/F 132/FT/Sig/P 4 0 R/Rect[280.5 810 314.5 842]/Subtype/Widget/T(sig)/V 12 0 R>>
endobj
14 0 obj
<</AP<</N 18 0 R>>/F 132/FT/Sig/P 7 0 R/Rect[200 400 300 420]/Subtype/Widget/T(sig)/V 12 0 R>>
endobj
15 0 obj
<</AP<</N 18 0 R>>/F 132/FT/Sig/P 9 0 R/Rect[100 200 150 480]/Subtype/Widget/T(sig)/V 12 0 R>>
endobj
11 0 obj
<</Fields[13 0 R 14 0 R 15 0 R]/SigFlags 3>>
endobj
...
I don't know if the example fits. I've read about it before, What I want is a signature with a different /Rect signature graphic on each page after a signature.
How can i do that using iText7 Java?
Can add such a signature multiple times to a PDF and pass signature verification?
After doing so, can Signature ArcoForm and field be deleted and/widgets cleared?
Can I crop Signature Graphic to show different parts on different pages?
Here's some code:
PdfDocument pdfDocument = new PdfDocument(new PdfReader(FILE));
final int pageCount = pdfDocument.getNumberOfPages();
pdfDocument.close();
PdfReader pdfReader = new PdfReader(FILE);
PdfSigner pdfSigner = new PdfSigner(pdfReader, new FileOutputStream(SIGN), new StampingProperties().useAppendMode());
File imageFile = new File(IMAGE);
java.awt.Image image = ImageIO.read(imageFile);
ImageData imageData = ImageDataFactory.create(image, null);
Rectangle rect = new Rectangle(
(pdfDocument.getDefaultPageSize().getRight() / 2) - (imageData.getWidth() / 2),
pdfDocument.getDefaultPageSize().getTop() - imageData.getHeight(),
imageData.getWidth(),
imageData.getHeight()
);
PdfSignatureAppearance appearance = pdfSigner.getSignatureAppearance();
appearance.setPageNumber(pageCount);
appearance.setSignatureGraphic(imageData);
appearance.setRenderingMode(PdfSignatureAppearance.RenderingMode.GRAPHIC);
appearance.setPageRect(rect);
appearance.setReason("reason");
appearance.setLocation("location");
appearance.setReuseAppearance(false);
pdfSigner.setFieldName("sig");
KeyStore ks = KeyStore.getInstance(KeyStore.getDefaultType());
ks.load(new FileInputStream(KEYSTORE), PASSWORD);
String alias = ks.aliases().nextElement();
PrivateKey pk = (PrivateKey) ks.getKey(alias, PASSWORD);
Certificate[] chain = ks.getCertificateChain(alias);
BouncyCastleProvider provider = new BouncyCastleProvider();
Security.addProvider(provider);
IExternalSignature pks = new PrivateKeySignature(pk, DigestAlgorithms.SHA256, provider.getName());
IExternalDigest digest = new BouncyCastleDigest();
pdfSigner.signDetached(digest, pks, chain, null, null, null, 0, PdfSigner.CryptoStandard.CMS);
append:
In com.itextpdf.signatures.PdfSigner.preClose() method, i tried the following code, although successfully adding the desired object, but only the widget annotations that are added first can be shown in Adobe Reader, what should I do? the PDF I got
if (fieldExist) ... else {
PdfDictionary ap = new PdfDictionary();
for (int i = 1; i <= document.getNumberOfPages(); i++) {
PdfWidgetAnnotation widget = new PdfWidgetAnnotation(appearance.getPageRect());
widget.setFlags(PdfAnnotation.PRINT | PdfAnnotation.LOCKED);
PdfSignatureFormField sigField = PdfFormField.createSignature(document);
sigField.setFieldName(name);
sigField.put(PdfName.V, cryptoDictionary.getPdfObject());
sigField.addKid(widget);
if (this.fieldLock != null) {
this.fieldLock.getPdfObject().makeIndirect(document);
sigField.put(PdfName.Lock, this.fieldLock.getPdfObject());
fieldLock = this.fieldLock;
}
widget.setPage(document.getPage(i));
widget.put(PdfName.AP, ap);
if (1 == i)
ap.put(PdfName.N, appearance.getAppearance().getPdfObject());
acroForm.addField(sigField, document.getPage(i));
}
...
}

Data formatting for grouped boxplot using seaborn or matplotlib

I have 3 dataframes where column names and number of rows are exactly the same in all 3 data frames. I want to plot all the columns from all three dataframes as a grouped boxplot into one image using seaborn or matplotlib. But I am having difficulties in combining and formating the data so that I can plot them as grouped box plot.
df=
A B C D E F G H I J
0 0.031810 0.000556 0.007798 0.000741 0 0 0 0.000180 0.002105 0
1 0.028687 0.000571 0.009356 0.000000 0 0 0 0.000183 0.001250 0
2 0.029635 0.001111 0.009121 0.000000 0 0 0 0.000194 0.001111 0
3 0.030579 0.002424 0.007672 0.000000 0 0 0 0.000194 0.001176 0
4 0.028544 0.002667 0.007973 0.000000 0 0 0 0.000179 0.001333 0
5 0.027286 0.003226 0.006881 0.000000 0 0 0 0.000196 0.001111 0
6 0.031597 0.003030 0.006695 0.000000 0 0 0 0.000180 0.002353 0
7 0.034226 0.003030 0.010804 0.000667 0 0 0 0.000179 0.003333 0
8 0.035105 0.002941 0.010176 0.000645 0 0 0 0.000364 0.003529 0
9 0.035171 0.003125 0.012666 0.001250 0 0 0 0.000612 0.005556 0
df1 =
A B C D E F G H I J
0 0.034898 0.003750 0.014091 0.001290 0 0 0 0.001488 0.005333 0
1 0.042847 0.003243 0.011559 0.000625 0 0 0 0.002272 0.010769 0
2 0.046087 0.005455 0.013101 0.000588 0 0 0 0.002147 0.008750 0
3 0.042719 0.003684 0.010496 0.001333 0 0 0 0.002627 0.004444 0
4 0.042410 0.004211 0.011580 0.000645 0 0 0 0.003007 0.006250 0
5 0.044515 0.003500 0.013990 0.000000 0 0 0 0.003954 0.007000 0
6 0.046062 0.004865 0.013278 0.000714 0 0 0 0.004035 0.011111 0
7 0.043666 0.004444 0.013460 0.000625 0 0 0 0.003826 0.010000 0
8 0.039888 0.006857 0.014351 0.000690 0 0 0 0.004314 0.011474 0
9 0.048203 0.006667 0.016338 0.000741 0 0 0 0.005294 0.013603 0
df3 =
A B C D E F G H I J
0 0.048576 0.006471 0.020130 0.002667 0 0 0 0.005536 0.015179 0
1 0.056270 0.007179 0.021519 0.001429 0 0 0 0.005524 0.012333 0
2 0.054020 0.008235 0.024464 0.001538 0 0 0 0.005926 0.010445 0
3 0.047297 0.008649 0.026650 0.002198 0 0 0 0.005870 0.010000 0
4 0.049347 0.009412 0.022808 0.002838 0 0 0 0.006541 0.012222 0
5 0.052026 0.010000 0.019935 0.002714 0 0 0 0.005062 0.012222 0
6 0.055124 0.010625 0.022950 0.003499 0 0 0 0.005954 0.008964 0
7 0.044411 0.010909 0.019129 0.005709 0 0 0 0.005209 0.007222 0
8 0.047697 0.010270 0.017234 0.008800 0 0 0 0.004808 0.008355 0
9 0.048562 0.010857 0.020219 0.008504 0 0 0 0.005665 0.004862 0
I can do single boxplots by using the following:
g = sns.boxplot(data=df, color = 'white', fliersize=1, linewidth=2, meanline = True, showmeans=True)
But how to get all three in one figure seems a bit difficult. I see I need to re-arrange the whole data and use hue in order to get every thing from combined data frame, but how exactly should I format the data is a question. Any help?
You can do all in one sns.boxplot run by concatenate the dataframes and passing hue:
tmp = (pd.concat([d.assign(data=i) # assign adds the column `data` with values i
for i,d in enumerate([df,df1,df3])] # enumerate gives you a generator of pairs (0,df), (1,df1), (2,df2)
)
.melt(id_vars='data') # melt basically turns `id_vars` columns into index,
# and stacks other columns
)
sns.boxplot(data=tmp, x='variable', hue='data', y='value')
Output:

What does each number represents in MNIST?

I have successfully downloaded MNIST data in files with .npy extension. When I print the few columns of first image. I get the following result. What does each number represent here?
a= np.load("training_set.npy")
print(a[:1,100:200])
print(a.shape)
[[ 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 3 18
18 18 126 136 175 26 166 255 247 127 0 0 0 0 0 0 0 0
0 0 0 0 30 36 94 154 170 253 253 253 253 253 225 172 253 242
195 64 0 0 0 0 0 0 0 0]]
(60000, 784)
[[ 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 3 18
18 18 126 136 175 26 166 255 247 127 0 0 0 0 0 0 0 0
0 0 0 0 30 36 94 154 170 253 253 253 253 253 225 172 253 242
195 64 0 0 0 0 0 0 0 0]]
These are the intensity values (0-255) for each of the 784 pixels (28x28) of a MNIST image; the total number of training images is 60,000 (you'll find 10,000 more images in the test set).
(60000, 784) means 60,000 samples (images), each one consisting of 784 features (pixel values).

Open PDF, save DOCX bugs out after a few dozen documents and outputs garbled/corrupted files

I have a few thousand PDF files that I needs to convert to DOCX. I wrote the following macro:
Sub convertPDFtoDOCX()
'
' convertPDFtoDOCX Macro
'
'
Dim docDirectory As String
Dim pdfDirectory As String
Dim docPath As String
Dim doc As Document
docDirectory = "C:\Users\<USER>\DOCX\"
pdfDirectory = "C:\Users\<USER>\PDF\"
pdfFile = Dir(pdfDirectory & "*.*")
Do While pdfFile <> ""
docPath = docDirectory & pdfFile & ".docx"
Set doc = Documents.Open(FileName:=pdfDirectory & pdfFile)
ActiveDocument.SaveAs2 FileName:=docPath, FileFormat:=wdFormatXMLDocument
Documents.Close
pdfFile = Dir
Loop
End Sub
It works fine for the first few dozen documents, but then starts outputting "corrupted files", that aren't docx and can't be opened with a PDF viewer either. There is no error message when it starts bugging out. The problem doesn't come from the PDF files, since if I stop the macro and start it again on the same documents, they are correctly converted the second time.
"Corrupted" files looks like this:
%PDF-1.5
%µµµµ
1 0 obj
<</Type/Catalog/Pages 2 0 R/Lang(fr-FR) /StructTreeRoot 91 0 R/MarkInfo<</Marked true>>>>
endobj
2 0 obj
<</Type/Pages/Count 21/Kids[ 3 0 R 27 0 R 31 0 R 42 0 R 44 0 R 46 0 R 48 0 R 55 0 R 59 0 R 61 0 R 63 0 R 65 0 R 67 0 R 69 0 R 71 0 R 73 0 R 75 0 R 77 0 R 79 0 R 81 0 R 88 0 R] >>
endobj
3 0 obj
<</Type/Page/Parent 2 0 R/Resources<</Font<</F1 5 0 R/F2 9 0 R/F3 11 0 R/F4 16 0 R/F5 18 0 R/F6 20 0 R/F7 25 0 R>>/ExtGState<</GS7 7 0 R/GS8 8 0 R>>/ProcSet[/PDF/Text/ImageB/ImageC/ImageI] >>/MediaBox[ 0 0 595.2 841.8] /Contents 4 0 R/Group<</Type/Group/S/Transparency/CS/DeviceRGB>>/Tabs/S/StructParents 0>>
endobj
4 0 obj
<</Filter/FlateDecode/Length 4428>>
stream
xœ­\Ën7Ýð?Ô.Ý ¨Ä7«‚ ¹%e4ð+²’Y$Yt¤¶£A,9RÛÈüÕ|Æ|ÆìÙäæ^²ÈzðQ-¦ È]U¼$//:<yØÞ¾__o«££Ã“ív}ýóæ¦úþðÅýv{ÿñÇë}Ú¾]¸½[ooïï
What causes the issue and how can I fix it?
I use Word 2016 on Windows 10.
I don't think you can fix the issue without a patch from Microsoft. Meanwhile, you can move your code to run outside Word and create a new Word.Application object for each iteration.

How can I change my index vector into sparse feature vector that can be used in sklearn?

I am doing a News recommendation system and I need to build a table for users and news they read. my raw data just like this :
001436800277225 [12,456,157]
009092130698762 [248]
010003000431538 [361,521,83]
010156461231357 [173,67,244]
010216216021063 [203,97]
010720006581483 [86]
011199797794333 [142,12,86,411,201]
011337201765123 [123,41]
011414545455156 [62,45,621,435]
011425002581540 [341,214,286]
the first column is userID, the second column is the newsID.newsID is a index column, for example, after transformation, [12,456,157] in the first row means that this user has read the 12th, 456th and 157th news (in sparse vector, the 12th column, 456th column and 157th column are 1, while other columns have value 0). And I want to change these data into a sparse vector format that can be used as input vector in Kmeans or DBscan algorithm of sklearn.
How can I do that?
One option is to construct the sparse matrix explicitly. I often find it easier to build the matrix in COO matrix format and then cast to CSR format.
from scipy.sparse import coo_matrix
input_data = [
("001436800277225", [12,456,157]),
("009092130698762", [248]),
("010003000431538", [361,521,83]),
("010156461231357", [173,67,244])
]
NUMBER_MOVIES = 1000 # maximum index of the movies in the data
NUMBER_USERS = len(input_data) # number of users in the model
# you'll probably want to have a way to lookup the index for a given user id.
user_row_map = {}
user_row_index = 0
# structures for coo format
I,J,data = [],[],[]
for user, movies in input_data:
if user not in user_row_map:
user_row_map[user] = user_row_index
user_row_index+=1
for movie in movies:
I.append(user_row_map[user])
J.append(movie)
data.append(1) # number of times users watched the movie
# create the matrix in COO format; then cast it to CSR which is much easier to use
feature_matrix = coo_matrix((data, (I,J)), shape=(NUMBER_USERS, NUMBER_MOVIES)).tocsr()
Use MultiLabelBinarizer from sklearn.preprocessing
from sklearn.preprocessing import MultiLabelBinarizer
mlb = MultiLabelBinarizer()
pd.DataFrame(mlb.fit_transform(df.newsID), columns=mlb.classes_)
12 41 45 62 67 83 86 97 123 142 ... 244 248 286 341 361 411 435 456 521 621
0 1 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 1 0 0
1 0 0 0 0 0 0 0 0 0 0 ... 0 1 0 0 0 0 0 0 0 0
2 0 0 0 0 0 1 0 0 0 0 ... 0 0 0 0 1 0 0 0 1 0
3 0 0 0 0 1 0 0 0 0 0 ... 1 0 0 0 0 0 0 0 0 0
4 0 0 0 0 0 0 0 1 0 0 ... 0 0 0 0 0 0 0 0 0 0
5 0 0 0 0 0 0 1 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
6 1 0 0 0 0 0 1 0 0 1 ... 0 0 0 0 0 1 0 0 0 0
7 0 1 0 0 0 0 0 0 1 0 ... 0 0 0 0 0 0 0 0 0 0
8 0 0 1 1 0 0 0 0 0 0 ... 0 0 0 0 0 0 1 0 0 1
9 0 0 0 0 0 0 0 0 0 0 ... 0 0 1 1 0 0 0 0 0 0