Uncompilable source code - Erroneous sym type: org.apache.lucene.document.FieldType.setIndexed when indexing pdf files error - pdf

im trying to index pdf files in lucene 6.6.0 and pdfbox 2.0.7
im getting some following errors. (EDITED)
run:
Indexing ke folder: 'D:\Kuliah\rancangan document indexing\dir-index\'...
Indexing PDF document: D:\Kuliah\rancangan document indexing\dir-pdf\dua.pdf
Exception in thread "main" java.lang.ExceptionInInitializerError
at tigasepuluh.Playground.indexDocs(Playground.java:110)
at tigasepuluh.Playground.indexDocs(Playground.java:88)
at tigasepuluh.Playground.main(Playground.java:65)
Caused by: java.lang.RuntimeException: Uncompilable source code - Erroneous sym type: org.apache.lucene.document.FieldType.setIndexed
at org.apache.pdfbox.examples.lucene.LucenePDFDocument.<clinit>(LucenePDFDocument.java:123)
... 3 more
C:\Users\abc\AppData\Local\NetBeans\Cache\8.2\executor-snippets\run.xml:53: Java returned: 1
BUILD FAILED (total time: 5 seconds)
And this is github link to my complete code
my complete code

Change this line in your copy of org.apache.pdfbox.examples.lucene.LucenePDFDocument:
TYPE_STORED_NOT_INDEXED.setIndexed(false);
to
TYPE_STORED_NOT_INDEXED.setIndexOptions(IndexOptions.NONE);
The problem you had is because the PDFBox example was made for lucene 4.

Related

Extract an attribute in GPKG

I am trying to extract rivers from OSM. I downloaded the waterway GPKG where I believe there are over 21 million entries (see link) with a file size of 19.9 GB.
I have tried using the split vector layer in QGIS, but it would crash.
Was thinking of using GDAL ogr2ogr, but having trouble generating the command line.
I first isolated the multiline string with the following command.
ogr2ogr -f gpkg water.gpkg waterway_EPSG4326.gpkg waterway_EPSG4326_line -nlt linestring
ogrinfo water.gpkg INFO: Open of water.gpkg' using driver GPKG' successful. 1: waterway_EPSG4326_line (Line String)
Tried the following command below, but it is not working.
ogr2ogr -f GPKG SELECT * FROM waterway_EPSG4326_line - where waterway="river" river.gpkg water.gpkg
Please let me know what is missing or if there is any easy way to perform the task. I tried opening the file in R sf package, but it would not load after a long time.
Thanks

PDFBOX - Invalid characters codes with AR PL Zenkai Uni Font

PdfBox 2.0.24.
Hi, I'm developing a PDF writer and I need to use "AR PL Zenkai Uni Font".
When I try to load it PDFBox crash with the following error:
Exception in thread "main" java.io.IOException: Invalid characters codes
at org.apache.fontbox.ttf.CmapSubtable.processSubtype12(CmapSubtable.java:257)
at org.apache.fontbox.ttf.CmapSubtable.initSubtable(CmapSubtable.java:111)
at org.apache.fontbox.ttf.CmapTable.read(CmapTable.java:86)
at org.apache.fontbox.ttf.TrueTypeFont.readTable(TrueTypeFont.java:361)
at org.apache.fontbox.ttf.TTFParser.parseTables(TTFParser.java:173)
at org.apache.fontbox.ttf.TTFParser.parse(TTFParser.java:150)
at org.apache.fontbox.ttf.TTFParser.parse(TTFParser.java:106)
at org.apache.pdfbox.pdmodel.font.PDType0Font.load(PDType0Font.java:97)
at com.vgs.pdf.PDFCreatorSandbox.main(PDFCreatorSandbox.java:166)e here
To load this font i'm usign the following code:
PDType0Font brokenFont = PDType0Font.load(document, new FileInputStream("font/ukai.ttf"), false);
This code was run on Windows 10 with java 1.8.0_291
Any suggestions?
Thanks in advance

Hi , Google big query - bq fail load display file number how to get the file name

I'm running the following bq command
bq load --source_format=CSV --skip_leading_rows=1 --max_bad_records=1000 --replace raw_data.order_20150131 gs://raw-data/order/order/2050131/* order.json
and
getting the following message when loading data into bq .
*************************************
Waiting on bqjob_r4ca10491_0000014ce70963aa_1 ... (412s) Current status: DONE
BigQuery error in load operation: Error processing job
'orders:bqjob_r4ca10491_0000014ce70963aa_1': Too few columns: expected
11 column(s) but got 1 column(s). For additional help: http://goo.gl/RWuPQ
Failure details:
- File: 844 / Line:1: Too few columns: expected 11 column(s) but got
1 column(s). For additional help: http://goo.gl/RWuPQ
**********************************
The message display only the file number .
checked the files content most of them are good .
gsutil ls and the cloud console on the other hand display file names .
how can I know which file is it according to the file number?
There seems to be some weird spacing introduced in the question, but if the desired path to ingest is "/order.json" - that won't work: You can only use "" at the end of the path when ingesting data to BigQuery.

Voice recognition with Julius. How to make .voca file?

I'm making a voice recognition system and Julius shows not bad results in this work.
Words from sample .voca file are recognizing perfectly but how to place own words and transcriptions to the file?
I've tried VoxForge (http://www.voxforge.org/) last release and nightly builds for acoustic models with their vocabulary but I've got a lot a lot errors at julius start like this:
Error: voca_load_htkdict: line 19: triphone "r-d+v" not found
Error: voca_load_htkdict: line 19: triphone "d-v+aa" not found
Error: voca_load_htkdict: the line content was: 2 [AARDVARK] aa r d v aa r k
Error: voca_load_htkdict: begin missing phones
Error: voca_load_htkdict: r-d+v
Error: voca_load_htkdict: d-v+aa
Error: voca_load_htkdict: end missing phones
Error: init_voca: error in reading /usr/src/custom/julius/quickstart/grammar/sample.dict
ERROR: failed to read dictionary "/usr/src/custom/julius/quickstart/grammar/sample.dict"
ERROR: m_fusion: some error occured in reading grammars
ERROR: Error in loading model
Anyone knows the rules of word transcription for .voca files?
error reason:
julius optput these messages when your word dictionary contains words that are not trained in the Acoustic Model because the "voca_load_htkdict.c" tries to match the triphones in dict file with the triphone list in Acoustic Model, so when it does not find it, it shows this error and stops the program.
possible error solutions:
1. enable -forcedict option or uncomment it jconf file to Skip error words in dictionary and force running.
or..
2. map the "not found triphone" to the most close physical triphone in hmmlist file "tiedlist".
for example:
b-ey+t v-eh+t
The first column is the name of triphone (generated from your dictionary), and the second column is the name of the HMM actually defined in your AM.
but this solution can be done if the "not found triphones" are little, not too many.
the best solution is to not to include words in your dict file that are not in the A.M
note that the first two solutions are for testing julius only because for production or comercial projects you must train the acoustic model and language model with the same corpus.

error when trying to import ps file by grImport in R

I need to create a pdf file with several chart created by ggplot2 arranged in a A4 paper, and repeat it 20-30 times.
I export the ggplot2 chart into ps file, and try to PostScriptTrace it as instructed in grImport, but it just keep giving me error of Unrecoverable error, exit code 1.
I ignore the error and try to import and xml file generated into R object, give me another error:
attributes construct error
Couldn't find end of Start Tag text line 21
Premature end of data in tag picture line 3
Error: 1: attributes construct error
2: Couldn't find end of Start Tag text line 21
3: Premature end of data in tag picture line 3
What's wrong here?
Thanks!
If you have no time to deal with Sweave, you could also write a simple TeX document from R after generating the plots, which you could later compile to pdf.
E.g.:
ggsave(p, file=paste('filename', id, '.pdf'))
cat(paste('\\includegraphics{',
paste('filename', id, '.pdf'), '}', sep=''),
file='report.pdf')
Later, you could easily compile it to pdf with for example pdflatex.