I am trying to compile a test class to test a simple grammer.
import org.antlr.v4.runtime.*;
public class Test {
public static void main(String[] args) throws Exception
{
CharStream input = null;
// pick an input stream (filename from commandline or stdin)
if(args.length > 0) input = new ANTLRFileStream(args[0]);
else input = new ANTLRInputStream(System.in);
// create the lexer
DrinkLexer lex = new DrinkLexer(input);
// create a buffer of tokens between the lexer and parser
CommonTokenStream tokens = new CommonTokenStream(lex);
// create the parser, attaching it to the token buffer
DrinkParser p = new DrinkParser(tokens);
p.drinkSentence(); // launch parser at drinkSentence file
}
}
How would I go about replacing the deprecated class?
Use the various static methods from CharStreams:
CharStream input = null;
// pick an input stream (filename from commandline or stdin)
if(args.length > 0) input = CharStreams.fromFileName(args[0]);
else input = CharStreams.fromStream(System.in);
// create the lexer
DrinkLexer lex = new DrinkLexer(input);
// create a buffer of tokens between the lexer and parser
CommonTokenStream tokens = new CommonTokenStream(lex);
// create the parser, attaching it to the token buffer
DrinkParser p = new DrinkParser(tokens);
p.drinkSentence(); // launch parser at drinkSentence file
Related
import org.antlr.v4.runtime.*;
import org.antlr.v4.runtime.tree.ParseTree;
import java.io.FileInputStream;
import java.io.InputStream;
public class Calc {
public static void main(String[] args) throws Exception {
String inputFile = null;
if ( args.length>0 ) inputFile = args[0];
InputStream is = System.in;
if ( inputFile!=null ) is = new FileInputStream(inputFile);
ANTLRInputStream input = new ANTLRInputStream(is);
CalculatorLexer lexer = new CalculatorLexer(input);
CommonTokenStream tokens = new CommonTokenStream(lexer);
CalculatorParser parser = new CalculatorParser(tokens);
ParseTree tree = parser.program(); // parse
CalcVisitor calcc = new CalcVisitor();
calcc.visit(tree);
}
}
As far as I know, I am pretty sure the ANTLRFileStream is what is depricated, but I have tried replacing it with CharStreams, but the code I try and run keeps resulting in an error. How can I fix this?
Try something like this:
public static void main(String[] args) throws Exception {
CharStream charStream = args.length > 0
? CharStreams.fromFileName(args[0])
: CharStreams.fromStream(System.in);
CalculatorLexer lexer = new CalculatorLexer(charStream);
CommonTokenStream tokens = new CommonTokenStream(lexer);
CalculatorParser parser = new CalculatorParser(tokens);
ParseTree tree = parser.program(); // parse
CalcVisitor calcc = new CalcVisitor();
calcc.visit(tree);
}
I am parsing pdfs with apache tika, but I have these in the result String .
I think it's a enumeration dash. How can I remove this from the result string or parse it the right way?
Other characters are working fine e.g. $ % & and so on.
Here is the code to parse from an InputStream:
private SolrInputDocument getDocument(InputStream stream, String docPath) throws SolrServerException {
SolrInputDocument solrDocument = new SolrInputDocument();
Parser parser = new AutoDetectParser(); // Should auto-detect!
BodyContentHandler handler = new BodyContentHandler(Integer.MAX_VALUE);
PDFParserConfig pdfConfig = new PDFParserConfig();
pdfConfig.setExtractInlineImages(false);
ParseContext parseContext = new ParseContext();
parseContext.set(PDFParserConfig.class, pdfConfig);
parseContext.set(Parser.class, parser);
Metadata metadata = new Metadata();
metadata.add("encoding", "unicode");
try {
parser.parse(stream, handler, metadata, parseContext);
String body = handler.toString();
body = CharMatcher.JAVA_ISO_CONTROL.removeFrom(body);
solrDocument.addField("id", docPath);
solrDocument.addField("body", body);
solrDocument.addField("url", docPath);
solrDocument.addField("extension", PDF_EXTENSION);
} catch (IOException | SAXException | TikaException e) {
throw new SolrServerException(e);
}
return solrDocument;
}
I'm using Pdfbox (1.8.8) to adding attachments to a pdf. My problem is when one of the attachments is of type .pdf and i'm saving the PDDocument to OutputStream the final pdf document does not include the attachments. If a save the PDDocument to a file instead an OutputStream all works just fine, and if the attachments does not include any pdf, both save to file or OutputStream works fine.
I would like to know if there is any way to add pdf embedded Files and save the PDDocument to OutputStream keeping the attached files in the final pdf that is generated.
The code i'm using is:
private void insertAttachments(OutputStream out, ArrayList<Attachment> attachmentsResources) {
final PDDocument doc;
Boolean hasPdfAttach = false;
try {
doc = PDDocument.load(new ByteArrayInputStream(((ByteArrayOutputStream) out).toByteArray()));
// final PDFTextStripper pdfStripper = new PDFTextStripper();
// final String text = pdfStripper.getText(doc);
final PDEmbeddedFilesNameTreeNode efTree = new PDEmbeddedFilesNameTreeNode();
final Map embeddedFileMap = new HashMap();
PDEmbeddedFile embeddedFile;
File file = null;
for (Attachment attach : attachmentsResources) {
// first create the file specification, which holds the embedded file
final PDComplexFileSpecification fileSpecification = new PDComplexFileSpecification();
fileSpecification.setFile(attach.getFilename());
file = AttachmentUtils.getAttachmentFile(attach);
final InputStream is = new FileInputStream(file.getAbsolutePath());
embeddedFile = new PDEmbeddedFile(doc, is);
// set some of the attributes of the embedded file
if ("application/pdf".equals(attach.getMimetype())) {
hasPdfAttach = true;
}
embeddedFile.setSubtype(attach.getMimetype());
embeddedFile.setSize((int) (long) attach.getFilesize());
fileSpecification.setEmbeddedFile(embeddedFile);
// now add the entry to the embedded file tree and set in the document.
embeddedFileMap.put(attach.getFilename(), fileSpecification);
// final String text2 = pdfStripper.getText(doc);
}
// final String text3 = pdfStripper.getText(doc);
efTree.setNames(embeddedFileMap);
// ((COSDictionary) efTree.getCOSObject()).removeItem(COSName.LIMITS); (this not work for me)
// attachments are stored as part of the "names" dictionary in the document catalog
final PDDocumentNameDictionary names = new PDDocumentNameDictionary(doc.getDocumentCatalog());
names.setEmbeddedFiles(efTree);
doc.getDocumentCatalog().setNames(names);
// final ByteArrayOutputStream pdfboxToDocumentStream = new ByteArrayOutputStream();
final String tmpfile = "temporary.pdf";
if (hasPdfAttach) {
final File f = new File(tmpfile);
doc.save(f);
doc.close();
//i have try with parser but without success too
// PDFParser parser = new PDFParser(new FileInputStream(tmpfile));
// parser.parse();
// PDDocument doc2 = parser.getPDDocument();
final PDDocument doc2 = PDDocument.loadNonSeq(f, new RandomAccessFile(new File(getHomeTMP()
+ "tempppp.pdf"), "r"));
doc2.save(out);
doc2.close();
} else {
doc.save(out);
doc.close();
}
//that does not work too
// final InputStream in = new FileInputStream(tmpfile);
// IOUtils.copy(in, out);
// out = new FileOutputStream(tmpFile);
// doc.save (out);
} catch (IOException e1) {
e1.printStackTrace();
} catch (Exception e2) {
e2.printStackTrace();
}
}
Best regards
Solution:
private void insertAttachments(OutputStream out, ArrayList<Attachment> attachmentsResources) {
final PDDocument doc;
try {
doc = PDDocument.load(new ByteArrayInputStream(((ByteArrayOutputStream) out).toByteArray()));
((ByteArrayOutputStream) out).reset();
final PDEmbeddedFilesNameTreeNode efTree = new PDEmbeddedFilesNameTreeNode();
final Map embeddedFileMap = new HashMap();
PDEmbeddedFile embeddedFile;
File file = null;
for (Attachment attach : attachmentsResources) {
// first create the file specification, which holds the embedded file
final PDComplexFileSpecification fileSpecification = new PDComplexFileSpecification();
fileSpecification.setFile(attach.getFilename());
file = AttachmentUtils.getAttachmentFile(attach);
final InputStream is = new FileInputStream(file.getAbsolutePath());
embeddedFile = new PDEmbeddedFile(doc, is);
// set some of the attributes of the embedded file
embeddedFile.setSubtype(attach.getMimetype());
embeddedFile.setSize((int) (long) attach.getFilesize());
fileSpecification.setEmbeddedFile(embeddedFile);
// now add the entry to the embedded file tree and set in the document.
embeddedFileMap.put(attach.getFilename(), fileSpecification);
}
efTree.setNames(embeddedFileMap);
((COSDictionary) efTree.getCOSObject()).removeItem(COSName.LIMITS);
// attachments are stored as part of the "names" dictionary in the document catalog
final PDDocumentNameDictionary names = new PDDocumentNameDictionary(doc.getDocumentCatalog());
names.setEmbeddedFiles(efTree);
doc.getDocumentCatalog().setNames(names);
((COSDictionary) efTree.getCOSObject()).removeItem(COSName.LIMITS);
doc.save(out);
doc.close();
} catch (IOException e1) {
e1.printStackTrace();
} catch (Exception e2) {
e2.printStackTrace();
}
}
You store the new PDF after the original PDF in out:
Look at all the uses of out in your method:
private void insertAttachments(OutputStream out, ArrayList<Attachment> attachmentsResources) {
...
doc = PDDocument.load(new ByteArrayInputStream(((ByteArrayOutputStream) out).toByteArray()));
...
doc2.save(out);
...
doc.save(out);
So you get as input a ByteArrayOutputStream and take its current content as input (i.e. the ByteArrayOutputStream is not empty but already contains a PDF) and after some processing you append the modified PDF to the ByteArrayOutputStream. Depending on the PDF viewer you present this to, you will be shown either the original or the manipulated PDF or a (very correct) error message that the file is garbage.
If you want the ByteArrayOutputStream to contain only the manipulated PDF, simply add
((ByteArrayOutputStream) out).reset();
or (if you are not sure about the state of the stream)
out = new ByteArrayOutputStream();
right after
doc = PDDocument.load(new ByteArrayInputStream(((ByteArrayOutputStream) out).toByteArray()));
PS: According to the comments the OP tried the above proposed changes to his code without success.
I cannot run the code as presented in the question because it is not self-contained. Thus, I reduced it to the essentials to get a self-contained test:
#Test
public void test() throws IOException, COSVisitorException
{
ByteArrayOutputStream baos = new ByteArrayOutputStream();
try (
InputStream sourceStream = getClass().getResourceAsStream("test.pdf");
InputStream attachStream = getClass().getResourceAsStream("artificial text.pdf"))
{
final PDDocument document = PDDocument.load(sourceStream);
final PDEmbeddedFile embeddedFile = new PDEmbeddedFile(document, attachStream);
embeddedFile.setSubtype("application/pdf");
embeddedFile.setSize(10993);
final PDComplexFileSpecification fileSpecification = new PDComplexFileSpecification();
fileSpecification.setFile("artificial text.pdf");
fileSpecification.setEmbeddedFile(embeddedFile);
final Map<String, PDComplexFileSpecification> embeddedFileMap = new HashMap<String, PDComplexFileSpecification>();
embeddedFileMap.put("artificial text.pdf", fileSpecification);
final PDEmbeddedFilesNameTreeNode efTree = new PDEmbeddedFilesNameTreeNode();
efTree.setNames(embeddedFileMap);
final PDDocumentNameDictionary names = new PDDocumentNameDictionary(document.getDocumentCatalog());
names.setEmbeddedFiles(efTree);
document.getDocumentCatalog().setNames(names);
document.save(baos);
document.close();
}
Files.write(Paths.get("attachment.pdf"), baos.toByteArray());
}
As you see PDFBox here uses only streams. The result:
Thus, PDFBox without problem stores a PDF into which it has embedded a PDF file attachment.
The problem, therefore, most likely have nothing to do with this work flow as such
Is there any way to read chunk at a time (instead of reading the entire file) from a file using Tika API?
following is my code. As you can see I am reading the entire file at once. I would like to read chunk at a time and create a text file the content.
InputStream stream = new FileInputStream(file);
Parser p = new AutoDetectParser();
Metadata meta =new Metadata();
WriteOutContentHandler handler = new WriteOutContnetHandler(-1);
ParseContext parse = new ParseContext();
....
p.parse(stream,handler,meta, context);
...
String content = handler.toString();
There's (now) and Apache Tika example which shows how you can capture the plain text output, and return it in chunks based on the maximum allowed size of a chunk. You can find it in ContentHandlerExample - method is parseToPlainTextChunks
Based on that, if you wanted to output to a file instead, and on a per-chunk basis, you'd tweak it to be something like:
final int MAXIMUM_TEXT_CHUNK_SIZE = 100 * 1024 * 1024;
final File outputDir = new File("/tmp/");
private class ChunkHandler extends ContentHandlerDecorator {
private int size = 0;
private int fileNumber = -1;
private OutputStreamWriter out = null;
#Override
public void characters(char[] ch, int start, int length) throws IOException {
if (out == null || size+length > MAXIMUM_TEXT_CHUNK_SIZE) {
if (out != null) out.close();
fileNumber++;
File f = new File(outputDir, "output-" + fileNumber + ".txt);
out = new OutputStreamWriter(new FileOutputStream(f, "UTF-8"));
}
out.write(ch, start, length);
}
public void close() throws IOException {
if (out != null) out.close();
}
}
public void parse(File file) {
InputStream stream = new FileInputStream(file);
Parser p = new AutoDetectParser();
Metadata meta =new Metadata();
ContentHandler handler = new ChunkHandler();
ParseContext parse = new ParseContext();
p.parse(stream,handler,meta, context);
((ChunkHandler)handler).close();
}
That will give you plain text files in the given directory, of no more than a maximum size. All html tags will be ignored, you'll only get the plain textual content
Here is my code to Implement a UDF using Distributed Cache Using Pig.
public class Regex extends EvalFunc<Integer> {
static HashMap<String, String> map = new HashMap<String, String>();
public List<String> getCacheFiles() {
Path lookup_file = new Path(
"hdfs://localhost.localdomain:8020/user/cloudera/top");
List<String> list = new ArrayList<String>(1);
list.add(lookup_file + "#id_lookup");
return list;
}
public void VectorizeData() throws IOException {
FileReader fr = new FileReader("./id_lookup");
BufferedReader brd = new BufferedReader(fr);
String line;
while ((line = brd.readLine()) != null) {
String str[] = line.split("#");
map.put(str[0], str[1]);
}
fr.close();
}
#Override
public Integer exec(Tuple input) throws IOException {
// TODO Auto-generated method stub
return map.size();
}
}
Given Below is my Distributed Cache Input File (hdfs://localhost.localdomain:8020/user/cloudera/top)
Impetigo|Streptococcus pyogenes#Impetigo
indeterminate leprosy|Uncharacteristic leprosy#indeterminate leprosy
Output I get is
(0)
(0)
(0)
(0)
(0)
This means that my hashmap is empty.
How do i fill my hashmap using Distributed Cache?.
This was because VectorizeData() was not called in the executable.