stanford nlp tokenizer - tokenize

How can i tokenize a string in java class using stanford parser?
I am only able to find examples of documentProcessor and PTBTokenizer taking text from external file.
DocumentPreprocessor dp = new DocumentPreprocessor("hello.txt");
for (List sentence : dp) {
System.out.println(sentence);
}
// option #2: By token
PTBTokenizer ptbt = new PTBTokenizer(new FileReader("hello.txt"),
new CoreLabelTokenFactory(), "");
for (CoreLabel label; ptbt.hasNext(); ) {
label = (CoreLabel) ptbt.next();
System.out.println(label);
}
Thanks.

PTBTokenizer constructor takes a java.io.Reader, then you can use a StringReader to parse your text

Related

Getting 3rd party software disturbance in selenium automation

I am using selenium to automate. I have also integrated Asprise OCR in my automation to scan texts and this is my code
{
Ocr.setUp();
Ocr ocr = new Ocr();
ocr.startEngine("eng", Ocr.SPEED_FASTEST); // English
String path = "C:\\Users\\Vishalpatel\\Desktop\\screen\\captcha.png";
String text = ocr.recognize(new File[] { new File(path) },Ocr.RECOGNIZE_TYPE_TEXT, Ocr.OUTPUT_FORMAT_PLAINTEXT);
System.out.println(text+" my code");
text = text.replace("x", " ");
text = text.replace(" ", "");
CharSequence str1= text.subSequence(0, 1);
CharSequence str2= text.subSequence(2, text.length()-1);
int x = Integer.parseInt((String) str1) + Integer.parseInt((String) str2);
ocr.stopEngine(); // Stop OCR engine
return x+"";
}
My scipt get stuck on this dialog and I have to handle it manually. Any idea to handle such issue? Thank you in advance

allow arabic text in pdf table using itext7 (xamarin android)

I have to put my list data in a table in a pdf file. My data has some Arabic words. When my pdf is generated, the Arabic words don't appear. I searched and found that I need itext7.pdfcalligraph so I installed it in my app. I found this code too https://itextpdf.com/en/blog/technical-notes/displaying-text-different-languages-single-pdf-document and tried to do something similar to allow Arabic words in my table but I couldn't figure it out.
This is a trial code before I apply it to my real list:
var path2 = global::Android.OS.Environment.ExternalStorageDirectory.AbsolutePath;
filePath = System.IO.Path.Combine(path2.ToString(), "myfile2.pdf");
stream = new FileStream(filePath, FileMode.Create);
PdfWriter writer = new PdfWriter(stream);
PdfDocument pdf2 = new iText.Kernel.Pdf.PdfDocument(writer);
Document document = new Document(pdf2, PageSize.A4);
FontSet set = new FontSet();
set.AddFont("ARIAL.TTF");
document.SetFontProvider(new FontProvider(set));
document.SetProperty(Property.FONT, "Arial");
string[] sources = new string[] { "يوم","شهر 2020" };
iText.Layout.Element.Table table = new iText.Layout.Element.Table(2, false);
foreach (string source in sources)
{
Paragraph paragraph = new Paragraph();
Bidi bidi = new Bidi(source, Bidi.DirectionDefaultLeftToRight);
if (bidi.BaseLevel != 0)
{
paragraph.SetTextAlignment(iText.Layout.Properties.TextAlignment.RIGHT);
}
paragraph.Add(source);
table.AddCell(new Cell(1, 1).SetTextAlignment(iText.Layout.Properties.TextAlignment.CENTER).Add(paragraph));
}
document.Add(table);
document.Close();
I updated my code and added the arial.ttf to my assets folder . i'm getting the following exception:
System.InvalidOperationException: 'FontProvider and FontSet are empty. Cannot resolve font family name (see ElementPropertyContainer#setFontFamily) without initialized FontProvider (see RootElement#setFontProvider).'
and I still can't figure it out. any ideas?
thanks in advance
- C #
I have a similar situation for Turkish characters, and I've followed these
steps :
Create a folder under projects root folder which is : /wwwroot/Fonts
Add OpenSans-Regular.ttf under the Fonts folder
Path for font is => ../wwwroot/Fonts/OpenSans-Regular.ttf
Create font like below :
public static PdfFont CreateOpenSansRegularFont()
{
var path = "{Your absolute path for FONT}";
return PdfFontFactory.CreateFont(path, PdfEncodings.IDENTITY_H, true);
}
and use it like :
paragraph.Add(source)
.SetFont(FontFactory.CreateOpenSansRegularFont()); //set font in here
table.AddCell(new Cell(1, 1)
.SetTextAlignment(iText.Layout.Properties.TextAlignment.CENTER)
.Add(paragraph));
This is how I used font factory for Turkish characters ex: "ü,i,ç,ş,ö"
For Xamarin-Android, you could try
string documentsPath = Environment.GetFolderPath(Environment.SpecialFolder.MyDocuments);
var path = Path.Combine(documentsPath, "Fonts/Arial.ttf");
Look i fixed it in java,this may help you:
String font = "your Arabic font";
//the magic is in the next 4 lines:
PdfFontFactory.register(font);
FontProgram fontProgram = FontProgramFactory.createFont(font, true);
PdfFont f = PdfFontFactory.createFont(fontProgram, PdfEncodings.IDENTITY_H);
LanguageProcessor languageProcessor = new ArabicLigaturizer();
//and look here how i used setBaseDirection and don't use TextAlignment ,it will work without it
com.itextpdf.kernel.pdf.PdfDocument tempPdfDoc = new com.itextpdf.kernel.pdf.PdfDocument(new PdfReader(pdfFile.getPath()), TempWriter);
com.itextpdf.layout.Document TempDoc = new com.itextpdf.layout.Document(tempPdfDoc);
com.itextpdf.layout.element.Paragraph paragraph0 = new com.itextpdf.layout.element.Paragraph(languageProcessor.process("الاستماره الالكترونية--الاستماره الالكترونية--الاستماره الالكترونية--الاستماره الالكترونية"))
.setFont(f).setBaseDirection(BaseDirection.RIGHT_TO_LEFT)
.setFontSize(15);

How to move text in PDF?

Is there a Java or Nodejs library that can move existing text in a PDF file?
I'd like to extract all the text nodes, then move some of them to a new location based on some conditions.
I tried PdfClown, galkahana/HummusJS, Hopding/pdf-lib, but seems they don't have exactly what I need.
can anyone help? thanks
After inspecting the variables, I figured out how to move text, here is the code
PrimitiveComposer composer = new PrimitiveComposer(page);
ContentScanner scanner = composer.getScanner();
tranverse(scanner);
composer.flush();
...
while (level.moveNext()){
ContentObject content = level.getCurrent();
if (content instanceof Text){
...
List<ContentObject> objects = text.getBaseDataObject().getObjects();
for(ContentObject co: objects){
if(co instanceof SetTextMatrix){
List<PdfDirectObject> operands = ((SetTextMatrix)co).getOperands();
PdfInteger y = (PdfInteger)operands.get(5);
operands.set(5, new PdfInteger(y.getIntValue()-100));
}
}

Using pdfbox - how to get the font from a COSName?

How to get the font from a COSName?
The solution I'm looking for looks somehow like this:
COSDictionary dict = new COSDictionary();
dict.add(fontname, something); // fontname COSName from below code
PDFontFactory.createFont(dict);
If you need more background, I added the whole story below:
I try to replace some string in a pdf. This succeeds (as long as all text is stored in one token). In order to keep the format I like to re-center the text. As far as I understood I can do this by getting the width of the old string and the new one, do some trivial calculation and setting the new position.
I found some inspiration on stackoverflow for replacing https://stackoverflow.com/a/36404377 (yes it has some issues, but works for my simple pdf's. And How to center a text using PDFBox. Unfortunatly this example uses a font constant.
So using the first link's code I get a handling for operator 'TJ' and one for 'Tj'.
PDFStreamParser parser = new PDFStreamParser(page);
parser.parse();
java.util.List<Object> tokens = parser.getTokens();
for (int j = 0; j < tokens.size(); j++)
{
Object next = tokens.get(j);
if (next instanceof Operator)
{
Operator op = (Operator) next;
// Tj and TJ are the two operators that display strings in a PDF
if (op.getName().equals("Tj"))
{
// Tj takes one operator and that is the string to display so lets
// update that operator
COSString previous = (COSString) tokens.get(j - 1);
String string = previous.getString();
String replaced = prh.getReplacement(string);
if (!string.equals(replaced))
{ // if changes are there, replace the content
previous.setValue(replaced.getBytes());
float xpos = getPosX(tokens, j);
//if (true) // center the text
if (6 * xpos > page.getMediaBox().getWidth()) // check if text starts right from 1/xth page width
{
float fontsize = getFontSize(tokens, j);
COSName fontname = getFontName(tokens, j);
// TODO
PDFont font = ?getFont?(fontname);
// TODO
float widthnew = getStringWidth(replaced, font, fontsize);
setPosX(tokens, j, page.getMediaBox().getWidth() / 2F - (widthnew / 2F));
}
replaceCount++;
}
}
Considering the code between the TODO tags, I will get the required values from the token list. (yes this code is awful, but for now it let's me concentrate on the main issue)
Having the string, the size and the font I should be able to call the getWidth(..) method from the sample code.
Unfortunatly I run into trouble to create a font from the COSName variable.
PDFont doesn't provide a method to create a font by name.
PDFontFactory looks fine, but requests a COSDictionary. This is the point I gave up and request help from you.
The names are associated with font objects in the page resources.
Assuming you use PDFBox 2.0.x and that page is a PDPage instance, you can resolve the name fontname using:
PDFont font = page.getResources().getFont(fontname);
But the warning from the comments to the questions you reference remain: This approach will work only for very simple PDFs and might even damage other ones.
try {
//Loading an existing document
File file = new File("UKRSICH_Mo6i-Spikyer_z1560-FAV.pdf");
PDDocument document = PDDocument.load(file);
PDPage page = document.getPage(0);
PDResources pageResources = page.getResources();
System.out.println(pageResources.getFontNames() );
for (COSName key : pageResources.getFontNames())
{
PDFont font = pageResources.getFont(key);
System.out.println("Font: " + font.getName());
}
document.close();
}

Split PDF into separate files based on text

I have a large single pdf document which consists of multiple records. Each record usually takes one page however some use 2 pages. A record starts with a defined text, always the same.
My goal is to split this pdf into separate pdfs and the split should happen always before the "header text" is found.
Note: I am looking for a tool or library using java or python. Must be free and available on Win 7.
Any ideas? AFAIK imagemagick won't work for this. May itext do this? I never used and it's
pretty complex so would need some hints.
EDIT:
Marked Answer led me to solution. For completeness here my exact implementation:
public void splitByRegex(String filePath, String regex,
String destinationDirectory, boolean removeBlankPages) throws IOException,
DocumentException {
logger.entry(filePath, regex, destinationDirectory);
destinationDirectory = destinationDirectory == null ? "" : destinationDirectory;
PdfReader reader = null;
Document document = null;
PdfCopy copy = null;
Pattern pattern = Pattern.compile(regex);
try {
reader = new PdfReader(filePath);
final String RESULT = destinationDirectory + "/record%d.pdf";
// loop over all the pages in the original PDF
int n = reader.getNumberOfPages();
for (int i = 1; i < n; i++) {
final String text = PdfTextExtractor.getTextFromPage(reader, i);
if (pattern.matcher(text).find()) {
if (document != null && document.isOpen()) {
logger.debug("Match found. Closing previous Document..");
document.close();
}
String fileName = String.format(RESULT, i);
logger.debug("Match found. Creating new Document " + fileName + "...");
document = new Document();
copy = new PdfCopy(document,
new FileOutputStream(fileName));
document.open();
logger.debug("Adding page to Document...");
copy.addPage(copy.getImportedPage(reader, i));
} else if (document != null && document.isOpen()) {
logger.debug("Found Open Document. Adding additonal page to Document...");
if (removeBlankPages && !isBlankPage(reader, i)){
copy.addPage(copy.getImportedPage(reader, i));
}
}
}
logger.exit();
} finally {
if (document != null && document.isOpen()) {
document.close();
}
if (reader != null) {
reader.close();
}
}
}
private boolean isBlankPage(PdfReader reader, int pageNumber)
throws IOException {
// see http://itext-general.2136553.n4.nabble.com/Detecting-blank-pages-td2144877.html
PdfDictionary pageDict = reader.getPageN(pageNumber);
// We need to examine the resource dictionary for /Font or
// /XObject keys. If either are present, they're almost
// certainly actually used on the page -> not blank.
PdfDictionary resDict = (PdfDictionary) pageDict.get(PdfName.RESOURCES);
if (resDict != null) {
return resDict.get(PdfName.FONT) == null
&& resDict.get(PdfName.XOBJECT) == null;
} else {
return true;
}
}
You can create a tool for your requirements using iText.
Whenever you are looking for code samples concerning (current versions of) the iText library, you should consult iText in Action — 2nd Edition the code samples from which are online and searchable by keyword from here.
In your case the relevant samples are Burst.java and ExtractPageContentSorted2.java.
Burst.java shows how to split one PDF in multiple smaller PDFs. The central code:
PdfReader reader = new PdfReader("allrecords.pdf");
final String RESULT = "record%d.pdf";
// We'll create as many new PDFs as there are pages
Document document;
PdfCopy copy;
// loop over all the pages in the original PDF
int n = reader.getNumberOfPages();
for (int i = 0; i < n; ) {
// step 1
document = new Document();
// step 2
copy = new PdfCopy(document,
new FileOutputStream(String.format(RESULT, ++i)));
// step 3
document.open();
// step 4
copy.addPage(copy.getImportedPage(reader, i));
// step 5
document.close();
}
reader.close();
This sample splits a PDF in single-page PDFs. In your case you need to split by different criteria. But that only means that in the loop you sometimes have to add more than one imported page (and thus decouple loop index and page numbers to import).
To recognize on which pages a new dataset starts, be inspired by ExtractPageContentSorted2.java. This sample shows how to parse the text content of a page to a string. The central code:
PdfReader reader = new PdfReader("allrecords.pdf");
for (int i = 1; i <= reader.getNumberOfPages(); i++) {
System.out.println("\nPage " + i);
System.out.println(PdfTextExtractor.getTextFromPage(reader, i));
}
reader.close();
Simply search for the record start text: If the text from page contains it, a new record starts there.
Apache PDFBox has a PDFSplit utility that you can run from the command-line.
If you like Python, there's a nice library: PyPDF2. The library is pure python2, BSD-like license.
Sample code:
from PyPDF2 import PdfFileWriter, PdfFileReader
input1 = PdfFileReader(open("C:\\Users\\Jarek\\Documents\\x.pdf", "rb"))
# analyze pdf data
print input1.getDocumentInfo()
print input1.getNumPages()
text = input1.getPage(0).extractText()
print text.encode("windows-1250", errors='backslashreplacee')
# create output document
output = PdfFileWriter()
output.addPage(input1.getPage(0))
fout = open("c:\\temp\\1\\y.pdf", "wb")
output.write(fout)
fout.close()
For non coders PDF Content Split is probably the easiest way without reinventing the wheel and has an easy to use interface: http://www.traction-software.co.uk/pdfcontentsplitsa/index.html
hope that helps.