parser.getTokens() gives out junk data and singlecharacters PDFBox-1.8.9 version - pdf

I am new to pdfbox. I am using pdfbox-app-2.0.0-RC1 version to fetch the entire text from the pdf using PDFTextStripperByArea. Is it possible for me to get each string separately?
For Example,
In the following text,
Nomination : Name&Address
Shipper : shipper name
I need Nomination as seperate string and "Name&Address" as separate string. Instead I am getting each character separately. I have tried with different Pdfs. For most pdfs I am able to get the exact string but for few pdfs I don't.
I am using the following code to get the separate string.
for (PDPage page : doc.getPages()) {
PDFStreamParser parser = new PDFStreamParser(page);
parser.parse();
List<Object> tokens = parser.getTokens();
for (int j = 0; j < tokens.size(); j++) {
Object next = tokens.get(j);
if (next instanceof Operator) {
Operator op = (Operator) next;
if (op.getName().equals("Tj")) {
COSString previous = (COSString) tokens.get(j - 1);
String string = previous.getString();
System.out.println("string1===" + string);
if (string.contains("Plant")) {
int size = al.size();
al.add(string);
stop = false;
continue;
}
if (!string.contains("_") && !stop) {
if (string.contains("Nomination")) {
stop = true;
} else {
al.add(string);
}
}
} else if (op.getName().equals("TJ")) {
COSArray previous = (COSArray) tokens.get(j - 1);
for (int k = 0; k < previous.size(); k++) {
Object arrElement = previous.getObject(k);
if (arrElement instanceof COSString) {
COSString cosString = (COSString)arrElement;
String string = cosString.getString();
System.out.println("string2====>>"+string);
al.add(string);
}
}
}
}
}
}
I am getting the following output:
string2====>>Nom
string2====>>i
string2====>>na
string2====>>t
string2====>>i
string2====>>on
string1===
string2====>>(
string2====>>T
string2====>>o
string1===
string2====>>Loa
string2====>>di
string2====>>ng
string1===
string2====>>Fa
string2====>>c
string2====>>i
string2====>>l
string2====>>i
string2====>>t
string2====>>y
string2====>>)

Related

Update PDF using pdfBox

I would like to ask if having a PDF it is possible, using pdfbox libraries, to update it at a specific point.
I am trying to use a solution already online but seems the gettoken() method does not enter code heresection the words properly to allow me to find the part I would like to modify.
This is the code(Groovy):
for( int i = 0; i < dataContext.getDataCount(); i++ ) {
InputStream is = dataContext.getStream(i);
Properties props = dataContext.getProperties(i);
String searchString= "Hours worked";
String replacement = "Hours worked: 2";
File file = new File("\\\\****\\UKDC\\GFS\\PRE\\PREPROD\\Alchemer\\Template\\***.pdf");
PDDocument doc = PDDocument.load(file);
for ( PDPage page : doc.getPages() )
{
PDFStreamParser parser = new PDFStreamParser(page);
parser.parse();
List tokens = parser.getTokens();
logger.info("in Page");
for (int j = 0; j < tokens.size(); j++)
{
logger.info("tokens:"+tokens[j]);
Object next = tokens.get(j);
//logger.info("in Object");
if (next instanceof Operator)
{
Operator op = (Operator) next;
String pstring = "";
int prej = 0;
//Tj and TJ are the two operators that display strings in a PDF
if (op.getName().equals("Tj"))
{
logger.info("in Tj");
// Tj takes one operator and that is the string to display so lets update that operator
COSString previous = (COSString) tokens.get(j - 1);
String string = previous.getString();
logger.info("previousString:"+string);
string = string.replaceFirst(searchString, replacement);
previous.setValue(string.getBytes());
} else
if (op.getName().equals("TJ"))
{
logger.info("in TJ:"+ op.getName());
COSArray previous = (COSArray) tokens.get(j - 1);
logger.info("previous:"+previous);
for (int k = 0; k < previous.size(); k++)
{
Object arrElement = previous.getObject(k);
if (arrElement instanceof COSString)
{
COSString cosString = (COSString) arrElement;
String string = cosString.getString();
logger.info("string:"+string);
if (j == prej || string.equals(" ") || string.equals(":") || string.equals("-")) {
pstring += string;
} else {
prej = j;
pstring = string;
}
}
}
logger.info("pstring:"+pstring);
if (searchString.equals(pstring.trim()))
{
logger.info("in searchString");
COSString cosString2 = (COSString) previous.getObject(0);
cosString2.setValue(replacement.getBytes());
int total = previous.size()-1;
for (int k = total; k > 0; k--) {
previous.remove(k);
}
}
}
}
}
logger.info("in updatedStream");
// now that the tokens are updated we will replace the page content stream.
PDStream updatedStream = new PDStream(doc);
OutputStream out = updatedStream.createOutputStream(COSName.FLATE_DECODE);
ContentStreamWriter tokenWriter = new ContentStreamWriter(out);
tokenWriter.writeTokens(tokens);
logger.info("in tokenWriter");
out.close();
page.setContents(updatedStream);
doc.save("\\\\***\\UKDC\\GFS\\PRE\\PREPROD\\Alchemer\\***1.pdf");
}
Executing the code I am trying to search "Hours worked" String and update with
"Hours worked: 2"
There are 2 questions:
1.When I execute and check the logs can see the Tokens are not created properly:
enter image description here
enter image description here
So are created two different COSArrays meantime I have all in one Line:
enter image description here
and this can be a problem if I have to search a specific word.
When it find the word it seems it is working but it apply a strange char:
enter image description here
So Here 2 questions:
How to manage to specify the token behaviour (or maybe for the parser) to get an entire phrase in the same token until a special char happen?
Hot to format the new char in the new PDF?
Hope you can help me, thanks for your support.

How to find empty rows from excel sheet given by user and delete them in asp.net

public async Task<List<IndiaCIT>> Import(IFormFile file)
{
var list = new List<IndiaCIT>();
using (var stream = new MemoryStream())
{
await file.CopyToAsync(stream);
ExcelPackage.LicenseContext = LicenseContext.NonCommercial;
using (var package=new ExcelPackage(stream))
{
ExcelWorksheet worksheet = package.Workbook.Worksheets[0];
var rowcount = worksheet.Dimension.Rows;
for (int row = 1; row <= rowcount; row++)
{
list.Add(new IndiaCIT {
NameCH = worksheet.Cells[row, 1].Value.ToString().Trim(),
City= worksheet.Cells[row, 2].Value.ToString().Trim(),
Age = worksheet.Cells[row, 3].Value.ToString().Trim(),
});
}
}
}
return list;
}
this is controller code and in model class declared the columns name and used it as IndiaCIT list in controller,. I want empty rows to get deleted
This could help:
For each row in the sheet, check if the cell in each column is empty. If so delete it.
var rowcount = worksheet.Dimension.Rows;
var maxColums = worksheet.Dimension.Columns;
for(int row = rowcount; row > 0; row--)
{
bool isRowEmpty = true;
for(int column = 1; column <= maxColumns; column++)
{
var cellEntry = worksheet.Cells[row, column].Value.ToString();
if(!string.IsNullOrEmpty(cellEntry)
{
isRowEmpty = false;
break;
}
}
if(!isRowEmpty)
continue;
else
worksheet.DeleteRow(row);
}
I generally use Spire.Xls to handle such problems. It works for me.
public IActionResult Test() {
//init workbook
Workbook workbook = new Workbook();
// load file
workbook.LoadFromFile("11.xlsx");
// get the first sheet
Worksheet sheet = workbook.Worksheets[0];
//delete blank row
for (int i = sheet.Rows.Count() - 1; i >= 0; i--)
{
if (sheet.Rows[i].IsBlank)
{
sheet.DeleteRow(i + 1);
}
}
//delete blank column
for (int j = sheet.Columns.Count() - 1; j >= 0; j--)
{
if (sheet.Columns[j].IsBlank)
{
sheet.DeleteColumn(j + 1);
}
}
//save as a new xls file
workbook.SaveToFile("new.xlsx", ExcelVersion.Version2016);
return Ok();
}

Null pointer exception on Processing (ldrValues)

My code involves both Processing and Arduino. 5 different photocells are triggering 5 different sounds. My sound files play only when the ldrvalue is above the threshold.
The Null Pointer Exception is highlighted on this line
for (int i = 0; i < ldrValues.length; i++) {
I am not sure which part of my code should be changed so that I can run it.
import processing.serial.*;
import processing.sound.*;
SoundFile[] soundFiles = new SoundFile[5];
Serial myPort; // Create object from Serial class
int[] ldrValues;
int[] thresholds = {440, 490, 330, 260, 450};
int i = 0;
boolean[] states = {false, false, false, false, false};
void setup() {
size(200, 200);
println((Object[])Serial.list());
String portName = Serial.list()[3];
myPort = new Serial(this, portName, 9600);
soundFiles[0] = new SoundFile(this, "1.mp3");
soundFiles[1] = new SoundFile(this, "2.mp3");
soundFiles[2] = new SoundFile(this, "3.mp3");
soundFiles[3] = new SoundFile(this, "4.mp3");
soundFiles[4] = new SoundFile(this, "5.mp3");
}
void draw()
{
background(255);
//serial loop
while (myPort.available() > 0) {
String myString = myPort.readStringUntil(10);
if (myString != null) {
//println(myString);
ldrValues = int(split(myString.trim(), ','));
//println(ldrValues);
}
}
for (int i = 0; i < ldrValues.length; i++) {
println(states[i]);
println(ldrValues[i]);
if (ldrValues[i] > thresholds[i] && !states[i]) {
println("sensor " + i + " is activated");
soundFiles[i].play();
states[i] = true;
}
if (ldrValues[i] < thresholds[i]) {
println("sensor " + i + " is NOT activated");
soundFiles[i].stop();
states[i] = false;
}
}
}
You're approach is shall we say optimistic ? :)
It's always assuming there was a message from Serial, always formatted the right way so it could be parsed and there were absolutely 0 issues buffering data (incomplete strings, etc.))
The simplest thing you could do is check if the parsing was successful, otherwise the ldrValues array would still be null:
void draw()
{
background(255);
//serial loop
while (myPort.available() > 0) {
String myString = myPort.readStringUntil(10);
if (myString != null) {
//println(myString);
ldrValues = int(split(myString.trim(), ','));
//println(ldrValues);
}
}
// double check parsing int values from the string was successfully as well, not just buffering the string
if(ldrValues != null){
for (int i = 0; i < ldrValues.length; i++) {
println(states[i]);
println(ldrValues[i]);
if (ldrValues[i] > thresholds[i] && !states[i]) {
println("sensor " + i + " is activated");
soundFiles[i].play();
states[i] = true;
}
if (ldrValues[i] < thresholds[i]) {
println("sensor " + i + " is NOT activated");
soundFiles[i].stop();
states[i] = false;
}
}
}else{
// print a helpful debugging message otherwise
println("error parsing ldrValues from string: " + myString);
}
}
(Didn't know you could parse a int[] with int(): nice!)

NullPointerException for statement

public void loadFileRecursiv(String pathDir)
{
File fisier = new File(pathDir);
File[] listaFisiere = fisier.listFiles();
for(int i = 0; i < listaFisiere.length; i++)
{
if(listaFisiere[i].isDirectory())
{
loadFileRecursiv(pathDir + File.separatorChar + listaFisiere[i].getName());
}
else
{
String cuExtensie = listaFisiere[i].getName();
String nume = cuExtensie.split(".")[0];
String acronimBanca = nume.split("_")[0];
String tipAct = nume.split("_")[1];
String dataActString = nume.split("_")[2];
//Date dataAct = new SimpleDateFormat("dd-MM-yyyy").parse(dataActString);
//String denBanca = inlocuireAcronim(acronimBanca);
insertData(listaFisiere[i], cuExtensie, acronimBanca, tipAct, dataActString);
//fisiere[i].renameTo(new File("u02/ActeConstitutive/Mutate"));
}
}
}
I have a simple code that checks all files and folders recursevely when a path is given. Unfortunately, i have a NULLPOINTEREXCEPTION for for(int i = 0; i < listaFisiere.length; i++) this line. What can be the problem? Thank you!
check whether listaFisiere is null or not
If not null, change this line as for(int i = 0; i < listaFisiere.length(); i++)
and
you can change your code as below
for(File path:listaFisiere)
{
if(path.isDirectory())
{
loadFileRecursiv(pathDir + File.separatorChar + path.getName());
}
else
{
String cuExtensie = path.getName();
String nume = cuExtensie.split(".")[0];
String acronimBanca = nume.split("_")[0];
String tipAct = nume.split("_")[1];
String dataActString = nume.split("_")[2];
//Date dataAct = new SimpleDateFormat("dd-MM-yyyy").parse(dataActString);
//String denBanca = inlocuireAcronim(acronimBanca);
insertData(path, cuExtensie, acronimBanca, tipAct, dataActString);
//fisiere[i].renameTo(new File("u02/ActeConstitutive/Mutate"));
}
}

Label Printing using iTextSharp

I have a logic to export avery label pdf. The logic exports the pdf with labels properly but when i print that pdf, the page size measurements (Page properties) that i pass isn't matching with the printed page.
Page Properties
Width="48.5" Height="25.4" HorizontalGapWidth="0" VerticalGapHeight="0" PageMarginTop="21" PageMarginBottom="21" PageMarginLeft="8" PageMarginRight="8" PageSize="A4" LabelsPerRow="4" LabelRowsPerPage="10"
The above property values are converted equivalent to point values first before applied.
Convert to point
private float mmToPoint(double mm)
{
return (float)((mm / 25.4) * 72);
}
Logic
public Stream SecLabelType(LabelProp _label)
{
List<LabelModelClass> Model = new List<LabelModelClass>();
Model = RetModel(_label);
bool IncludeLabelBorders = false;
FontFactory.RegisterDirectories();
Rectangle pageSize;
switch (_label.PageSize)
{
case "A4":
pageSize = iTextSharp.text.PageSize.A4;
break;
default:
pageSize = iTextSharp.text.PageSize.A4;
break;
}
var doc = new Document(pageSize,
_label.PageMarginLeft,
_label.PageMarginRight,
_label.PageMarginTop,
_label.PageMarginBottom);
var output = new MemoryStream();
var writer = PdfWriter.GetInstance(doc, output);
writer.CloseStream = false;
doc.Open();
var numOfCols = _label.LabelsPerRow + (_label.LabelsPerRow - 1);
var tbl = new PdfPTable(numOfCols);
var colWidths = new List<float>();
for (int i = 1; i <= numOfCols; i++)
{
if (i % 2 > 0)
{
colWidths.Add(_label.Width);
}
else
{
colWidths.Add(_label.HorizontalGapWidth);
}
}
var w = iTextSharp.text.PageSize.A4.Width - (doc.LeftMargin + doc.RightMargin);
var h = iTextSharp.text.PageSize.A4.Height - (doc.TopMargin + doc.BottomMargin);
var size = new iTextSharp.text.Rectangle(w, h);
tbl.SetWidthPercentage(colWidths.ToArray(), size);
//var val = System.IO.File.ReadLines("C:\\Users\\lenovo\\Desktop\\test stock\\testing3.txt").ToArray();
//var ItemNoArr = Model.Select(ds => ds.ItemNo).ToArray();
//string Header = Model.Select(ds => ds.Header).FirstOrDefault();
int cnt = 0;
bool b = false;
int iAddRows = 1;
for (int iRow = 0; iRow < ((Model.Count() / _label.LabelsPerRow) + iAddRows); iRow++)
{
var rowCells = new List<PdfPCell>();
for (int iCol = 1; iCol <= numOfCols; iCol++)
{
if (Model.Count() > cnt)
{
if (iCol % 2 > 0)
{
var cellContent = new Phrase();
if (((iRow + 1) >= _label.StartRow && (iCol) >= (_label.StartColumn + (_label.StartColumn - 1))) || b)
{
b = true;
try
{
var StrArr = _label.SpineLblFormat.Split('|');
foreach (var x in StrArr)
{
string Value = "";
if (x.Contains(","))
{
var StrCommaArr = x.Split(',');
foreach (var y in StrCommaArr)
{
if (y != "")
{
Value = ChunckText(cnt, Model, y, Value);
}
}
if (Value != null && Value.Replace(" ", "") != "")
{
cellContent.Add(new Paragraph(Value));
cellContent.Add(new Paragraph("\n"));
}
}
else
{
Value = ChunckText(cnt, Model, x, Value);
if (Value != null && Value.Replace(" ", "") != "")
{
cellContent.Add(new Paragraph(Value));
cellContent.Add(new Paragraph("\n"));
}
}
}
}
catch (Exception e)
{
var fontHeader1 = FontFactory.GetFont("Verdana", BaseFont.CP1250, true, 6, 0);
cellContent.Add(new Chunk("NA", fontHeader1));
}
cnt += 1;
}
else
iAddRows += 1;
var cell = new PdfPCell(cellContent);
cell.FixedHeight = _label.Height;
cell.HorizontalAlignment = Element.ALIGN_LEFT;
cell.Border = IncludeLabelBorders ? Rectangle.BOX : Rectangle.NO_BORDER;
rowCells.Add(cell);
}
else
{
var gapCell = new PdfPCell();
gapCell.FixedHeight = _label.Height;
gapCell.Border = Rectangle.NO_BORDER;
rowCells.Add(gapCell);
}
}
else
{
var gapCell = new PdfPCell();
gapCell.FixedHeight = _label.Height;
gapCell.Border = Rectangle.NO_BORDER;
rowCells.Add(gapCell);
}
}
tbl.Rows.Add(new PdfPRow(rowCells.ToArray()));
_label.LabelRowsPerPage = _label.LabelRowsPerPage == null ? 0 : _label.LabelRowsPerPage;
if ((iRow + 1) < _label.LabelRowsPerPage && _label.VerticalGapHeight > 0)
{
tbl.Rows.Add(CreateGapRow(numOfCols, _label));
}
}
doc.Add(tbl);
doc.Close();
output.Position = 0;
return output;
}
private PdfPRow CreateGapRow(int numOfCols, LabelProp _label)
{
var cells = new List<PdfPCell>();
for (int i = 0; i < numOfCols; i++)
{
var cell = new PdfPCell();
cell.FixedHeight = _label.VerticalGapHeight;
cell.Border = Rectangle.NO_BORDER;
cells.Add(cell);
}
return new PdfPRow(cells.ToArray());
}
A PDF document may have very accurate measurements, but then those measurements get screwed up because the page is scaled during the printing process. That is a common problem: different printers will use different scaling factors with different results when you print the document using different printers.
How to avoid this?
In the print dialog of Adobe Reader, you can choose how the printer should behave:
By default, the printer will try to "Fit" the content on the page, but as not every printer can physically use the full page size (due to hardware limitations), there's a high chance the printer will scale the page down if you use "Fit".
It's better to choose the option "Actual size". The downside of using this option is that some content may get lost because it's too close to the border of the page in an area that physically can't be reached by the printer, but the advantage is that the measurements will be preserved.
You can set this option programmatically in your document by telling the document it shouldn't scale:
writer.AddViewerPreference(PdfName.PRINTSCALING, PdfName.NONE);
See How to set initial view properties? for more info about viewer preferences.