StyledDocument adding extra count to indexof for each line of file - highlighting

I have a strange problem (at least it appears that way) that when searching for a string in a textPane, I get an extra index for each line number that is searched and returned when using StyledDoc verses just getting the text from a textPane. I get the same text from the same pane, it's just that one is from the plain text the other is from the styled doc. Am I missing something here. I'll try to list as many of the changes between the two versions I am working with.
The plain text version:
public int displayXMLFile(String path, int target){
InputStreamReader inputStream;
FileInputStream fileStream;
BufferedReader buffReader;
if(target == 1){
try{
File file = new File(path);
fileStream = new FileInputStream(file);
inputStream = new InputStreamReader(fileStream,"UTF-8");
buffReader = new BufferedReader(inputStream);
StringBuffer content = new StringBuffer("");
String line = "";
while((line = buffReader.readLine())!=null){
content.append(line+"\n");
}
buffReader.close();
xhw.txtDisplay_1.setText(content.toString());
}
catch(Exception e){
e.printStackTrace();
return -1;
}
}
}
verses the Styled Doc (without the styles applied)
protected void openFile(String path, StyledDocument sDoc, int target)
throws BadLocationException {
FileInputStream fileStream;
String file;
if(target == 1){
file = "Openning First File";
} else {
file = "Openning Second File";
}
try {
fileStream = new FileInputStream(path);
// Get the object of DataInputStream
//DataInputStream in = new DataInputStream(fileStream);
ProgressMonitorInputStream in = new ProgressMonitorInputStream(
xw.getContentPane(), file, fileStream);
BufferedReader br = new BufferedReader(new InputStreamReader(in));
String strLine;
//Read File Line By Line
while ((strLine = br.readLine()) != null) {
sDoc.insertString(sDoc.getLength(), strLine + "\n", sDoc.getStyle("regular"));
xw.updateProgress(target);
}
//Close the input stream
in.close();
} catch (Exception e){//Catch exception if any
System.err.println("Error: " + e.getMessage());
}
This is how I search:
public int searchText(int sPos, int target) throws BadLocationException{
String search = xhw.textSearch.getText();
String contents;
JTextPane searchPane;
if(target == 1){
searchPane = xhw.txtDisplay_1;
} else {
searchPane = xhw.txtDisplay_2;
}
if(xhw.textSearch.getText().isEmpty()){
xhw.displayDialog("Nothing to search for");
highlight(searchPane, null, 0,0);
} else {
contents = searchPane.getText();
// Search for the desired string starting at cursor position
int newPos = contents.indexOf( search, sPos );
// cycle cursor to beginning of doc window
if (newPos == -1 && sPos > 0){
sPos = 0;
newPos = contents.indexOf( search, sPos );
}
if ( newPos >= 0 ) {
// Select occurrence if found
highlight(searchPane, contents, newPos, target);
sPos = newPos + search.length()+1;
} else {
xhw.displayDialog("\"" + search + "\"" + " was not found in File " + target);
}
}
return sPos;
}
The sample file:
<?xml version="1.0" encoding="UTF-8"?>
<AlternateDepartureRoutes>
<AlternateDepartureRoute>
<AdrName>BOIRR</AdrName>
<AdrRouteAlpha>..BROPH..</AdrRouteAlpha>
<TransitionFix>
<FixName>BROPH</FixName>
</TransitionFix>
</AlternateDepartureRoute>
<AlternateDepartureRoute>
</AlternateDepartureRoutes>
And my highlighter:
public void highlight(JTextPane tPane, String text, int position, int target) throws BadLocationException {
Highlighter highlighter = new DefaultHighlighter();
Highlighter.HighlightPainter painter = new DefaultHighlighter.DefaultHighlightPainter(Color.LIGHT_GRAY);
tPane.setHighlighter(highlighter);
String searchText = xhw.textSearch.getText();
String document = tPane.getText();
int startOfSString = document.indexOf(searchText,position);
if(startOfSString >= 0){
int endOfSString = startOfSString + searchText.length();
highlighter.addHighlight(startOfSString, endOfSString, painter);
tPane.setCaretPosition(endOfSString);
int caretPos = tPane.getCaretPosition();
javax.swing.text.Element root = tPane.getDocument().getDefaultRootElement();
int lineNum = root.getElementIndex(caretPos) +1;
if (target == 1){
xhw.txtLineNum1.setText(Integer.toString(lineNum));
} else if (target == 2){
xhw.txtLineNum2.setText(Integer.toString(lineNum));
} else {
xhw.txtLineNum1.setText(null);
xhw.txtLineNum2.setText(null);
}
} else {
highlighter.removeAllHighlights();
}
}
When I do a search for Alt with the indexof() I get 40 for the plain text (which is what it should return) and 41 when searching with the styled doc. And for each additional line that Alt appears on I get and extra index (so that the indexof() call returns 2 more then needed in line 3). This happens for every additional line that it finds. Am I missing something obvious? (If I need to push this to a smaller single class to make it easier to check I can do this later when I have some more time).
Thanks in advance...

If you are on Windows, then the TextComponent text (searchPane.getText()) can contain carriage-return+newline characters (\r\n), but the TextComponent's Styled Document (sSearchPane.getText(0, sSearchPane.getLength())) contains only newline characters (\n). That's why your newPos is always larger than newPosS by the number of newlines at that point. To fix this, in your search function you can change:
contents = searchPane.getText();
to:
contents = searchPane.getText().replaceAll("\r\n","\n");
That way the search occurs with the same indices that the Styled Document is using.

OK I have found a solution (basicly). I approached this from the aspect that I am getting text from the same text componet in two different ways...
String search = xw.textSearch.getText();
String contents;
String contentsS;
JTextPane searchPane;
StyledDocument sSearchPane;
searchPane = xw.txtDisplay_left;
sSearchPane = xw.txtDisplay_left.getStyledDocument();
contents = searchPane.getText();
contentsS = sSearchPane.getText(0, sSearchPane.getLength());
// Search for the desired string starting at cursor position
int newPos = contents.indexOf( search, sPos );
int newPosS = contentsS.indexOf(search, sPos);
So when comparing the two variables "newPos" & "newPosS", newPos retruned 1 more then newPosS for each line that the search string was found on. So when looking at the sample file and searching for "Alt" the first instance is found on line 2. "newPos" returns 41 and "newPosS returns 40 (which then highlights the correct text). The next occurance (which is found in line 3) "newPos" returns 71 and "newPosS" returns 69. As you can see, every new line increases the count by the line number the occurance begins in. I would suspect that there is an extra character being added in for each new line from the textPane that is not present in the StyledDoc.
I'm sure there is a reasonable explaination but I don't have it at this time.

Related

Update PDF using pdfBox

I would like to ask if having a PDF it is possible, using pdfbox libraries, to update it at a specific point.
I am trying to use a solution already online but seems the gettoken() method does not enter code heresection the words properly to allow me to find the part I would like to modify.
This is the code(Groovy):
for( int i = 0; i < dataContext.getDataCount(); i++ ) {
InputStream is = dataContext.getStream(i);
Properties props = dataContext.getProperties(i);
String searchString= "Hours worked";
String replacement = "Hours worked: 2";
File file = new File("\\\\****\\UKDC\\GFS\\PRE\\PREPROD\\Alchemer\\Template\\***.pdf");
PDDocument doc = PDDocument.load(file);
for ( PDPage page : doc.getPages() )
{
PDFStreamParser parser = new PDFStreamParser(page);
parser.parse();
List tokens = parser.getTokens();
logger.info("in Page");
for (int j = 0; j < tokens.size(); j++)
{
logger.info("tokens:"+tokens[j]);
Object next = tokens.get(j);
//logger.info("in Object");
if (next instanceof Operator)
{
Operator op = (Operator) next;
String pstring = "";
int prej = 0;
//Tj and TJ are the two operators that display strings in a PDF
if (op.getName().equals("Tj"))
{
logger.info("in Tj");
// Tj takes one operator and that is the string to display so lets update that operator
COSString previous = (COSString) tokens.get(j - 1);
String string = previous.getString();
logger.info("previousString:"+string);
string = string.replaceFirst(searchString, replacement);
previous.setValue(string.getBytes());
} else
if (op.getName().equals("TJ"))
{
logger.info("in TJ:"+ op.getName());
COSArray previous = (COSArray) tokens.get(j - 1);
logger.info("previous:"+previous);
for (int k = 0; k < previous.size(); k++)
{
Object arrElement = previous.getObject(k);
if (arrElement instanceof COSString)
{
COSString cosString = (COSString) arrElement;
String string = cosString.getString();
logger.info("string:"+string);
if (j == prej || string.equals(" ") || string.equals(":") || string.equals("-")) {
pstring += string;
} else {
prej = j;
pstring = string;
}
}
}
logger.info("pstring:"+pstring);
if (searchString.equals(pstring.trim()))
{
logger.info("in searchString");
COSString cosString2 = (COSString) previous.getObject(0);
cosString2.setValue(replacement.getBytes());
int total = previous.size()-1;
for (int k = total; k > 0; k--) {
previous.remove(k);
}
}
}
}
}
logger.info("in updatedStream");
// now that the tokens are updated we will replace the page content stream.
PDStream updatedStream = new PDStream(doc);
OutputStream out = updatedStream.createOutputStream(COSName.FLATE_DECODE);
ContentStreamWriter tokenWriter = new ContentStreamWriter(out);
tokenWriter.writeTokens(tokens);
logger.info("in tokenWriter");
out.close();
page.setContents(updatedStream);
doc.save("\\\\***\\UKDC\\GFS\\PRE\\PREPROD\\Alchemer\\***1.pdf");
}
Executing the code I am trying to search "Hours worked" String and update with
"Hours worked: 2"
There are 2 questions:
1.When I execute and check the logs can see the Tokens are not created properly:
enter image description here
enter image description here
So are created two different COSArrays meantime I have all in one Line:
enter image description here
and this can be a problem if I have to search a specific word.
When it find the word it seems it is working but it apply a strange char:
enter image description here
So Here 2 questions:
How to manage to specify the token behaviour (or maybe for the parser) to get an entire phrase in the same token until a special char happen?
Hot to format the new char in the new PDF?
Hope you can help me, thanks for your support.

Using Lucene's highlighting, getting too much highlighted, is there a workaround for this?

I am using the highlighting feature of Lucene to isolate matching terms for my query, but some of the matched terms are excessive.
I have some simple test cases which are delivered in an Ant project (download details below).
Materials
You can download the test case here: mydemo_with_libs.zip
That archive includes the Lucene 8.6.3 libraries which my test uses; if you prefer a copy without the JAR files you can download that from here: mydemo_without_libs.zip
The necessary libraries are: core, analyzers, queries, queryparser, highlighter, and memory.
You can run the test case by unzipping the archive into an empty directory and running the Ant command ant synsearch
Input
I have provided a short synonym list which is used for indexing and analysing in the highlighting methods:
cope,manage
jobs,tasks
simultaneously,at once
and there is one document being indexed:
Queues are a useful way of grouping jobs together in order to manage a number of them at once. You can:
hold or release multiple jobs at the same time;
group multiple tasks (for the same event);
control the priority of jobs in the queue;
Eventually log all events that take place in a queue.
Use either job.queue or task.queue in specifications.
Process
When building the index I am storing the text field, and using a custom analyzer. This is because (in the real world) the content I am indexing is technical documentation, so stripping out punctuation is inappropriate because so much of it may be significant in technical expressions. My analyzer uses a TechTokenFilter which breaks the stream up into tokens consisting of strings of words or digits, or individual characters which don't match the previous pattern.
Here's the relevant code for the analyzer:
public class MyAnalyzer extends Analyzer {
public MyAnalyzer(String synlist) {
if (synlist != "") {
this.synlist = synlist;
this.useSynonyms = true;
}
}
public MyAnalyzer() {
this.useSynonyms = false;
}
#Override
protected TokenStreamComponents createComponents(String fieldName) {
WhitespaceTokenizer src = new WhitespaceTokenizer();
TokenStream result = new TechTokenFilter(new LowerCaseFilter(src));
if (useSynonyms) {
result = new SynonymGraphFilter(result, getSynonyms(synlist), Boolean.TRUE);
result = new FlattenGraphFilter(result);
}
return new TokenStreamComponents(src, result);
}
and here's my filter:
public class TechTokenFilter extends TokenFilter {
private final CharTermAttribute termAttr;
private final PositionIncrementAttribute posIncAttr;
private final ArrayList<String> termStack;
private AttributeSource.State current;
private final TypeAttribute typeAttr;
public TechTokenFilter(TokenStream tokenStream) {
super(tokenStream);
termStack = new ArrayList<>();
termAttr = addAttribute(CharTermAttribute.class);
posIncAttr = addAttribute(PositionIncrementAttribute.class);
typeAttr = addAttribute(TypeAttribute.class);
}
#Override
public boolean incrementToken() throws IOException {
if (this.termStack.isEmpty() && input.incrementToken()) {
final String currentTerm = termAttr.toString();
final int bufferLen = termAttr.length();
if (bufferLen > 0) {
if (termStack.isEmpty()) {
termStack.addAll(Arrays.asList(techTokens(currentTerm)));
current = captureState();
}
}
}
if (!this.termStack.isEmpty()) {
String part = termStack.remove(0);
restoreState(current);
termAttr.setEmpty().append(part);
posIncAttr.setPositionIncrement(1);
return true;
} else {
return false;
}
}
public static String[] techTokens(String t) {
List<String> tokenlist = new ArrayList<String>();
String[] tokens;
StringBuilder next = new StringBuilder();
String token;
char minus = '-';
char underscore = '_';
char c, prec, subc;
// Boolean inWord = false;
for (int i = 0; i < t.length(); i++) {
prec = i > 0 ? t.charAt(i - 1) : 0;
c = t.charAt(i);
subc = i < (t.length() - 1) ? t.charAt(i + 1) : 0;
if (Character.isLetterOrDigit(c) || c == underscore) {
next.append(c);
// inWord = true;
}
else if (c == minus && Character.isLetterOrDigit(prec) && Character.isLetterOrDigit(subc)) {
next.append(c);
} else {
if (next.length() > 0) {
token = next.toString();
tokenlist.add(token);
next.setLength(0);
}
if (Character.isWhitespace(c)) {
// shouldn't be possible because the input stream has been tokenized on
// whitespace
} else {
tokenlist.add(String.valueOf(c));
}
// inWord = false;
}
}
if (next.length() > 0) {
token = next.toString();
tokenlist.add(token);
// next.setLength(0);
}
tokens = tokenlist.toArray(new String[0]);
return tokens;
}
}
Examining the index I can see that the index contains the separate terms I expect, including the synonym values. For example the text at the end of the first line has produced the terms
of
them
at , simultaneously
once
.
You
can
:
and the text at the end of the third line has produced the terms
same
event
)
;
When the application performs a search it analyzes the query without using the synonym list (because the synonyms are already in the index), but I have discovered that I need to include the synonym list when analyzing the stored text to identify the matching fragments.
Searches match the correct documents, but the code I have added to identify the matching terms over-performs. I won't show all the search method here, but will focus on the code which lists matched terms:
public static void doSearch(IndexReader reader, IndexSearcher searcher,
Query query, int max, String synList) throws IOException {
SimpleHTMLFormatter htmlFormatter = new SimpleHTMLFormatter("\001", "\002");
Highlighter highlighter = new Highlighter(htmlFormatter, new QueryScorer(query));
Analyzer analyzer;
if (synList != null) {
analyzer = new MyAnalyzer(synList);
} else {
analyzer = new MyAnalyzer();
}
// Collect all the docs
TopDocs results = searcher.search(query, max);
ScoreDoc[] hits = results.scoreDocs;
int numTotalHits = Math.toIntExact(results.totalHits.value);
System.out.println("\nQuery: " + query.toString());
System.out.println("Matches: " + numTotalHits);
// Collect matching terms
HashSet<String> matchedWords = new HashSet<String>();
int start = 0;
int end = Math.min(numTotalHits, max);
for (int i = start; i < end; i++) {
int id = hits[i].doc;
float score = hits[i].score;
Document doc = searcher.doc(id);
String docpath = doc.get("path");
String doctext = doc.get("text");
try {
TokenStream tokens = TokenSources.getTokenStream("text", null, doctext, analyzer, -1);
TextFragment[] frag = highlighter.getBestTextFragments(tokens, doctext, false, 100);
for (int j = 0; j < frag.length; j++) {
if ((frag[j] != null) && (frag[j].getScore() > 0)) {
String match = frag[j].toString();
addMatchedWord(matchedWords, match);
}
}
} catch (InvalidTokenOffsetsException e) {
System.err.println(e.getMessage());
}
System.out.println("matched file: " + docpath);
}
if (matchedWords.size() > 0) {
System.out.println("matched terms:");
for (String word : matchedWords) {
System.out.println(word);
}
}
}
Problem
While the correct documents are selected by these queries, and the fragments chosen for highlighting do contain the query terms, the highlighted pieces in some of the selected fragments extend over too much of the input.
For example, if the query is
+text:event +text:manage
(the first example in the test case) then I would expect to see 'event' and 'manage' in the highlighted list. But what I actually see is
event);
manage
Despite the highlighting process using an analyzer which breaks terms apart and treats punctuation characters as single terms, the highlight code is "hungry" and breaks on whitespace alone.
Similarly if the query is
+text:queeu~1
(my final test case) I would expect to only see 'queue' in the list. But I get
queue.
job.queue
task.queue
queue;
It is so nearly there... but I don't understand why the highlighted pieces are inconsistent with the index, and I don't think I should have to parse the list of matches through yet another filter to produce the correct list of matches.
I would really appreciate any pointers to what I am doing wrong or how I could improve my code to deliver exactly what I need.
Thanks for reading this far!
I managed to get this working by replacing the WhitespaceTokenizer and TechTokenFilter in my analyzer with a PatternTokenizer; the regular expression took a bit of work but once I had it all the matching terms were extracted with pinpoint accuracy.
The replacement analyzer:
public class MyAnalyzer extends Analyzer {
public MyAnalyzer(String synlist) {
if (synlist != "") {
this.synlist = synlist;
this.useSynonyms = true;
}
}
public MyAnalyzer() {
this.useSynonyms = false;
}
private static final String tokenRegex = "(([\\w]+-)*[\\w]+)|[^\\w\\s]";
#Override
protected TokenStreamComponents createComponents(String fieldName) {
PatternTokenizer src = new PatternTokenizer(Pattern.compile(tokenRegex), 0);
TokenStream result = new LowerCaseFilter(src);
if (useSynonyms) {
result = new SynonymGraphFilter(result, getSynonyms(synlist), Boolean.TRUE);
result = new FlattenGraphFilter(result);
}
return new TokenStreamComponents(src, result);
}

Insert line breaks in VariableReplace in docx4j

I have been trying to fill up a word template(.docx) file which has placeholders which needs to be replaced.
I was able to rewrite the template but the text does not come with line breaks
I understand that carriage return or new line (\r\n) does not work in .docx files. I used the VariableReplace method to convert but I was unable to place br or factory.createBr() while using the variable replace.
Any suggestions would be really helpful. Below is the piece of code what i tried
Map<String,String> variableReplaceMap = new HashMap<>();
Map<String, String> textContent = readTextContentAfterDBExtractionToFillUpTemplate();
ObjectFactory factory = Context.getWmlObjectFactory();
P para = factory.createP();
R rspc = factory.createR();
String power= textContent.get("Power & Energy");
String[] powerWithNewLine = skills.split("\\\\n");
for (String eachLineOfPower : powerWithNewLine) {
Text eachLineOfPowerTxt = factory.createText();
eachLineOfPowerTxt .setValue( eachLineOfPower );
rspc.getContent().add( eachLineOfPowerTxt );
Br br = factory.createBr();
rspc.getContent().add(br);
para.getParagraphContent().add(rspc);
documentPart.addObject(para);
}
String str = "";
for (Object eachLineOfPgrph : para.getParagraphContent()) {
str = str + eachLineOfPgrph;
}
variableReplaceMap.put("POWER", str);
return variableReplaceMap;
The link from Jason is dead.
Here is the current link: https://github.com/plutext/docx4j/blob/master/docx4j-samples-docx4j/src/main/java/org/docx4j/samples/VariableReplace.java
In case it gets changed in the future, simply use the following function and apply it to your string, that contains linebreaks:
/**
* Hack to convert a new line character into w:br.
* If you need this sort of thing, consider using
* OpenDoPE content control data binding instead.
*
* #param r
* #return
*/
private static String newlineToBreakHack(String r) {
StringTokenizer st = new StringTokenizer(r, "\n\r\f"); // tokenize on the newline character, the carriage-return character, and the form-feed character
StringBuilder sb = new StringBuilder();
boolean firsttoken = true;
while (st.hasMoreTokens()) {
String line = (String) st.nextToken();
if (firsttoken) {
firsttoken = false;
} else {
sb.append("</w:t><w:br/><w:t>");
}
sb.append(line);
}
return sb.toString();
}
See newlineToBreakHack method at https://github.com/plutext/docx4j/blob/master/src/samples/docx4j/org/docx4j/samples/VariableReplace.java#L122

Capitalise first letter of each word in string + lowercase all other letters [duplicate]

Is there a function built into Java that capitalizes the first character of each word in a String, and does not affect the others?
Examples:
jon skeet -> Jon Skeet
miles o'Brien -> Miles O'Brien (B remains capital, this rules out Title Case)
old mcdonald -> Old Mcdonald*
*(Old McDonald would be find too, but I don't expect it to be THAT smart.)
A quick look at the Java String Documentation reveals only toUpperCase() and toLowerCase(), which of course do not provide the desired behavior. Naturally, Google results are dominated by those two functions. It seems like a wheel that must have been invented already, so it couldn't hurt to ask so I can use it in the future.
WordUtils.capitalize(str) (from apache commons-text)
(Note: if you need "fOO BAr" to become "Foo Bar", then use capitalizeFully(..) instead)
If you're only worried about the first letter of the first word being capitalized:
private String capitalize(final String line) {
return Character.toUpperCase(line.charAt(0)) + line.substring(1);
}
The following method converts all the letters into upper/lower case, depending on their position near a space or other special chars.
public static String capitalizeString(String string) {
char[] chars = string.toLowerCase().toCharArray();
boolean found = false;
for (int i = 0; i < chars.length; i++) {
if (!found && Character.isLetter(chars[i])) {
chars[i] = Character.toUpperCase(chars[i]);
found = true;
} else if (Character.isWhitespace(chars[i]) || chars[i]=='.' || chars[i]=='\'') { // You can add other chars here
found = false;
}
}
return String.valueOf(chars);
}
Try this very simple way
example givenString="ram is good boy"
public static String toTitleCase(String givenString) {
String[] arr = givenString.split(" ");
StringBuffer sb = new StringBuffer();
for (int i = 0; i < arr.length; i++) {
sb.append(Character.toUpperCase(arr[i].charAt(0)))
.append(arr[i].substring(1)).append(" ");
}
return sb.toString().trim();
}
Output will be: Ram Is Good Boy
I made a solution in Java 8 that is IMHO more readable.
public String firstLetterCapitalWithSingleSpace(final String words) {
return Stream.of(words.trim().split("\\s"))
.filter(word -> word.length() > 0)
.map(word -> word.substring(0, 1).toUpperCase() + word.substring(1))
.collect(Collectors.joining(" "));
}
The Gist for this solution can be found here: https://gist.github.com/Hylke1982/166a792313c5e2df9d31
String toBeCapped = "i want this sentence capitalized";
String[] tokens = toBeCapped.split("\\s");
toBeCapped = "";
for(int i = 0; i < tokens.length; i++){
char capLetter = Character.toUpperCase(tokens[i].charAt(0));
toBeCapped += " " + capLetter + tokens[i].substring(1);
}
toBeCapped = toBeCapped.trim();
I've written a small Class to capitalize all the words in a String.
Optional multiple delimiters, each one with its behavior (capitalize before, after, or both, to handle cases like O'Brian);
Optional Locale;
Don't breaks with Surrogate Pairs.
LIVE DEMO
Output:
====================================
SIMPLE USAGE
====================================
Source: cApItAlIzE this string after WHITE SPACES
Output: Capitalize This String After White Spaces
====================================
SINGLE CUSTOM-DELIMITER USAGE
====================================
Source: capitalize this string ONLY before'and''after'''APEX
Output: Capitalize this string only beforE'AnD''AfteR'''Apex
====================================
MULTIPLE CUSTOM-DELIMITER USAGE
====================================
Source: capitalize this string AFTER SPACES, BEFORE'APEX, and #AFTER AND BEFORE# NUMBER SIGN (#)
Output: Capitalize This String After Spaces, BeforE'apex, And #After And BeforE# Number Sign (#)
====================================
SIMPLE USAGE WITH CUSTOM LOCALE
====================================
Source: Uniforming the first and last vowels (different kind of 'i's) of the Turkish word D[İ]YARBAK[I]R (DİYARBAKIR)
Output: Uniforming The First And Last Vowels (different Kind Of 'i's) Of The Turkish Word D[i]yarbak[i]r (diyarbakir)
====================================
SIMPLE USAGE WITH A SURROGATE PAIR
====================================
Source: ab 𐐂c de à
Output: Ab 𐐪c De À
Note: first letter will always be capitalized (edit the source if you don't want that).
Please share your comments and help me to found bugs or to improve the code...
Code:
import java.util.ArrayList;
import java.util.Date;
import java.util.List;
import java.util.Locale;
public class WordsCapitalizer {
public static String capitalizeEveryWord(String source) {
return capitalizeEveryWord(source,null,null);
}
public static String capitalizeEveryWord(String source, Locale locale) {
return capitalizeEveryWord(source,null,locale);
}
public static String capitalizeEveryWord(String source, List<Delimiter> delimiters, Locale locale) {
char[] chars;
if (delimiters == null || delimiters.size() == 0)
delimiters = getDefaultDelimiters();
// If Locale specified, i18n toLowerCase is executed, to handle specific behaviors (eg. Turkish dotted and dotless 'i')
if (locale!=null)
chars = source.toLowerCase(locale).toCharArray();
else
chars = source.toLowerCase().toCharArray();
// First charachter ALWAYS capitalized, if it is a Letter.
if (chars.length>0 && Character.isLetter(chars[0]) && !isSurrogate(chars[0])){
chars[0] = Character.toUpperCase(chars[0]);
}
for (int i = 0; i < chars.length; i++) {
if (!isSurrogate(chars[i]) && !Character.isLetter(chars[i])) {
// Current char is not a Letter; gonna check if it is a delimitrer.
for (Delimiter delimiter : delimiters){
if (delimiter.getDelimiter()==chars[i]){
// Delimiter found, applying rules...
if (delimiter.capitalizeBefore() && i>0
&& Character.isLetter(chars[i-1]) && !isSurrogate(chars[i-1]))
{ // previous character is a Letter and I have to capitalize it
chars[i-1] = Character.toUpperCase(chars[i-1]);
}
if (delimiter.capitalizeAfter() && i<chars.length-1
&& Character.isLetter(chars[i+1]) && !isSurrogate(chars[i+1]))
{ // next character is a Letter and I have to capitalize it
chars[i+1] = Character.toUpperCase(chars[i+1]);
}
break;
}
}
}
}
return String.valueOf(chars);
}
private static boolean isSurrogate(char chr){
// Check if the current character is part of an UTF-16 Surrogate Pair.
// Note: not validating the pair, just used to bypass (any found part of) it.
return (Character.isHighSurrogate(chr) || Character.isLowSurrogate(chr));
}
private static List<Delimiter> getDefaultDelimiters(){
// If no delimiter specified, "Capitalize after space" rule is set by default.
List<Delimiter> delimiters = new ArrayList<Delimiter>();
delimiters.add(new Delimiter(Behavior.CAPITALIZE_AFTER_MARKER, ' '));
return delimiters;
}
public static class Delimiter {
private Behavior behavior;
private char delimiter;
public Delimiter(Behavior behavior, char delimiter) {
super();
this.behavior = behavior;
this.delimiter = delimiter;
}
public boolean capitalizeBefore(){
return (behavior.equals(Behavior.CAPITALIZE_BEFORE_MARKER)
|| behavior.equals(Behavior.CAPITALIZE_BEFORE_AND_AFTER_MARKER));
}
public boolean capitalizeAfter(){
return (behavior.equals(Behavior.CAPITALIZE_AFTER_MARKER)
|| behavior.equals(Behavior.CAPITALIZE_BEFORE_AND_AFTER_MARKER));
}
public char getDelimiter() {
return delimiter;
}
}
public static enum Behavior {
CAPITALIZE_AFTER_MARKER(0),
CAPITALIZE_BEFORE_MARKER(1),
CAPITALIZE_BEFORE_AND_AFTER_MARKER(2);
private int value;
private Behavior(int value) {
this.value = value;
}
public int getValue() {
return value;
}
}
Using org.apache.commons.lang.StringUtils makes it very simple.
capitalizeStr = StringUtils.capitalize(str);
From Java 9+
you can use String::replaceAll like this :
public static void upperCaseAllFirstCharacter(String text) {
String regex = "\\b(.)(.*?)\\b";
String result = Pattern.compile(regex).matcher(text).replaceAll(
matche -> matche.group(1).toUpperCase() + matche.group(2)
);
System.out.println(result);
}
Example :
upperCaseAllFirstCharacter("hello this is Just a test");
Outputs
Hello This Is Just A Test
With this simple code:
String example="hello";
example=example.substring(0,1).toUpperCase()+example.substring(1, example.length());
System.out.println(example);
Result: Hello
I'm using the following function. I think it is faster in performance.
public static String capitalize(String text){
String c = (text != null)? text.trim() : "";
String[] words = c.split(" ");
String result = "";
for(String w : words){
result += (w.length() > 1? w.substring(0, 1).toUpperCase(Locale.US) + w.substring(1, w.length()).toLowerCase(Locale.US) : w) + " ";
}
return result.trim();
}
Use the Split method to split your string into words, then use the built in string functions to capitalize each word, then append together.
Pseudo-code (ish)
string = "the sentence you want to apply caps to";
words = string.split(" ")
string = ""
for(String w: words)
//This line is an easy way to capitalize a word
word = word.toUpperCase().replace(word.substring(1), word.substring(1).toLowerCase())
string += word
In the end string looks something like
"The Sentence You Want To Apply Caps To"
This might be useful if you need to capitalize titles. It capitalizes each substring delimited by " ", except for specified strings such as "a" or "the". I haven't ran it yet because it's late, should be fine though. Uses Apache Commons StringUtils.join() at one point. You can substitute it with a simple loop if you wish.
private static String capitalize(String string) {
if (string == null) return null;
String[] wordArray = string.split(" "); // Split string to analyze word by word.
int i = 0;
lowercase:
for (String word : wordArray) {
if (word != wordArray[0]) { // First word always in capital
String [] lowercaseWords = {"a", "an", "as", "and", "although", "at", "because", "but", "by", "for", "in", "nor", "of", "on", "or", "so", "the", "to", "up", "yet"};
for (String word2 : lowercaseWords) {
if (word.equals(word2)) {
wordArray[i] = word;
i++;
continue lowercase;
}
}
}
char[] characterArray = word.toCharArray();
characterArray[0] = Character.toTitleCase(characterArray[0]);
wordArray[i] = new String(characterArray);
i++;
}
return StringUtils.join(wordArray, " "); // Re-join string
}
public static String toTitleCase(String word){
return Character.toUpperCase(word.charAt(0)) + word.substring(1);
}
public static void main(String[] args){
String phrase = "this is to be title cased";
String[] splitPhrase = phrase.split(" ");
String result = "";
for(String word: splitPhrase){
result += toTitleCase(word) + " ";
}
System.out.println(result.trim());
}
1. Java 8 Streams
public static String capitalizeAll(String str) {
if (str == null || str.isEmpty()) {
return str;
}
return Arrays.stream(str.split("\\s+"))
.map(t -> t.substring(0, 1).toUpperCase() + t.substring(1))
.collect(Collectors.joining(" "));
}
Examples:
System.out.println(capitalizeAll("jon skeet")); // Jon Skeet
System.out.println(capitalizeAll("miles o'Brien")); // Miles O'Brien
System.out.println(capitalizeAll("old mcdonald")); // Old Mcdonald
System.out.println(capitalizeAll(null)); // null
For foo bAR to Foo Bar, replace the map() method with the following:
.map(t -> t.substring(0, 1).toUpperCase() + t.substring(1).toLowerCase())
2. String.replaceAll() (Java 9+)
ublic static String capitalizeAll(String str) {
if (str == null || str.isEmpty()) {
return str;
}
return Pattern.compile("\\b(.)(.*?)\\b")
.matcher(str)
.replaceAll(match -> match.group(1).toUpperCase() + match.group(2));
}
Examples:
System.out.println(capitalizeAll("12 ways to learn java")); // 12 Ways To Learn Java
System.out.println(capitalizeAll("i am atta")); // I Am Atta
System.out.println(capitalizeAll(null)); // null
3. Apache Commons Text
System.out.println(WordUtils.capitalize("love is everywhere")); // Love Is Everywhere
System.out.println(WordUtils.capitalize("sky, sky, blue sky!")); // Sky, Sky, Blue Sky!
System.out.println(WordUtils.capitalize(null)); // null
For titlecase:
System.out.println(WordUtils.capitalizeFully("fOO bAR")); // Foo Bar
System.out.println(WordUtils.capitalizeFully("sKy is BLUE!")); // Sky Is Blue!
For details, checkout this tutorial.
BufferedReader br = new BufferedReader(new InputStreamReader(System.in));
System.out.println("Enter the sentence : ");
try
{
String str = br.readLine();
char[] str1 = new char[str.length()];
for(int i=0; i<str.length(); i++)
{
str1[i] = Character.toLowerCase(str.charAt(i));
}
str1[0] = Character.toUpperCase(str1[0]);
for(int i=0;i<str.length();i++)
{
if(str1[i] == ' ')
{
str1[i+1] = Character.toUpperCase(str1[i+1]);
}
System.out.print(str1[i]);
}
}
catch(Exception e)
{
System.err.println("Error: " + e.getMessage());
}
I decided to add one more solution for capitalizing words in a string:
words are defined here as adjacent letter-or-digit characters;
surrogate pairs are provided as well;
the code has been optimized for performance; and
it is still compact.
Function:
public static String capitalize(String string) {
final int sl = string.length();
final StringBuilder sb = new StringBuilder(sl);
boolean lod = false;
for(int s = 0; s < sl; s++) {
final int cp = string.codePointAt(s);
sb.appendCodePoint(lod ? Character.toLowerCase(cp) : Character.toUpperCase(cp));
lod = Character.isLetterOrDigit(cp);
if(!Character.isBmpCodePoint(cp)) s++;
}
return sb.toString();
}
Example call:
System.out.println(capitalize("An à la carte StRiNg. Surrogate pairs: 𐐪𐐪."));
Result:
An À La Carte String. Surrogate Pairs: 𐐂𐐪.
Use:
String text = "jon skeet, miles o'brien, old mcdonald";
Pattern pattern = Pattern.compile("\\b([a-z])([\\w]*)");
Matcher matcher = pattern.matcher(text);
StringBuffer buffer = new StringBuffer();
while (matcher.find()) {
matcher.appendReplacement(buffer, matcher.group(1).toUpperCase() + matcher.group(2));
}
String capitalized = matcher.appendTail(buffer).toString();
System.out.println(capitalized);
There are many way to convert the first letter of the first word being capitalized. I have an idea. It's very simple:
public String capitalize(String str){
/* The first thing we do is remove whitespace from string */
String c = str.replaceAll("\\s+", " ");
String s = c.trim();
String l = "";
for(int i = 0; i < s.length(); i++){
if(i == 0){ /* Uppercase the first letter in strings */
l += s.toUpperCase().charAt(i);
i++; /* To i = i + 1 because we don't need to add
value i = 0 into string l */
}
l += s.charAt(i);
if(s.charAt(i) == 32){ /* If we meet whitespace (32 in ASCII Code is whitespace) */
l += s.toUpperCase().charAt(i+1); /* Uppercase the letter after whitespace */
i++; /* Yo i = i + 1 because we don't need to add
value whitespace into string l */
}
}
return l;
}
package com.test;
/**
* #author Prasanth Pillai
* #date 01-Feb-2012
* #description : Below is the test class details
*
* inputs a String from a user. Expect the String to contain spaces and alphanumeric characters only.
* capitalizes all first letters of the words in the given String.
* preserves all other characters (including spaces) in the String.
* displays the result to the user.
*
* Approach : I have followed a simple approach. However there are many string utilities available
* for the same purpose. Example : WordUtils.capitalize(str) (from apache commons-lang)
*
*/
import java.io.BufferedReader;
import java.io.IOException;
import java.io.InputStreamReader;
public class Test {
public static void main(String[] args) throws IOException{
System.out.println("Input String :\n");
InputStreamReader converter = new InputStreamReader(System.in);
BufferedReader in = new BufferedReader(converter);
String inputString = in.readLine();
int length = inputString.length();
StringBuffer newStr = new StringBuffer(0);
int i = 0;
int k = 0;
/* This is a simple approach
* step 1: scan through the input string
* step 2: capitalize the first letter of each word in string
* The integer k, is used as a value to determine whether the
* letter is the first letter in each word in the string.
*/
while( i < length){
if (Character.isLetter(inputString.charAt(i))){
if ( k == 0){
newStr = newStr.append(Character.toUpperCase(inputString.charAt(i)));
k = 2;
}//this else loop is to avoid repeatation of the first letter in output string
else {
newStr = newStr.append(inputString.charAt(i));
}
} // for the letters which are not first letter, simply append to the output string.
else {
newStr = newStr.append(inputString.charAt(i));
k=0;
}
i+=1;
}
System.out.println("new String ->"+newStr);
}
}
Here is a simple function
public static String capEachWord(String source){
String result = "";
String[] splitString = source.split(" ");
for(String target : splitString){
result += Character.toUpperCase(target.charAt(0))
+ target.substring(1) + " ";
}
return result.trim();
}
This is just another way of doing it:
private String capitalize(String line)
{
StringTokenizer token =new StringTokenizer(line);
String CapLine="";
while(token.hasMoreTokens())
{
String tok = token.nextToken().toString();
CapLine += Character.toUpperCase(tok.charAt(0))+ tok.substring(1)+" ";
}
return CapLine.substring(0,CapLine.length()-1);
}
Reusable method for intiCap:
public class YarlagaddaSireeshTest{
public static void main(String[] args) {
String FinalStringIs = "";
String testNames = "sireesh yarlagadda test";
String[] name = testNames.split("\\s");
for(String nameIs :name){
FinalStringIs += getIntiCapString(nameIs) + ",";
}
System.out.println("Final Result "+ FinalStringIs);
}
public static String getIntiCapString(String param) {
if(param != null && param.length()>0){
char[] charArray = param.toCharArray();
charArray[0] = Character.toUpperCase(charArray[0]);
return new String(charArray);
}
else {
return "";
}
}
}
Here is my solution.
I ran across this problem tonight and decided to search it. I found an answer by Neelam Singh that was almost there, so I decided to fix the issue (broke on empty strings) and caused a system crash.
The method you are looking for is named capString(String s) below.
It turns "It's only 5am here" into "It's Only 5am Here".
The code is pretty well commented, so enjoy.
package com.lincolnwdaniel.interactivestory.model;
public class StringS {
/**
* #param s is a string of any length, ideally only one word
* #return a capitalized string.
* only the first letter of the string is made to uppercase
*/
public static String capSingleWord(String s) {
if(s.isEmpty() || s.length()<2) {
return Character.toUpperCase(s.charAt(0))+"";
}
else {
return Character.toUpperCase(s.charAt(0)) + s.substring(1);
}
}
/**
*
* #param s is a string of any length
* #return a title cased string.
* All first letter of each word is made to uppercase
*/
public static String capString(String s) {
// Check if the string is empty, if it is, return it immediately
if(s.isEmpty()){
return s;
}
// Split string on space and create array of words
String[] arr = s.split(" ");
// Create a string buffer to hold the new capitalized string
StringBuffer sb = new StringBuffer();
// Check if the array is empty (would be caused by the passage of s as an empty string [i.g "" or " "],
// If it is, return the original string immediately
if( arr.length < 1 ){
return s;
}
for (int i = 0; i < arr.length; i++) {
sb.append(Character.toUpperCase(arr[i].charAt(0)))
.append(arr[i].substring(1)).append(" ");
}
return sb.toString().trim();
}
}
Here we go for perfect first char capitalization of word
public static void main(String[] args) {
String input ="my name is ranjan";
String[] inputArr = input.split(" ");
for(String word : inputArr) {
System.out.println(word.substring(0, 1).toUpperCase()+word.substring(1,word.length()));
}
}
}
//Output : My Name Is Ranjan
For those of you using Velocity in your MVC, you can use the capitalizeFirstLetter() method from the StringUtils class.
String s="hi dude i want apple";
s = s.replaceAll("\\s+"," ");
String[] split = s.split(" ");
s="";
for (int i = 0; i < split.length; i++) {
split[i]=Character.toUpperCase(split[i].charAt(0))+split[i].substring(1);
s+=split[i]+" ";
System.out.println(split[i]);
}
System.out.println(s);
package corejava.string.intern;
import java.io.DataInputStream;
import java.util.ArrayList;
/*
* wap to accept only 3 sentences and convert first character of each word into upper case
*/
public class Accept3Lines_FirstCharUppercase {
static String line;
static String words[];
static ArrayList<String> list=new ArrayList<String>();
/**
* #param args
*/
public static void main(String[] args) throws java.lang.Exception{
DataInputStream read=new DataInputStream(System.in);
System.out.println("Enter only three sentences");
int i=0;
while((line=read.readLine())!=null){
method(line); //main logic of the code
if((i++)==2){
break;
}
}
display();
System.out.println("\n End of the program");
}
/*
* this will display all the elements in an array
*/
public static void display(){
for(String display:list){
System.out.println(display);
}
}
/*
* this divide the line of string into words
* and first char of the each word is converted to upper case
* and to an array list
*/
public static void method(String lineParam){
words=line.split("\\s");
for(String s:words){
String result=s.substring(0,1).toUpperCase()+s.substring(1);
list.add(result);
}
}
}
If you prefer Guava...
String myString = ...;
String capWords = Joiner.on(' ').join(Iterables.transform(Splitter.on(' ').omitEmptyStrings().split(myString), new Function<String, String>() {
public String apply(String input) {
return Character.toUpperCase(input.charAt(0)) + input.substring(1);
}
}));
String toUpperCaseFirstLetterOnly(String str) {
String[] words = str.split(" ");
StringBuilder ret = new StringBuilder();
for(int i = 0; i < words.length; i++) {
ret.append(Character.toUpperCase(words[i].charAt(0)));
ret.append(words[i].substring(1));
if(i < words.length - 1) {
ret.append(' ');
}
}
return ret.toString();
}

How to Detect table start in itextSharp?

I am trying to convert pdf to csv file. pdf file has data in tabular format with first row as header. I have reached to the level where I can extract text from a cell, compare the baseline of text in table and detect newline but I need to compare table borders to detect start of table. I do not know how to detect and compare lines in PDF. Can anyone help me?
Thanks!!!
As you've seen (hopefully), PDFs have no concept of tables, just text placed at specific locations and lines drawn around them. There is no internal relationship between the text and the lines. This is very important to understand.
Knowing this, if all of the cells have enough padding you can look for gaps between characters that are large enough such as the width of 3 or more spaces. If the cells don't have enough spacing this will unfortunately probably break.
You could also look at every line in the PDF and try to figure out what represents your "table-like" lines. See this answer for how to walk every token on a page to see what's being drawn.
I was also searching the answer for the similar question, but unfortunately I didn't found one so I did it on my own.
A PDF page like this
Will give the output as
Here is the github link for the dotnet Console Application I made.
https://github.com/Justabhi96/Detect_And_Extract_Table_From_Pdf
This application detects the table in the specific page of the PDF and prints them in a table format on the console.
Here is the code that i used to make this application.
First of all I took the text out of PDF along with their coordinates using a class which extends iTextSharp.text.pdf.parser.LocationTextExtractionStrategy class of iTextSharp. The Code is as follows:
This is the Class that is going to store the chunks with there coordinates and text.
using System;
using System.Collections.Generic;
using System.Linq;
using System.Web;
namespace itextPdfTextCoordinates
{
public class RectAndText
{
public iTextSharp.text.Rectangle Rect;
public String Text;
public RectAndText(iTextSharp.text.Rectangle rect, String text)
{
this.Rect = rect;
this.Text = text;
}
}
}
And this is the class that extends the LocationTextExtractionStrategy class.
using iTextSharp.text.pdf.parser;
using System;
using System.Collections.Generic;
using System.Linq;
using System.Web;
namespace itextPdfTextCoordinates
{
public class MyLocationTextExtractionStrategy : LocationTextExtractionStrategy
{
public List<RectAndText> myPoints = new List<RectAndText>();
//Automatically called for each chunk of text in the PDF
public override void RenderText(TextRenderInfo renderInfo)
{
base.RenderText(renderInfo);
//Get the bounding box for the chunk of text
var bottomLeft = renderInfo.GetDescentLine().GetStartPoint();
var topRight = renderInfo.GetAscentLine().GetEndPoint();
//Create a rectangle from it
var rect = new iTextSharp.text.Rectangle(
bottomLeft[Vector.I1],
bottomLeft[Vector.I2],
topRight[Vector.I1],
topRight[Vector.I2]
);
//Add this to our main collection
this.myPoints.Add(new RectAndText(rect, renderInfo.GetText()));
}
}
}
This class is overriding the RenderText method of the LocationTextExtractionStrategy class which will be called each time you extract the chunks from a PDF page using PdfTextExtractor.GetTextFromPage() method.
using itextPdfTextCoordinates;
using iTextSharp.text.pdf;
//Create an instance of our strategy
var t = new MyLocationTextExtractionStrategy();
var path = "F:\\sample-data.pdf";
//Parse page 1 of the document above
using (var r = new PdfReader(path))
{
for (var i = 1; i <= r.NumberOfPages; i++)
{
// Calling this function adds all the chunks with their coordinates to the
// 'myPoints' variable of 'MyLocationTextExtractionStrategy' Class
var ex = iTextSharp.text.pdf.parser.PdfTextExtractor.GetTextFromPage(r, i, t);
}
}
//Here you can loop over the chunks of PDF
foreach(chunk in t.myPoints){
Console.WriteLine("character {0} is at {1}*{2}",i.Text,i.Rect.Left,i.Rect.Top);
}
Now for Detecting the start and end of the table you can use the coordinates of the chunks extracted from the PDF.
Like if the specific line is not having table then there will be no jumps in the right coordinate of the current chunk and and Left coordinate of next chunk. But the lines having table will be having those coordinate jumps of at least 3 points.
Like for Lines having table will have coordinates of chunks something like this:
right coord of current chunk -> 12.75pts
left coords of next chunk -> 20.30pts
so further you can use this logic to detect tables in the PDF.
The code is as follows:
using itextPdfTextCoordinates;
using iTextSharp.text.pdf;
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Threading.Tasks;
namespace ConsoleApp1
{
class LineUsingCoordinates
{
public static List<List<string>> getLineText(string path, int page, float[] coord)
{
//Create an instance of our strategy
var t = new MyLocationTextExtractionStrategy();
//Parse page 1 of the document above
using (var r = new PdfReader(path))
{
// Calling this function adds all the chunks with their coordinates to the
// 'myPoints' variable of 'MyLocationTextExtractionStrategy' Class
var ex = iTextSharp.text.pdf.parser.PdfTextExtractor.GetTextFromPage(r, page, t);
}
// List of columns in one line
List<string> lineWord = new List<string>();
// temporary list for working around appending the <List<List<string>>
List<string> tempWord;
// List of rows. rows are list of string
List<List<string>> lineText = new List<List<string>>();
// List consisting list of chunks related to each line
List<List<RectAndText>> lineChunksList = new List<List<RectAndText>>();
//List consisting the chunks for whole page;
List<RectAndText> chunksList;
// List consisting the list of Bottom coord of the lines present in the page
List<float> bottomPointList = new List<float>();
//Getting List of Coordinates of Lines in the page no matter it's a table or not
foreach (var i in t.myPoints)
{
Console.WriteLine("character {0} is at {1}*{2}", i.Text, i.Rect.Left, i.Rect.Top);
// If the coords passed to the function is not null then process the part in the
// given coords of the page otherwise process the whole page
if (coord != null)
{
if (i.Rect.Left >= coord[0] &&
i.Rect.Bottom >= coord[1] &&
i.Rect.Right <= coord[2] &&
i.Rect.Top <= coord[3])
{
float bottom = i.Rect.Bottom;
if (bottomPointList.Count == 0)
{
bottomPointList.Add(bottom);
}
else if (Math.Abs(bottomPointList.Last() - bottom) > 3)
{
bottomPointList.Add(bottom);
}
}
}
// else process the whole page
else
{
float bottom = i.Rect.Bottom;
if (bottomPointList.Count == 0)
{
bottomPointList.Add(bottom);
}
else if (Math.Abs(bottomPointList.Last() - bottom) > 3)
{
bottomPointList.Add(bottom);
}
}
}
// Sometimes the above List will be having some elements which are from the same line but are
// having different coordinates due to some characters like " ",".",etc.
// And these coordinates will be having the difference of at most 4 points between
// their bottom coordinates.
//so to remove those elements we create two new lists which we need to remove from the original list
//This list will be having the elements which are having different but a little difference in coordinates
List<float> removeList = new List<float>();
// This list is having the elements which are having the same coordinates
List<float> sameList = new List<float>();
// Here we are adding the elements in those two lists to remove the elements
// from the original list later
for (var i = 0; i < bottomPointList.Count; i++)
{
var basePoint = bottomPointList[i];
for (var j = i+1; j < bottomPointList.Count; j++)
{
var comparePoint = bottomPointList[j];
//here we are getting the elements with same coordinates
if (Math.Abs(comparePoint - basePoint) == 0)
{
sameList.Add(comparePoint);
}
// here ae are getting the elements which are having different but the diference
// of less than 4 points
else if (Math.Abs(comparePoint - basePoint) < 4)
{
removeList.Add(comparePoint);
}
}
}
// Here we are removing the matching elements of remove list from the original list
bottomPointList = bottomPointList.Where(item => !removeList.Contains(item)).ToList();
//Here we are removing the first matching element of same list from the original list
foreach (var r in sameList)
{
bottomPointList.Remove(r);
}
// Here we are getting the characters of the same line in a List 'chunkList'.
foreach (var bottomPoint in bottomPointList)
{
chunksList = new List<RectAndText>();
for (int i = 0; i < t.myPoints.Count; i++)
{
// If the character is having same bottom coord then add it to chunkList
if (bottomPoint == t.myPoints[i].Rect.Bottom)
{
chunksList.Add(t.myPoints[i]);
}
// If character is having a difference of less than 3 in the bottom coord then also
// add it to chunkList because the coord of the next line will differ at least 10 points
// from the coord of current line
else if (Math.Abs(t.myPoints[i].Rect.Bottom - bottomPoint) < 3)
{
chunksList.Add(t.myPoints[i]);
}
}
// Here we are adding the chunkList related to each line
lineChunksList.Add(chunksList);
}
bool sameLine = false;
//Here we are looping through the lines consisting the chunks related to each line
foreach(var linechunk in lineChunksList)
{
var text = "";
// Here we are looping through the chunks of the specific line to put the texts
// that are having a cord jump in their left coordinates.
// because only the line having table will be having the coord jumps in their
// left coord not the line having texts
for (var i = 0; i< linechunk.Count-1; i++)
{
// If the coord is having a jump of less than 3 points then it will be in the same
// column otherwise the next chunk belongs to different column
if (Math.Abs(linechunk[i].Rect.Right - linechunk[i + 1].Rect.Left) < 3)
{
if (i == linechunk.Count - 2)
{
text += linechunk[i].Text + linechunk[i+1].Text ;
}
else
{
text += linechunk[i].Text;
}
}
else
{
if (i == linechunk.Count - 2)
{
// add the text to the column and set the value of next column to ""
text += linechunk[i].Text;
// this is the list of columns in other word its the row
lineWord.Add(text);
text = "";
text += linechunk[i + 1].Text;
lineWord.Add(text);
text = "";
}
else
{
text += linechunk[i].Text;
lineWord.Add(text);
text = "";
}
}
}
if(text.Trim() != "")
{
lineWord.Add(text);
}
// creating a temporary list of strings for the List<List<string>> manipulation
tempWord = new List<string>();
tempWord.AddRange(lineWord);
// "lineText" is the type of List<List<string>>
// this is our list of rows. and rows are List of strings
// here we are adding the row to the list of rows
lineText.Add(tempWord);
lineWord.Clear();
}
return lineText;
}
}
}
You can call getLineText() method of the above class and run the following loop to see the output in the table structure on the console.
var testFile = "F:\\sample-data.pdf";
float[] limitCoordinates = { 52, 671, 357, 728 };//{LowerLeftX,LowerLeftY,UpperRightX,UpperRightY}
// This line gives the lists of rows consisting of one or more columns
//if you pass the third parameter as null the it returns the content for whole page
// but if you pass the coordinates then it returns the content for that coords only
var lineText = LineUsingCoordinates.getLineText(testFile, 1, null);
//var lineText = LineUsingCoordinates.getLineText(testFile, 1, limitCoordinates);
// For detecting the table we are using the fact that the 'lineText' item which length is
// less than two is surely not the part of the table and the item which is having more than
// 2 elements is the part of table
foreach (var row in lineText)
{
if (row.Count > 1)
{
for (var col = 0; col < row.Count; col++)
{
string trimmedValue = row[col].Trim();
if (trimmedValue != "")
{
Console.Write("|" + trimmedValue + "|");
}
}
Console.WriteLine("");
}
}
Console.ReadLine();