Getting the RIGHT word count of a PDF file

Getting the RIGHT word count of a PDF file - pdf

The response in this topic helped me understand why sometimes my
PDF fails to find a word and why I keep getting different word counts when using
different PDF word count programs. I decided to use xpdf. I converted it to text
and added the -layout tag and then opened the resulting text file with Word 2003.
I noted the word count. Then I decided, unfortunately, to remove the -layout tag.
This time, though, the word count is different.
Why did that tag affect the word count? Is there an accurate way to find the word count
of a PDF file? I would even pay for such software if I have to so long as it gives me
the right number of words.
(I checked another topic but thought I'd find out if the solution I just offered would solve everything. There was another topic where advancedpdf was recommended.)

I'd like to argue that there is no reliable word counting. One could, for example, just to make your life harder, put each character of this lovely Stackoverflow answer into a single text object and position such objects such that, only when rendered, gives a meaningful paragraph to humans. Like this:
<html><body><style>
div {float: left;}
</style><div><p>S</p></div><div><p>t</p></div><div><p>a</p></div>
<div><p>c</p></div><div><p>k</p></div>

I would suggest an open source solution using Java. First you would have to parse the pdf file and extract all the text using Tika.
Then i believe you can achieve this simply by scanning the extracted text and counting the words.
Sample code would look like this:
if (f.getName().endsWith(".txt"))
{
in = new BufferedReader(new FileReader(f));
StringBuilder sb = new StringBuilder();
String s = null;
while ((s = in.readLine()) != null)
sb.append(s);
String[] tokenizedTerms = sb.toString().replaceAll("[\\W&&[^\\s]]", "").split("\\W+"); //to get individual terms
}
In tokenizedTerms array , you wil have all the terms(words) of the document and you can count them by calling tokenizedTerms.length(). Hope this was useful. :-)

Related

Split and find specific text?

ok so i've made a HTTPWEBREQUEST and i've made the source of the result show in a richtextbox, Now say i have this in the richtextbox
<p>Short URL: <code>http://URL.me/u/eywnp</code></p>
How would i go about just getting the "http://URL.me/u/eywnp" ive tried split but didnt work, guess i'm doing it wrong?
NOTE the URL will be different everytime

Split isn’t the right tool for the job. It will result in a rather complex piece of code that’s quite brittle (meaning it will break as soon as there’s the slightest change in the input).
For a robust, well-written solution you need to parse the HTML properly. Luckily there exist canned solutions for that: The HtmlAgilityPack library.
Dim doc As New HtmlDocument()
doc.LoadHtml(yourCode)
Dim result = doc.DocumentElement.SelectNodes("//a[#href]")(0)("href")
The only complicated part here is the string "//a[#href]". This is an XPath string. XPath strings are a mini-language that is used to address elements in an HTML or XML document. They are conceptually similar to file paths (like C:\Users\foo\Documents\file.txt) but with a slightly different syntax.
The XPath simply selects all the <a> elements having a href attribute from your document. Then you can grab the first of that collection and retrieve the href attribute’s value.

Thanks for all your help, i did find a solution and i used
Dim iStartIndex, iEndIndex As Integer
With RichTextBox1.Text
iStartIndex = .IndexOf("<p>Short URL: <code><a href=") + 29
iEndIndex = .IndexOf(""">", iStartIndex)
Clipboard.SetText(.Substring(iStartIndex, iEndIndex - iStartIndex))
End With
works perfect so far

Searching Multiple Results using Streamreader

Does anyone know of a method that allows you to search a string through a text file using StreamReader that allows you to account for multiple instances of finding the results. Basically I am creating a booking application and each time a customer books a seat, their PrimaryKey, FirstName, LastName and the co-ordinates of the seat on a data grid (which I have used as a method to book seats) are generated then saved to a text file.
I want the ability to be able to read multiple instances of a PrimaryKey then find the seat co-ordinates of each line that this PrimaryKey is listed on and repopulate another similar datagridview with these co-ordinates which is all going to be driven by a combobox index change.
It seems a bit complicated to understand but if anyone can help then please let me know.
I just need the knowhow of how to search multiple instances, so after its found the string once then look through the rest of the file to find another instance, I can do the rest by myself.
I'm coding using Visual Basic.Net

Yes it's possible to search multiple times through a file, but you'd either have to reopen the file, or rewind the stream (FileStream.Seek).
Wildly inefficient though.
If it has to remain an unsorted and unstructured file, build an in memory index to it.
If your key is an integer, create a Dictionary<int,int> of Key and Position in the stream.
Then when you want find key X you use FileStream.Seek to move to it, and read a line to get the data. If you find yourself grouping by say aeroplaneID, build a Dictionary<Int, List<Int,Int>>
where the key is the aeroplane id and the list is a list of primary keys and positions in the file.
You could push all that off to a background thread. You could try and get really clever and build them up as you need them. Personally though I'd be trying to move my storage to a more suitable format. You aren't struggling to do this because you've missed a class, you are struggling because you shouldn't.
Something like
Dictionary<int, int> _fileIndex = new Dictionary<int,int>();
using(FileStream fs = new FileStream(DataFileName,FileMode.Open,FileAccess.Read))
{
StreamReader reader = new StreamReader(fs);
int lastPosition = 0;
string currentLine = null;
while(currentLine = reader.ReadLine() != null)
{
String[] data = currentLine.Split(new char[] {','});
int key = int.Parse(data[0]);
fileIndex.Add(key,lastPosition);
lastPosition = fs.Position;
}
}
NB didn't test the above and there should be a bit more error checking in it. If there's alot of data in the line, then might be better off not suing split and just pulling out everything up to the correct delimiter. Also be careful how many indexes you keep live, wouldn't be long before they used up more space than just reading the entire thing in to memory.
Then you could create a class or structure to implement a line in the file, and write a bit of code
to use FileStream.Seek) to get there. If you wanted to load up a bunch of 'em it would make sense to get your list of positions of each one in the file and then sort them in order, then you could rip through the file in it's 'order' picking them out.

How would you get count of a given word in a given PDF?

Interview Question
I have been asked this question in an interview, and the answer doesn't have to be specific programming language, platform- or tool- specific.
The question was phrased as following:
How would you get the instance count of a given word in a PDF. The answer doesn't have to be programming, platform, or tool specific. Just let me know how would you do it in a memory and speed efficient way
I am posting this question for following reasons:
To better understand the context - I still fail to understand the context of this question, what might the interviewer be looking for by asking this question?
To get diverse opinions - I tend to answer such questions based on my skills on a programming language (C#), but there might be other valid options to get this done.
Thanks for your interest.

If I had to write a program to do it, I'd find a PDF rendering library capable of extracting text from PDF files, such as Xpdf and then count the words.
If this was a one-of task or something that needed to be automated for a non-production quality task, I'd just feed the file into pdftotext program and then parsed the output file with python, splitting into words, putting them in a dictionary and counting number of occurances.
If I was asking this interviewing question, I'd be looking for a couple of things:
understanding the difference between the setting for this task:
one-off script thingy vs production code
not attempting to
implement PDF rendered yourself and trying to find a library
instead.
Now I wouldn't expect this from any random candidate with no PDF experience, but you can have a very meaningful discussion about what PDF is and what a "word" is. You see, PDF stored text as a bunch of string with coordinates. Each string is not necessarily a word. Often times, the words will be split into a couple of completely separate strings which are absolutely positioned in the document to make a single word. This is why sometimes when searching for words in a PDF document you get strange looking results. So to implement word searching in a document you'd have to glue these strings back together (pdftotext takes care of that for you).
It's not a bad question at all.

You can use Trie It is very easy to get the count of given word.

I would suggest an open source solution using Java. First you would have to parse the pdf file and extract all the text using Tika.
Then I believe the correct question is how to to find the TF(term frequency) of a word in a text. I will not trouble you with definitions because you can achieve this simply by scanning the extracted text and counting the frequency of word.
Sample code would look like this:
while(scan.hasNext())
{
word = scan.next();
ha += (" " + word + " ");
int countWord = 0;
if(!listOfWords.containsKey(word))
{
listOfWords.put(word, 1); //first occurance of this word
}
else
{
countWord = listOfWords.get(word) + 1; //get current count and increment
//now put the new value back in the HashMap
listOfWords.remove(word); //first remove it (can't have duplicate keys)
listOfWords.put(word, countWord); //now put it back with new value
}
}

Converting a PDF file to a nice table

I have this PDF file which is arranged in 5 columns.
I have looked and looked through Stack Overflow (and Googled crazily) and tried all the solutions (including the last resort of trying Adobe Acrobat itself).
However, for some reason I cannot get those 5 columns in csv/xls format - as I need them arranged. Usually when I export them, the format is horrible and all the entries are arranged line by line with some data loss.
http://www.2shared.com/document/PagE4A1T/ex1.html
Here is a link to an excerpt of the file above, but I am really getting frustrated and am running out of options.

iText (or iTextSharp) could do this, if you can give it the boundaries of those 5 columns, and are willing to deal with some overhead (namely reparsing the page's text for each column)
Rectangle2D columnBoxArray[] = buildColumnBoxes();
ArrayList<String> columnTexts = new ArrayList<String>(columnBoxArray.length);
For (Rectangle2D columnBBox : columnBoxArray) {
FilteredTextRenderListener textInRectStrategy =
new FilteredTextRenderListener(new LocationTextExtractionStrategy(),
new RegionTextRenderFilter( columnBBox ) );
columnTexts.add(PdfTextExtractor.extractText( reader, pageNum, textInRectStrategy));
}
Each line of text should be separated by \n, so it becomes a simple matter of string parsing.
If you wanted to not reparse the whole page for each column, you could probably come up with a custom implementation of FilteredTextRenderListener that would take multiple listener/filter pairs. You could then parse the whole thing once rather than once for each column.

Proportional font IDE

I would really like to see a proportional font IDE, even if I have to build it myself (perhaps as an extension to Visual Studio). What I basically mean is MS Word style editing of code that sort of looks like the typographical style in The C++ Programming Language book.
I want to set tab stops for my indents and lining up function signatures and rows of assignment statements, which could be specified in points instead of fixed character positions. I would also like bold and italics. Various font sizes and even style sheets would be cool.
Has anyone seen anything like this out there or know the best way to start building one?

I'd still like to see a popular editor or IDE implement elastic tabstops.

Thinking with Style suggests to use your favorite text-manipulation software like Word or Writer. Create your programme code in rich XML and extract the compiler-relevant sections with XSLT. The "Office" software will provide all advanced text-manipulation and formatting features.

i expected you'll get down-modded and picked on for that suggestion, but there's some real sense to the idea.
The main advantage of the traditional 'non-proportional' font requirement in code editors is to ease the burden of performing code formatting.
But with all of the interactive automatic formatting that occurs in modern IDE's, it's really possible that a proportional font could improve the readability of the code (rather than hampering it, as i'm sure many purists would expect).
A character called Roedy Green (famous for his 'how to write unmaintainable code' articles) wrote about a theoretical editor/language, based on Java and called Bali. It didn't include non-proportional fonts exactly, but it did include the idea of having non-uniform font-sizes.
Also, this short Joel Spolsky post posts to a solution, elastic tab stops (as mentioned by another commentor) that would help with the support of non-proportional (and variable sized) fonts.

#Thomas Owens
I don't find code formatted like that easier to read.
That's fine, it is just a personal preference and we can disagree. Format it the way you think is best and I'll respect it. I frequently ask myself 'how should I format this or that thing?' My answer is always to format it to improve readability, which I admit can be subjective.
Regarding your sample, I just like having that nicely aligned column on the right hand side, its sort of a quick "index" into the code on the left. Having said that, I would probably avoid commenting every line like that anyway because the code itself shouldn't need that much explanation. And if it does I tend to write a paragraph above the code.
But consider this example from the original poster. Its easier to spot the comments in the second one in my opinion.
for (size-type i = 0; i<v.size(); i++) { // rehash:
size-type ii = has(v[i].key)%b.size9); // hash
v[i].next = b[ii]; // link
b[ii] = &v[i];
}
for (size-type i = 0; i<v.size(); i++) { // rehash:
size-type ii = has(v[i].key)%b.size9); // hash
v[i].next = b[ii]; // link
b[ii] = &v[i];
}

#Thomas Owens
But do people really line comments up
like that? ... I never try to
line up declarations or comments or
anything, and the only place I've ever
seen that is in textbooks.
Yes people do line up comments and declarations and all sorts of things. Consistently well formatted code is easier to read and code that is easier to read is easier to maintain.

I wonder why nobody actually answers your question, and why the accepted answer doesn't really have anything to do with your question. But anyway...
a proportional font IDE
In Eclipse you can cchoose any font on your system.
set tab stops for my indents
In Eclipse you can configure the automatic indentation, including setting it to "tabs only".
lining up function signatures and rows of assignment statements
In Eclipse, automatic indentation does that.
which could be specified in points instead of fixed character positions.
Sorry, I don't think Eclipse can help you there. But it is open source. ;-)
bold and italics
Eclipse has that.
Various font sizes and even style sheets would be cool
I think Eclipse only uses one font and font-size for each file type (for example Java source file), but you can have different "style sheets" for different file types.

When I last looked at Eclipse (some time ago now!) it allowed you to choose any installed font to work in. Not so sure whether it supported the notion of indenting using tab stops.
It looked cool, but the code was definitely harder to read...

Soeren: That's kind of neat, IMO. But do people really line comments up like that? For my end of line comments, I always use a single space then // or /* or equivalent, depending on language I'm using. I never try to line up declarations or comments or anything, and the only place I've ever seen that is in textbooks.

#Brian Ensink: I don't find code formatted like that easier to read.
int var1 = 1 //Comment
int longerVar = 2 //Comment
int anotherVar = 4 //Command
versus
int var2 = 1 //Comment
int longerVar = 2 //Comment
int anotherVar = 4 //Comment
I find the first lines easier to read than the second lines, personally.

The indentation part of your question is being done today in a real product, though possibly to even a greater level of automation than you imagined, the product I mention is an XSLT IDE, but the same formatting principles would work with most (but not all) conventional code syntaxes.
This really has to be seen in video to get the sense of it all (sorry about the music back-track). There's also a light XML editor spin-off product, XMLQuire, that serves as a technology demonstrator.
The screenshot below shows XML formatted with quite complex formatting rules in this XSLT IDE, where all indentation is performed word-processor style, using the left margin - not space or tab characters.
To emphasise this formatting concept, all characters have been highlighted to show where the left-margin extends to keep indentation. I use the term Virtual Formatting to describe this - it's not like elastic tab stops, because there simply are no tabs, just margin information which is part of the 'paragraph' formatting (RTF codes are used here). The parser reformats continuously, in the same pass as syntax coloring.
A proportional font hasn't been used here, but it could have been quite easily - because the indentation is set in TWIPS. The editing experience is quite compelling because, as you refactor the code (XML in this case), perhaps through drag and drop, or by extending the length of an attribute value, the indentation just re-flows itself to fit - there's no tab-key or 'reformat' button to press.
So, the indentation is there, but the font work is a more complex problem. I've experimented with this, but found that if fonts are re-selected as you type, the horizontal shifting of the code is too distracting - there would need to be a user-initiated 'format fonts' command probably. The product also has Ink/Handwriting technology built-in for annotating code, but I've yet to exploit this in the live release.

Folks are all complaining about comments not lining up.
Seems to me that there's a very simple solution: Define the unit space as the widest character in the font. Now, proportionally space all characters except the space. the space takes up as much room so as to line up the next character where it would be if all preceeding characters on the line were the widest in the font.
ie:
iiii_space_Foo
xxxx_space_Foo
would line up the "Foo", with the space after the "i" being much wider than after the "x".
So call it elastic spaces. rather than tab-stops.
If you're a smart editor, treat comments specially, but that's just gravy

Let me recall arguments about using the 'var' keyword in C#. People hated it, and thought it would make code less clear. For example, you couldn't know the type in something like:
var x = GetResults("Main");
foreach(var y in x)
{
WriteResult(x);
}
Their argument was, that you couln't see if x was an array, an List or any other IEnumerable. Or what the type of y was. In my opinion the unclearity did not arise from using var, but from picking unclear variable names. Why not just type:
var electionResults = GetRegionalElactionResults("Main");
foreach(var result in electionResults)
{
Write(result); // you can see what you're writing!!
}
"But you still cannot see the type of electionResults!" - does it really matter? If you want to change the return type of GetRegionalElectionResults, you can do so. Any IEnumerable will do.
Fast forward to now. People want to align comments en similar code:
int var2 = 1; //The number of days since startup, including the first
int longerVar = 2; //The number of free days per week
int anotherVar = 38; //The number of working hours per week
So without the comment everything is unclear. And if you don't align the values, you cannot seperate them from the variales. But do you? What about this (ignore the bullets please)
int daysSinceStartup = 1; // including first
int freeDaysPerWeek = 2;
int workingHoursPerWeek = 38;
If you need a comment on EVERY LINE, you're doing something wrong. "But you still need to align the VALUES" - do you? what does 38 have to do with 2?
In C# Most code blocks can easily be aligned using only tabs (or acually, multiples of four spaces):
var regionsWithIncrease =
from result in GetRegionalElectionResults()
where result.TotalCount > result > PreviousTotalCount &&
result.PreviousTotalCount > 0 // just new regions
select result.Region;
foreach (var region in regionsWithIncrease)
{
Write(region);
}
You should never use line-to-line comments and you should rarely need to vertically align things. Rarely, not never. So I understand if some of you guys prefer a monospaced font. I prefer the readibility of font Noto Sans or Source Sans Pro. These fonts are available freely from Google, and resemble Calibri, but are designed for programming and thus have all the neccesary characteristics:
Big : ; . , so you can clearly see the difference
Clearly distinct 0Oo and distinct Il|

The major problem with proportional fonts is they destroy the vertical alignment of the code and this is a fairly major loss when it comes to writing code.
The vertical alignment makes it possible to manipulate rectangular blocks of code that span multiple lines by allowing block operations like cut, copy, paste, delete and indent, unindent etc to be easily performed.
As an example consider this snippet of code:
a1 = a111;
B2 = aaaa;
c3 = AAAA;
w4 = wwWW;
W4 = WWWW;
In a mono-spaced font the = and the ; all line up.
Now if this text is loded into Word and display using a proportional font the text effectively turns into this:
NOTE: Extra white space added to show how the = and ; no longer line up:
a1 = a1 1 1;
B2 = aaaa;
c3 = A A A A;
w4 = w w W W;
W4 = W W W W;
With the vertical alignment gone those nice blocks of code effectively disappear.
Also because the cursor is no longer guaranteed to move vertically (i.e. the column number is not always constant from one line to the next) it makes it more difficult to write throw away macro scripts designed to manipulated similar looking lines.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas