Phantomjs - Finding words at a given coordinate on rendered pages - phantomjs

Using PhantomJs, I am trying to find out which single word is rendered at a given x/y-coordinate on a web page. I have seen several examples of achieving such localizations at object level, but I am looking for a way of breaking it down to term-level precision.
Any suggestion would be greatly appreciated.

This basically involves finding the bounds of all the words on the page, and then figuring out which one contains said point. This is nearly identical to finding the word under the cursor, which has been thoroughly answered in this question: How to get a word under cursor using JavaScript?

Related

Power Automate: Is there an operation that can split PDFs based on shared text across pages?

Any advice on this would be appreciated! I'm a newbie to Power Automate and Flows, though have watched a lot of tutorial content. I haven't seen a guide for exactly what I'm looking to do, so was hoping an experienced user could provide some advice.
What I need to do is split a PDF into smaller PDFs grouped by entity ID numbers that are on each page. I can't go an split on an increment because some entities have more pages of data that others. Generally the PDF will be about 700 pages and will be split into about 300 PDFs grouped by entity. Currently this is a labor intensive process, and automating would be incredible.
I'm looking into doing it with an Encodian split PDF by text action, but that requires the text is provided. What I need is a way to identify which pages have the same ID and group those into PDFs.
Does anyone have any experience doing something similar?
I have tried putting this together, but so far have only found operations that will let me split when I find a specific text string that must be provided during the operation. What I need is a way to find the entity IDs on each page, and then group the pages for the each entity together and split into its own smaller PDF file.

Web-scraping dynamic pages in Java

I know this question was asked before but none of the proposed solutions work in my case. I am trying to web-scrape a page of results but the problem is that 95% of div tags contain only class names that are dynamically changing. My code works for several hours using Xpath or a whole day but at some point class names slightly change and my program breaks. Xpath obviously does not work since it changes with the content. I wanted to make sure I fully understand the limitations of web-scraping using Selenium when I have limited options in terms of selecting tags. Is there any other solution that could work in situation like mine where I only have a page of results full of div tags and dynamically named classes?

How to recombine split sentences?

I am processing PDFs that have been converted to text. The problem? Sometimes a sentence gets split due to wonky PDF formatting and/or PDF-to-text conversion.
So I'm looking for tools that help "reassemble" sentences that got split apart. Page headers or footers often are the culprits. Other elements, such as figures and charts, can come into play as well, but they are not my primary concern right now.
This problem can be tackled in a few ways:
Removing headers and footers before doing NLP sentence detection would certainly help. I don't know of tools that do this. Do you know of tools or methods? (The general idea to remove page numbers is "easy" in theory: find consecutive increasing numbers that occur about once per page.)
Using NLP parsers that can judge the likelihood that a sentence is grammatically correct would help. That way I can compare the grammatical correctness of two sentences taken separately in comparison with the correctness of their amalgamation. (The Stanford Parser, as I understand it, does not evaluate grammatical correctness.) Do you know of tools that can help?
Please let me know if you have suggestions, answers, or other ways to approach the problem.
use Apache Tika to extract the data from a pdf.

How can I present multiple pages with similar content (mostly images) without Google penalizing me?

I have a website that presents Q&As to mathematical problems, mostly for pupils aged approx. 16-18 years old. Due to the difficulties of presenting formulas on webpages, the Q&As (formulas) are presented as images. At the moment, each webpage contains one Q&A, and there are many questions and answers. Thus, with little in the way of text, every page looks almost identical. Therefore, Google might very easily see this as duplicate content. What is my best solution to this problem? Should I try put the Q&As in a database and present each different one on the same page (dynamically). Or should I keep things the way they are and prevent Google from seeing most of the Q&As? It is also difficult to make different titles, descriptions etc. as, for each topic, only the question number changes.
Many thanks for your time.
You're basically a ghost to google anyways if there is no text on each page. If you are worried about SEO you need to worry about text.
You should at the very least look into tagging the formulas or creating a title for the question which is relevant and putting that into a header tag above the question image.
Otherwise no one will find you by that content and that's what it's all about.
You said it: you can hide the QandA file/directory in the robots.txt file of your web server.
Disallow: /QAfolder
or
Disallow: /Q1.htm
Disallow: /Q2.htm
Disallow: /Q3.htm
or whatnot.
Normally, this would be a bad thing (preventing users from searching for question content) but as you said, they're images anyway.
Create descriptive useful page titles and meta descriptions.
Create textual representations of what is in the image using alt tags.
Use Different headers.
This could be a little hard to think about as in your context. but, you may be able probably use the question type description or name of the chapter its taken from. basically a text description relevant to the question.
One more thing you can do is. If you have empty space on your page, you can put in some text that describes your website and at the same time uses the right keywords(that you are targeting) in the right percentages Higher up in the page - You may writeup 2-3 different descriptions and alternate them between pages, i.e. if your design permits you.

itextsharp: solution on how to display a report

i have a report which looks like this. it will be in PDF format:
alt text http://img52.imageshack.us/img52/3324/fullscreencapture121420.png
the user will input all the different foods, thus every section like NONE, MODERATE, SEVERE will be a different size and thus i need to be able to expand the sections during run time. in order to do that i should probably slice up the image and add different sections during run time. i dont know the proper way to do it.
please help me with a suggestion on how to go about fitting the text in the appropriate sections (but also keep in mind i have no control over how many foods are in each section, the user will decide this during run time)
I would create an iTextSharp table for each of your results (None, Moderate, Severe) and write out the table sequentially, in the order you want them to appear on your PDF. Each row in your tables would have four columns.
I found these articles useful for creating tables in iTextSharp:
iTextSharp - Introducing Tables
SourceForge Table Tutorial
Edit
Sorry, I didn't see the vb.net tag on your question. The pages I linked are in C# - I hope you can translate. I found that most of the iTextSharp samples you'll find are in C#.
It might be worth using a reporting tool rather than iTextSharp for formatted/tabular data?
We use Active Reports from http://www.datadynamics.com/ but I am sure there are others.
EDIT:
It looks like iTextSharp supports html-to-pdf conversion? Maybe thats easier to render?
Just did a search and found this: http://somewebguy.wordpress.com/2009/05/08/itextsharp-simplify-your-html-to-pdf-creation/