I'm trying to extract text from a PDF in python, but I get the following warning message which limits the amount of text for each page that is extracted. Is there any solution anyone can think of to resolve this issue? Code also below:
WARNING:pdfminer.layout:Too many boxes (106) to group, skipping.
import slate3k as slate
with open("mypdf.pdf",'rb') as f:
extracted_text = slate.PDF(f)
print(extracted_text)
Related
I try to add new font in my pyqt5 apps with this code:
I am sure that it has correct file path points to right files.
It can load it and add to font_db by succesfully printing font family.
font_db = QtGui.QFontDatabase()
for f in ['resources/fonts/GothamNarrowBold.otf', 'resources/fonts/GothamNarrowBook.otf']:
font_files = os.path.join(package_dir, f)
print(font_files)
font_id = font_db.addApplicationFont(font_files)
font_family = font_db.applicationFontFamilies(font_id)
print(font_family)
but I still get this error message.
qt.qpa.fonts: Populating font family aliases took 103 ms. Replace uses of missing font family "Gotham Narrow Bold" with one that exists to avoid this cost.
What I have missed? Somebody can help me to solve this problem?
thanks in advance.
Note : I'm experienced in python however just starting out in selenium and webscraping. Please excuse if this is a bad question or if my fundamentals in selenium seem amiss. I could not find an answer in hours of searching hence i ask here
Goal: To extract the "About the Business" information found in Yelp pages of businesses
Some pages have their about the business information within a Read More button based popup (eg : https://www.yelp.com/biz/and-pizza-bethesda-bethesda)
Some pages do not have their business information in a Read More button based popup (eg : https://www.yelp.com/biz/pneuma-fashions-upper-marlboro-3 )
Problem: Unable to navigate to the About the Business popup that appears after clicking the Read More button and extract the text present in it.
Attempts as of now: From googling I had found explanations on how to handle alert popups or window popups. However the code doesnt work. The popup that emerges when clicking Read More button does not cause change in window_handles
import re
# getting all sections of the page
result=driver.find_elements_by_tag_name("section")
About = None
for sec in result:
if sec.text.startswith("About the Business"):
# this pertains only to the About the business section
main_page=driver.current_window_handle
print(main_page) # Returns the current handle
sec.find_element_by_tag_name("button").click()
popup=None
for handle in driver.window_handles: # is an iterable with only one handle
# The only handle present is the main_page handle
print(handle)
if handle!=main_page:
popup = handle
print(popup) # returns None
driver.switch_to.window(popup) # Throws error because popup=None
# THE FOLLOWING SECTION IS NOT EXECUTED BECAUSE OF THE ERROR ABOVE
#////////////////////////////////////////////////////
button_contents=driver.find_elements_by_tag_name("p")
for b in button_contents:
print(b.text) # intended to print text contents
close=driver.find_element_by_tag_name("button")
close.click()
driver.switch_to.window(main_page)
Please help
Thank you to everyone who reads this question and provides advice and answers
That is a custom pop-up so you won't need to switch to it. I suggest to study about getting relative xpath . Use loop to navigate to your urls and include below code
driver.get(your_URL)
readMoreBtnXpath= "//h4[text()='About the Business']/ancestor::section//button"
aboutTheBusinessSec = "//h4[text()='About the Business']/ancestor::section"
fromTheBusinessSec = "((//h2[text()='From the business']/parent::div/following-sibling::div//div)[5]/div)[last()]/preceding-sibling::div"
try:
driver.find_element(By.XPATH, readMoreBtnXpath).click()
button_contents = driver.find_elements(By.XPATH, fromTheBusinessSec)
for b in button_contents:
print(b.text)
except:
print(driver.find_element(By.XPATH, aboutTheBusinessSec).text)
One thing that u should know is that the pop-up is not displayed in a new window. It is instead displayed in the same page itself. Here is the complete code to extract the text from the pop-up:
from selenium import webdriver
driver = webdriver.Chrome()
driver.get('https://www.yelp.com/biz/and-pizza-bethesda-bethesda')
try:
driver.find_element_by_xpath('//*[#id="wrap"]/div[3]/div/div[4]/div/div/div[2]/div/div/div[1]/div/div[1]/section[5]/div[2]/button').click()
p1 = driver.find_element_by_xpath('//*[#id="modal-portal-container"]/div[2]/div/div/div/div[2]/div/div[2]/div/div[2]/div/div/div[1]/p').text
p2 = driver.find_element_by_xpath('//*[#id="modal-portal-container"]/div[2]/div/div/div/div[2]/div/div[2]/div/div[2]/div/div/div[2]/p[2]').text
print("Specialties --",p1)
print("History --",p2)
except:
print('Read more button not found')
Output:
Specialties -- Award-winning pizza: Named one of Fast Company's "World's Most Innovative Companies" in 2018, third-place in the Washington Post Express's of "Best Fast Casual" in 2018, third place in the Washington City Paper's "Best Gluten-Free Menu" in 2018 and won its "Best Pizza in D.C." in 2017, 11th on TripAdvisor's "Best Fast Casual Restaurants -- United States" in 2018.
History -- Since 2012, we've built pizza shops with an edge to their craft pies, beverages and shop design, created an environment where ALL of our Tribe can thrive, supported our local communities and now we'll text you back, if you want. Started with a pizza shop. Became a culture. That's &pizza.
Edit:
Since this doesn't work with this website, replace the first find_element_by_xpath with:
driver.find_element_by_xpath("//div[#class='lemon--div__373c0__1mboc border-color--default__373c0__3-ifU']/button[.='Read more']").click()
This works for both the websites.
I'm using TestCafe for test automation of a web application based on the Wicket framework. I try to type text into a text input field ... well, actually it is a dropdown list, where a text input field appears, so that the user can search for certain codes. The HTML fragment is as follows:
HTML fragment
And here is the corresponding screenshot (text field above "001"):
Text input field with dropdown
The user can type some characters and the list below is automatically filtered (I did this manually):
Text input field with some text
My TestCafe test tries this:
.click( productcodeList )
.expect( productcodeInputField.visible ).ok()
.click( productcodeInputField )
.typeText( productcodeInputField, 'ABW' )
i.e.
Click on the drop down list.
Assume that the text input field is now visible (works fine).
Click on the text input field (this should not be necessary, since typeText() is supposed to do this anyway).
Type the text "ABW" into the text input field ==> This does not work.
I'm sure that my Selector works, since the assertion (expect) is successful and when I debug the test run after the second click (on the text input field), I see the following:
TestCafe screenshot
I.e. the cursor is directly on the text field, but somehow TestCafe cannot write the text into the field.
Some additional information: The Selector for the input field is created as follows:
productcodeInputField = Selector('span').withAttribute('class', /select2-dropdown.*/ ).child('span').withAttribute('class', /select2-search.*/ ).child('input').withAttribute('class', 'select2-search__field' );
More information: I'm using the same logic on the same page:
kurzbezeichnungField = Selector('input').withAttribute('name', /.*aeAbbreviation.*/);
...
await t.click( kurzbezeichnungField )
.typeText( kurzbezeichnungField, 'xxxWWW' )
and this works fine.
Node.js version: v10.16.3
Testcafe version: 1.5.0
This issue looks like a bug. However, I cannot say it precisely without an example that demonstrates the problem.
My team would really appreciate it if you share your project or sample to demonstrate the issue.
Please create a separate issue in the TestCafe github repository using the following template and provide as much additional information as possible.
I want to read the text which comes from API end, When I query (query("*")) it does not appear on the calabash-android console.
wait_for_text(text, timeout: 10) does not work either.
query "all * marked'Email field can not be empty'"
Calabash doesn't return results that are not visible by default. So if the error message is on the screen but just invisible, using the all operator should do the trick.
In android two different message can show in edit text field by using hint text and error text
if its hint text use this:
query("* id:'edit_text_id'", :hint)
if its error message use this:
query("* id:'edit_text_id'", :error)
Normally these kind of text messages won't show by querying -> query("*")
I'm using Docx4J to make an invoice model.
In the left-side of the page, it's usual to show a legal sentence as: Registered company in ... Book ... Page ...
I have inserted this in my template with a Word text frame.
Well, my issue is: when exporting to .docx, this legal text is shown perfect, but when exporting to .pdf, it's shown as an horizontal table under the other data.
The code to export to PDF is:
FOSettings foSettings = Docx4J.createFOSettings();
foSettings.setFoDumpFile(foDumpFile);
foSettings.setWmlPackage(template);
fos = new FileOutputStream(new File("/C:/mypath/prueba_OUT.pdf"));
Docx4J.toFO(foSettings, fos, Docx4J.FLAG_EXPORT_PREFER_XSL);
Any help would be very appreciated.
Thanks.
You'd need to extend the PDF via FO code; see further How to correctly position a header image with docx4j?
Float left may or may not be easy; similarly the rotated text.
In general, the way to work on this is to take the FO generated by docx4j, then hand edit it to something which FOP can convert to a PDF you are happy with. If you can do that, then its a matter of modifying docx4j to generate that FO.