BeautifulSoup downloading corrupt pdfs

BeautifulSoup downloading corrupt pdfs - beautifulsoup

I have some code that downloads pdf files from a website but when I download the pdf files they are all corrupted, the pdfs appear to contain no data when I examine them in a hex editor. Any idea why?
EDIT - I have found that if I click on the link to the pdf it will load but if I attempt to open in a new tab or paste the url into a new tab it will give a blank output. The link has some javascript
onclick="var win = window.open(this.href,'','');return false;"
Code
pdf_links = []
box_2 = right_div.find_all("div", {"class":"right"})[2]#Contains PDF links
for link in box_2.find_all('a'):
current_link = link.get('href')
if current_link.endswith('pdf'):
pdf_links.append('http://' + set_domain + current_link)
for url in pdf_links:
response = requests.get(url)
with open(join('C:/Users/Ninja2k/Desktop', basename(url)), 'wb') as f:
f.write(response.content)

Within context manager, do close the file using f.close()
for url in pdf_links:
response = requests.get(url)
with open(join('C:/Users/Ninja2k/Desktop', basename(url)), 'wb') as f:
f.write(response.content)
f.close()

Related

Playwright: Download via Print to PDF?

I'm seeking to scrape a web page using Playwright.
I load the page, and click the download button with Playwright successfully. This brings up a print dialog box with a printer selected.
I would like to select "Save as PDF" and then click the "Save" button.
Here's my current code:
with sync_playwright() as p:
browser = p.chromium.launch(headless=True)
playwright_page = browser.new_page()
got_error = False
try:
playwright_page.goto(url_to_start_from)
print(playwright_page.title())
html = playwright_page.content()
except Exception as e:
print(f"Playwright exception: {e}")
got_error = True
if not got_error:
soup = BeautifulSoup(html, 'html.parser')
#download pdf
with playwright_page.expect_download() as download_info:
playwright_page.locator("text=download").click()
download = download_info.value
path = download.path()
download.save_as(DOWNLOADED_PDF_FOLDER)
browser.close()
Is there a way to do this using Playwright?

Thanks very much to #KJ in the comments, who suggested that with headless=True, Chromium won't even put up a print dialog box in the first place.

Splash return embedded response

I am looking to return an embedded response from a website. This website makes it very difficult to reach this embedded response without javascript so I am hoping to use splash. I am not interested in returning the rendered HTML, but rather one embedded response. Below is a screenshot of the exact response that I am looking to get back from splash.
This response returns a JSON object to the site to render, I would like the raw JSON returned from this response, how do I do this in Lua?

Turns out this is a bit tricky. The following is the kludge I have found to do this:
Splash call with LUA script, called from Scrapy:
scrpitBusinessUnits = """
function main(splash, args)
splash.request_body_enabled = true
splash.response_body_enabled = true
assert(splash:go(args.url))
assert(splash:wait(18))
splash:runjs('document.getElementById("RESP_INQA_WK_BUSINESS_UNIT$prompt").click();')
assert(splash:wait(20))
return {
har = splash:har(),
}
end
"""
yield SplashRequest(
url=self.start_urls[0],
callback=self.parse,
endpoint='execute',
magic_response=True,
meta={'handle_httpstatus_all': True},
args={'lua_source': scrpitBusinessUnits,'timeout':90,'images':0},
)
This script works by returning the HAR file of the whole page load, it is key to set splash.request_body_enabled = true and splash.response_body_enabled = true to get the actual response content in the HAR file.
The HAR file is just a glorified JSON object with a different name... so:
def parse(self, response):
harData = json.loads(response.text)
responseData = harData['har']['log']['entries']
...
# Splash appears to base64 encode large content fields,
# you may have to decode the field to load it properly
bisData = base64.b64decode(bisData['content']['text'])
From there you can search the JSON object for the exact embedded response.
I really dont think this is a very efficient method, but it works.

Acess data image url when the data url is only obtain upon rendering

I would like to automatically get images saved as browser's data after the page renders, using their corresponding data URLs.
For example:
You can go to the webpage: https://en.wikipedia.org/wiki/Truck
Using the WebInspector from Firefox pick the first thumbnail image on the right.
Now on the Inspector tab, right click over the img tag, go to Copy and press "Image Data-URL"
Open a new tab, paste and enter to see the image from the data URL.
Notice that the data URL is not available on the page source. On the website I want to scrape, the images are rendered after passing through a php script. The server returns a 404 response if the images try to be accessed directly with the src tag attribute.
I believe it should be possible to list the data URLs of the images rendered by the website and download them, however I was unable to find a way to do it.
I normally scrape using selenium webdriver with Firefox coded in python, but any solution would be welcome.

I managed to work out a solution using chrome webdriver with CORS disabled as with Firefox I could not find a cli argument to disable it.
The solution executes some javascript to redraw the image on a new canvas element and then use toDataURL method to get the data url. To save the image I convert the base64 data to binary data and save it as png.
This apparently solved the issue in my use case.
Code to get first truck image
from binascii import a2b_base64
from selenium.webdriver.chrome.options import Options
chrome_options = Options()
chrome_options.add_argument("--disable-web-security")
chrome_options.add_argument("--disable-site-isolation-trials")
driver = webdriver.Chrome(options=chrome_options)
driver.get("https://en.wikipedia.org/wiki/Truck")
img = driver.find_element_by_xpath("/html/body/div[3]/div[3]"
"/div[5]/div[1]/div[4]/div"
"/a/img")
img_base64 = driver.execute_script(
"""
const img = arguments[0];
const canvas = document.createElement('canvas');
const ctx = canvas.getContext('2d');
canvas.width = img.width;
canvas.height = img.height;
ctx.drawImage(img, 0, 0);
data_url = canvas.toDataURL('image/png');
return data_url
""",
img)
binary_data = a2b_base64(img_base64.split(',')[1])
with open('image.png', 'wb') as save_img:
save_img.write(binary_data)
Also, I found that the data url that you get with the procedure described in my question, was generated by the Firefox web inspector on request, so it should not be possible to get a list of data urls (that are not within the page source) as I first thought.

BeautifulSoup is the best library to use for such problem statements. When u wanna retrieve data from any website, u can blindly use BeautifulSoup as it is faster than selenium. BeautifulSoup just takes around 10 seconds to complete this task, whereas selenium would approximately take 15-20 seconds to complete the same task, so it is better to use BeautifulSoup. Here is how u do it using BeautifulSoup:
from bs4 import BeautifulSoup
import requests
import time
st = time.time()
src = requests.get('https://en.wikipedia.org/wiki/Truck').text
soup = BeautifulSoup(src,'html.parser')
divs = soup.find_all('div',class_ = "thumbinner")
count = 1
for x in divs:
url = x.a.img['srcset']
url = url.split('1.5x,')[-1]
url = url.split('2x')[0]
url = "https:" + url
url = url.replace(" ","")
path = f"D:\\Truck_Img_{count}.png"
response = requests.get(url)
file = open(path, "wb")
file.write(response.content)
file.close()
count+=1
print(f"Execution Time = {time.time()-st} seconds")
Output:
Execution Time = 9.65831208229065 seconds
29 Images. Here is the first image:
Hope that this helps!

How to get the rendered template from django?-pdfkit

I have a template in my django application and I need to get it rendered in a variable or save it in an html file.
My goal is to convert the html rendering of the template to pdf, I am using pdfkit since it is the best html to pdf converter I have seen, reportlab does not do what I want.
When I try to do something like this:
pdf = pdfkit.from_file ('app / templates / app / table.html', 'table.pdf')
I get the pdf but print something like this:
enter image description here
I appreciate any help!

This is the solution to my case that I use django 2.0.1 and pdfkit 0.6.1:
To obtain the template:
template = get_template ('plapp / person_list.html')
To render it with the data:
html = template.render ({'persons': persons})
To continuation the definition of the method in views.py, the one that downloads the pdf directly in the browser:
def pdf(request):
persons = Person.objects.all()
template = get_template('plapp/person_list.html')
html = template.render({'persons': persons})
options = {
'page-size': 'Letter',
'encoding': "UTF-8",
}
pdf = pdfkit.from_string(html, False, options)
response = HttpResponse(pdf, content_type='application/pdf')
response['Content-Disposition'] = 'attachment;
filename="pperson_list_pdf.pdf"'
return response

from django.template.loader import get_template, render_to_string
Use the above to import functions that return the template. get_template returns the template object while render_to_string returns the string of a rendered template. Here's how I do it using weasyprint not pdfkit though.
def weasy_pdf_generation(request, id):
# my data
_, _, draft_details = get_draft_details('setup', request, id)
radios_dict = {k:v[1] for k,v in draft_details.items()}
# rendering to string
html_template = render_to_string('tax/setupreview report.html', radios_dict)
styles = CSS(url="https://maxcdn.bootstrapcdn.com/bootstrap/3.3.7/css/bootstrap.min.css")
pdf_file = HTML(string=html_template).write_pdf(stylesheets=[styles])
#response details
response = HttpResponse(pdf_file, content_type='application/pdf')
response['Content-Disposition'] = 'filename="home_page.pdf"'
return response

How to upload an image in "read only" input field

I need to upload an image in a read only input field. The code below works, but takes image url as a text file.
FYI - swf_creative is pointing to the config file where the url image is saved. The url is:
swf_creative=/Testing/creatives/Contemp.swf
JavascriptExecutor js = (JavascriptExecutor) driver;
js.executeScript("document.getElementById('fileName').value=\"" + swf_creative +"\"");
Screenshot:

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

BeautifulSoup downloading corrupt pdfs - beautifulsoup

Within context manager, do close the file using f.close() for url in pdf_links: response = requests.get(url) with open(join('C:/Users/Ninja2k/Desktop', basename(url)), 'wb') as f: f.write(response.content) f.close()

Related

Playwright: Download via Print to PDF?

Splash return embedded response

Acess data image url when the data url is only obtain upon rendering

How to get the rendered template from django?-pdfkit

How to upload an image in "read only" input field

Categories

Resources