Getting the source of a webpage with Selenium

Getting the source of a webpage with Selenium - selenium

After the browser finishes loading a page, I can right-click on the page, and select 'Save Page As'. There, I get an 'HTML Only' option as well as 'Web Page, Complete' option. Of course, the second option creates a directory (to save all the .js files etc.), but interestingly, the main source file is also different. That is, 'HTML Only' creates a file named (e.g.) site.html, while the 'Complete' option creates site.html as well as a site/ directory. The two site.html files are different. Why is that?
Anyway, I try to fetch (with Selenium) the second file, that is, I need to get a file identical to the site.html file saved by the 'Complete' option. It doesn't work. I get a different version of the html source file (I use Selenium's page_source method).
If there's a way to get it, in an automated way, without Selenium, I'm also interested.

There's a non-Selenium way of downloading html files using the requests library. Here's a method that will download a file to a folder of your choice:
import os
import requests
def download_file_to(file_url, destination_folder, new_file_name=None):
if new_file_name:
file_name = new_file_name
else:
file_name = file_url.split("/")[-1]
r = requests.get(file_url)
file_path = os.path.join(destination_folder, file_name)
with open(file_path, "wb") as code:
code.write(r.content)

The standard way of getting the page source with Selenium is:
driver.page_source
Is that the source you're looking for?

Related

Why does urllib3 fail to download list of files if authentication is required <and> if headers aren't re-created

*NOTE: I'm posting this for two reasons:
It took me maybe 3-4hrs to figure out a solution (this is my first urllib3 project) and hopefully this will help others who run into this.
I'm curious why urllib3 behaves as described below, as it is (to me anyway) very un-intuitive.*
I'm using urllib3 to first load a list of files and then to download the files that are on the list. The server the files are on requires authentication.
The behavior I ran into is that if I don't re-make the headers before adding each file to the PoolManager, only the first file downloads correctly. The contents of all subsequent files is an error message from the server saying that authentication failed.
However, if I add a line that regenerates the headers (see the commented line in the code snippet below) the download works as expected. Is this intended behavior and if so can anyone explain why the headers can't be re-used (all they contain is my username/password, which doesn't change).
http = urllib3.PoolManager(num_pools=10,maxsize = 10, block=True)
myHeaders = urllib3.util.make_headers(basic_auth=f'{username}:{password}')
files = http.request('GET', url, headers=myHeaders)
file_list = files.data.decode('utf-8')
file_list = file_list.split('<a href="')
file_list_a = [file.split('">')[0] for file in file_list if file.startswith('https://')]
for path in tqdm.tqdm(file_list,desc = 'Downloading'):
output_fn = get_output_filename(path,output_dir)
#___---^^^Re-make headers^^^---___
myHeaders = urllib3.util.make_headers(basic_auth=f'{username}:{password}')
with open(output_fn, 'wb') as out:
r = http.request('GET', path, headers=myHeaders, preload_content=False)
shutil.copyfileobj(r, out)
Thanks in advance,

Scrapy upload files to dynamically created directories in S3 based on field

I've been experimenting with Scrapy for sometime now and recently have been trying to upload files (data and images) to an S3 bucket. If the directory is static, it is pretty straightforward and I didn't hit any roadblocks. But what I want to achieve is to dynamically create directories based on a certain field from the extract data and place the data & media in those directories. The template path, if you will, is below:
s3://<bucket-name>/crawl_data/<account_id>/<media_type>/<file_name>
For example if the account_id is 123, then the images should be placed in the following directory:
s3://<bucket-name>/crawl_data/123/images/file_name.jpeg
and the data file should be placed in the following directory:
s3://<bucket-name>/crawl_data/123/data/file_name.json
I have been able to achieve this for the media downloads (kind of a crude way to segregate media types, as of now), with the following custom File Pipeline:
class CustomFilepathPipeline(FilesPipeline):
def file_path(self, request, response=None, info=None, *, item=None):
adapter = ItemAdapter(item)
account_id = adapter["account_id"]
file_name = os.path.basename(urlparse(request.url).path)
if ".mp4" in file_name:
media_type = "video"
else:
media_type = "image"
file_path = f"crawl_data/{account_id}/{media_type}/{file_name}"
return file_path
The following settings have been configured at a spider level with custom_settings:
custom_settings = {
'FILES_STORE': 's3://<my_s3_bucket_name>/',
'FILES_RESULT_FIELD': 's3_media_url',
'DOWNLOAD_WARNSIZE': 0,
'AWS_ACCESS_KEY_ID': <my_access_key>,
'AWS_SECRET_ACCESS_KEY': <my_secret_key>,
}
So, the media part works flawlessly and I have been able to download the images and videos in their separate directories based on the account_id, in the S3 bucket. My questions is:
Is there a way to achieve the same results with the data files as well? Maybe another custom pipeline?
I have tried to experiment with the 1st example on the Item Exporters page but couldn't make any headway. One thing that I thought might help is to use boto3 to establish connection and then upload files but that might possibly require me to segregate files locally and upload those files together, by using a combination of Pipelines (to split data) and Signals (once spider is closed to upload the files to S3).
Any thoughts and/or guidance on this or a better approach would be greatly appreciated.

Error converting Java.io.File to org.core.resource.IFile?

I am dynamically creating a File in the workspace and trying to generate a IFile instance of it.
IPath location= Path.fromOSString(file.getAbsolutePath());
IFile iFile=ResourcesPlugin.getWorkspace().getRoot().getFile(location);
FileEditorInput input = new FileEditorInput(iFile);
but when I try to see if ifile exists or not (using iFile.exists()) it gives false.
I tried using canonical path as well but that also did not help.

Changes to the file system are not automatically detected by the Eclipse workspace, you'll need to tell the workspace to refresh its view of the local file system. You can do this with:
iFile.refreshLocal(IResource.DEPTH_ZERO, null);
If more than one file has changed you can do the refresh at the folder level changing the depth.

an error 3013 thrown when writing a file Adobe AIR

I'm trying to write/create a JSON file from a AIR app, I'm trying not so show a 'Save as' dialogue box.
Here's the code I'm using:
var fileDetails:Object = CreativeMakerJSX.getFileDetails();
var fileName:String = String(fileDetails.data.filename);
var path:String = String(fileDetails.data.path);
var f:File = File.userDirectory.resolvePath( path );
var stream:FileStream = new FileStream();
stream.open(f, FileMode.WRITE );
stream.writeUTFBytes( jsonToExport );
stream.close();
The problem I'm having is that I get a 'Error 3013. File or directory in use'. The directory/path is gathered from a Creative Suite Extension I'm building, this path is the same as the FLA being developed in CS that the Extension is being used with.
So I'm not sure if the problem is that there are already files in the directory I'm writing the JSON file to?
Do I need to add a timer in order to close the stream after a slight delay, giving some time to writing the file?

Can you set up some trace() commands? I would need to know what the values of the String variables are, and the f.url.
Can you read from the file that you are trying to write to, or does nothing work?
Where is CreativeMakerJSX.getFileDetails() coming from? Is it giving you data about a file that is in use?
And from Googling around, this seems like it may be a bug. Try setting up a listener for when you are finished, if you have had the file open previously.

I re-wrote how the file was written, no longer running into this issue.

Rails 3 send_data issue; difference between production and development

I have a strange bug in my Rails 3 app. I am using this code to send images that are not public:
image = open(f, "rb") { |io| io.read }
send_data(image, :disposition => 'inline')
I am using this code to display images in admin pages and user pages. If I use development environment this code works OK and the images are displayed on both pages. But if I use production environment, this images are displayed only in admin pages but not user pages. I can click on the images that are not displayed, and select "properties". Under image type I see:
application/xhtml+xml
But other public images has under the type JPG image/PNG image or something like this.
Which difference between the environemnts could be causing images not to work and how can I fix this, so the images will be properlly displayed on the all pages?

I had a distinctly similar symptom. I know this is an old issue and already resolved but I thought I would contribute the findings of my situation which turned out to be a different cause.
I was building a CSV file and using send_file to send the file to the browser. In development it worked great, in production the browser reported a page not found.
Here is the action from the controller.
def export
#campaign = LiveEmailCampaign.find params[:id]
#campaign.recipients_csv do |csv_file|
send_file csv_file,
filename: #campaign.name,
type: Mime::CSV
end
end
And the CSV is build from this code in a model.
def recipients_csv
tempfile = Tempfile.new(self.name.downcase.dasherize)
CSV.open tempfile, 'w' do |csv|
recipients.each do |recipient|
csv << [recipient]
end
end
yield tempfile
end
After a few minutes of research I determined that the culprit was a conflict between the XSendFile directive in Apache on the production server and the temporary path being used to write the CSV data to. In my case XSendFile was set for only the app root and the temp file was in /tmp on the server.
Instead of tampering with the XSendFile config at the server level I just instructed Tempfile to use the tmp folder in the Rails app.
So, I changed the call to Tempfile in the model method to this
tempfile = Tempfile.new(self.name.downcase.dasherize)
Now, Rails and Apache are friends again. I just need to refactor this code because it doesn't explicitly unlink the created temporary file. Best practice is to explicitly unlink temporary files.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Getting the source of a webpage with Selenium - selenium

The standard way of getting the page source with Selenium is: driver.page_source Is that the source you're looking for?

Related

Why does urllib3 fail to download list of files if authentication is required <and> if headers aren't re-created

Scrapy upload files to dynamically created directories in S3 based on field

Error converting Java.io.File to org.core.resource.IFile?

an error 3013 thrown when writing a file Adobe AIR

Rails 3 send_data issue; difference between production and development

Categories

Resources