Scraping a site using selenium and bs4 does not work

Scraping a site using selenium and bs4 does not work - selenium

I am trying to scrape the following site:
https://cve.mitre.org/cve/data_feeds
driver = webdriver.Chrome() # brew install chromedirver
driver.get(self._SCRAPE_WEBSITE_URL)
page = driver.page_source
soup = BeautifulSoup(page, 'lxml')
cve = soup.find_all("li", {"class": "timeline-TweetList-tweet customisable-border"})
print(cve)
but my print returns an empty list.
any ideas?

The elements you are trying to access are inside an iframe. In order to access them you have to switch to that iframe. With Selenium this can be done as following:
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
driver = webdriver.Chrome() # brew install chromedirver
wait = WebDriverWait(driver, 20)
driver.get(self._SCRAPE_WEBSITE_URL)
wait.until(EC.frame_to_be_available_and_switch_to_it((By.XPATH,"//iframe[#id='twitter-widget-0']")))
wait.until(EC.visibility_of_element_located((By.CSS_SELECTOR, "li.timeline-TweetList-tweet.customisable-border")))
cve = driver.find_elements(By.CSS_SELECTOR, "li.timeline-TweetList-tweet.customisable-border")
I guess this can also be done with bs4, however I'm not familiar enough with bs4, so I don't know how to switch into iframe with bs4.
Also don't forget to switch to the default content when you finished dealing with iframe content.

Your print returns an empty list because the html dom under 2nd iframe and you need to switch to get data. Now it's working fine.
You can install manager: pip install webdriver-manager and run code
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from webdriver_manager.chrome import ChromeDriverManager
from bs4 import BeautifulSoup
import time
url = 'https://cve.mitre.org/cve/data_feeds'
cm = ChromeDriverManager().install()
driver = webdriver.Chrome(cm)
driver.maximize_window()
time.sleep(8)
driver.get(url)
time.sleep(5)
iframe = driver.find_elements_by_tag_name('iframe')[1]
driver.switch_to.frame(iframe)
soup = BeautifulSoup(driver.page_source, 'html.parser')
cves =soup.find_all("li", {"class": "timeline-TweetList-tweet customisable-border"})
for cve in cves:
tweet_text= cve.select_one('p').text
print(tweet_text)
Result:
CVE-2021-44833 The CLI 1.0.0 for Amazon AWS OpenSearch has weak permissions for the configuration file. https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2021-44833 …
CVE-2021-41805 HashiCorp Consul Enterprise before 1.8.17, 1.9.x before 1.9.11, and 1.10.x before 1.10.4 has Incorrect Access Control. An ACL token (with the default operator:write permissions) in one namespace can be used for unintended privilege ... https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2021-41805 …
CVE-2021-44515 Zoho ManageEngine Desktop Central is vulnerable to authentication bypass, leading to remote code
execution on the server, as exploited in the wild in December 2021. For Enterprise builds 10.1.2127.17 and earlier, upgrade to 10.1.212... https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2021-44515 …
CVE-2021-4097 phpservermon is vulnerable to Improper Neutralization of CRLF Sequences https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2021-4097 …
CVE-2021-4092 yetiforcecrm is vulnerable to Cross-Site Request Forgery (CSRF) https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2021-4092 …
CVE-2021-41242 OpenOlat is a web-basedlearning management system. A path traversal vulnerability exists in OpenOlat prior to versions 15.5.12 and 16.0.5. By providing a filename that contains a relative path as a parameter in some REST methods, it... https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2021-41242 …
CVE-2021-26340 A malicious hypervisor in conjunction with an unprivileged attacker process inside an SEV/SEV-ES
guest VM may fail to flush the Translation Lookaside Buffer (TLB) resulting in unexpected behavior inside the virtual machine (VM). https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2021-26340 …
CVE-2020-12890 Improper handling of pointers in the System Management Mode (SMM) handling code may allow for a privileged attacker with physical or administrative access to potentially manipulate the AMD Generic Encapsulated Software Architecture ... https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2020-12890 …
CVE-2021-43815 Grafana is an open-source platform for monitoring and observability. Grafana prior to versions 8.3.2 and 7.5.12 has a directory traversal for arbitrary .csv files. It only affects in... https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2021-43815 …
CVE-2021-4089 snipe-it is vulnerable to Improper Access Control https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2021-4089 …
CVE-2021-23700 All versions of package merge-deep2 are vulnerable to Prototype Pollution via the mergeDeep() function. https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2021-23700 …
CVE-2021-23663 All versions of package sey are vulnerable to Prototype Pollution via the deepmerge() function. https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2021-23663 …
CVE-2021-23639 The package md-to-pdf before 5.0.0 are vulnerable to Remote Code Execution (RCE) due to utilizing the library gray-matter to parse front matter content, without disabling the JS engine. https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2021-23639 …
CVE-2021-23561 All versions of package comb are vulnerable to Prototype Pollution via the deepMerge() function.
https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2021-23561 …
CVE-2021-23463 The package com.h2database:h2 from 0 and before 2.0.202 are vulnerable to XML External Entity (XXE) Injection via the org.h2.jdbc.JdbcSQLXML class object, when it receives parsed str... https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2021-23463 …
CVE-2021-27984 In Pluck-4.7.15 admin background a remote command execution vulnerability exists when uploading files. https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2021-27984 …

Related

Can I install chromedriver on python anywhere and not use it headless?

I am trying to use this code that works on my local machine on python anywhere and i want to understand if it is even possible:
from selenium import webdriver
from bs4 import BeautifulSoup
import time
# Initialize webdriver
driver = webdriver.Chrome(executable_path="/Users/matteo/Downloads/chromedriver")
# Navigate to website
driver.get("https://apnews.com/article/prince-harry-book-meghan-royals-4141be64bcd1521d1d5cf0f9b65e20b5")
time.sleep(5)
# Parse page source
soup = BeautifulSoup(driver.page_source, "html.parser")
# Find desired elements using Beautiful Soup
elements = soup.find_all("p")
# Print element text
for element in elements:
print(element.text)
# Close webdriver
driver.quit()
Do i need to have installed chrome to make that work or is chromium enough? Because when i run that code on my local machine a chrome page opens up. How does that work on python anywhere? Would it crush?
I am wondering if the code i am using only works if someone is on a GUI with Chrome installed or if it can work on python anywhere too.

The short answer is no. ChromeDriver must have chrome installed. You can run your tests headless for time save, but chrome still must be installed.

Selenium: get() not working with custom google profile

All what im trying to do is pretty much access whatsapp web where I have my whatsapp already linked, However when I use a custom profile the profile does open, however browser.get("https://web.whatsapp.com) doesn't seem to open. or any browser.get(). What could be the issue?
from selenium import webdriver
from webdriver_manager.chrome import ChromeDriverManager
from selenium.common.exceptions import NoSuchElementException, TimeoutException, WebDriverException
options = webdriver.ChromeOptions()
options.add_argument('--user-data-dir=/Users/omarassouma/Library/Application Support/Google/Chrome/')
options.add_experimental_option("deatch", True)
browser = webdriver.Chrome(executable_path="/Applications/Google Chrome.app/Contents/MacOS/Google Chrome",chrome_options=options)
browser.get("https://web.whatsapp.com/")
this is the updated version, it now opens whatsapp web however not in a custom profile, moreover I cant really use webdriver.options(), is there anything extra I have to import?.
from selenium import webdriver
from webdriver_manager.chrome import ChromeDriverManager
options = webdriver.ChromeOptions()
options.add_argument("user-data-dir=/Users/omarassouma/Library/Application Support/Google/Chrome/User Data/Default")
browser = webdriver.Chrome(executable_path="/Users/omarassouma/Downloads/chromedriver",options=options)
browser.get("https://web.whatsapp.com/")

You need to take care of a couple of things as follows:
To use a Custome Chrome Profile you have to pass the absolute path as follows:
options.add_argument("user-data-dir=/Users/omarassouma/Library/Application Support/Google/Chrome/User Data/Default")
You can find a detailed discussion in How to use Chrome Profile in Selenium Webdriver Python 3
Instead of passing the absolute path of the google-chrome binary, you need to pass the absolute path of the ChromeDriver through the key executable_path.
Additionally, instead of chrome_options you need to use options as chrome_options is deprecated now.
You can find a detailed discussion in DeprecationWarning: use options instead of chrome_options error using Brave Browser With Python Selenium and Chromedriver on Windows
So effectively the line of code will be:
browser = webdriver.Chrome(executable_path="/path/to/chromedriver", options=options)

At 05.11.2022 I found the only way to pass through authorization for myself is using cookie - https://stackoverflow.com/a/15058521
Runing selenium driver with google account isn't working

Selenium and Chromedriver on Lambda - Inconsistency at scale

I have a basic selenium script running/deployed on aws lambda.
Chrome and chromedriver and installed as layers (and available on /opt) via serverless.
The script works ... but only some of the time and rarely at scale (invoking more than 5 instances asynchronously).
I invoke the function in a simple for loop (up to about 200 iterations)
response = client.invoke(
FunctionName='arn:aws:lambda:us-east-1:12345667:function:selenium-lambda-dev-hello',
InvocationType='Event', #|'RequestResponse'|'Event' (async)| DryRun'
LogType='Tail',
#ClientContext='string',
Payload=event_payload,
#Qualifier='24'
)
Other runs, the process hangs while initiating the selenium driver on this line
driver = webdriver.Chrome('/opt/chromedriver_89', chrome_options=options)
other iterations the drivers fail/throw a 'timeout waiting for renderer exception'
This I believe is often due to a mismatch of chromedriver/chrome. I have checked and verified my versions are matched up and compatible (and like i said they do work sometimes).
I guess i'm looking for some ideas/direction to even begin to troubleshoot this. I was under the impression that each invokation of a lambda function is in a separate environment, so why would increasing the volume of invokations have any adverse effect on how well my script runs?
Any ideas or thoughts would be greatly appreciated all!

Discussed in the comments.
There's no complete solution but what has helped improve the situation was increasing the memory of the Lambda service.
The alternatives to try/consider:
Don't use Chrome. Use requests and lxml to query the pages via the network level and remove the need for Chrome.
I did something similar to support another stack question recently. You can see it's similar but not quite the same.
Go to a url and get some text from an xpath:
from lxml import html
import requests
import json
url = "https://nonfungible.com/market/history"
response = requests.get(url)
page = html.fromstring(response.content)
datastring = page.xpath('//script[#id="__NEXT_DATA__"]/text()')
Don't use Chrome. Chrome is ideal for functional testing - but if that's not you objective consider using HtmlUnitDriver or phantomjs. Both of these are significantly lighter than chrome and won't require a browser to be installed (you can run it straight from libraries)
You only need to change the driver initialisation and the rest of the script (in theory) should work.
PhantomJS:
$ pip install selenium
$ brew install phantomjs
from selenium import webdriver
driver = webdriver.PhantomJS()
Unit Driver:
from selenium import webdriver
driver = webdriver.Remote(
desired_capabilities=webdriver.DesiredCapabilities.HTMLUNIT)

Selenium - Edge - How to start a webdriver session with work profile?

My application does not have a login page to get authenticated. It uses my organizational email id (SSO) to authenticate my access to the application. I am using Version 80.0.361.66 (Official build) (64-bit) of Microsoft Edge.
driver = webdriver.Edge()
driver.maximize_window()
selenium version - selenium==3.141.0
This edge session does not use my work profile . It opens a new session because of which my work profile is not loaded and my access to the application is denied. However, I did try to update the version of selenium to use EdgeOptions. But, that didn't work as well. Below is the code :
options = webdriver.EdgeOptions()
options.add_argument("user-data-dir=C:\\Users\\Ajmal.Moideen\\AppData\\Local\\Microsoft\\Edge\\User Data")
driver = webdriver.Edge(options=options)
driver.maximize_window()
selenium version=4.0.0a3

Here's how I got it to work - I am using Chromium Edge 85.0.564.51 with Selenium 3.141.0.
Selenium 3.141.0 from pip doesn't seem to support new Chromium-based Edge Webdriver, but Microsoft supplies it in their msedge-selenium-tools package (better documentation here) as mentioned by Matthias' comment on your question.
First, grab the Chromium Edge webdriver here - get the version that matches your version of Edge (go to chrome:version in Edge to see what version you are running). Put the webdriver somewhere convenient, you need to set driverpath below to point to it.
Install the pip packages:
pip install msedge-selenium-tools selenium==3.141
In your code, import the msedge-selenium-tools Webdriver and Options modules and build the webdriver as shown:
from msedge.selenium_tools import Edge, EdgeOptions
...
options = EdgeOptions()
options.use_chromium = True
options.add_argument("--user-data-dir=C:\\Users\\YOUR-USERNAME\\AppData\\Local\\Microsoft\\Edge\\User Data")
options.add_argument("--start-maximized")
driverpath = 'msedgedriver.exe'
driver = Edge(driverpath, options=options)
Voilà, that should do the trick.
P.S.: Even though chrome:version will show your Profile path with a trailing \Default, don't include that in your --user-data-dir argument above, since the driver seems to append \Default to the end.

Selenium WebDriver.get(url) does not open the URL

from selenium import webdriver
from selenium.common.exceptions import TimeoutException
from selenium.webdriver.support.ui import WebDriverWait # available since 2.4.0
import time
# Create a new instance of the Firefox driver
driver = webdriver.Firefox()
# go to the google home page
driver.get("http://www.google.com")
This opens a Firefox window but does not open a url.
I have a proxy server(but the address bar does not show the passed url)
I have two Firefox profiles.
Can 1 or 2 be an issue? if yes, then how can I resolve it?

It is a defect of Selenium.
I have the same problem in Ubuntu 12.04 behind the proxy.
Problem is in incorrect processing proxy exclusions. Default Ubuntu exclusions are located in no_proxy environment variable:
no_proxy=localhost,127.0.0.0/8
But it seems that /8 mask doesn't work for selenium. To workaround the problem it is enough to change no_proxy to the following:
no_proxy=localhost,127.0.0.1
Removing proxy settings before running python script also helps:
http_proxy= python script.py

I was facing exactly the same issue, after browsing for sometime,I came to know that it is basically version compatibility issue between FireFox and selenium. I have got the latest FireFox but my Selenium imported was older which is causing the issue. Issue got resolved after upgrading selenium
pip install -U selenium
OS: windows Python 2.7

I have resolved this issue.
If your jar files are older than the latest version and the browser has updated to latest version, then download:
the latest jar files from the selenium website http://www.seleniumhq.org/download/, and
the latest geckodriver.exe.

#Neeraj
I've resolved this problem, but i'm not sure if you are the same reason.
In general, my problem was caused by some permission issues.
I tried to move my whole project into ~/:
mv xxx/ ~/
and then i change give it the 777 permission:
chmod -R 777 xxx/
I'm not familiar with linux permission so i just do this to make sure i have permission to execute the program.
Even you don't have permission, the selenium program will not prompt you.
So, good luck.

import org.openqa.selenium.WebDriver;
import org.openqa.selenium.firefox.FirefoxDriver;
WebDriver driver = new FirefoxDriver();
driver.manage().timeouts().implicitlyWait(30,TimeUnit.SECONDS);
driver.get("http://www.google.com");
OR
import org.openqa.selenium.support.ui.ExpectedConditions;
WebDriverWait wait = new WebDriverWait(driver,30);
driver.get("http://www.google.com");
//hplogo is the id of Google logo on google.com
wait.until(ExpectedConditions.presenceOfElementLocated(By.id("hplogo")));

I had the similar problem. All I had to do was delete the existing geckodriver.exe and download the latest release of the same. You can find the latest release here https://github.com/mozilla/geckodriver/releases.

A spent a lot of time on this issue and finally found that selenium 2.44 not working with node version 0.12.
Use node version 0.10.38.

I got the same error when issuing a URL without the protocol (like localhost:4200) instead of a correct one also specifying the protocol (e.g. http://localhost:4200).
Google Chrome works fine without the protocol (it takes http as the default), but Firefox crashes with this error.

I was getting similar problem and Stating string for URL worked for me. :)
package Chrome_Example;
import org.openqa.selenium.WebDriver;
import org.openqa.selenium.chrome.ChromeDriver;
public class Launch_Chrome {
public static void main(String[] args) {
System.setProperty("webdriver.chrome.driver", "C:\\Users\\doyes\\Downloads\\chromedriver_win324\\chromedriver.exe");
String URL = "http://www.google.com";
WebDriver driver = new ChromeDriver();
driver.get(URL);
}
}

I had the same problem but with Chrome.
Solved it using the following steps
Install Firefox/Chrome webdriver from Google
Put the webdriver in Chrome's directory.
Here's the code and it worked fine
from selenium import webdriver
class InstaBot(object):
def __init__(self):
self.driver=webdriver.Chrome("C:\Program
Files(x86)\Google\Chrome\Application\chromedriver.exe")# make sure
#it is chrome driver
self.driver.get("https://www.wikipedia.com")
InstaBot()

Check your browser version and do the following.
1. Download the Firefox/Chrome webdriver from Google
2. Put the webdriver in Chrome's directory.

I was having the save issue when trying with Chrome. I finally placed my chromedrivers.exe in the same location as my project. This fixed it for me.

Update your driver based on your browser.
In my case for chrome,
Download latest driver for your chrome from here : https://chromedriver.chromium.org/downloads
Check chrome version from your browser at : chrome://settings/help.
While initialising your driver,
use driver = webdriver.Chrome(executable_path="path/to/downloaded/driver")

Please have a look at this HowTo: http://www.qaautomation.net/?p=373
Have a close look at section "Instantiating WebDriver"
I think you are missing the following code line:
wait = new WebDriverWait(driver, 30);
Put it between
driver = webdriver.Firefox();
and
driver.getUrl("http://www.google.com");
Haven't tested it, because I'm not using Selenium at the moment. I'm familiar with Selenium 1.x.

I was having the save issue. I assume you made sure your java server was running before you started your python script? The java server can be downloaded from selenium's download list.
When I did a netstat to evaluate the open ports, i noticed that the java server wasn't running on the specific "localhost" host:
When I started the server, I found that the port number was 4444 :
$ java -jar selenium-server-standalone-2.35.0.jar
Sep 24, 2013 10:18:57 PM org.openqa.grid.selenium.GridLauncher main
INFO: Launching a standalone server
22:19:03.393 INFO - Java: Apple Inc. 20.51-b01-456
22:19:03.394 INFO - OS: Mac OS X 10.8.5 x86_64
22:19:03.418 INFO - v2.35.0, with Core v2.35.0. Built from revision c916b9d
22:19:03.681 INFO - RemoteWebDriver instances should connect to: http://127.0.0.1:4444/wd/hub
22:19:03.683 INFO - Version Jetty/5.1.x
22:19:03.683 INFO - Started HttpContext[/selenium-server/driver,/selenium-server/driver]
22:19:03.685 INFO - Started HttpContext[/selenium-server,/selenium-server]
22:19:03.685 INFO - Started HttpContext[/,/]
22:19:03.755 INFO - Started org.openqa.jetty.jetty.servlet.ServletHandler#21b64e6a
22:19:03.755 INFO - Started HttpContext[/wd,/wd]
22:19:03.765 INFO - Started SocketListener on 0.0.0.0:4444
I was able to view my listening ports and their port numbers(the -n option) by running the following command in the terminal:
$netstat -an | egrep 'Proto|LISTEN'
This got me the following output
Proto Recv-Q Send-Q Local Address Foreign Address (state)
tcp46 0 0 *.4444 *.* LISTEN
I realized this may be a problem, because selenium's socket utils, found in: webdriver/common/utils.py are trying to connect via "localhost" or 127.0.0.1:
socket_.connect(("localhost", port))
once I changed the "localhost" to '' (empty single quotes to represent all local addresses), it started working. So now, the previous line from utils.py looks like this:
socket_.connect(('', port))
I am using MacOs and Firefox 22. The latest version of Firefox at the time of this post is 24, but I heard there are some security issues with the version that may block some of selenium's functionality (I have not verified this). Regardless, for this reason, I am using the older version of Firefox.

This worked for me (Tested on Ubuntu Desktop 11.04 with Python-2.7):
from selenium import webdriver
driver = webdriver.Firefox()
driver.get("http://www.stackoverflow.com")

Since you mentioned you use a proxy, try setting up the firefox driver with a proxy by following the answer given here proxy selenium python firefox

Try the following code
import org.openqa.selenium.WebDriver;
import org.openqa.selenium.firefox.FirefoxDriver;
WebDriver DRIVER = new FirefoxDriver();
DRIVER.get("http://www.google.com");

You need to first declare url as a sting as below:
from selenium import webdriver
from selenium.common.exceptions import TimeoutException
from selenium.webdriver.support.ui import WebDriverWait # available since 2.4.0
import time
# Create a new instance of the Firefox driver
String URL = "http://www.google.com";
driver = webdriver.Firefox()
# go to the google home page
driver.get(URL);

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Scraping a site using selenium and bs4 does not work - selenium

Related

Can I install chromedriver on python anywhere and not use it headless?

Selenium: get() not working with custom google profile

Selenium and Chromedriver on Lambda - Inconsistency at scale

Selenium - Edge - How to start a webdriver session with work profile?

Selenium WebDriver.get(url) does not open the URL

Categories

Resources