Python Selenium does not update html after a Javascript execution - selenium

I am testing out the code below to scrape some Options data from Marketchameleon.com. The original table is sorted by ATM IV % Change, however I would like to sort it by Implied Straddle Premium column. As it is not a button to click I though about (after checking the HTML source) doing it like that:
from selenium import webdriver
from selenium.webdriver.common.by import By
from bs4 import BeautifulSoup as BSoup
browser = webdriver.PhantomJS()
browser.get("https://marketchameleon.com/Screeners/Options")
bs_obj = BSoup(browser.page_source, 'html.parser').encode("utf-8")
with open("Market_Chameleon_Unsorted.html", "w") as file:
file.write(str(bs_obj))
element = browser.find_element_by_xpath("//th[#aria-label='Implied StraddlePremium %: activate to sort column ascending']")
browser.execute_script("arguments[0].setAttribute('aria-label','Implied StraddlePremium %: activate to sort column descending')", element)
bs_obj = BSoup(browser.page_source, 'html.parser').encode("utf-8")
with open("Market_Chameleon_Sorted.html", "w") as file:
file.write(str(bs_obj))
The code runs without any errors, but it does not sort the table, i.e the unsorted and the sorted one are the same (I parse the HTML files in R). It seems the page does not really refresh after the html is modified by javascript. If I do a normal refresh, I again get the original html with the unsorted table. How can I explain Selenium to sort the table? Is there another way?

Related

How store values together after scrape

I am able to scrape individual fields off a website, but would like to map the title to the time.
The fields "have their own class, so I am struggling on how to map the time to the title.
A dictionary would work, but how would i structure/format this dictionary so that it stores values on a line by line basis?
url for reference - https://ash.confex.com/ash/2021/webprogram/STUDIO.html
expected output:
9:00 AM-9:30 AM, Defining Race, Ethnicity, and Genetic Ancestry
11:00 AM-11:30 AM, Definitions of Structural Racism
etc
from bs4 import BeautifulSoup
from selenium import webdriver
driver = webdriver.Chrome()
import time
driver.get('https://ash.confex.com/ash/2021/webprogram/STUDIO.html')
time.sleep(3)
page_source = driver.page_source
soup=BeautifulSoup(page_source,'html.parser')
productlist=soup.find_all('div',class_='itemtitle')
for item in productlist:
for eachLine in item.find_all('a',href=True):
title=eachLine.text
print(title)
times=driver.find_elements_by_class_name("time")
for t in times:
print(t.text)
Selenium is an overkill here. Website didn't use any dynamic content, so you can scrape it with Python requests and BeautifulSoup. Here is a code how to achieve it. You need to query productlist and times separately and then iterate using indexes to be able to get both items at once. I put in range() length of an productlist because I assuming that both productlist and times will have equal length.
import requests
from bs4 import BeautifulSoup
url = 'https://ash.confex.com/ash/2021/webprogram/STUDIO.html'
res = requests.get(url)
soup = BeautifulSoup(res.content,'html.parser')
productlist = soup.select('div.itemtitle > a')
times = soup.select('.time')
for iterator in range(len(productlist)):
row = times[iterator].text + ", " + productlist[iterator].text
print(row)
Note: soup.select() gather items by css.

Can't find the right ID Selection or Xpath for Google Search Input Box

I am looking for Google input box selection, where I can't find the right ID selection; by using Selenium, I want to send keys or search terms to the Google Input box. Please, I like the perfect id selection or Xpath Selection.
from selenium import webdriver
from time import sleep
from bs4 import BeautifulSoup
driver = webdriver.Chrome('C:/chromedriver/chromedriver.exe')
driver.get("https://www.google.com")
search_bar = driver.find_element_by_id('input')
search_bar.send_keys("I want to be a Genius")
search_bar.submit()
#sleep(5)
#end
driver.close()
Here is the screenshot of the Present Google Source code for input Search Terms
Try this code:
search_bar = driver.find_element_by_name('q')
search_bar.send_keys("I want to be a Genius")
search_bar.submit()
It worked for me. Using chrome to inspect search box element there is an attribute called name with value of q

How to use elements retrieved from selenium?

How to I effectively use elements retrieved from Selenium that are stored in variables? I am using python. In the program below:
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.common.by import By
from selenium.webdriver.support.select import Select
driver = webdriver.Firefox()
driver.get("https://boards.4chan.org/wg/archive")
matching_threads = []
key = "Pixel"
for i in driver.find_elements_by_class_name("teaser-col"):
if key in i.text:
matching_threads.append(i)
matched_thread = i
print(matching_threads)
driver.quit()
I get the following from the printout of matching_threads:
[<selenium.webdriver.firefox.webelement.FirefoxWebElement (session="aa74a4a6-5bb2-4b48-92b6-50f5d51a9e5c", element="59b6076f-a5a2-4862-9c1f-028025e4b567")>]
How can I use that output to select said element in selenium and interact with it? What I am trying to do is goto that element and then click on the element to the right of it. What I am failing to understand is how to retrieve the element in selenium using the stored information in matching_threads.
If anyone can help me, I would very much appreciate it.
To click on the next opposing td with an a tag with class quotelink.
i.find_element_by_xpath(".//following::td[1]/a[#class='quotelink']").click()
Now if the page moves to another you could just grab the href value, insert them into an array and than loop through them with and use driver.get(). If it opens a new tab you should be fine.
.get_attribute('href')

I am not sure between which two elements I should be looking to scrape and formatting error (jupyter + selenium)

I finally got around to displaying the page that I need in text/HTML and did conclude that the data I need is also included. For now I just have it printing the entire page because I remain conflicted between the two elements that I potentially need to get what I want. Between these three highlighted elements 1, 2, and 3, I am having trouble with first identifying which one I should reference (I would go with the 'table' element but it doesn't highlight the left most column with ticker names which is literally half the point of getting this data, though the name is referenced like so as shown in the highlighted yellow part). Also, the class descriptions seem really long and and sometimes appears to have two within the same elements so I was wondering how I would address that? And though this problem is not as immediate, if you did take that code and just printed it and scrolled a bit down, the table data is in straight columns so I was wondering if that would be addressed after I reference the proper element or have to write something additional to fix it? Would the fact that I have multiple pages to scan also change anything in the code? Thank you in advance!
Code:
!pip install selenium
from bs4 import BeautifulSoup
from urllib.request import urlopen, Request
from selenium import webdriver
import pandas as pd
driver = webdriver.Chrome("D:/chromedriver/chromedriver.exe")
driver.get('https://www.barchart.com/options/unusual-activity/stocks')
soup = BeautifulSoup(driver.page_source, 'html.parser')
# get text
text = soup.get_text()
print(text)
edit
read_html without bs4
You wont need beautifulsoup to get your goal, pandas is selecting all html tables from the page source and push them into a list of data frames.
In your case there is only one table in the page source, so you get your df by selecting the first element in list by slicing with [0]:
df = pd.read_html(driver.page_source)[0]
Example
from selenium import webdriver
import pandas as pd
driver = webdriver.Chrome('D:/chromedriver/chromedriver.exe')
driver.get('https://www.barchart.com/options/unusual-activity/stocks')
df = pd.read_html(driver.page_source)[0]
driver.close()
Initial answer based on bs4
Your close to a solution, let pandas take control and read the html prettified and bs4 flavored to pandas and modify it there to your needs:
pd.read_html(soupt_one('table').prettify(), flavor='bs4')
Example
from bs4 import BeautifulSoup
from selenium import webdriver
import pandas as pd
driver = webdriver.Chrome('D:/chromedriver/chromedriver.exe')
driver.get('https://www.barchart.com/options/unusual-activity/stocks')
soup = BeautifulSoup(driver.page_source, 'html.parser')
df = pd.read_html(soup.select_one('table').prettify(), flavor='bs4')[0]
df

BeautifulSoup findAll() not finding all, regardless of which parser I use

So I've read through all the questions about findAll() not working that I can find, and the answer always seems to be an issue with the particular html parser. I have run the following code using the default 'html.parser' along with 'lxml' and 'html5lib' yet I can only find one instance when I should be finding 14.
import bs4
from urllib.request import urlopen as uReq
from bs4 import BeautifulSoup as soup
my_url = 'https://robertsspaceindustries.com/pledge/ships'
uClient = uReq(my_url)
page_html = uClient.read()
uClient.close()
page_soup = soup(page_html, features = "lxml")
containers = page_soup.findAll("ul", {"class":"ships-listing"})
len(containers)
I tried posting a picture of the HTML code, but I don't have enough reputation. Here is a link to the image (https://imgur.com/a/cOLO8GO).
When you download a page through urllib (or requests HTTP library) it downloads the original HTML source file.
Initially there's only sinlge tag with the class name 'ships-listing' because that tag comes with the source page. But once you scroll down, the page generates additional <ul class='ships-listing'> and these elements are generated by the JavaScript.
So when you download a page using urllib, the downloaded content only contains the original source page (you could see it by view-source option in the browser).