How to extract html-table formed by javascript to data.frame using RSelenium - html-table

I need to get data of the oil price table formed by javascript so that I can have it onto data.frame in R.
In the below codes, I am able to open the url in chrome browser using RSelenium but I am unable to extract the table of historical oil price in xpath [#id='historic-price-list']/div/div[2]/table. The below doesn't seem to give me a table or the values I wanted.
https://markets.businessinsider.com/commodities/historical-prices/oil-price/usd?type=brent
library('RSelenium')
rD <- rsDriver(browser = "chrome")
remDr <- rD[["client"]] #Start Chrome.
siteAdd <- "https://markets.businessinsider.com/commodities/historical-prices/oil-price/usd?type=brent"
remDr$navigate(siteAdd) #Open the site.
abc=remDr$findElement("css selector","//*[#id='historic-price-list']/div/div[2]/table > tbody > tr:nth-child(1) > th:nth-child(1)")$getElementText()
I hope to get it onto a table which I can put it onto data.frame.

For any one who needs the codes! RSelenium codes more than some number of years cant really work and I was struggling how to do the exraction.
Many thanks to this link too !
How to read an html table using Rselenium?
library('RSelenium')
library('XML')
rD <- rsDriver(browser = "chrome")
#Start Chrome.
remDr <- rD[["client"]]
#Add a test URL.
siteAdd <- "https://markets.businessinsider.com/commodities/historical-prices/oil-price/usd?type=brent"
#Open the site.
remDr$navigate(siteAdd)
doc <- htmlParse(remDr$getPageSource()[[1]])
readHTMLTable(doc)

Related

Extracting information from same page popup in python Selenium webscraping

Note : I'm experienced in python however just starting out in selenium and webscraping. Please excuse if this is a bad question or if my fundamentals in selenium seem amiss. I could not find an answer in hours of searching hence i ask here
Goal: To extract the "About the Business" information found in Yelp pages of businesses
Some pages have their about the business information within a Read More button based popup (eg : https://www.yelp.com/biz/and-pizza-bethesda-bethesda)
Some pages do not have their business information in a Read More button based popup (eg : https://www.yelp.com/biz/pneuma-fashions-upper-marlboro-3 )
Problem: Unable to navigate to the About the Business popup that appears after clicking the Read More button and extract the text present in it.
Attempts as of now: From googling I had found explanations on how to handle alert popups or window popups. However the code doesnt work. The popup that emerges when clicking Read More button does not cause change in window_handles
import re
# getting all sections of the page
result=driver.find_elements_by_tag_name("section")
About = None
for sec in result:
if sec.text.startswith("About the Business"):
# this pertains only to the About the business section
main_page=driver.current_window_handle
print(main_page) # Returns the current handle
sec.find_element_by_tag_name("button").click()
popup=None
for handle in driver.window_handles: # is an iterable with only one handle
# The only handle present is the main_page handle
print(handle)
if handle!=main_page:
popup = handle
print(popup) # returns None
driver.switch_to.window(popup) # Throws error because popup=None
# THE FOLLOWING SECTION IS NOT EXECUTED BECAUSE OF THE ERROR ABOVE
#////////////////////////////////////////////////////
button_contents=driver.find_elements_by_tag_name("p")
for b in button_contents:
print(b.text) # intended to print text contents
close=driver.find_element_by_tag_name("button")
close.click()
driver.switch_to.window(main_page)
Please help
Thank you to everyone who reads this question and provides advice and answers
That is a custom pop-up so you won't need to switch to it. I suggest to study about getting relative xpath . Use loop to navigate to your urls and include below code
driver.get(your_URL)
readMoreBtnXpath= "//h4[text()='About the Business']/ancestor::section//button"
aboutTheBusinessSec = "//h4[text()='About the Business']/ancestor::section"
fromTheBusinessSec = "((//h2[text()='From the business']/parent::div/following-sibling::div//div)[5]/div)[last()]/preceding-sibling::div"
try:
driver.find_element(By.XPATH, readMoreBtnXpath).click()
button_contents = driver.find_elements(By.XPATH, fromTheBusinessSec)
for b in button_contents:
print(b.text)
except:
print(driver.find_element(By.XPATH, aboutTheBusinessSec).text)
One thing that u should know is that the pop-up is not displayed in a new window. It is instead displayed in the same page itself. Here is the complete code to extract the text from the pop-up:
from selenium import webdriver
driver = webdriver.Chrome()
driver.get('https://www.yelp.com/biz/and-pizza-bethesda-bethesda')
try:
driver.find_element_by_xpath('//*[#id="wrap"]/div[3]/div/div[4]/div/div/div[2]/div/div/div[1]/div/div[1]/section[5]/div[2]/button').click()
p1 = driver.find_element_by_xpath('//*[#id="modal-portal-container"]/div[2]/div/div/div/div[2]/div/div[2]/div/div[2]/div/div/div[1]/p').text
p2 = driver.find_element_by_xpath('//*[#id="modal-portal-container"]/div[2]/div/div/div/div[2]/div/div[2]/div/div[2]/div/div/div[2]/p[2]').text
print("Specialties --",p1)
print("History --",p2)
except:
print('Read more button not found')
Output:
Specialties -- Award-winning pizza: Named one of Fast Company's "World's Most Innovative Companies" in 2018, third-place in the Washington Post Express's of "Best Fast Casual" in 2018, third place in the Washington City Paper's "Best Gluten-Free Menu" in 2018 and won its "Best Pizza in D.C." in 2017, 11th on TripAdvisor's "Best Fast Casual Restaurants -- United States" in 2018.
History -- Since 2012, we've built pizza shops with an edge to their craft pies, beverages and shop design, created an environment where ALL of our Tribe can thrive, supported our local communities and now we'll text you back, if you want. Started with a pizza shop. Became a culture. That's &pizza.
Edit:
Since this doesn't work with this website, replace the first find_element_by_xpath with:
driver.find_element_by_xpath("//div[#class='lemon--div__373c0__1mboc border-color--default__373c0__3-ifU']/button[.='Read more']").click()
This works for both the websites.

Scraping a page every time it changes

Hi I am currently looking to scrape a [age such as this "https://www.tennis24.com/match/ABiALWlt/#match-statistics;0" every time it changes score. currently i have the ability to scrape it using selenium and BS using the below code
from selenium import webdriver
Chrom_path = r"C:\Users\Dan1\Desktop\chromedriver.exe"
driver = webdriver.Chrome(Chrom_path)
driver.get("https://www.tennis24.com/match/zVrM3ySQ/#match-statistics;0")
data = driver.find_elements_by_class_name("statTextGroup")
for d in data:
sub_data = d.find_elements_by_xpath(".//*")
assert len(sub_data)==3
for s_d in sub_data:
print(s_d.get_attribute('class')[19:], s_d.get_attribute('innerText'))
but I have no idea how to automate it so that once the score at the top of the page located here"Medical timeout6 : 6 ( 0 : 0 )" changes, the scraper scrapes the new data. the change to monitor though is only visible when the match is in play and not always there.
if you need anymore info please let me know and ill be happy to add it
You can scrape in a while loop the "scoreboard"-class and when this is not the same as the old value of this then this value changed and you can scrape the other things you wanted.
Hope it helped

There's no input tag where there should be one [duplicate]

I'd like to modify part of the text in a textarea with Selenium. The textarea seems almost as if it were read-only.
In this very simple example using a sample algo, it would be great to be able to change the stock id on this line:
context.aapl = sid(24)
... to something like:
context.aapl = sid(39840)
... which is the Tesla stock id. The variable name will no longer make sense, doesn't matter, just a start.
This Selenium code for me is able to open the sample with no login required.
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
t = webdriver.Firefox() # t stands for tab as in browser tab in my mind
t.implicitly_wait(10)
t.get('https://www.quantopian.com/algorithms/')
o = t.find_element_by_xpath("//body") # o stands for object
o.send_keys(Keys.ESCAPE) # clearing the popup
o = t.find_element_by_link_text("Hello World Algorithm")
o.click()
''' for the fun of it if you want to run the backtest
o = t.find_element_by_xpath('//body')
o.send_keys(Keys.CONTROL + 'b')
o.send_keys(Keys.ESCAPE)
'''
print t.find_element_by_id('code-area').text
Here's the output from that
1
# Called once at the start of the simulation.
2
def initialize(context):
3
# Reference to the AAPL security.
4
context.aapl = sid(24)
5
6
# Rebalance every day, one hour and a half after market open.
7
schedule_function(my_rebalance,
8
date_rules.every_day(),
9
time_rules.market_open(hours=1, minutes=30))
10
11
# This function was scheduled to run once per day at 11AM ET.
12
def my_rebalance(context, data):
13
14
# Take a 100% long position in AAPL. Readjusts each day to
15
# account for price fluctuations.
16
if data.can_trade(context.aapl):
17
order_target_percent(context.aapl, 1.00)
That id is 'code-area'. The content includes margin numbers which might be a problem.
Next nested area is 'code-area-internal', seems the same.
Followed by these two.
<div class='ide-container' id='ide-container'>
<textarea class='width_100pct' id='codebox'>
In trying to obtain the content of the algorithm with 'codebox', content doesn't appear to be present, just u'' ...
>>> p = t.find_element_by_id('codebox').text
>>> p
u''
Attempt to do CTRL-A on it results in this exception...
>>> o = t.find_element_by_id('codebox')
>>> o.send_keys(Keys.CONTROL + 'a')
ElementNotInteractableException: Message: Element is not reachable by keyboard
If the text can be completely cut, then replace can done in Python and paste, that would be fine.
I wouldn't expect Selenium to be able to find and replace text, just surprised it finds a visible area for user input to be off limits from interactivity.
That textarea does have its own Find, and hoping won't have to resort to trying to use it as a workaround.
(The environment is an online IDE for stock market algorithms called Quantopian)
This is the one other thing I tried, with no apparent effect:
>>> t.execute_script("arguments[0].value = arguments[1]", t.find_element_by_id("ide-container"), "_new_")
Appreciate any pointers.
Textarea has style="display: none" attribute which means that you cannot get its content with text property. In this case you can use:
p = t.find_element_by_id('codebox').get_attribute("textContent")
To set new value to code field you can use:
field = driver.find_element_by_css_selector('div[role="presentation"]')
driver.execute_script("arguments[0].textContent = 'New value';", field)
But note that initially each code line in code field displayed as separate div node with specific value and styles. So to make new value looks exactly as code (in the same formatting) you can prepare HTML sample e.g.
value = """<div style="position: relative;"><div class="CodeMirror-gutter-wrapper" style="left: -48px;"><div class="CodeMirror-linenumber CodeMirror-gutter-elt" style="left: 15px; width: 21px;">1</div></div><pre class=" CodeMirror-line " role="presentation"><span role="presentation" style="padding-right: 0.1px;"><span class="cm-comment"># Comment for new code.</span></span></pre></div>"""
and do
driver.execute_script("arguments[0].innerHTML = arguments[1];", field, value)
The content of the algorithm with codebox which you are trying to extract is having the style attribute set to display: none;. So to extract the text you can use the following lines of code :
p = t.find_element_by_xpath("//div[#class='ide-container']/textarea[#id='codebox']")
t.execute_script("arguments[0].removeAttribute('style')", p)
print(t.get_attribute("innerHTML"))

Scrape table data using pandas/beautiful soup (instead of selenium which is slow?), BS implementation not working

I am trying to scrape web data on this website, and the only way I was able to access the data was by iterating through the rows of the table, adding them to a list (then adding them to a pandas data frame/writing to a csv), and then clicking to the next page and repeating the process [there are about 50 pages per search and my program does 100+ searches]. It's super slow/inefficient, and I was wondering if there was a way to efficiently add all the data using pandas or beautiful soup instead of iterating through each line/column.
url = "https://claimittexas.org/app/claim-search"
rows = driver.find_elements_by_xpath("//tbody/tr")
try:
for row in rows[1:]:
row_array = []
#print(row.text) # prints the whole row
for col in row.find_elements_by_xpath('td')[1:]:
row_array.append(col.text.strip())
table_array.append(row_array)
df = pd.DataFrame(table_array)
df.to_csv('my_csv.csv', mode='a', header=False)
except:
print(letters + "no table exists")
EDIT: I tried to scrape using beautiful soup, something I tried earlier in the week and posted about, but I can't seem to access the table without using selenium
With the bs version, I put a bunch of print statements in to see what was wrong, and it has that the rows value is just an empty list
html = driver.page_source
soup = BeautifulSoup(html, 'lxml')
rows = soup.find('table').find('tbody').find_all(('tr')[1:])
for row in rows[1:]:
cells = row.find_all('td')
for cell in cells[1:]:
print(cell.get_text())
use this line in BS4 code implimentation
rows = soup.find('table').find('tbody').find_all('tr')[1:]
instead of
rows = soup.find('table').find('tbody').find_all(('tr')[1:])

How to find the element within element in selenium

I am creating a framework for the data validation using selenium. The issue I am struggling with is I want to locate the element "td"(html tag) within element "tr"(html tag) . This is the code I have written.
Iterator<WebElement> i = rows.iterator();
While(i.hasnext()){
List<WebElement> columns = row.findElements(By.tagName("td"));
for(WebElement s:columns)
{
System.out.println("columnDetails : "+s.getText().toString());
}
if(columns.isEmpty())
{
ElementNotFoundException e = new ElementNotFoundException("No data in table");
throw e;
}
Iterator<WebElement> j = columns.iterator();// does some other work
ClusterData c = new ClusterData(); // does some other work
ClusterDataInitializer.initUI(c, j, lheaders); // does some other work
CUIData.put(c.getCN(), c); // does some other work
}
Now the issue with this is:
I am trying to fetch the data from the rows(see table data) in arraylist and use that arraylist further. Currently whats happening is the data for column header is fetched at start of which I have no use.I only want the rows's data. I am not able to determine the proper way to collect the data of table rows only.
if xPath of the table will help you understand it properly then here are the details :
Table header xPath of cluster name column:
/html/body/table/tbody/tr[2]/td[2]/div[2]/div/div/div[2]/div/div/div[2]/div/div/div[2]/div/div/div/div/div/div[2]/div/div/div[2]/div/div/div[2]/div[2]/div/table/tbody/tr/td[2]/div/div[2]
Table row (Table Data) xPath of test cluster 01:
/html/body/table/tbody/tr[2]/td[2]/div[2]/div/div/div[2]/div/div/div[2]/div/div/div[2]/div/div/div/div/div/div[2]/div/div/div[2]/div/div/div[3]/div[2]/div/table/tbody/tr/td[2]/div/div/a
Please let me know if you need anything else.
I am using the following code to extract row data from table.
List<WebElement> rows = getElement(driver,sBy,"table_div_id").findElements(By.tagName("tr"));
where sBy = By.id and table_div_id = id of div in which table is present. This extracts all the rows into arraylist and then i am using code to extract the row data into another arraylist. It is where I am stuck.
Each row from the table is in its own "table" tag so following things are not working :-
List<WebElement> rows = driver.findElements(By.xpath("//div[#id = 'table_div_id']//tr"));
List<WebElement> columns = row.findElements(By.xpath("./td"));
or the approach I used for the previous release of product i.e.
List<WebElement> columns = row.findElements(By.tagName("td"));
So, I used following approach which enabled me to capture all of the visible rows from the table.
List<WebElement> columns = row.findElements(By.xpath(".//table[#class='gridxRowTable']/tbody/tr"));
But after that I faced another issue that is since this table was implemented using dojo, the scrolling was impossible and Selenium was only able to capture the visible rows , so to overcome this I zoomed out in the browser using selenium. This is how i achieved my goal of getting the data.I believe others might have provided me answer if i would have shared some more details. Still , sorry about that and hope my answer helps you all.
instead of
List<WebElement> columns = row.findElements(By.tagName("td"));
try using
List<WebElement> columns = row.findElements(By.xpath("./td"));
Check if this helps. This should give you the td elements. If I have not understood your issue, let me know.
You can use this way-
driver.findElement(By.Xpath("//table[#id=\"table1\"]/tbody/tr[2]/td[1]"));
Regards,
Anuja
Do you have selenium IDE installed? Perform storeText operation on the row you want to retrieve, then xpath will get populated in IDE. There will be multiple xpaths; the most reliable is xpath:position, use that to capture your rows.
And use firebug for better visibilty of your AUT.
Firebug and Selenium IDE are the most basic component of Selenium Framework development.
You can manipulate xpath as you want.