I'm trying to do web scraping of a page, everything went normal until I figured out, there is a "pseudo element" (::before), let me show you
btw, the inspector is above the "Magnifier Glass"
So the problem comes with detalle_causa = driver.find_element(By.CSS_SELECTOR,'i.fa.fa-search.fa-lg').click()
if t_rol == rol_yr:
rol_c0.append(t_rol)
materias_e_c0.append(t_tiporecurso)
carat_c0.append(t_caratulado)
fecha_c0.append(t_fecha)
estado_c0.append(t_estado)
data = {"ROL": rol_c0,"TIPO RECURSO":materias_e_c0, "CARATULADO":carat_c0, "FECHA":fecha_c0, "ESTADO":estado_c0}
sleep(3)
df = pd.DataFrame(data)
print(df)
sleep(3)
detalle_causa = driver.find_element(By.CSS_SELECTOR,'i.fa.fa-search.fa-lg').click()
print(detalle_causa)
df.to_csv('2022_cs_p2.csv', index=False, encoding='utf-8')
sleep(2)
else:
print("No era")
EDIT: I need to get access to this.
So if you guys could give me a hand here, I will really appreciate it.
detalle_causa = driver.find_element(By.CSS_SELECTOR,'i.fa.fa-search.fa-lg')
detalle_causa.click()
print(detalle_causa)
You are trying to store a click.
Related
I wrote a script to scrape data from SubGraph APIs. It simply click the run button and gets some output code. The problem is that I dhould scroll until the end of the page to get the full output code, unless I get it cutted. This is the way I tried:
def find_datasets():
datasets_url = []
s=Service(ChromeDriverManager().install())
options = Options()
options.headless = False
options.add_argument('window-size=800,600')
driver = webdriver.Chrome(service=s, options=options)
driver.get("https://v4.subgraph.polygon.oceanprotocol.com/subgraphs/name/oceanprotocol/ocean-subgraph/graphql?query=%7B%0A%20%20pools(orderBy%3A%20createdTimestamp%2C%20orderDirection%3A%20desc)%20%7B%0A%20%20%20%20id%0A%20%20%20%20datatoken%20%7B%0A%20%20%20%20%20%20address%0A%20%20%20%20%7D%0A%20%20%20%20publishMarketSwapFee%0A%20%20%20%20liquidityProviderSwapFee%0A%20%20%7D%0A%7D%0A")
sleep(15)
driver.find_element(by=By.XPATH, value="//button[contains(#class, 'execute-button')]").click()
sleep(8)
element = driver.find_elements(by=By.XPATH, value="//div[contains(#class, 'CodeMirror-lines')]")
driver.execute_script("arguments[0].scrollIntoView(true);", element);
print(driver.find_element(by=By.XPATH, value="//div[contains(#class, 'result-window')]").text)
driver.get_screenshot_as_file("screenshot.png")
What am I missing? Thank you for your patience.
have you tried with the class Actions ?
Example:
menu = driver.find_element(By.CSS_SELECTOR, ".nav")
hidden_submenu = driver.find_element(By.CSS_SELECTOR, ".nav #submenu1")
actions = ActionChains(driver)
actions.move_to_element(menu)
actions.click(hidden_submenu)
actions.perform()
Or I think .. you could try to put only
driver.execute_script("arguments[0].scrollIntoView();", element)
before do action on the target element, you need invoke scrollIntoView() first.
I'm trying to scrape the data that only appears on mouseover(selenium). It's a concert map and this is my entire code. I keep getting TypeError: 'ActionChains' object is not iterable
The idea would be to hover over the whole map & always scrape the code when the html changes. I'm pretty sure I need two for loops for that, but I don't know yet, how to combine them. Also, I know I'll have to use bs4, could someone share ideas how I could go about this?
driver = webdriver.Chrome()
driver.maximize_window()
driver.get('https://www.ticketcorner.ch/event/simple-minds-hallenstadion-12035377/')
#accept shadow-root cookie banner
time.sleep(5)
driver.execute_script('return document.querySelector("#cmpwrapper").shadowRoot.querySelector("#cmpbntyestxt")').click()
time.sleep(5)
# Click on the saalplan so we get to the concert map
WebDriverWait(driver,20).until(EC.element_to_be_clickable((By.CSS_SELECTOR, '#tickets > div.seat-switch-wrapper.js-tab-switch-group > a:nth-child(3) > div > div.seat-switch-radio.styled-checkbox.theme-switch-bg.theme-text-color.theme-link-color-hover.theme-switch-border > label'))).click()
time.sleep(5)
#Scroll to the concert map, which will be the element to hover over
element_map = driver.find_element(By.CLASS_NAME, 'js-tickettype-cta')
actions = ActionChains(driver)
actions.move_to_element(element_map).perform()
# Close the drop-down which is partially hiding the concert map
driver.find_element(by=By.XPATH, value='//*[#id="seatmap-tab"]/div[2]/div/div/section/div[2]/div[1]/div[1]/div/div/div[1]/div').click()
# Mouse Hover over the concert map and find the empty seats to extract the data.
actions = ActionChains(driver)
data = actions.move_to_element_with_offset(driver.find_element(by=By.XPATH, value='//*[#id="seatmap-tab"]/div[2]/div/div/section/div[2]/div[1]/div[2]/div/div[2]/div[1]/div[2]/div[2]/canvas'),0,0)
for i in data:
actions.move_by_offset(50, 50).perform()
time.sleep(2)
# print content of each box
hover_data = driver.find_element(By.XPATH, '//*[#id="tooltipster-533522"]/div[1]').get_attribute('tooltipster-content')
print(hover_data)```
# The code I would use to hover over the element
#actions.move_by_offset(100, 50).perform()
# time.sleep(5)
# actions.move_by_offset(150, 50).perform()
# time.sleep(5)
# actions.move_by_offset(200, 50).perform()
When I run this code I got from here, nothing happens:
import requests
import bs4
res = requests.get('https://en.wikipedia.org/wiki/Machine_learning')
soup = bs4.BeautifulSoup(res.text, 'lxml')
foo = soup.select('.mw-headline')
for i in soup.select('.mw-header'):
print(i.text)
Everything were installed (lxml, requests, bs4)
I cannot continue his tutorial If I'm stuck here.
Because soup.select('.mw-header') return [], this is empty array. .mw-header cannot be found in source website!
I recommend you use jupyter notebook, there will be visual results if you use it.
I'm trying to extract a keyword/string from a website's source code using this python 2.7 script:
from selenium import webdriver
keyword = ['googleadservices']
driver = webdriver.Chrome(executable_path=r'C:\Users\Jacob\PycharmProjects\Testing\chromedriver_win32\chromedriver.exe')
driver.get('https://www.vacatures.nl/')
elem = driver.find_element_by_xpath("//*")
source_code = elem.get_attribute("outerHTML")
for searchstring in keyword:
if searchstring.lower() in str(source_code).lower():
print (searchstring, 'found')
else:
print (searchstring, 'not found')
The browser fortunately opens when the script is running, but I'm not able to extract the desired keywords from it's source code. Any help?
As others have said, the issue isn't your code but simply that googleadservice isn't present in the source code.
What I want to add, is that your code is a bit over engineered, since all you seem to do is either return true or false if a certain string is present in the source code.
You can achieve that much easier with a better xpath like //script[contains(text(),'googletagmanager')] and than use find_element_by_xpath and catch the possible NoSuchElementException. That might save you time and you don't need the for loop.
There are other possiblities as well, using ExpectedConditions or find_elements_by_xpath and then check if the returned list is greater than 0.
I observed that googleadservices is NOT present in the web page source code.
There is NO issue with the code.
I tried with GoogleAnalyticsObject, and it is found.
from selenium import webdriver
keyword = ['googleadservices', 'GoogleAnalyticsObject']
driver = webdriver.Chrome()
driver.get('https://www.vacatures.nl/')
elem = driver.find_element_by_xpath("//*")
source_code = elem.get_attribute("outerHTML")
for searchstring in keyword:
if searchstring.lower() in str(source_code).lower():
print (searchstring, 'found')
else:
print (searchstring, 'not found')
Instead of using //* to find the source code
elem = driver.find_element_by_xpath("//*")
source_code = elem.get_attribute("outerHTML")
Use the following code:
source_code = driver.page_source
Just getting started with Scrapy, I'm hoping for a nudge in the right direction.
I want to scrape data from here:
https://www.sportstats.ca/display-results.xhtml?raceid=29360
This is what I have so far:
import scrapy
import re
class BlogSpider(scrapy.Spider):
name = 'sportstats'
start_urls = ['https://www.sportstats.ca/display-results.xhtml?raceid=29360']
def parse(self, response):
headings = []
results = []
tables = response.xpath('//table')
headings = list(tables[0].xpath('thead/tr/th/span/span/text()').extract())
rows = tables[0].xpath('tbody/tr[contains(#class, "ui-widget-content ui-datatable")]')
for row in rows:
result = []
tds = row.xpath('td')
for td in enumerate(tds):
if headings[td[0]].lower() == 'comp.':
content = None
elif headings[td[0]].lower() == 'view':
content = None
elif headings[td[0]].lower() == 'name':
content = td[1].xpath('span/a/text()').extract()[0]
else:
try:
content = td[1].xpath('span/text()').extract()[0]
except:
content = None
result.append(content)
results.append(result)
for result in results:
print(result)
Now I need to move on to the next page, which I can do in a browser by clicking the "right arrow" at the bottom, which I believe is the following li:
<li><a id="mainForm:j_idt369" href="#" class="ui-commandlink ui-widget fa fa-angle-right" onclick="PrimeFaces.ab({s:"mainForm:j_idt369",p:"mainForm",u:"mainForm:result_table mainForm:pageNav mainForm:eventAthleteDetailsDialog",onco:function(xhr,status,args){hideDetails('athlete-popup');showDetails('event-popup');scrollToTopOfElement('mainForm\\:result_table');;}});return false;"></a>
How can I get scrapy to follow that?
If you open the url in a browser without javascript you won't be able to move to the next page. As you can see inside the li tag, there is some javascript to be executed in order to get the next page.
Yo get around this, the first option is usually try to identify the request generated by javascript. In your case, it should be easy: just analyze the java script code and replicate it with python in your spider. If you can do that, you can send the same request from scrapy. If you can't do it, the next option is usually to use some package with javascript/browser emulation or someting like that. Something like ScrapyJS or Scrapy + Selenium.
You're going to need to perform a callback. Generate the url from the xpath from the 'next page' button. So url = response.xpath(xpath to next_page_button) and then when you're finished scraping that page you'll do yield scrapy.Request(url, callback=self.parse_next_page). Finally you create a new function called def parse_next_page(self, response):.
A final, final note is if it happens to be in Javascript (and you can't scrape it even if you're sure you're using the correct xpath) check out my repo in using splash with scrapy https://github.com/Liamhanninen/Scrape