I can't make line breaks between each URL I get.
The urls are displayed in a row when I would like to have 1 url per line.
Could you help me with this problem?
from selenium import webdriver
from time import sleep
driver = webdriver.Chrome('chromedriver.exe')
driver.get("https://www.twitch.tv/directory/game/League%20of%20Legends/clips?range=7d")
sleep(3)
i = 1
while i <= 20:
links = driver.find_elements_by_xpath("//a[#data-a-target='preview-card-image-link']")
driver.execute_script('arguments[0].scrollIntoView(true);', links[len(links)-1])
print("=> i :", i)
i+=20
sleep(1)
links = driver.find_elements_by_xpath("//a[#data-a-target='preview-card-image-link']")
for link in links:
print(link.get_attribute('href'))
f = (link.get_attribute('href'))
c = open('proxy_list.txt', 'a')
c.write(f)
A few things...
Your first while loop only loops once... I'm not sure if that's intentional or not? You are setting i to 1 and looping until i > 20. After the first command, you increment i by 20 which means that it will fail the while condition and fall out. If that's what you intend, you can get rid of most of that code and just keep 2 lines. The first block then becomes
links = driver.find_elements_by_xpath("//a[#data-a-target='preview-card-image-link']")
driver.execute_script('arguments[0].scrollIntoView(true);', links[len(links)-1])
The second block of code is looping through the returned links, getting the href in each and then writing to file. Writing to disk is slow so you want to minimize that as much as possible. Since you aren't trying to write millions of lines at once, you can just build your final string containing the hrefs in the loop and then once the loop is done, write the string to file. That way you only write to disk once.
Adding the line break is as simple as appending "\n" (newline character) to the end of each href.
from selenium import webdriver
from time import sleep
driver = webdriver.Chrome('chromedriver.exe')
driver.get("https://www.twitch.tv/directory/game/League%20of%20Legends/clips?range=7d")
links = driver.find_elements_by_xpath("//a[#data-a-target='preview-card-image-link']")
driver.execute_script('arguments[0].scrollIntoView(true);', links[len(links)-1])
list = ""
links = driver.find_elements_by_xpath("//a[#data-a-target='preview-card-image-link']")
for link in links:
print(link.get_attribute('href'))
list += link.get_attribute('href') + "\n"
c = open('proxy_list.txt', 'a')
c.write(list)
Related
Right now my code is set up to scrape through all the last names that start with 'Z'
query = ['Z']
for letter in query:
url = "https://hsba.org/HSBA_2020/For_the_Public/Find_a_Lawyer/HSBA_2020/Public/Find_a_Lawyer.aspx"
driver.get(url)
#input from query
element = driver.find_element(By.CSS_SELECTOR,'#txtDirectorySearchLastName')
element.send_keys(letter)
I need it to loop through the whole alphabet and get all the text info on each page.
I tried searching for help and am not sure where to start...
(Disclaimer: very new to coding)
import string
alphabets=list(string.ascii_lowercase)
for alph in alphabets:
url = "https://hsba.org/HSBA_2020/For_the_Public/Find_a_Lawyer/HSBA_2020/Public/Find_a_Lawyer.aspx"
driver.get(url)
#input from query
element = driver.find_element(By.CSS_SELECTOR,'#txtDirectorySearchLastName')
element.send_keys(alph)
If you need to loop all the letters just get all letters using string.ascii_lowercase and loop them.
I'm trying to print search results of DuckDuckgo using a headless WebDriver and Selenium. However, I cannot locate the DOM elements referring to the search results no matter what ID or class name I search for and no matter how long I wait for it to load.
Here's the code:
opts = Options()
opts.headless = False
browser = Firefox(options=opts)
browser.get('https://duckduckgo.com')
search = browser.find_element_by_id('search_form_input_homepage')
search.send_keys("testing")
search.submit()
# wait for URL to change with 15 seconds timeout
WebDriverWait(browser, 15).until(EC.url_changes(browser.current_url))
print(browser.current_url)
results = WebDriverWait(browser,10)
.until(EC.presence_of_element_located((By.ID,"links")))
time.sleep(10)
results = browser.find_elements_by_class_name('result results_links_deep highlight_d result--url-above-snippet') # I tried many other ID's and class names
print(results) # prints []
I'm starting to suspect there is some trickery to avoid web scraping in DuckDuckGo. Does anyone has a clue?
I've changed to use cssSelector then it works.I use java, not python.
List<WebElement> elements = driver.findElements(
By.cssSelector(".result.results_links_deep.highlight_d.result--url-above-snippet"));
System.out.println(elements.size());
//10
Hi I am currently looking to scrape a [age such as this "https://www.tennis24.com/match/ABiALWlt/#match-statistics;0" every time it changes score. currently i have the ability to scrape it using selenium and BS using the below code
from selenium import webdriver
Chrom_path = r"C:\Users\Dan1\Desktop\chromedriver.exe"
driver = webdriver.Chrome(Chrom_path)
driver.get("https://www.tennis24.com/match/zVrM3ySQ/#match-statistics;0")
data = driver.find_elements_by_class_name("statTextGroup")
for d in data:
sub_data = d.find_elements_by_xpath(".//*")
assert len(sub_data)==3
for s_d in sub_data:
print(s_d.get_attribute('class')[19:], s_d.get_attribute('innerText'))
but I have no idea how to automate it so that once the score at the top of the page located here"Medical timeout6 : 6 ( 0 : 0 )" changes, the scraper scrapes the new data. the change to monitor though is only visible when the match is in play and not always there.
if you need anymore info please let me know and ill be happy to add it
You can scrape in a while loop the "scoreboard"-class and when this is not the same as the old value of this then this value changed and you can scrape the other things you wanted.
Hope it helped
I'd like to modify part of the text in a textarea with Selenium. The textarea seems almost as if it were read-only.
In this very simple example using a sample algo, it would be great to be able to change the stock id on this line:
context.aapl = sid(24)
... to something like:
context.aapl = sid(39840)
... which is the Tesla stock id. The variable name will no longer make sense, doesn't matter, just a start.
This Selenium code for me is able to open the sample with no login required.
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
t = webdriver.Firefox() # t stands for tab as in browser tab in my mind
t.implicitly_wait(10)
t.get('https://www.quantopian.com/algorithms/')
o = t.find_element_by_xpath("//body") # o stands for object
o.send_keys(Keys.ESCAPE) # clearing the popup
o = t.find_element_by_link_text("Hello World Algorithm")
o.click()
''' for the fun of it if you want to run the backtest
o = t.find_element_by_xpath('//body')
o.send_keys(Keys.CONTROL + 'b')
o.send_keys(Keys.ESCAPE)
'''
print t.find_element_by_id('code-area').text
Here's the output from that
1
# Called once at the start of the simulation.
2
def initialize(context):
3
# Reference to the AAPL security.
4
context.aapl = sid(24)
5
6
# Rebalance every day, one hour and a half after market open.
7
schedule_function(my_rebalance,
8
date_rules.every_day(),
9
time_rules.market_open(hours=1, minutes=30))
10
11
# This function was scheduled to run once per day at 11AM ET.
12
def my_rebalance(context, data):
13
14
# Take a 100% long position in AAPL. Readjusts each day to
15
# account for price fluctuations.
16
if data.can_trade(context.aapl):
17
order_target_percent(context.aapl, 1.00)
That id is 'code-area'. The content includes margin numbers which might be a problem.
Next nested area is 'code-area-internal', seems the same.
Followed by these two.
<div class='ide-container' id='ide-container'>
<textarea class='width_100pct' id='codebox'>
In trying to obtain the content of the algorithm with 'codebox', content doesn't appear to be present, just u'' ...
>>> p = t.find_element_by_id('codebox').text
>>> p
u''
Attempt to do CTRL-A on it results in this exception...
>>> o = t.find_element_by_id('codebox')
>>> o.send_keys(Keys.CONTROL + 'a')
ElementNotInteractableException: Message: Element is not reachable by keyboard
If the text can be completely cut, then replace can done in Python and paste, that would be fine.
I wouldn't expect Selenium to be able to find and replace text, just surprised it finds a visible area for user input to be off limits from interactivity.
That textarea does have its own Find, and hoping won't have to resort to trying to use it as a workaround.
(The environment is an online IDE for stock market algorithms called Quantopian)
This is the one other thing I tried, with no apparent effect:
>>> t.execute_script("arguments[0].value = arguments[1]", t.find_element_by_id("ide-container"), "_new_")
Appreciate any pointers.
Textarea has style="display: none" attribute which means that you cannot get its content with text property. In this case you can use:
p = t.find_element_by_id('codebox').get_attribute("textContent")
To set new value to code field you can use:
field = driver.find_element_by_css_selector('div[role="presentation"]')
driver.execute_script("arguments[0].textContent = 'New value';", field)
But note that initially each code line in code field displayed as separate div node with specific value and styles. So to make new value looks exactly as code (in the same formatting) you can prepare HTML sample e.g.
value = """<div style="position: relative;"><div class="CodeMirror-gutter-wrapper" style="left: -48px;"><div class="CodeMirror-linenumber CodeMirror-gutter-elt" style="left: 15px; width: 21px;">1</div></div><pre class=" CodeMirror-line " role="presentation"><span role="presentation" style="padding-right: 0.1px;"><span class="cm-comment"># Comment for new code.</span></span></pre></div>"""
and do
driver.execute_script("arguments[0].innerHTML = arguments[1];", field, value)
The content of the algorithm with codebox which you are trying to extract is having the style attribute set to display: none;. So to extract the text you can use the following lines of code :
p = t.find_element_by_xpath("//div[#class='ide-container']/textarea[#id='codebox']")
t.execute_script("arguments[0].removeAttribute('style')", p)
print(t.get_attribute("innerHTML"))
I am trying to scrape web data on this website, and the only way I was able to access the data was by iterating through the rows of the table, adding them to a list (then adding them to a pandas data frame/writing to a csv), and then clicking to the next page and repeating the process [there are about 50 pages per search and my program does 100+ searches]. It's super slow/inefficient, and I was wondering if there was a way to efficiently add all the data using pandas or beautiful soup instead of iterating through each line/column.
url = "https://claimittexas.org/app/claim-search"
rows = driver.find_elements_by_xpath("//tbody/tr")
try:
for row in rows[1:]:
row_array = []
#print(row.text) # prints the whole row
for col in row.find_elements_by_xpath('td')[1:]:
row_array.append(col.text.strip())
table_array.append(row_array)
df = pd.DataFrame(table_array)
df.to_csv('my_csv.csv', mode='a', header=False)
except:
print(letters + "no table exists")
EDIT: I tried to scrape using beautiful soup, something I tried earlier in the week and posted about, but I can't seem to access the table without using selenium
With the bs version, I put a bunch of print statements in to see what was wrong, and it has that the rows value is just an empty list
html = driver.page_source
soup = BeautifulSoup(html, 'lxml')
rows = soup.find('table').find('tbody').find_all(('tr')[1:])
for row in rows[1:]:
cells = row.find_all('td')
for cell in cells[1:]:
print(cell.get_text())
use this line in BS4 code implimentation
rows = soup.find('table').find('tbody').find_all('tr')[1:]
instead of
rows = soup.find('table').find('tbody').find_all(('tr')[1:])