When I run this code I got from here, nothing happens:
import requests
import bs4
res = requests.get('https://en.wikipedia.org/wiki/Machine_learning')
soup = bs4.BeautifulSoup(res.text, 'lxml')
foo = soup.select('.mw-headline')
for i in soup.select('.mw-header'):
print(i.text)
Everything were installed (lxml, requests, bs4)
I cannot continue his tutorial If I'm stuck here.
Because soup.select('.mw-header') return [], this is empty array. .mw-header cannot be found in source website!
I recommend you use jupyter notebook, there will be visual results if you use it.
Related
I have been trying to rotate some IPs with this piece of code. It didn't work. It still gave me my own IP. Could anyone help me check if there is anything wrong with it?
This is my code:
import random
from selenium import webdriver
from selenium.webdriver.common.by import By
import time
ips = ["185.199.228.220:7300", "185.199.231.45:8382"]
def rand_proxy():
proxy = random.choice(ips)
return proxy
def myip_now():
chrome_options = webdriver.ChromeOptions()
proxy = rand_proxy()
chrome_options.add_argument(f'--proxy-server = {proxy}')
driver = webdriver.Chrome(options = chrome_options)
driver.get("https://myexternalip.com/raw")
print(proxy)
time.sleep(10)
driver.quit()
myip_now()
What I expected was that on https://myexternalip.com/raw controlled by my bot, I should see either 185.199.228.220:7300 or 185.199.231.45:8382.
Seems some minor issues with the blank spaces and/or single quotes. You can tweak your code block a bit removing the extra spaces and replacing the single quotes with double quotes as follows:
import random
from selenium import webdriver
from selenium.webdriver.common.by import By
import time
ips = ["185.199.228.220:7300", "185.199.231.45:8382"]
def rand_proxy():
proxy = random.choice(ips)
return proxy
def myip_now():
chrome_options = webdriver.ChromeOptions()
proxy = rand_proxy()
chrome_options.add_argument(f"--proxy-server={proxy}")
driver = webdriver.Chrome(options = chrome_options)
driver.get("https://myexternalip.com/raw")
print(proxy)
time.sleep(10)
driver.quit()
myip_now()
Reference
You can find a couple of relevant detailed discussion in:
How to rotate Selenium webrowser IP address
f-strings in Python
Python 3's f-Strings: An Improved String Formatting Syntax (Guide)
I'm trying to extract a keyword/string from a website's source code using this python 2.7 script:
from selenium import webdriver
keyword = ['googleadservices']
driver = webdriver.Chrome(executable_path=r'C:\Users\Jacob\PycharmProjects\Testing\chromedriver_win32\chromedriver.exe')
driver.get('https://www.vacatures.nl/')
elem = driver.find_element_by_xpath("//*")
source_code = elem.get_attribute("outerHTML")
for searchstring in keyword:
if searchstring.lower() in str(source_code).lower():
print (searchstring, 'found')
else:
print (searchstring, 'not found')
The browser fortunately opens when the script is running, but I'm not able to extract the desired keywords from it's source code. Any help?
As others have said, the issue isn't your code but simply that googleadservice isn't present in the source code.
What I want to add, is that your code is a bit over engineered, since all you seem to do is either return true or false if a certain string is present in the source code.
You can achieve that much easier with a better xpath like //script[contains(text(),'googletagmanager')] and than use find_element_by_xpath and catch the possible NoSuchElementException. That might save you time and you don't need the for loop.
There are other possiblities as well, using ExpectedConditions or find_elements_by_xpath and then check if the returned list is greater than 0.
I observed that googleadservices is NOT present in the web page source code.
There is NO issue with the code.
I tried with GoogleAnalyticsObject, and it is found.
from selenium import webdriver
keyword = ['googleadservices', 'GoogleAnalyticsObject']
driver = webdriver.Chrome()
driver.get('https://www.vacatures.nl/')
elem = driver.find_element_by_xpath("//*")
source_code = elem.get_attribute("outerHTML")
for searchstring in keyword:
if searchstring.lower() in str(source_code).lower():
print (searchstring, 'found')
else:
print (searchstring, 'not found')
Instead of using //* to find the source code
elem = driver.find_element_by_xpath("//*")
source_code = elem.get_attribute("outerHTML")
Use the following code:
source_code = driver.page_source
Just getting started with Scrapy, I'm hoping for a nudge in the right direction.
I want to scrape data from here:
https://www.sportstats.ca/display-results.xhtml?raceid=29360
This is what I have so far:
import scrapy
import re
class BlogSpider(scrapy.Spider):
name = 'sportstats'
start_urls = ['https://www.sportstats.ca/display-results.xhtml?raceid=29360']
def parse(self, response):
headings = []
results = []
tables = response.xpath('//table')
headings = list(tables[0].xpath('thead/tr/th/span/span/text()').extract())
rows = tables[0].xpath('tbody/tr[contains(#class, "ui-widget-content ui-datatable")]')
for row in rows:
result = []
tds = row.xpath('td')
for td in enumerate(tds):
if headings[td[0]].lower() == 'comp.':
content = None
elif headings[td[0]].lower() == 'view':
content = None
elif headings[td[0]].lower() == 'name':
content = td[1].xpath('span/a/text()').extract()[0]
else:
try:
content = td[1].xpath('span/text()').extract()[0]
except:
content = None
result.append(content)
results.append(result)
for result in results:
print(result)
Now I need to move on to the next page, which I can do in a browser by clicking the "right arrow" at the bottom, which I believe is the following li:
<li><a id="mainForm:j_idt369" href="#" class="ui-commandlink ui-widget fa fa-angle-right" onclick="PrimeFaces.ab({s:"mainForm:j_idt369",p:"mainForm",u:"mainForm:result_table mainForm:pageNav mainForm:eventAthleteDetailsDialog",onco:function(xhr,status,args){hideDetails('athlete-popup');showDetails('event-popup');scrollToTopOfElement('mainForm\\:result_table');;}});return false;"></a>
How can I get scrapy to follow that?
If you open the url in a browser without javascript you won't be able to move to the next page. As you can see inside the li tag, there is some javascript to be executed in order to get the next page.
Yo get around this, the first option is usually try to identify the request generated by javascript. In your case, it should be easy: just analyze the java script code and replicate it with python in your spider. If you can do that, you can send the same request from scrapy. If you can't do it, the next option is usually to use some package with javascript/browser emulation or someting like that. Something like ScrapyJS or Scrapy + Selenium.
You're going to need to perform a callback. Generate the url from the xpath from the 'next page' button. So url = response.xpath(xpath to next_page_button) and then when you're finished scraping that page you'll do yield scrapy.Request(url, callback=self.parse_next_page). Finally you create a new function called def parse_next_page(self, response):.
A final, final note is if it happens to be in Javascript (and you can't scrape it even if you're sure you're using the correct xpath) check out my repo in using splash with scrapy https://github.com/Liamhanninen/Scrape
I want to get the text in the span. I have checked it, but I don't see the problem
from bs4 import BeautifulSoup
import urllib.request
import socket
searchurl = "http://suchen.mobile.de/auto/search.html?scopeId=C&isSearchRequest=true&sortOption.sortBy=price.consumerGrossEuro"
f = urllib.request.urlopen(searchurl)
html = f.read()
soup = BeautifulSoup(html)
print(soup.findAll('span',attrs={'class': 'b'}))
The result was [], why?
Looking at the site in question, your search result turns up an empty list because there are no spans with a class value of b. BeautifulSoup does not propagate down the CSS like a browser would. In addition, your urllib request looks incorrect. Looking at the site, I think you want to grab all the spans with a class of label, though it's hard when the site isn't in my native language. Here's is how you would go about it:
from bs4 import BeautifulSoup
import urllib2 # Note urllib2
searchurl = "http://suchen.mobile.de/auto/search.html?scopeId=C&isSearchRequest=true&sortOption.sortBy=price.consumerGrossEuro"
f = urllib2.urlopen(searchurl) # Note no need for request
html = f.read()
soup = BeautifulSoup(html)
for s in soup.findAll('span', attrs={"class":"label"}):
print s.text
This gives for the url listed:
Farbe:
Kraftstoffverbr. komb.:
Kraftstoffverbr. innerorts:
Kraftstoffverbr. außerorts:
CO²-Emissionen komb.:
Zugr.-lgd. Treibstoffart:
I get an when i try to mix mechanize and BeautifulSoup in the following code:
from BeautifulSoup import BeautifulSoup
import urllib2
import re
import mechanize
br=mechanize.Browser()
br.set_handle_robots(True)
br.open('http://tel.search.ch/')
br.select_form(nr=0)
br.form["was"] = "siemens"
br.submit()
content = br.response
soup = BeautifulSoup(content)
for a in soup.findAll('a',href=True):
if re.findall('title', a['href']):
print "URL:", a['href']
br.close()
The code from the beginning till br.submit() works fine with mechanize and the for loop with BeautifulSoup too. But I don't know how to pass the results from br.submit() into BeautifulSoup. The 2 lines:
content = br.response
soup = BeautifulSoup(content)
are apparently wrong. I get an error for soup = BeautifulSoup(content):
TypeError: expected string or buffer
Can anyone help?
Try changing
content = br.response
to
content = br.response().read()
In this way content now has html that can be passed to BeautifulSoup.