BeautifulSoup: How to get only one part of <p> output? - beautifulsoup

I am trying to find the solution how to get only the price without text from the paragraph.
from bs4 import BeautifulSoup
import requests
headers = {"user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.131 Safari/537.36"}
p = requests.get(url = 'https://www.tia-mobiteli.hr/detaljan-prikaz.aspx?gid=11-appise_64wheu', headers = headers)
soup = BeautifulSoup(p.content,'lxml')
price = soup.find('div', class_='widget widget-info widget-price').p.text
price2 = price.strip()
print(price2)
My output is:
Naša najniža cijena za gotovinsko/virmansko plaćanje: 3.649,00 kn
I want to get only:
3.649,00 kn
Or if it is possible:
3649.00

The price is inside <b> tag:
import requests
from bs4 import BeautifulSoup
headers = {
"user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.131 Safari/537.36"
}
p = requests.get(
url="https://www.tia-mobiteli.hr/detaljan-prikaz.aspx?gid=11-appise_64wheu",
headers=headers,
)
soup = BeautifulSoup(p.content, "lxml")
price = soup.find("div", class_="widget widget-info widget-price").b.text
price = float(price.split()[0].replace(".", "").replace(",", "."))
print(price)
Prints:
3649.0

You can use parse module that acts like reverse format().
Usage:
import parse
...
float(parse.parse('Naša najniža cijena za gotovinsko/virmansko plaćanje: {} kn',price2)[0].replace('.','').replace(',','.'))

Related

How to change the header just for a specific request in scrapy spider?

I am trying to build a web crawler using scrapy. I want to change useragent for a single request in the spider. I tried the below code but the user agent is not being updated during the crawl process.
def start_requests(self):
request = Request(
"url",
callback=self.parse_search,
meta={'xpaths': self.xpaths},
headers={
"User-Agent": "Googlebot-Image/1.0"
}
)
return [request]
Your code works perfectly (see my code). But some middleware on your side may affect your User-Agent header:
class UserAgentSpider(scrapy.Spider):
name = 'useragent_spider'
user_agents = [
{'title': 'Galaxy S9', 'value': 'Mozilla/5.0 (Linux; Android 8.0.0; SM-G960F Build/R16NW) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.84 Mobile Safari/537.36'},
{'title': 'iPhone', 'value': 'Mozilla/5.0 (iPhone; CPU iPhone OS 12_0 like Mac OS X) AppleWebKit/605.1.15 (KHTML, like Gecko) CriOS/69.0.3497.105 Mobile/15E148 Safari/605.1'},
{'title': 'Edge', 'value': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/42.0.2311.135 Safari/537.36 Edge/12.246'},
]
def start_requests(self):
for user_agent in self.user_agents:
yield scrapy.Request(
url="https://www.myip.com/",
headers={
'user-agent': user_agent['value'],
},
cb_kwargs={
'user_agent': user_agent['title']
},
callback=self.parse,
dont_filter=True,
)
def parse(self, response, user_agent):
with open(f"Samples/{user_agent}.htm", 'wb') as f:
f.write(response.body)

I send a post request by scrapy, response data is 'too frequently',but i send this same request by postman,response is this i want

**
This is my code of my scrapy. I also send same request with postman.No matter i send it any times,i can recive data that i want.But i send it by scrapy,I recive data alwanys is 'too frequently,forbid visit'.Maybe there will are many causes.But I want to know what are the possible causes.
**
'
class TestSpider(scrapy.Spider):
name = 'test'
allowed_domains = ['www.lagou.com']
start_urls = ['https://www.lagou.com/jobs/positionAjax.json?px=default&city=%E5%8C%97%E4%BA%AC&needAddtionalResult=false']
def start_requests(self):
yield FormRequest(
self.start_urls[0],
callback=self.parse,
)
def parse(self,response):
print(response.text)
'
You need to show the website that you are an actual user, not a bot
try sending a user-agent in the header
yield FormRequest(
url=self.start_urls[0],
callback=self.parse,
headers={ 'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.122 Safari/537.36',}
)

Scrapy | How get response from request without urllib?

I believe there is a better way to get response using scrapy.Request then I do
...
import urllib.request
from scrapy.selector import Selector
from scrapy.http import HtmlResponse
...
class MatchResultsSpider(scrapy.Spider):
name = 'match_results'
allowed_domains = ['site.com']
start_urls = ['url.com']
def get_detail_page_data(self, detail_url):
req = urllib.request.Request(
detail_url,
data=None,
headers={
'user_agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.106 Safari/537.36',
'accept': 'application/json, text/javascript, */*; q=0.01',
'referer': 'site.com',
}
)
page = urllib.request.urlopen(req)
response = HtmlResponse(url=detail_url, body=page.read())
target = Selector(response=response)
return target.xpath('//dd[#data-first_name]/text()').extract_first()
I get all information inside parse function.
But in one place I need to get a little peace data from inside detail page.
# Lineups
lineup_team_tables = lineups_container.xpath('.//tbody')
for i, table in enumerate(lineup_team_tables):
# lineup players
line_up = []
lineup_players = table.xpath('./tr[not(contains(string(), "Coach"))]')
for lineup_player in lineup_players:
line_up_entries = {}
lineup_player_url = lineup_player.xpath('.//a/#href').extract_first()
line_up_entries['player_id'] = get_id(lineup_player_url)
line_up_entries['jersey_num'] = lineup_player.xpath('./td[#class="shirtnumber"]/text()').extract_first()
abs_lineup_player_url = response.urljoin(lineup_player_url)
line_up_entries['position_id_detail'] = self.get_detail_page_data(abs_lineup_player_url)
line_up.append(line_up_entries)
# team_lineup['line_up'] = line_up
self.write_to_scuard(i, 'line_up', line_up)
Can I get data from other page using scrapy.Request(detail_url, calback_func)?
Thank for your help!
Too much extra code. Use simple scheme of Scrapy parsing:
class ********(scrapy.Spider):
name = '*******'
domain = '****'
allowed_domains = ['****']
start_urls = ['https://******']
custom_settings = {
'USER_AGENT': 'Mozilla/5.0 (Windows NT 10.0; Win64;AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.84 Safari/537.36',
'DEFAULT_REQUEST_HEADERS': {
'ACCEPT': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8',
'ACCEPT_ENCODING': 'gzip, deflate, br',
'ACCEPT_LANGUAGE': 'en-US,en;q=0.9',
'CONNECTION': 'keep-alive',
}
def parse(self, response):
(You already have responsed html start_urls = ['https://******'])
yield scrapy.Request(url, callback=self.parse_details)
then you can parse further (nested). And return back to parse callback:
def parse_details(self, response):
************
yield scrapy.Request(url_2, callback=self.parse)

Error 403 Forbidden not User-Agent

I've tried looking at previous posts on the same subject but none of the solutions seem to be working and I'd like to confirm that there is indeed nothing I can do to get around this.
I'm a journalist trying to download permit data from off the planning authority's website. I could do this no problem up till a few months ago but the website has been changed and after adapting my code to the new site, I now seem to be getting an Error 403 every time I try to follow links on the site.
Any help would be greatly appreciated.
My code -not the best looking or most efficient, but I'm self taught and use coding mainly for scraping data for work - stats on the page: http://www.pa.org.mt/padecisionSearch?date=1/31/2018%2012:00:00%20AM
In the bit of code I have pasted beneath I am trying to access each link permit link (first one on page: http://www.pa.org.mt/PACaseDetails?Systemkey=200414&CaseType=PA/10351/17%27) in order to scrape permit details.
While I can generate the link addresses without a problem (they are accessible by clicking the link), sending a request to the address returns:
b'\r\nForbidden\r\n\r\nForbidden URL\r\nHTTP Error 403. The request URL is forbidden.\r\n\r\n'
I've tried changing the User-Agent, and I've also tried to put in a timer between requests but nothing seems to have any effect.
Any suggestions would be very welcome
My code:
import requests
import pandas as pd
import csv
from bs4 import BeautifulSoup
from datetime import date, timedelta as td
import pandas as pd
from collections import Counter
import numpy as np
import matplotlib.pyplot as plt
import urllib
with requests.Session() as s:
#s.headers.update(head)
r= s.get("http://www.pa.org.mt",data=None, headers = {"User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36"})
page = (s.get("http://www.pa.org.mt/padecisionSearch?date=1/31/2018%2012:00:00%20AM", data=None, headers = {"User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML,like Gecko) Chrome/63.0.3239.132 Safari/537.36"}).content)
soup = BeautifulSoup(page, 'html.parser')
search_1 = soup.find_all('table')
for item in search_1:
item1 = item.find_all('tr')
for item2 in item1:
item3 = item2.find_all('td', class_ = 'fieldData')
for element in item3:
list2.append(element.text)
zejt_number = (len(list2)/6)
zi = element.find_all('a')
if len(zi) == 0 and ((len(list2)-1)%5 == 0 or len(list2) == 1):
case_status.append("")
applicant.append("")
architect.append("")
application_type.append("")
case_category.append("")
case_officer.append("")
case_officer2.append("")
date_approved.append("")
application_link.append("")
elif len(zi) != 0:
for li in zi:
hyperlink = "http://www.pa.org.mt/"+li.get('href')
application_link.append(hyperlink)
print(hyperlink)
z = (s.get(hyperlink, data=None, headers = {"User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36"}).content)
print(z)
first of all your code is a bit messy. is it all your code? or just a part of it? e.g. you are importing pandas twice. nevertheless your main problem why this is not working is the hyperlinks you are generating:
for li in zi:
hyperlink = "http://www.pa.org.mt/"+li.get('href')
print(hyperlink)
the result looks like this:
http://www.pa.org.mt/../PACaseDetails?Systemkey=200414&CaseType=PA/10351/17'
this is link won't work. a quick workaround would be to edit the hyperlink before you do the request:
for li in zi:
hyperlink = "http://www.pa.org.mt/"+li.get('href')
hyperlink = hyperlink.replace('../', '')
print(hyperlink)
z = (s.get(hyperlink, data=None, headers = {"User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36"}).content)
print(z)
the hyperlinks now should look like this:
http://www.pa.org.mt/PACaseDetails?Systemkey=200414&CaseType=PA/10351/17'
and the request should pass through.

Why is Beautifulsoup find_all not returning complete results?

I am trying to parse an Amazon search results page. I want to access the data contained in an <li> tag with <id=result_0>, <id=result_1>, <id=result_2>, etc. The find_all('li') function only returns 4 results (up to result_3), which I thought was odd, since when viewing the webpage in my browser, I see 12 results.
When I print parsed_html, I see it contains all the way to result_23. Why isn't find_all returning all 24 objects? A snippet of my code is below.
import requests
try:
from BeautifulSoup import bsoup
except ImportError:
from bs4 import BeautifulSoup as bsoup
search_url = 'https://www.amazon.com/s/ref=nb_sb_noss_2?url=search-
alias%3Dstripbooks&field-keywords=data+analytics'
response = requests.get(search_url, headers={
"User-Agent": "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36
(KHTML, like Gecko) Chrome/50.0.2661.102 Safari/537.36"})
parsed_html = bsoup(response.text)
results_tags = parsed_html.find_all('div',attrs={'id':'atfResults'})
results_html = bsoup(str(results_tags[0]))
results_html.find_all('li')
For what it's worth, the results_tags object also only contains the 4 results. Which is why I am thinking the issue is in the find_all step, rather than with the BeautifulSoup object.
If anyone can help me figure out what is happening here and how I can access all of the search results on this webpage, I will really appreciate it!!
import requests, re
try:
from BeautifulSoup import bsoup
except ImportError:
from bs4 import BeautifulSoup as bsoup
search_url = 'https://www.amazon.com/s/?url=search-%20alias%3Dstripbooks&field-keywords=data+analytics' #delete the irrelevant part from url
response = requests.get(search_url, headers={
"User-Agent": "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36(KHTML, like Gecko) Chrome/50.0.2661.102 Safari/537.36",
"Accept":"text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8" }) # add 'Accept' header
parsed_html = bsoup(response.text, 'lxml')
lis = parsed_html.find_all('li', class_='s-result-item' ) # use class to find li tag
len(lis)
out:
25
Can access the li elements directly through class instead of id. This will print the text from each li element.
results_tags = parsed_html.find_all('li',attrs={'class':'s-result-item'})
for r in results_tags:
print(r.text)