I am following this guide to scrape movie titles from my local cinema website. I am using Scrapy Spider and CSS parsing to get this done. Within the HTML for the site, each movie title is constructed like this:
<div class="col-md-12 movie-description">
<h2>Minions: The Rise of Gru<h2>
...
Here is my code that attempts to scrape this info
import scrapy
class CinemaSpider(scrapy.Spider):
name = "cinema"
allowed_domains = ["cannonvalleycinema10.com"]
start_urls = ["https://cannonvalleycinema10.com/"]
def parse(self, response):
movie_names = response.css(".col-md-12.movie-description h2::text").extract()
for movie_name in movie_names:
yield {
'name': movie_name
}
The cinema's website is here. I have tried all sorts of different combinations for what would get the titles I'm looking for to be added to my json file but can't figure it out.
If it helps, I am running this code:
scrapy runspider .\cinema_scrape.py -o movies.json
I am in the proper directory, too.
The page is dynamically loaded so you have try scrapy and json together :
import scrapy
from scrapy import FormRequest
from scrapy.crawler import CrawlerProcess
import json
from scrapy.http import Request
class TestSpider(scrapy.Spider):
name = 'test'
url = 'https://cabbtheatres.intensify-solutions.com/embed/ajaxGetRepertoire'
cookies = {
'PHPSESSID': 'i8l12572hvd3a702d4nfj3vbg0',
}
headers = {
'Accept': 'application/json, text/javascript, */*; q=0.01',
'Accept-Language': 'en-US,en;q=0.9',
'Connection': 'keep-alive',
'Content-Type': 'application/x-www-form-urlencoded; charset=UTF-8',
# 'Cookie': 'PHPSESSID=i8l12572hvd3a702d4nfj3vbg0',
'Origin': 'https://cabbtheatres.intensify-solutions.com',
'Referer': 'https://cabbtheatres.intensify-solutions.com/embed?location=3663456',
'Sec-Fetch-Dest': 'empty',
'Sec-Fetch-Mode': 'cors',
'Sec-Fetch-Site': 'same-origin',
'User-Agent': 'Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.0.0 Safari/537.36',
'X-Requested-With': 'XMLHttpRequest',
'sec-ch-ua': '".Not/A)Brand";v="99", "Google Chrome";v="103", "Chromium";v="103"',
'sec-ch-ua-mobile': '?0',
'sec-ch-ua-platform': '"Windows"',
}
data = {
'location': '3663456',
'date': '2022-07-30',
'lang': 'en',
'soon': '',
}
def start_requests(self):
yield scrapy.FormRequest(
url =self.url,
method='POST',
formdata=self.data,
headers=self.headers,
callback=self.parse_item,
)
def parse_item(self, response):
detail=response.json()
titles=detail['data']
for name in titles:
title=name['title']
print(title)
output:
Minions: The Rise of Gru
Thor Love and Thunder
DC League of Super-Pets
Elvis(2022)
Mrs. Harris Goes to Paris
Where the Crawdads Sing
Top Gun: Maverick
Nope
Related
For research purposes, we analyzed and extracted large-scale traffic data from 511nj cameras. Starting from 2020, it adds an OTP(one-time password) to the video link, which is requested by the website API: https://511nj.org/api/client/camera/getHlsToken?Id=2&rnd=202205281151, as observed from the network traffic. However, the website itself gets a response of 200 with a valid token, while accessing it by myself gets a response of 401, which indicates an unauthorized request. What makes me confused is why I can get the camera feed with the OTP through the website, but can't access the api. Is there a solution?
Failed attempt:
requests.get('https://511nj.org/api/client/camera/getHlsToken?Id=2&rnd='+datetime.datetime.now().strftime('%Y%m%d%H%M')).json()
20220601 Update: Modification has been made based on #Brad 's comment. However, the error message now turns into "An error has occurred". Please help!
def GetOTP():
import requests
import json
import datetime
Now = datetime.datetime.now()
DT = Now.strftime('%Y%m%d%I%M%S')
DT24 = Now.strftime('%Y%m%d%H%M')
Enigma = dict(zip(map(str,range(0,10)),('2','1','*','3','4','5','6','7','8','9')))
cookies = {
'ReloadVersion': '112',
'_ga': 'GA1.2.335422978.1653768841',
'_gid': 'GA1.2.721252567.1653768841',
}
headers = {
'authority': '511nj.org',
'accept': 'application/json, text/plain, */*',
'accept-language': 'en-US,en;q=0.9',
'referer': 'https://511nj.org/Scripts/Directives/HLSPlayer.html?v=V223',
'responsetype': "".join([Enigma[key] for key in DT[4:6]])+'$'.join([Enigma[key] for key in DT[6:8]])+'$'.join([Enigma[key] for key in DT[:4]])+'#'+Enigma[DT[9]]+"!".join([Enigma[key] for key in DT[10:12]])+"!"+DT[-2:]+"#PK",
'sec-ch-ua': '" Not A;Brand";v="99", "Chromium";v="102", "Google Chrome";v="102"',
'sec-ch-ua-mobile': '?0',
'sec-ch-ua-platform': '"macOS"',
'sec-fetch-dest': 'empty',
'sec-fetch-mode': 'cors',
'sec-fetch-site': 'same-origin',
'user-agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/102.0.5005.61 Safari/537.36',
}
params = {
'Id': '2',
'rnd': DT24,
}
Response = requests.get('https://511nj.org/api/client/camera/getHlsToken?Id=2&rnd='+DT24, headers=headers,cookies=cookies,params=params).json()
print("".join([Enigma[key] for key in DT[4:6]])+'$'+''.join([Enigma[key] for key in DT[6:8]])+'$'+''.join([Enigma[key] for key in DT[:4]])+'#'+Enigma[DT[9]]+"!"+''.join([Enigma[key] for key in DT[10:12]])+"!"+DT[-2:]+"#PK")
print(params['rnd'])
print(Response)
return Response['Data']
I am trying to scrape wind speed data for different UK weather stations using the site wunderground. I assume they have an API, I just have a hard to connecting to it.
Here's the XHR link I use:
https://api.weather.com/v1/location/EGNV:9:GB/observations/historical.json?apiKey=6532d6454b8aa370768e63d6ba5a832e&units=e&startDate=20150101&endDate=20150131
This is the data I would like. The table in the bottom for wind speed:
https://www.wunderground.com/history/monthly/gb/darlington/EGNV/date/2015-1
My code, pretty simple: I first load the headers, my function get_data gets me the response in json format.
In my main I append the data to a dataframe and print it.
from bs4 import BeautifulSoup
import pandas as pd
import requests
import urllib
from urllib.request import urlopen
headers = {
':authority': 'api.weather.com',
#':path': '/v1/location/EGNV:9:GB/observations/historical.json?apiKey=6532d6454b8aa370768e63d6ba5a832e&units=e&startDate=20150101&endDate=20150131',
':scheme': 'https',
'accept': 'application/json, text/plain, */*',
'accept-encoding' : 'gzip, deflate, br',
'accept-language': 'en-GB,en;q=0.9,en-US;q=0.8,da;q=0.7',
'origin': 'https://www.wunderground.com',
#'apiKey': '6532d6454b8aa370768e63d6ba5a832e',
'referer': 'https://www.wunderground.com/history/monthly/gb/darlington/EGNV/date/2015-1',
'sec-fetch-mode': 'cors',
'sec-fetch-site': 'cross-site',
'user-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.87 Safari/537.36'
}
def get_data(response):
df = response.json()
return df
if __name__ == "__main__":
date = pd.datetime.now().strftime("%d-%m-%Y")
api_key = "6532d6454b8aa370768e63d6ba5a832e"
start_date = "20150101"
end_date = "20150131"
urls = [
"https://api.weather.com/v1/location/EGNV:9:GB/observations/historical.json?apiKey="+ api_key +"&units=e&startDate="+start_date+"&endDate="+end_date+""
]
df = pd.DataFrame()
for url in urls:
res = requests.get(url, headers= headers)
data = get_data(res)
df = df.append(data)
print(df)
The error I get:
SSLError: HTTPSConnectionPool(host='api.weather.com', port=443): Max retries exceeded with url: /v1/location/EGNV:9:GB/observations/historical.json?apiKey=6532d6454b8aa370768e63d6ba5a832e&units=e&startDate=20150101&endDate=20150131
(Caused by SSLError(SSLError("bad handshake: Error([('SSL routines', 'tls_process_server_certificate', 'certificate verify failed')])")))
Update:
Even when without trying to connect to the API, but by scraping the page using BS4, I still get denied access. Not sure why, and how they can detect my scraper?
I solved it.
If I add verify = False in my requests.get() I manage to go around the error.
start_urls = ['https://www.qichacha.com/search?key=北京证大向上']
def parse(self, response):
# the start_url is a list page, the company_url is a detail_url from the list page
yield scrapy.Request(url=company_url, meta={"infos":info},callback=self.parse_basic_info, dont_filter=True)
when request the company_url, then response 405,
but, if i use
response = requests.get(company_url, headers=headers)
print(response.code)
print(response.txt)
then response 200 and can parse the html page, or
start_urls=[company_url]
def parse(self, response):
print(response.code)
print(response.txt)
and also response 200,I don't know why response 405
when it response 405,i print request like this:
{'_encoding': 'utf-8', 'method': 'GET', '_url': 'https://www.qichacha.com/firm_b18bf42ee07d7961e91a0edaf1649287.html', '_body': b'', 'priority': 0, 'callback': None, 'errback': None, 'cookies': {}, 'headers': {b'User-Agent': [b'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_3) AppleWebKit/535.20 (KHTML, like Gecko) Chrome/19.0.1036.7 Safari/535.20']}, 'dont_filter': False, '_meta': {'depth': 1}, 'flags': []}
what's wrong with it?
It seems that the page blocks Scrapy using the default user-agent string. Running the spider like this works for me:
scrapy runspider -s USER_AGENT="Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.80 Safari/537.36" spider.py
Alternatively, you can set USER_AGENT in your project's settings.py. Or, use something like scrapy-fake-useragent to handle this automatically.
This website response
DEBUG: Crawled (520) <GET https://ddlfr.pw/> (referer: None)
How can i resolve this ?
I post my code for explain
import json
from scrapy import Spider, Request, Selector
class LoginSpider(Spider):
name = 'ddlfr.pw'
start_urls = ['https://ddlfr.pw/index.php?do=search']
numero = 0
def parse(self, response):
global numero
return scrapy.FormRequest.from_response(
response,
headers = {'user-agent' : 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36'},
formdata= {'dosearch': 'Rechercher', 'story': 'musso', 'do': 'search' , 'subaction': 'search', 'search_start': str(self.numero) , 'full_search': '0', 'result_form': '1'},
callback=self.after_login,
dont_filter = True
)
def after_login(self, response):
for title in response.xpath('//div[#class="short nl nl2"]'):
yield {'roman': title.extract()}
yes because the web require valid browser's headers. while scrapy send headers as a bot.
Try to use these headers:
headers = {
'user-agent' : 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36'
}
You can see crawled status over your website
I suggest that you monitor what your web browser does when you send the form from the web browser (Network tab of the developer tools), and try to reproduce the request with Scrapy.
In Firefox, for example, you can copy the successful request from the Network tab as a curl command, which is a clear representation of the request.
I believe there is a better way to get response using scrapy.Request then I do
...
import urllib.request
from scrapy.selector import Selector
from scrapy.http import HtmlResponse
...
class MatchResultsSpider(scrapy.Spider):
name = 'match_results'
allowed_domains = ['site.com']
start_urls = ['url.com']
def get_detail_page_data(self, detail_url):
req = urllib.request.Request(
detail_url,
data=None,
headers={
'user_agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.106 Safari/537.36',
'accept': 'application/json, text/javascript, */*; q=0.01',
'referer': 'site.com',
}
)
page = urllib.request.urlopen(req)
response = HtmlResponse(url=detail_url, body=page.read())
target = Selector(response=response)
return target.xpath('//dd[#data-first_name]/text()').extract_first()
I get all information inside parse function.
But in one place I need to get a little peace data from inside detail page.
# Lineups
lineup_team_tables = lineups_container.xpath('.//tbody')
for i, table in enumerate(lineup_team_tables):
# lineup players
line_up = []
lineup_players = table.xpath('./tr[not(contains(string(), "Coach"))]')
for lineup_player in lineup_players:
line_up_entries = {}
lineup_player_url = lineup_player.xpath('.//a/#href').extract_first()
line_up_entries['player_id'] = get_id(lineup_player_url)
line_up_entries['jersey_num'] = lineup_player.xpath('./td[#class="shirtnumber"]/text()').extract_first()
abs_lineup_player_url = response.urljoin(lineup_player_url)
line_up_entries['position_id_detail'] = self.get_detail_page_data(abs_lineup_player_url)
line_up.append(line_up_entries)
# team_lineup['line_up'] = line_up
self.write_to_scuard(i, 'line_up', line_up)
Can I get data from other page using scrapy.Request(detail_url, calback_func)?
Thank for your help!
Too much extra code. Use simple scheme of Scrapy parsing:
class ********(scrapy.Spider):
name = '*******'
domain = '****'
allowed_domains = ['****']
start_urls = ['https://******']
custom_settings = {
'USER_AGENT': 'Mozilla/5.0 (Windows NT 10.0; Win64;AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.84 Safari/537.36',
'DEFAULT_REQUEST_HEADERS': {
'ACCEPT': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8',
'ACCEPT_ENCODING': 'gzip, deflate, br',
'ACCEPT_LANGUAGE': 'en-US,en;q=0.9',
'CONNECTION': 'keep-alive',
}
def parse(self, response):
(You already have responsed html start_urls = ['https://******'])
yield scrapy.Request(url, callback=self.parse_details)
then you can parse further (nested). And return back to parse callback:
def parse_details(self, response):
************
yield scrapy.Request(url_2, callback=self.parse)