Scrapy Project Review/ Link Rules - scrapy

This is my second project and I was wondering if someone could review and give me best practices in applying scrapy framework. I also have a specific issue: not all courses are scraped from the site.
Goal: scrape all golf courses info from golf advisor website. Link: https://www.golfadvisor.com/course-directory/1-world/
Approach: I used CrawlSpider to include rules for links to explore.
Result: Only 19,821 courses out of 36,587 were scraped from the site.
Code:
import scrapy
from urllib.parse import urljoin
from collections import defaultdict
# adding rules with crawlspider
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
class GolfCourseSpider(CrawlSpider):
name = 'golfadvisor'
allowed_domains = ['golfadvisor.com']
start_urls = ['https://www.golfadvisor.com/course-directory/1-world/']
base_url = 'https://www.golfadvisor.com/course-directory/1-world/'
# use rules to visit only pages with 'courses/' in the path and exclude pages with 'page=1, page=2, etc'
# since those are duplicate links to the same course
rules = [
Rule(LinkExtractor(allow=('courses/'), deny=('page=')), callback='parse_filter_course', follow=True),
]
def parse_filter_course(self, response):
# checking if it is an actual course page. excluded it for final ran, didnt fully
# exists = response.css('.CoursePageSidebar-map').get()
# if exists:
# the page is split in multiple sections with different amount of details specified on each.
# I decided to use nested for loop (for section in sections, for detail in section) to retrieve data.
about_section = response.css('.CourseAbout-information-item')
details_section = response.css('.CourseAbout-details-item')
rental_section = response.css('.CourseAbout-rentalsServices-item')
practice_section = response.css('.CourseAbout-practiceInstruction-item')
policies_section = response.css('.CourseAbout-policies-item')
sections = [
about_section,
details_section,
rental_section,
practice_section,
policies_section
]
# created a default list dict to add new details from for loops
dict = defaultdict(list)
# also have details added NOT from for loop sections, but hard coded using css and xpath selectors.
dict = {
'link': response.url,
'Name': response.css('.CoursePage-pageLeadHeading::text').get().strip(),
'Review Rating': response.css('.CoursePage-stars .RatingStarItem-stars-value::text').get('').strip(),
'Number of Reviews': response.css('.CoursePage-stars .desktop::text').get('').strip().replace(' Reviews',''),
'% Recommend this course': response.css('.RatingRecommendation-percentValue::text').get('').strip().replace('%',''),
'Address': response.css('.CoursePageSidebar-addressFirst::text').get('').strip(),
'Phone Number': response.css('.CoursePageSidebar-phoneNumber::text').get('').strip(),
# website has a redirecting link, did not figure out how to get the main during scraping process
'Website': urljoin('https://www.golfadvisor.com/', response.css('.CoursePageSidebar-courseWebsite .Link::attr(href)').get()),
'Latitude': response.css('.CoursePageSidebar-map::attr(data-latitude)').get('').strip(),
'Longitude': response.css('.CoursePageSidebar-map::attr(data-longitude)').get('').strip(),
'Description': response.css('.CourseAbout-description p::text').get('').strip(),
# here, I was suggested to use xpath to retrieve text. should it be used for the fields above and why?
'Food & Beverage': response.xpath('//h3[.="Available Facilities"]/following-sibling::text()[1]').get('').strip(),
'Available Facilities': response.xpath('//h3[.="Food & Beverage"]/following-sibling::text()[1]').get('').strip(),
# another example of using xpath for microdata
'Country': response.xpath("(//meta[#itemprop='addressCountry'])/#content").get('')
}
# nested for loop I mentioned above
for section in sections:
for item in section:
dict[item.css('.CourseValue-label::text').get().strip()] = item.css('.CourseValue-value::text').get('').strip()
yield dict
E.G. it discovered only two golf courses in Mexico:
Club Campestre de Tijuana
Real del Mar Golf Resort
I've ran the code specifically scraping the pages it didn't pick up: I was able to scrapy those pages individually. Therefore my link extraction rules are wrong.
This is the output file with ~20k courses: https://drive.google.com/file/d/1izg2gZ87qbmMtg4S_VKQmkzlKON3poIs/view?usp=sharing
Thank you,
Yours Data Enthusiast

Related

Can I do any analysis on spacy display using NER?

When accessing this display in spacy NER, can you add the found entities - in this case any tweets with GPE or LOC - to a new dataframe or do any further analysis on this topic? I thought once I got them into a list I could use geopy to visualive it possibly, any thoughts?
colors = {'LOC': 'linear-gradient(90deg, ~aa9cde, #dc9ce7)', 'GPE' : 'radial-gradient(white, blue)'}
options = {'ents' : ['LOC', 'GPE'],'colors':colors}
spacy.displacy.render(doc, style='ent',jupyter=True, options=options, )
The entities are accessible on the doc object. If you want to get all the ents in the doc object into a list, simply use, doc.ents. For example:
import spacy
content = "Narendra Modi is the Prime Minister of India"
nlp = spacy.load('en_core_web_md')
doc = nlp(content)
print(doc.ents)
should output:
(Narendra Modi, India)
Say, you want to the text (or mention) of the entity and the label of the entity (say, PERSON, GPE, LOC, NORP, etc.) then you can get them as follows:
print([(ent, ent.label_) for ent in doc.ents])
should output:
[(Narendra Modi, 'PERSON'), (India, 'GPE')]
You should be able to use them in other places as you see fit.

SpaCy: Set entity information for a token which is included in more than one span

I am trying to use SpaCy for entity context recognition in the world of ontologies. I'm a novice at using SpaCy and just playing around for starters.
I am using the ENVO Ontology as my 'patterns' list for creating a dictionary for entity recognition. In simple terms the data is an ID (CURIE) and the name of the entity it corresponds to along with its category.
Screenshot of my sample data:
The following is the workflow of my initial code:
Creating patterns and terms
# Set terms and patterns
terms = {}
patterns = []
for curie, name, category in envoTerms.to_records(index=False):
if name is not None:
terms[name.lower()] = {'id': curie, 'category': category}
patterns.append(nlp(name))
Setup a custom pipeline
#Language.component('envo_extractor')
def envo_extractor(doc):
matches = matcher(doc)
spans = [Span(doc, start, end, label = 'ENVO') for matchId, start, end in matches]
doc.ents = spans
for i, span in enumerate(spans):
span._.set("has_envo_ids", True)
for token in span:
token._.set("is_envo_term", True)
token._.set("envo_id", terms[span.text.lower()]["id"])
token._.set("category", terms[span.text.lower()]["category"])
return doc
# Setter function for doc level
def has_envo_ids(self, tokens):
return any([t._.get("is_envo_term") for t in tokens])
##EDIT: #################################################################
def resolve_substrings(matcher, doc, i, matches):
# Get the current match and create tuple of entity label, start and end.
# Append entity to the doc's entity. (Don't overwrite doc.ents!)
match_id, start, end = matches[i]
entity = Span(doc, start, end, label="ENVO")
doc.ents += (entity,)
print(entity.text)
#########################################################################
Implement the custom pipeline
nlp = spacy.load("en_core_web_sm")
matcher = PhraseMatcher(nlp.vocab)
#### EDIT: Added 'on_match' rule ################################
matcher.add("ENVO", None, *patterns, on_match=resolve_substrings)
nlp.add_pipe('envo_extractor', after='ner')
and the pipeline looks like this
[('tok2vec', <spacy.pipeline.tok2vec.Tok2Vec at 0x7fac00c03bd0>),
('tagger', <spacy.pipeline.tagger.Tagger at 0x7fac0303fcc0>),
('parser', <spacy.pipeline.dep_parser.DependencyParser at 0x7fac02fe7460>),
('ner', <spacy.pipeline.ner.EntityRecognizer at 0x7fac02f234c0>),
('envo_extractor', <function __main__.envo_extractor(doc)>),
('attribute_ruler',
<spacy.pipeline.attributeruler.AttributeRuler at 0x7fac0304a940>),
('lemmatizer',
<spacy.lang.en.lemmatizer.EnglishLemmatizer at 0x7fac03068c40>)]
Set extensions
# Set extensions to tokens, spans and docs
Token.set_extension('is_envo_term', default=False, force=True)
Token.set_extension("envo_id", default=False, force=True)
Token.set_extension("category", default=False, force=True)
Doc.set_extension("has_envo_ids", getter=has_envo_ids, force=True)
Doc.set_extension("envo_ids", default=[], force=True)
Span.set_extension("has_envo_ids", getter=has_envo_ids, force=True)
Now when I run the text 'tissue culture', it throws me an error:
nlp('tissue culture')
ValueError: [E1010] Unable to set entity information for token 0 which is included in more than one span in entities, blocked, missing or outside.
I know why the error occurred. It is because there are 2 entries for the 'tissue culture' phrase in the ENVO database as shown below:
Ideally I'd expect the appropriate CURIE to be tagged depending on the phrase that was present in the text. How do I address this error?
My SpaCy Info:
============================== Info about spaCy ==============================
spaCy version 3.0.5
Location *irrelevant*
Platform macOS-10.15.7-x86_64-i386-64bit
Python version 3.9.2
Pipelines en_core_web_sm (3.0.0)
It might be a little late nowadays but, complementing Sofie VL's answer a little bit, and to anyone who might be still interested in it, what I (another spaCy newbie, lol) have done to get rid of overlapping spans, goes as follows:
import spacy
from spacy.util import filter_spans
# [Code to obtain 'entity']...
# 'entity' should be a list, i.e.:
# entity = ["Carolina", "North Carolina"]
pat_orig = len(entity)
filtered = filter_spans(ents) # THIS DOES THE TRICK
pat_filt =len(filtered)
doc.ents = filtered
print("\nCONVERSION REPORT:")
print("Original number of patterns:", pat_orig)
print("Number of patterns after overlapping removal:", pat_filt)
Important to mention that I am using the most recent version of spaCy at this date, v3.1.1. Additionally, it will work only if you actually do not mind about overlapping spans being removed, but if you do, then you might want to give this thread a look. More info regarding 'filter_spans' here.
Best regards.
Since spacy v3, you can use doc.spans to store entities that may be overlapping. This functionality is not supported by doc.ents.
So you have two options:
Implement an on_match callback that will filter out the results of the matcher before you use the result to set doc.ents. From a quick glance at your code (and the later edits), I don't think resolve_substrings is actually resolving conflicts? Ideally, the on_match function should check whether there are conflicts with existing ents, and decide which of them to keep.
Use doc.spans instead of doc.ents if that works for your use-case.

How do I find a specific tag's value (which could be anything) with beautifulsoup?

I am trying to get the job IDs from the tags of Indeed listings. So far, I have taken Indeed search results and put each job into its own "bs4.element.Tag" object, but I don't know how to extract the value of the tag (or is it a class?) "data-jk". Here is what I have so far:
import requests
import bs4
import re
# 1: scrape (5?) pages of search results for listing ID's
results = []
results.append(requests.get("https://www.indeed.com/jobs?q=data+analyst&l=United+States&start=0"))
results.append(requests.get("https://www.indeed.com/jobs?q=data+analyst&l=United+States&start=10"))
results.append(requests.get("https://www.indeed.com/jobs?q=data+analyst&l=United+States&start=20"))
results.append(requests.get("https://www.indeed.com/jobs?q=data+analyst&l=United+States&start=30"))
results.append(requests.get("https://www.indeed.com/jobs?q=data+analyst&l=United+States&start=40"))
# each search page has a query "q", location "l", and a "start" = 10*int
# the search results are contained in a "td" with ID = "resultsCol"
justjobs = []
for eachResult in results:
soup_jobs = bs4.BeautifulSoup(eachResult.text, "lxml") # this is for IDs
justjobs.extend(soup_jobs.find_all(attrs={"data-jk":True})) # re.compile("data-jk")
# each "card" is a div object
# each has the class "jobsearch-SerpJobCard unifiedRow row result clickcard"
# as well as a specific tag "data-jk"
# "data-jk" seems to be the actual IDs used in each listing's URL
# Now, each div element has a data-jk. I will try to get data-jk from each one:
jobIDs = []
print(type(justjobs[0])) # DEBUG
for eachJob in justjobs:
jobIDs.append(eachJob.find("data-jk"))
print("Length: " + str(len(jobIDs))) # DEBUG
print("Example JobID: " + str(jobIDs[1])) # DEBUG
The examples I've seen online generally try to get the information contained between and , but I am not sure how to get the info from inside of the (first) tag itself. I've tried doing it by parsing it as a string instead:
print(justjobs[0])
for eachJob in justjobs:
jobIDs.append(str(eachJob)[115:131])
print(jobIDs)
but the website is also inconsistent with how the tags operate, and I think that using beautifulsoup would be more flexible than multiple cases and substrings.
Any pointers would be greatly appreciated!
Looks like you can regex them out from a script tag
import requests,re
html = requests.get('https://www.indeed.com/jobs?q=data+analyst&l=United+States&start=0').text
p = re.compile(r"jk:'(.*?)'")
ids = p.findall(html)

Generate set of Nouns and verbs from n different descriptions, list out descriptions that match a noun and verb

Im new to NLP, i have with columns app name and its description. Data looks like this
app1, description1 (some information of app1, how it works)
app2, description2
.
.
app(n), description(n)
From these descriptions i need to generate a limited set of nouns and verbs. In the final application, when we pair a noun and verb from this list, output should be of list of apps that satisfy that noun+verb.
I dont have any idea where to start, can you please guide me where to start. Thank you.
The task of finding the morpho-syntactic category of words in a sentence is called part-of-speech (or PoS) tagging.
In your case, you probably need also to tokenize your text first.
To do so, you can use nltk, spacy, or the Stanford NLP tagger (among other tools).
Note that depending on the model you use, there can be several labels for nouns (singular nouns, plural nouns, proper nouns) and verbs (depending on the tense and person).
Example with NLTK:
import nltk
description = "This description describes apps with words."
tokenized_description = nltk.word_tokenize(description)
tagged_description = nltk.pos_tag(tokenized_description)
#tagged_description:
# [('This', 'DT'), ('description', 'NN'), ('describes', 'VBZ'), ('apps', 'RP'), ('with', 'IN'), ('words', 'NNS'), ('.', '.')]
# map the tags to a smaller set of tags
universal_tags_description = [(word, nltk.map_tag("wsj", "universal", tag)) for word, tag in tagged_description]
# universal_tags_description:
# [('This', 'DET'), ('description', 'NOUN'), ('describes', 'VERB'), ('apps', 'PRT'), ('with', 'ADP'), ('words', 'NOUN'), ('.', '.')]
filtered = [(word, tag) for word, tag in universal_tags_description if tag in {'NOUN', 'VERB'}]
# filtered:
# [('description', 'NOUN'), ('describes', 'VERB'), ('words', 'NOUN')]

Scrapy pull data from table rows

I'm trying to pull data from this page using Scrapy: https://www.interpol.int/notice/search/woa/1192802
The spider will crawl multiple pages but I have excluded the pagination code here to keep things simple. The problem is that the number of table rows that I want to scrape on each page can change each time.
So I need a way of scraping all the table data from the page no matter how many table rows it has.
First, I extracted all the table rows on the page. Then, I created a blank dictionary. Next, I tried to loop through each row and put it's cell data into the dictionary.
But it does not work and it is returning a blank file.
Any idea what's wrong?
# -*- coding: utf-8 -*-
import scrapy
class Test1Spider(scrapy.Spider):
name = 'test1'
allowed_domains = ['interpol.int']
start_urls = ['https://www.interpol.int/notice/search/woa/1192802']
def parse(self, response):
table_rows = response.xpath('//*[contains(#class,"col_gauche2_result_datasheet")]//tr').extract()
data = {}
for table_row in table_rows:
data.update({response.xpath('//td[contains(#class, "col1")]/text()').extract(): response.css('//td[contains(#class, "col2")]/text()').extract()})
yield data
What is this?
response.css('//td[contains(#class, "col2")]/text()').extract()
You are calling css() method but you are giving it a xpath
Anyways, here is the 100% working code, I have tested it.
table_rows = response.xpath('//*[contains(#class,"col_gauche2_result_datasheet")]//tr')
data = {}
for table_row in table_rows:
data[table_row.xpath('td[#class="col1"]/text()').extract_first().strip()] = table_row.xpath('td[#class="col2 strong"]/text()').extract_first().strip()
yield data
EDIT:
To remove the characters like \t\n\r etc, use regex.
import re
your_string = re.sub('\\t|\\n|\\r', '', your_string)
Try this.
I Hope it will help you.
# -*- coding: utf-8 -*-
import scrapy
class Test(scrapy.Spider):
name = 'test1'
allowed_domains = ['interpol.int']
start_urls = ['https://www.interpol.int/notice/search/woa/1192802']
def parse(self, response):
table_rows = response.xpath('//*[contains(#class,"col_gauche2_result_datasheet")]//tr')
for table_row in table_rows:
current_row = table_row.xpath('.//td/text()').extract()
print(current_row[0] + current_row[1].strip())