How do I find a specific tag's value (which could be anything) with beautifulsoup? - beautifulsoup

I am trying to get the job IDs from the tags of Indeed listings. So far, I have taken Indeed search results and put each job into its own "bs4.element.Tag" object, but I don't know how to extract the value of the tag (or is it a class?) "data-jk". Here is what I have so far:
import requests
import bs4
import re
# 1: scrape (5?) pages of search results for listing ID's
results = []
results.append(requests.get("https://www.indeed.com/jobs?q=data+analyst&l=United+States&start=0"))
results.append(requests.get("https://www.indeed.com/jobs?q=data+analyst&l=United+States&start=10"))
results.append(requests.get("https://www.indeed.com/jobs?q=data+analyst&l=United+States&start=20"))
results.append(requests.get("https://www.indeed.com/jobs?q=data+analyst&l=United+States&start=30"))
results.append(requests.get("https://www.indeed.com/jobs?q=data+analyst&l=United+States&start=40"))
# each search page has a query "q", location "l", and a "start" = 10*int
# the search results are contained in a "td" with ID = "resultsCol"
justjobs = []
for eachResult in results:
soup_jobs = bs4.BeautifulSoup(eachResult.text, "lxml") # this is for IDs
justjobs.extend(soup_jobs.find_all(attrs={"data-jk":True})) # re.compile("data-jk")
# each "card" is a div object
# each has the class "jobsearch-SerpJobCard unifiedRow row result clickcard"
# as well as a specific tag "data-jk"
# "data-jk" seems to be the actual IDs used in each listing's URL
# Now, each div element has a data-jk. I will try to get data-jk from each one:
jobIDs = []
print(type(justjobs[0])) # DEBUG
for eachJob in justjobs:
jobIDs.append(eachJob.find("data-jk"))
print("Length: " + str(len(jobIDs))) # DEBUG
print("Example JobID: " + str(jobIDs[1])) # DEBUG
The examples I've seen online generally try to get the information contained between and , but I am not sure how to get the info from inside of the (first) tag itself. I've tried doing it by parsing it as a string instead:
print(justjobs[0])
for eachJob in justjobs:
jobIDs.append(str(eachJob)[115:131])
print(jobIDs)
but the website is also inconsistent with how the tags operate, and I think that using beautifulsoup would be more flexible than multiple cases and substrings.
Any pointers would be greatly appreciated!

Looks like you can regex them out from a script tag
import requests,re
html = requests.get('https://www.indeed.com/jobs?q=data+analyst&l=United+States&start=0').text
p = re.compile(r"jk:'(.*?)'")
ids = p.findall(html)

Related

How do I get the correct path to a JSON file from a URL?

How do I get the "request url" part of the get request?
The number part is time in milliseconds but the part before the ".dat" in the URL changes for every game so I need a way to get the whole URL, using requests and BeautifulSoup4.
link to page https://www.oddsportal.com/soccer/germany/bundesliga/1-fc-koln-holstein-kiel-0IRBLw8b/
This was an interesting challenge so I decided to have a look.
You can construct the url from various parts of the initial response, with the inclusion of a tab mapping for football (shown in dictionary). It may be possibly to derive the mappings for the dictionary dynamically from the onmousedown arguments and the associated uid function. I started looking into it and may carry on time permits. Hardcoding for football, for full/1st half/2nd half tabs, seems to be ok for now.
import requests
import re, urllib, time
time_lkup = {
'full_time':'1-2',
'first_half':'1-3',
'second_half':'1-4'
}
with requests.Session() as s:
s.headers = {'User-Agent':'Mozilla/5.0',
'referer': 'https://www.oddsportal.com'}
r = s.get('https://www.oddsportal.com/soccer/germany/bundesliga/1-fc-koln-holstein-kiel-0IRBLw8b/')
version_id = re.search(r'"versionId":(\d+)', r.text).group(1)
sport_id = re.search(r'"sportId":(\d+)', r.text).group(1)
xeid = re.search(r'"id":"(.*?)"', r.text).group(1)
xhash = urllib.parse.unquote(re.search(r'"xhash":"(.*?)"', r.text).group(1))
unix = int(time.time())
url = f'https://fb.oddsportal.com/feed/match/{version_id}-{sport_id}-{xeid}-{time_lkup["full_time"]}-{xhash}.dat?_={unix}'
print(url)
r = s.get(url)
print(r.text)

How to make antispam function discord.py?

I need antispam function on my discord server. Please help me. I tried this:
import datetime
import time
time_window_milliseconds = 5000
max_msg_per_window = 5
author_msg_times = {}
#client.event
async def on_ready():
print('logged in as {0.user}'.format(client))
await client.change_presence(activity=discord.Activity(type=discord.ActivityType.playing,name="stack overflow"))
#client.event
async def on_message(message):
global author_msg_counts
ctx = await client.get_context(message)
author_id = ctx.author.id
# Get current epoch time in milliseconds
curr_time = datetime.datetime.now().timestamp() * 1000
# Make empty list for author id, if it does not exist
if not author_msg_times.get(author_id, False):
author_msg_times[author_id] = []
# Append the time of this message to the users list of message times
author_msg_times[author_id].append(curr_time)
# Find the beginning of our time window.
expr_time = curr_time - time_window_milliseconds
# Find message times which occurred before the start of our window
expired_msgs = [
msg_time for msg_time in author_msg_times[author_id]
if msg_time < expr_time
]
# Remove all the expired messages times from our list
for msg_time in expired_msgs:
author_msg_times[author_id].remove(msg_time)
# ^ note: we probably need to use a mutex here. Multiple threads
# might be trying to update this at the same time. Not sure though.
if len(author_msg_times[author_id]) > max_msg_per_window:
await ctx.send("Stop Spamming")
ping()
client.run(os.getenv('token'))
And it doesn't seem to work when I type the same message over and over again. Can you guys please help me? I need the good antispam function which will work inside on_message
I think the best thing you can do is to make an event on_member_join, which will be called every time user joins. Then in this event, you can make a list instead of variables that will save user id, and their current currency.
users_currency = ["user's id", "5$", "another user's id", "7$"] and so on. Next, I would recommend saving it to a text file.
Example code
global users_currency
users_currrency = []
#client.event
global users_currency
async def on_member_join(member): #on_member_join event
user = str(member.id) #gets user's id and changes it to string
users_currency.append(user) #adds user's id to your list
users_currency.append("0") #sets the currency to 0
Now if someone will join their id will appear in list and change their currency to 0.
How can you use assigned values in list
If you keep the code close to example higher then on users_currrency[0], users_currrency[2], [...]. You will get users' ids and on users_currrency[1], users_currrency[3], etc. You will get their currency. Then you can use on_message event or #client.command to make command that will look for user's id in list and change next value - their currency.
Saving it to a text file
You have to save it in a text file (Writing a list to a file with Python) and then make a function that will run at the start of the bot and read everything from the file and assign it inside your list.
Example code:
with open("users_currency.txt") as f:
rd=f.read()
changed_to_a_list=rd.split()
users_currency = changed_to_a_list

Scrapy Project Review/ Link Rules

This is my second project and I was wondering if someone could review and give me best practices in applying scrapy framework. I also have a specific issue: not all courses are scraped from the site.
Goal: scrape all golf courses info from golf advisor website. Link: https://www.golfadvisor.com/course-directory/1-world/
Approach: I used CrawlSpider to include rules for links to explore.
Result: Only 19,821 courses out of 36,587 were scraped from the site.
Code:
import scrapy
from urllib.parse import urljoin
from collections import defaultdict
# adding rules with crawlspider
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
class GolfCourseSpider(CrawlSpider):
name = 'golfadvisor'
allowed_domains = ['golfadvisor.com']
start_urls = ['https://www.golfadvisor.com/course-directory/1-world/']
base_url = 'https://www.golfadvisor.com/course-directory/1-world/'
# use rules to visit only pages with 'courses/' in the path and exclude pages with 'page=1, page=2, etc'
# since those are duplicate links to the same course
rules = [
Rule(LinkExtractor(allow=('courses/'), deny=('page=')), callback='parse_filter_course', follow=True),
]
def parse_filter_course(self, response):
# checking if it is an actual course page. excluded it for final ran, didnt fully
# exists = response.css('.CoursePageSidebar-map').get()
# if exists:
# the page is split in multiple sections with different amount of details specified on each.
# I decided to use nested for loop (for section in sections, for detail in section) to retrieve data.
about_section = response.css('.CourseAbout-information-item')
details_section = response.css('.CourseAbout-details-item')
rental_section = response.css('.CourseAbout-rentalsServices-item')
practice_section = response.css('.CourseAbout-practiceInstruction-item')
policies_section = response.css('.CourseAbout-policies-item')
sections = [
about_section,
details_section,
rental_section,
practice_section,
policies_section
]
# created a default list dict to add new details from for loops
dict = defaultdict(list)
# also have details added NOT from for loop sections, but hard coded using css and xpath selectors.
dict = {
'link': response.url,
'Name': response.css('.CoursePage-pageLeadHeading::text').get().strip(),
'Review Rating': response.css('.CoursePage-stars .RatingStarItem-stars-value::text').get('').strip(),
'Number of Reviews': response.css('.CoursePage-stars .desktop::text').get('').strip().replace(' Reviews',''),
'% Recommend this course': response.css('.RatingRecommendation-percentValue::text').get('').strip().replace('%',''),
'Address': response.css('.CoursePageSidebar-addressFirst::text').get('').strip(),
'Phone Number': response.css('.CoursePageSidebar-phoneNumber::text').get('').strip(),
# website has a redirecting link, did not figure out how to get the main during scraping process
'Website': urljoin('https://www.golfadvisor.com/', response.css('.CoursePageSidebar-courseWebsite .Link::attr(href)').get()),
'Latitude': response.css('.CoursePageSidebar-map::attr(data-latitude)').get('').strip(),
'Longitude': response.css('.CoursePageSidebar-map::attr(data-longitude)').get('').strip(),
'Description': response.css('.CourseAbout-description p::text').get('').strip(),
# here, I was suggested to use xpath to retrieve text. should it be used for the fields above and why?
'Food & Beverage': response.xpath('//h3[.="Available Facilities"]/following-sibling::text()[1]').get('').strip(),
'Available Facilities': response.xpath('//h3[.="Food & Beverage"]/following-sibling::text()[1]').get('').strip(),
# another example of using xpath for microdata
'Country': response.xpath("(//meta[#itemprop='addressCountry'])/#content").get('')
}
# nested for loop I mentioned above
for section in sections:
for item in section:
dict[item.css('.CourseValue-label::text').get().strip()] = item.css('.CourseValue-value::text').get('').strip()
yield dict
E.G. it discovered only two golf courses in Mexico:
Club Campestre de Tijuana
Real del Mar Golf Resort
I've ran the code specifically scraping the pages it didn't pick up: I was able to scrapy those pages individually. Therefore my link extraction rules are wrong.
This is the output file with ~20k courses: https://drive.google.com/file/d/1izg2gZ87qbmMtg4S_VKQmkzlKON3poIs/view?usp=sharing
Thank you,
Yours Data Enthusiast

Is there a function to return the options of a DropDown list in a HTML using Mechanicalsoup or BeautifulSoup?

As the title says, I'm working on a project using MechanicalSoup and I am wondering how I can write a function to return the possible options for a DropDown list. Is it possible to search an element by its name/id and then have it return the options?
import mechanicalsoup
from bs4 import BeautifulSoup
#Sets StatefulBrowser Object to winnet then it it grabs form
browser = mechanicalsoup.StatefulBrowser()
winnet = "http://winnet.wartburg.edu/coursefinder/"
browser.open(winnet)
Searchform = browser.select_form()
#Selects submit button and has filter options listed.
Searchform.choose_submit('ctl00$ContentPlaceHolder1$FormView1$Button_FindNow')
Searchform.set('ctl00$ContentPlaceHolder1$FormView1$TextBox_keyword', input()) #Keyword Searches by Class Title. Inputting string will search by that string ignoring any stored nonsense in the page.
#ACxxx Course Codes have 3 spaces after them, THIS IS REQUIRED. Except the All value for not searching by a Department does not.
Searchform.set("ctl00$ContentPlaceHolder1$FormView1$DropDownList_Department", 'CS ') #For Department List, it takes the CourseCodes as inputs and displays as the Full Name
Searchform.set("ctl00$ContentPlaceHolder1$FormView1$DropDownList_Term", "2020 Winter Term") # Term Dropdown takes a value that is a string. String is Exactly the Term date.
Searchform.set('ctl00$ContentPlaceHolder1$FormView1$DropDownList_MeetingTime', 'all') #Takes the Week Class Time as a String. Need to Retrieve list of options from pages
Searchform.set('ctl00$ContentPlaceHolder1$FormView1$DropDownList_EssentialEd', 'none') #takes a small string signialling the EE req or 'all' or 'none'. None doesn't select and option and all selects all coruses w/ a EE
Searchform.set('ctl00$ContentPlaceHolder1$FormView1$DropDownList_CulturalDiversity', 'none')# Cultural Diversity, Takes none, C, D or all
Searchform.set('ctl00$ContentPlaceHolder1$FormView1$DropDownList_WritingIntensive', 'none') # options are none or WI
Searchform.set('ctl00$ContentPlaceHolder1$FormView1$DropDownList_PassFail', 'none')# Pass/Faill takes 'none' or 'PF'
Searchform.set('ctl00$ContentPlaceHolder1$FormView1$CheckBox_OpenCourses', False) #Check Box, It's True or False
Searchform.set('ctl00$ContentPlaceHolder1$FormView1$DropDownList_Instructor', '0')# 0 is for None Selected otherwise it is a string of numbers (Instructor ID?)
#Submits Page, Grabs results and then launches a browser for test purposes.
browser.submit_selected()# Submits Form. Retrieves Results.
table = browser.get_current_page().find('table') #Finds Result Table
print(type(table))
rows = table.get_text().split('\n') # List of all Class Rows split by \n.
print(type(rows))
browser.launch_browser()
I figured out if I want to post the options I can retrieve a list of them by doing:
options_list = browser.get_current_page().findAll('option') #Finds Result Table
Then I was able to use a for-loop to extract the text and the underlying values:
vlist = []
tlist = []
for option in options_list:
value = str(option).split('"') # Splits option into chunks, value[1] is the value
vlist.append(value[1])
tlist.append(option.get_text())
Essentially I was able to make two separate lists one containing the option's text and one containing the underlying value. This can be modified to instead add to a dictionary and create a set of Key:Value pairs which would be more useful in some applications.

Python: Accessing indices of iterable passed into `Pool.map(function(), iterable)`

I have code that looks something like this:
def downloadImages(self, path):
for link in self.images: #self.images=list of links opened via requests & written to file
if stuff in link:
#parse link a certain way to get `name`
else:
#parse link a different way to get `name`
r = requests.get(link)
with open(path+name,'wb') as f:
f.write(r.content)
pool = Pool(2)
scraper.prepPage(url)
scraper.downloadImages('path/to/directory')
I want to change downloadImages to parallelize the function. Here's my attempt:
def downloadImage(self, path, index of self.images currently being processed):
if stuff in link: #because of multiprocessing, there's no longer a loop providing a reference to each index...the index is necessary to parse and complete the function.
#parse link a certain way to get `name`
else:
#parse link a different way to get `name`
r = requests.get(link)
with open(path+name,'wb') as f:
f.write(r.content)
pool = Pool(2)
scraper.prepPage(url)
pool.map(scraper.downloadImages('path/to/directory', ***some way to grab index of self.images****), self.images)
How can I refer to the index currently being worked on of the iterable passed into pool.map()?
I'm completely new to multiprocessing & couldn't find what I was looking for in the documentation...I also couldn't find a similar question via google or stackoverflow.