How do I get the correct path to a JSON file from a URL? - beautifulsoup

How do I get the "request url" part of the get request?
The number part is time in milliseconds but the part before the ".dat" in the URL changes for every game so I need a way to get the whole URL, using requests and BeautifulSoup4.
link to page https://www.oddsportal.com/soccer/germany/bundesliga/1-fc-koln-holstein-kiel-0IRBLw8b/

This was an interesting challenge so I decided to have a look.
You can construct the url from various parts of the initial response, with the inclusion of a tab mapping for football (shown in dictionary). It may be possibly to derive the mappings for the dictionary dynamically from the onmousedown arguments and the associated uid function. I started looking into it and may carry on time permits. Hardcoding for football, for full/1st half/2nd half tabs, seems to be ok for now.
import requests
import re, urllib, time
time_lkup = {
'full_time':'1-2',
'first_half':'1-3',
'second_half':'1-4'
}
with requests.Session() as s:
s.headers = {'User-Agent':'Mozilla/5.0',
'referer': 'https://www.oddsportal.com'}
r = s.get('https://www.oddsportal.com/soccer/germany/bundesliga/1-fc-koln-holstein-kiel-0IRBLw8b/')
version_id = re.search(r'"versionId":(\d+)', r.text).group(1)
sport_id = re.search(r'"sportId":(\d+)', r.text).group(1)
xeid = re.search(r'"id":"(.*?)"', r.text).group(1)
xhash = urllib.parse.unquote(re.search(r'"xhash":"(.*?)"', r.text).group(1))
unix = int(time.time())
url = f'https://fb.oddsportal.com/feed/match/{version_id}-{sport_id}-{xeid}-{time_lkup["full_time"]}-{xhash}.dat?_={unix}'
print(url)
r = s.get(url)
print(r.text)

Related

How to get the contributors you've coincided editing the most in Wikipedia

I'm doing a gamification web app to help Wikimedia's community health.
I want to find what editors have edited the same pages as 'Jake' the most in the last week or 100 last edits or something like that.
I know my query, but I can't figure out what tables I need because the Wikimedia DB layout is a mess.
So, I want to obtain something like
Username
Occurrences
Pages
Mikey
13
Obama,..
So the query would be something like (I'm accepting suggestions):
Get the pages that the user 'Jake' has edited in the last week.
Get the contributors of that page in last week.
For each of these contributors, get the pages they have edited in the last week and see if they match with the pages 'Jake' has edited and count them.
I've tried doing that something simpler in Pywikibot, but it's very, very slow (20secs for the last 500 contributions of Jake).
I only get the edited pages and get the contributors of that page and just count them and it's very slow.
My pywikibot code is:
site = Site(langcode, 'wikipedia')
user = User(site, username)
contributed_pages = set()
for page, oldid, ts, comment in user.contributions(total=100, namespaces=[0]):
contributed_pages.add(page)
return get_contributor_ocurrences(contributed_pages,site, username)
And the function
def get_contributor_ocurrences(contributed_pages, site,username):
contributors = []
for page in contributed_pages:
for editor in page.contributors():
if APISite.isBot(self= site,username=editor) or editor==username:
continue
contributors.append(editor)
return Counter(contributors)
PS: I have access to DB replicas, which I guess are way faster than Wikimedia API or Pywikibot
You can filter the data to be retrieved with timestamps parameters. This decreases the needed time a lot. Refer the documentation for their usage. Here is a code snippet to get the data with Pywikibot using time stamps:
from collections import Counter
from datetime import timedelta
import pywikibot
from pywikibot.tools import filter_unique
site = pywikibot.Site()
user = pywikibot.User(site, username) # username must be a string
# Setup the Generator for the last 7 days.
# Do not care about the timestamp format if using pywikibot.Timestamp
stamp = pywikibot.Timestamp.now() - timedelta(days=7)
contribs = user.contributions(end=stamp)
contributors= []
# filter_unique is used to remove duplicates.
# The key uses the page title
for page, *_ in filter_unique(contribs, key=lambda x: str(x[0])):
# note: editors is a Counter
editors = page.contributors(endtime=stamp)
print('{:<35}: {}'.format(page.title(), editors))
contributors.extend(editors.elements())
total = Counter(contributors)
This prints a list of pages and for each page it Shows the editors and their contribution counter of the given time range. Finally total should have the same content as your get_contributor_ocurrences functions above.
It requires some additional work to get the table you mentioned above.

How to make antispam function discord.py?

I need antispam function on my discord server. Please help me. I tried this:
import datetime
import time
time_window_milliseconds = 5000
max_msg_per_window = 5
author_msg_times = {}
#client.event
async def on_ready():
print('logged in as {0.user}'.format(client))
await client.change_presence(activity=discord.Activity(type=discord.ActivityType.playing,name="stack overflow"))
#client.event
async def on_message(message):
global author_msg_counts
ctx = await client.get_context(message)
author_id = ctx.author.id
# Get current epoch time in milliseconds
curr_time = datetime.datetime.now().timestamp() * 1000
# Make empty list for author id, if it does not exist
if not author_msg_times.get(author_id, False):
author_msg_times[author_id] = []
# Append the time of this message to the users list of message times
author_msg_times[author_id].append(curr_time)
# Find the beginning of our time window.
expr_time = curr_time - time_window_milliseconds
# Find message times which occurred before the start of our window
expired_msgs = [
msg_time for msg_time in author_msg_times[author_id]
if msg_time < expr_time
]
# Remove all the expired messages times from our list
for msg_time in expired_msgs:
author_msg_times[author_id].remove(msg_time)
# ^ note: we probably need to use a mutex here. Multiple threads
# might be trying to update this at the same time. Not sure though.
if len(author_msg_times[author_id]) > max_msg_per_window:
await ctx.send("Stop Spamming")
ping()
client.run(os.getenv('token'))
And it doesn't seem to work when I type the same message over and over again. Can you guys please help me? I need the good antispam function which will work inside on_message
I think the best thing you can do is to make an event on_member_join, which will be called every time user joins. Then in this event, you can make a list instead of variables that will save user id, and their current currency.
users_currency = ["user's id", "5$", "another user's id", "7$"] and so on. Next, I would recommend saving it to a text file.
Example code
global users_currency
users_currrency = []
#client.event
global users_currency
async def on_member_join(member): #on_member_join event
user = str(member.id) #gets user's id and changes it to string
users_currency.append(user) #adds user's id to your list
users_currency.append("0") #sets the currency to 0
Now if someone will join their id will appear in list and change their currency to 0.
How can you use assigned values in list
If you keep the code close to example higher then on users_currrency[0], users_currrency[2], [...]. You will get users' ids and on users_currrency[1], users_currrency[3], etc. You will get their currency. Then you can use on_message event or #client.command to make command that will look for user's id in list and change next value - their currency.
Saving it to a text file
You have to save it in a text file (Writing a list to a file with Python) and then make a function that will run at the start of the bot and read everything from the file and assign it inside your list.
Example code:
with open("users_currency.txt") as f:
rd=f.read()
changed_to_a_list=rd.split()
users_currency = changed_to_a_list

How do I find a specific tag's value (which could be anything) with beautifulsoup?

I am trying to get the job IDs from the tags of Indeed listings. So far, I have taken Indeed search results and put each job into its own "bs4.element.Tag" object, but I don't know how to extract the value of the tag (or is it a class?) "data-jk". Here is what I have so far:
import requests
import bs4
import re
# 1: scrape (5?) pages of search results for listing ID's
results = []
results.append(requests.get("https://www.indeed.com/jobs?q=data+analyst&l=United+States&start=0"))
results.append(requests.get("https://www.indeed.com/jobs?q=data+analyst&l=United+States&start=10"))
results.append(requests.get("https://www.indeed.com/jobs?q=data+analyst&l=United+States&start=20"))
results.append(requests.get("https://www.indeed.com/jobs?q=data+analyst&l=United+States&start=30"))
results.append(requests.get("https://www.indeed.com/jobs?q=data+analyst&l=United+States&start=40"))
# each search page has a query "q", location "l", and a "start" = 10*int
# the search results are contained in a "td" with ID = "resultsCol"
justjobs = []
for eachResult in results:
soup_jobs = bs4.BeautifulSoup(eachResult.text, "lxml") # this is for IDs
justjobs.extend(soup_jobs.find_all(attrs={"data-jk":True})) # re.compile("data-jk")
# each "card" is a div object
# each has the class "jobsearch-SerpJobCard unifiedRow row result clickcard"
# as well as a specific tag "data-jk"
# "data-jk" seems to be the actual IDs used in each listing's URL
# Now, each div element has a data-jk. I will try to get data-jk from each one:
jobIDs = []
print(type(justjobs[0])) # DEBUG
for eachJob in justjobs:
jobIDs.append(eachJob.find("data-jk"))
print("Length: " + str(len(jobIDs))) # DEBUG
print("Example JobID: " + str(jobIDs[1])) # DEBUG
The examples I've seen online generally try to get the information contained between and , but I am not sure how to get the info from inside of the (first) tag itself. I've tried doing it by parsing it as a string instead:
print(justjobs[0])
for eachJob in justjobs:
jobIDs.append(str(eachJob)[115:131])
print(jobIDs)
but the website is also inconsistent with how the tags operate, and I think that using beautifulsoup would be more flexible than multiple cases and substrings.
Any pointers would be greatly appreciated!
Looks like you can regex them out from a script tag
import requests,re
html = requests.get('https://www.indeed.com/jobs?q=data+analyst&l=United+States&start=0').text
p = re.compile(r"jk:'(.*?)'")
ids = p.findall(html)

Play reverse routing - getting absolute url

How can I get the absolute URL in play 2.2 scala when doing the following:
val promoLink = routes.Promotions.promotionsCategory(DOCID, slug)
//routes file
GET /promotions/:DOCID:/slug controllers.Promotions.promoCat(DOCID, slug)
As it stands I get a "found : play.api.mvc.Call" type mismatch on expecting a string
thanks
I suppose your promoLink should be a String containing an URL? Your question sounds a bit unclear.
If so then you probably need this:
val promoLink = routes.Promotions.promotionsCategory(DOCID, slug).absoluteURL(false)(request)
false in the .absoluteURL(false) stands for the isSecure parameter which will give you http or https url.
If you have an implicit request in scope you may omit the last (request) part

How to get all URLs in a Wikipedia page

It seems like Wikipedia API's definition of a link is different from URL? I'm trying to use the API to return all the urls in a specific wiki page.
I have been playing around with this query that I found from this page under generators and redirects.
I'm not sure why exactly are you confused (it would help if you explained that), but I'm quite sure that query is not what you want. It lists links (prop=links) on pages that are linked (generator=links) from the page “Title” (titles=Title). It also lists only the first page of links on the first page of links (with page size the tiny default value of 10).
If you want to get all the links on the page “Title”:
Use just prop=links, you don't want the generator.
Increase the limit to the maximum possible by adding pllimit=max (pl is the “prefix” for links)
Use the value given in the query-continue element to get to the second (and following) page of results.
So, the query for the first page would be:
http://en.wikipedia.org/w/api.php?action=query&titles=Title&prop=links&pllimit=max
And the second (and in this case, final) page:
http://en.wikipedia.org/w/api.php?action=query&titles=Title&prop=links&pllimit=max&plcontinue=226160|0|Lieutenant_General
Another thing that might be confusing you is that links returns only internal links (to other Wikipedia pages). To get external links, use prop=extlinks. You can also combine the two into one query:
http://en.wikipedia.org/w/api.php?action=query&titles=Title&prop=links|extlinks
Here's a Python solution that gets (and prints) all the pages linked to from a particular page. It gets the maximum number of links in the first request, then looks to see if the returned JSON object has a "continue" property. If it does, it adds the "plcontinue" value to the params dictionary and makes another request. (The last page of results returned will not have this property.)
import requests
session = requests.Session()
url = "https://en.wikipedia.org/w/api.php"
params = {
"action": "query",
"format": "json",
"titles": "Albert Einstein",
"prop": "links",
"pllimit": "max"
}
response = session.get(url=url, params=params)
data = response.json()
pages = data["query"]["pages"]
pg_count = 1
page_titles = []
print("Page %d" % pg_count)
for key, val in pages.items():
for link in val["links"]:
print(link["title"])
page_titles.append(link["title"])
while "continue" in data:
plcontinue = data["continue"]["plcontinue"]
params["plcontinue"] = plcontinue
response = session.get(url=url, params=params)
data = response.json()
pages = data["query"]["pages"]
pg_count += 1
print("\nPage %d" % pg_count)
for key, val in pages.items():
for link in val["links"]:
print(link["title"])
page_titles.append(link["title"])
print("%d titles found." % len(page_titles))
This code was adapted from the code in the MediaWiki API:Links example.