Send request to wordnet - beautifulsoup

I need to get send a request on wordnet knowing the tar_id (taken from Imagenet) to get the lemma assigned to that tar (e.g., I have a tar with houses, I need to send the request and obtain the lemma written on wordnet "living accommodation").
I used requests.get() first, with the URL. Then BeautifulSoup's parser.
I get the parsed HTML as a return but, there is no reference to the "body", meaning the part of the Noun and hypernyms / hyponyms.
Can you tell me how to get that part of Wordnet parsed with the rest of the page?
This is the URL I'm working on: http://wordnet-rdf.princeton.edu/pwn30/03546340-n

Just use the JSON endpoint.
For example:
import requests
headers = {
"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:96.0) Gecko/20100101 Firefox/96.0",
}
url = "http://wordnet-rdf.princeton.edu/json/pwn30/03546340-n"
data = requests.get(url, headers=headers).json()
print(data[0]["definition"])
Output:
structures collectively in which people are housed
And if you switch the endpoint to
url = "http://wordnet-rdf.princeton.edu/json_rel/03551520-n"
You'll get all the word relation data.

Related

Scrapy is not scraping the whole page but only some part of it

I am trying to scrape the British Petroleum website from scraping job profile. Initially the bot was not allowing it to scrape but after I initialized ROBOTSTXT_OBEY = False it started working but now it is not scraping whole page. Below is my code:
import scrapy
class exxonmobilSpider(scrapy.Spider):
name = "bp"
start_urls=['https://www.bp.com/en/global/corporate/careers/search-and-apply.html?query=data+scientist']
def parse(self, response):
name=response.xpath('//h3[#class="Hit_hitTitle__3MFk3"]')
print(name)
print(len(name))[enter image description here][1]
As you can see in image that xpath gives that h3 tag but when I run the code I am getting empty list. Later I cross checked by printing all the li or div tag and then counting the total number of tags, I found out that only half or some of the tags were getting scraped. Anyone has any idea why scrapy is scraping only some part of the page but not full page. Attaching the comparison image too. enter image description here
You Can see the total number of li tags are 55
But now check the length of the response variable "name".enter image description here
In the hope that OP will include a minimal reproducible example in his next question, here is a way of getting those jobs. Bear in mind jobs are being pulled from an API by Javascript in page, so you need to either use splash/scrapy-playwright, either scrape the API directly. We will do the latter. API url is being obtained from browser's Dev tools - Network tab.
import scrapy
class BpscrapeSpider(scrapy.Spider):
name = 'bpscrape'
allowed_domains = ['algolianet.com', 'bp.com']
def start_requests(self):
headers = {
'x-algolia-application-id': 'RF87OIMXXP',
'x-algolia-api-key': 'f4f167340049feccfcf6141fb7b90a5d',
'Origin': 'https://www.bp.com',
'content-type': 'application/x-www-form-urlencoded',
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/105.0.0.0 Safari/537.36'
}
api_url='https://rf87oimxxp-3.algolianet.com/1/indexes/*/queries?x-algolia-agent=Algolia%20for%20JavaScript%20(4.9.1)%3B%20Browser%3B%20JS%20Helper%20(3.4.4)%3B%20react%20(17.0.2)%3B%20react-instantsearch%20(6.11.0)'
payload = '{"requests":[{"indexName":"candidatematcher_bp_navapp_prod","params":"highlightPreTag=%3Cais-highlight-0000000000%3E&highlightPostTag=%3C%2Fais-highlight-0000000000%3E&filters=type%3A%20Professionals&hitsPerPage=100&query=data%20scientist&maxValuesPerFacet=20&page=0&facets=%5B%22country%22%2C%22group%22%5D&tagFilters="}]}'
yield scrapy.Request(
url=api_url,
headers=headers,
body=payload,
callback= self.parse,
method="POST")
def parse(self, response):
data = response.json()['results'][0]['hits']
for x in data:
yield x
Run with scrapy crawl bpscrape -o bpdsjobs.json to get a json file with all 26 jobs.
You will need to do some data cleaning, as that json response is quite comprehensive, and contains a lot of html tags etc.
For Scrapy documentation, please see https://docs.scrapy.org/en/latest/

Is anyone using Karate for Salesforce API testing? [duplicate]

Karate has been super helpful to validate our rest apis which gives json response. Now we have apis which gives us response in avro format. May also need to send the payload in avro format. How can i test the rest endpoints which gives response in AVRO format using karate? Is there any easy way I can tweak and get it done. Thanks!
Here's my suggestion, and in my opinion this will work very well.
Write a simple Java utility, maybe a static method that can take JSON and convert it to AVRO and vice versa.
Now you can define all your request data as JSON - but just before you make the request to the server, convert it to AVRO. I am not sure if the server call is HTTP or not. If it is HTTP - then you know what to do in Karate, just send binary as the request body etc.
Else you may not even use the Karate HTTP DSL like method, request etc. You have to write one more Java helper that will take your JSON (or AVRO) and make the call to the server (specific for your project) and return the response, converted back to JSON. For example:
* def Utils = Java.type('com.myco.avro.Utils')
* def json = { hello: 'world' }
* def req = Utils.toAvro(json)
* def res = Utils.send(req)
# you can combine this with the above
* def response = Utils.fromAvro(res)
* match response == { success: true }
Yes, you might be using Karate mostly for matching, reporting, environments etc. Which is still valuable ! Many people don't realize that HTTP is just 10% of what Karate can do for you.

Reading url using bs4 from yahoo finance

I am trying the following code to read the historical CSV data from yahoo finance:
import datetime
import time
from bs4 import BeautifulSoup
per1 = str(int(time.mktime((datetime.datetime.today() - td(days=365)).timetuple())))
per2 = str(int(time.mktime((datetime.datetime.today()).timetuple())))
url = 'https://query1.finance.yahoo.com/v7/finance/download/MSFT?period1=' + per1 + '&period2=' + per2 + '&interval=1d&events=history&crumb=OQg/YFV3fvh'
The url variable can be seen when you go to yahoo finance, type a ticker and hover over the "download data" button.
I get authentication error which I believe is due to missing cookie so I tried the following:
import requests
ses = requests.Session()
url1 = 'https://finance.yahoo.com/quote/MSFT/history?p=MSFT'
ses.get(url1)
soup = BeautifulSoup(ses.get(url).content)
print soup.prettify()
I get incorrect cookie error this time.
Can someone suggest how to work around this?
The crumb parameter of the query string keeps changing, perhaps with each browser session. So, when you copy its value from the browser, close it and then use it in another instance of the browser, it expires by then.
So, it should come as no surprise that by the time you use it in your requests session, it doesn't recognize the cookie value and generates an error.
Step 1
Studying the network tab in any browser will help. In this particular case, this crumb part is generated probably when you click on a ticker in the main page. So you'll have to fetch that URL first.
tickers = ('000001.SS', 'NKE', 'MA', 'SBUX')
url = 'https://finance.yahoo.com/quote/{0}?p={0}'.format(tickers[0])
r = s.get(url, headers = req_headers)
This URL needs to be fetched only once. So it doesn't matter which ticker you use for this.
Step 2
The response returned by the server contains the value passed to the crumb parameter in the query string when you download the CSV file.
However, it's contained in the script tag of the page returned by the previous request. This means you can't use BeautifulSoup alone to extract the crumb value.
I initially tried re to extract that out of the script tag's text. But for some reason, I wasn't able to. So I moved to json for parsing it.
soup = BeautifulSoup(r.content, 'lxml')
script_tag = soup.find(text=re.compile('crumb'))
response_dict = json.loads(script_tag[script_tag.find('{"context":'):script_tag.find('}}}};') + 4])
crumb = response_dict['context']['dispatcher']['stores']['CrumbStore']['crumb']
Note that BeautifulSoup is required to extract the script element's contents to be later passed to json to parse it into a Python dict object.
I had to use pprint to print the resulting dict to a file to see exactly where the crumb value was stored.
Step 3
The final URL that fetches the CSV file looks like this:
for ticker in tickers:
csv_url = 'https://query1.finance.yahoo.com/v7/finance/download/{0}?period1=1506656676&period2=1509248676&interval=1d&events=history&crumb={1}'.format(ticker, crumb)
r = s.get(csv_url, headers = req_headers)
Result
Here's the first few lines of one the files downloaded:
Date,Open,High,Low,Close,Adj Close,Volume
2017-09-29,3340.311035,3357.014893,3340.311035,3348.943115,3348.943115,144900
2017-10-09,3403.246094,3410.169922,3366.965088,3374.377930,3374.377930,191700
2017-10-10,3373.344971,3384.025879,3358.794922,3382.988037,3382.988037,179400
Note:
I used appropriate headers in both the requests. So if you skip that part and don't get the desired results, you may have to include them as well.
req_headers = {
'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8',
'accept-encoding': 'gzip, deflate, br',
'accept-language': 'en-US,en;q=0.8',
'upgrade-insecure-requests': '1',
'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.100 Safari/537.36'
}

Why my scrapy response body is empty?

I'm learning scrapy recently. And I tried use its simplist way to fetch a response body, but I got an empty string.
Here is my code:
>>> from scrapy.http import Response
>>> r = Response('http://zenofpython.blog.163.com/blog/static/23531705420146124552782')
>>> r.body
''
>>> r.headers
{}
>>> r.status
200
And with no difficulty, I can visit the url I used above for scrapy Response through browser.It has rich content.
What mistake I've made here?
Another reason for your problem can be that the site requires User-Agent header. Try it like this
scrapy shell http://www.to.somewhere -s USER_AGENT='Mozilla/5.0 (Windows; U; Windows NT 5.1; zh-CN; rv:1.9) Gecko/20080705 Firefox/3.0 Kapiko/3.0'
You can read more here
You're supposed to fetch a Request and get a Response object in return.
Try doing:
r = Request(url='http://zenofpython.blog.163.com/blog/static/23531705420146124552782')
fetch(r)
on scrapy shell and you'll be able to get the result as a Response object.
print response.body

Get app-details from Google Play

I am wondering how the various app statistics sites get app-details from Google Play. As GP does not have a public API. An example is Appaware.com - they have full details for Google Play apps.
A possible solution is scraping, however it doesn't work because Google will block you when you start sending hundreds of requests to them.
Any ideas?
p.s. "Google Play Developers API" is not a choice as it lets you access app-details only for your apps.
They use either the mobile API used by Android devices (i.e. with this library) or scrape the Google Play website. Both methods are subject to rate limiting, so they put pauses in between requests.
The mobile device API is completely undocumented and very difficult to program against. I would recommend scraping.
There is no official API or feed that you can use.
Android Marketing API is used to get the All app details from google store, You can check it out at here: https://code.google.com/p/android-market-api/
Unfortunately Google Play (previously known as Android Market) does not expose an official API.
To get the data you need, you could develop your own HTML crawler, parse the page and extract the app meta-data you need. This topic has been covered in other questions, for instance here.
If you don't want to implement all that by yourself (as you mentioned it's a complex project to do), you could use a third-party service to access Android apps meta-data through a JSON-based API.
For instance, 42matters.com (the company I work for) offers an API for both Android and iOS, you can see more details here.
The endpoints range from "lookup" (to get one app's meta-data, probably what you need) to "search", but we also expose "rank history" and other stats from the leading app stores. We have extensive documentation for all supported features, you find them in the left panel: 42matters docs
I hope this helps, otherwise feel free to get in touch with me. I know this industry quite well and can point you in the right direction.
Regards,
Andrea
The request might be blocked if using requests as default user-agent in requests library is a python-requests.
An additional step could be to rotate user-agent, for example, to switch between PC, mobile, and tablet, as well as between browsers e.g. Chrome, Firefox, Safari, Edge and so on. User-agent rotation can be used in combo with proxy rotation (ideally residential) + CAPTCHA solver.
At the moment, the Google Play Store has been heavily redesigned, now it is almost completely dynamic. However, all the data can be extracted from the inline JSON.
For scraping dynamic sites, selenium or playwright webdriver is great. However, in our case, using BeautifulSoup and regular expression is faster to extract data from the page source.
We must extract certain <script> element from all <script> elements in the HTML, by using regular expression, and transform in to a dict with json.loads():
# https://regex101.com/r/zOMOfo/1
basic_app_info = json.loads(re.findall(r"<script nonce=\"\w+\" type=\"application/ld\+json\">({.*?)</script>", str(soup.select("script")[11]), re.DOTALL)[0])
Check code in online IDE.
from bs4 import BeautifulSoup
import requests, re, json, lxml
# https://requests.readthedocs.io/en/latest/user/quickstart/#custom-headers
headers = {
"user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/106.0.0.0 Safari/537.36"
}
# https://requests.readthedocs.io/en/latest/user/quickstart/#passing-parameters-in-urls
params = {
"id": "com.nintendo.zara", # app name
"gl": "US", # country of the search
"hl": "en_GB" # language of the search
}
html = requests.get("https://play.google.com/store/apps/details", params=params, headers=headers, timeout=30)
soup = BeautifulSoup(html.text, "lxml")
# where all app data will be stored
app_data = {
"basic_info":{
"developer":{},
"downloads_info": {}
}
}
# [11] index is a basic app information
# https://regex101.com/r/zOMOfo/1
basic_app_info = json.loads(re.findall(r"<script nonce=\"\w+\" type=\"application/ld\+json\">({.*?)</script>", str(soup.select("script")[11]), re.DOTALL)[0])
# https://regex101.com/r/6Reb0M/1
additional_basic_info = re.search(fr"<script nonce=\"\w+\">AF_initDataCallback\(.*?(\"{basic_app_info.get('name')}\".*?)\);<\/script>",
str(soup.select("script")), re.M|re.DOTALL).group(1)
app_data["basic_info"]["name"] = basic_app_info.get("name")
app_data["basic_info"]["type"] = basic_app_info.get("#type")
app_data["basic_info"]["url"] = basic_app_info.get("url")
app_data["basic_info"]["description"] = basic_app_info.get("description").replace("\n", "") # replace new line character to nothing
app_data["basic_info"]["application_category"] = basic_app_info.get("applicationCategory")
app_data["basic_info"]["operating_system"] = basic_app_info.get("operatingSystem")
app_data["basic_info"]["thumbnail"] = basic_app_info.get("image")
app_data["basic_info"]["content_rating"] = basic_app_info.get("contentRating")
app_data["basic_info"]["rating"] = round(float(basic_app_info.get("aggregateRating").get("ratingValue")), 1) # 4.287856 -> 4.3
app_data["basic_info"]["reviews"] = basic_app_info.get("aggregateRating").get("ratingCount")
app_data["basic_info"]["reviews"] = basic_app_info.get("aggregateRating").get("ratingCount")
app_data["basic_info"]["price"] = basic_app_info["offers"][0]["price"]
app_data["basic_info"]["developer"]["name"] = basic_app_info.get("author").get("name")
app_data["basic_info"]["developer"]["url"] = basic_app_info.get("author").get("url")
# https://regex101.com/r/C1WnuO/1
app_data["basic_info"]["developer"]["email"] = re.search(r"[a-zA-Z0-9_.+-]+#[a-zA-Z0-9-]+\.[a-zA-Z0-9-.]+", additional_basic_info).group(0)
# https://regex101.com/r/Y2mWEX/1 (a few matches but re.search always matches the first occurence)
app_data["basic_info"]["release_date"] = re.search(r"\d{1,2}\s[A-Z-a-z]{3}\s\d{4}", additional_basic_info).group(0)
# https://regex101.com/r/7yxDJM/1
app_data["basic_info"]["downloads_info"]["long_form_not_formatted"] = re.search(r"\"(\d+,?\d+,?\d+\+)\"\,(\d+),(\d+),\"(\d+M\+)\"", additional_basic_info).group(1)
app_data["basic_info"]["downloads_info"]["long_form_formatted"] = re.search(r"\"(\d+,?\d+,?\d+\+)\"\,(\d+),(\d+),\"(\d+M\+)\"", additional_basic_info).group(2)
app_data["basic_info"]["downloads_info"]["as_displayed_short_form"] = re.search(r"\"(\d+,?\d+,?\d+\+)\"\,(\d+),(\d+),\"(\d+M\+)\"", additional_basic_info).group(4)
app_data["basic_info"]["downloads_info"]["actual_downloads"] = re.search(r"\"(\d+,?\d+,?\d+\+)\"\,(\d+),(\d+),\"(\d+M\+)\"", additional_basic_info).group(3)
# https://regex101.com/r/jjsdUP/1
# [2:] skips 2 PEGI logo thumbnails and extracts only app images
app_data["basic_info"]["images"] = re.findall(r",\[\d{3,4},\d{3,4}\],.*?(https.*?)\"", additional_basic_info)[2:]
try:
# https://regex101.com/r/C1WnuO/1
app_data["basic_info"]["video_trailer"] = "".join(re.findall(r"\"(https:\/\/play-games\.\w+\.com\/vp\/mp4\/\d+x\d+\/\S+\.mp4)\"", additional_basic_info)[0])
except:
app_data["basic_info"]["video_trailer"] = None
print(json.dumps(app_data, indent=2, ensure_ascii=False))
Example output:
[
{
"basic_info": {
"developer": {
"name": "Nintendo Co., Ltd.",
"url": "https://supermariorun.com/",
"email": "supermariorun-support#nintendo.co.jp"
},
"downloads_info": {
"long_form_not_formatted": "100,000,000+",
"long_form_formatted": "100000000",
"as_displayed_short_form": "100M+",
"actual_downloads": "213064462"
},
"name": "Super Mario Run",
"type": "SoftwareApplication",
"url": "https://play.google.com/store/apps/details/Super_Mario_Run?id=com.nintendo.zara&hl=en_GB&gl=US",
"description": "Control Mario with just a tap!",
"application_category": "GAME_ACTION",
"operating_system": "ANDROID",
"thumbnail": "https://play-lh.googleusercontent.com/3ZKfMRp_QrdN-LzsZTbXdXBH-LS1iykSg9ikNq_8T2ppc92ltNbFxS-tORxw2-6kGA",
"content_rating": "Everyone",
"rating": 4.0,
"reviews": "1645926",
"price": "0",
"release_date": "22 Mar 2017",
"images": [
"https://play-lh.googleusercontent.com/yT8ZCQHNB_MGT9Oc6mC5_mQS5vZ-5A4fvKQHHOl9NBy8yWGbM5-EFG_uISOXmypBYQ6G",
"https://play-lh.googleusercontent.com/AvRrlEpV8TCryInAnA__FcXqDu5d3i-XrUp8acW2LNmzkU-rFXkAKgmJPA_4AHbNjyY",
"https://play-lh.googleusercontent.com/AESbAa4QFa9-lVJY0vmAWyq2GXysv5VYtpPuDizOQn40jS9Z_ji8HXHA5hnOIzaf_w",
"https://play-lh.googleusercontent.com/KOCWy63UI2p7Fc65_X5gnIHsErEt7gpuKoD-KcvpGfRSHp-4k8YBGyPPopnrNQpdiQ",
"https://play-lh.googleusercontent.com/iDJagD2rKMJ92hNUi5WS2S_mQ6IrKkz6-G8c_zHNU9Ck8XMrZZP-1S_KkDsA6KDJ9No",
# ...
]
A possible good solution with shorter and simpler code could be Google Play Store API from SerpApi. It's a paid API with a free plan.
The difference is that it will bypass blocks (including CAPTCHA) from Google, no need to create the parser and maintain it.
SerpApi simple code example:
from serpapi import GoogleSearch
import os, json
params = {
"api_key": os.getenv("API_KEY"), # your serpapi api key
"engine": "google_play_product", # parsing engine
"store": "apps", # app page
"gl": "us", # country of the search
"product_id": "com.nintendo.zara", # low review count example to show it exits the while loop
"all_reviews": "true" # shows all reviews
}
search = GoogleSearch(params) # where data extraction happens
results = search.get_dict()
print(json.dumps(results["product_info"], indent=2, ensure_ascii=False))
print(json.dumps(results["media"], indent=2, ensure_ascii=False))
# other data
Output exactly the same as in the previous solution.
There's a Scrape Google Play Store App in Python blog post if you need a little bit more code explanation.
Disclaimer, I work for SerpApi.