Scraping angellist start-up data

Scraping angellist start-up data - selenium

I want to scrape data in a spreadsheet from this site Angel.co startup list i have tried many ways,but it shows an error. used IMPORTXML,IMPORTHTML in spreadsheet it's not working
format : startup name, location, category
Thanks in advance for help.
tried to used this below request method to scrape data however it shows no output.
import requests
URL = 'https://angel.co/social-network-2'
headers = {
"Host": "www.angel.co",
"User-Agent": "Mozilla/5.0 (X11; Ubuntu; Linux armv8l; rv:88.0)
Gecko/20100101 Firefox/88.0",
"Accept": "application/json, text/javascript, */*; q=0.01",
"Accept-Language": "en-US,en;q=0.5",
"Accept-Encoding": "gzip, deflate",
"Referer": "https://angel.co/social-network-2",
"X-Requested-With": "XMLHttpRequest",
"via": "1.1 google"
}
datas = requests.get(URL, headers=headers).json()
import re
for i in datas['data']:
for j in re.findall('class="uni-link">(.*)</a>',i['title']):
print(j)

I am afraid that you will not be able to scrape this webpage.
The problem is that they use cloudflare protection that is specially designed to prevent such automatic bot scraping...
Only suggestion would be to accept this fact and not waste your time...

Related

HLS Stream OneTimePassword (GetHLSToken from 511nj.org)

For research purposes, we analyzed and extracted large-scale traffic data from 511nj cameras. Starting from 2020, it adds an OTP(one-time password) to the video link, which is requested by the website API: https://511nj.org/api/client/camera/getHlsToken?Id=2&rnd=202205281151, as observed from the network traffic. However, the website itself gets a response of 200 with a valid token, while accessing it by myself gets a response of 401, which indicates an unauthorized request. What makes me confused is why I can get the camera feed with the OTP through the website, but can't access the api. Is there a solution?
Failed attempt:
requests.get('https://511nj.org/api/client/camera/getHlsToken?Id=2&rnd='+datetime.datetime.now().strftime('%Y%m%d%H%M')).json()
20220601 Update: Modification has been made based on #Brad 's comment. However, the error message now turns into "An error has occurred". Please help!
def GetOTP():
import requests
import json
import datetime
Now = datetime.datetime.now()
DT = Now.strftime('%Y%m%d%I%M%S')
DT24 = Now.strftime('%Y%m%d%H%M')
Enigma = dict(zip(map(str,range(0,10)),('2','1','*','3','4','5','6','7','8','9')))
cookies = {
'ReloadVersion': '112',
'_ga': 'GA1.2.335422978.1653768841',
'_gid': 'GA1.2.721252567.1653768841',
}
headers = {
'authority': '511nj.org',
'accept': 'application/json, text/plain, */*',
'accept-language': 'en-US,en;q=0.9',
'referer': 'https://511nj.org/Scripts/Directives/HLSPlayer.html?v=V223',
'responsetype': "".join([Enigma[key] for key in DT[4:6]])+'$'.join([Enigma[key] for key in DT[6:8]])+'$'.join([Enigma[key] for key in DT[:4]])+'#'+Enigma[DT[9]]+"!".join([Enigma[key] for key in DT[10:12]])+"!"+DT[-2:]+"#PK",
'sec-ch-ua': '" Not A;Brand";v="99", "Chromium";v="102", "Google Chrome";v="102"',
'sec-ch-ua-mobile': '?0',
'sec-ch-ua-platform': '"macOS"',
'sec-fetch-dest': 'empty',
'sec-fetch-mode': 'cors',
'sec-fetch-site': 'same-origin',
'user-agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/102.0.5005.61 Safari/537.36',
}
params = {
'Id': '2',
'rnd': DT24,
}
Response = requests.get('https://511nj.org/api/client/camera/getHlsToken?Id=2&rnd='+DT24, headers=headers,cookies=cookies,params=params).json()
print("".join([Enigma[key] for key in DT[4:6]])+'$'+''.join([Enigma[key] for key in DT[6:8]])+'$'+''.join([Enigma[key] for key in DT[:4]])+'#'+Enigma[DT[9]]+"!"+''.join([Enigma[key] for key in DT[10:12]])+"!"+DT[-2:]+"#PK")
print(params['rnd'])
print(Response)
return Response['Data']

reformat HTTP request from json file to raw

i have multiple files which is HTTP request in json file format and every file content as following .
{
"http://testphp.vulnweb.com/search.php": {
"headers": {
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
"Accept-Encoding": "gzip, deflate",
"Accept-Language": "en-US,en;q=0.5",
"Connection": "close",
"Upgrade-Insecure-Requests": "1",
"User-Agent": "Mozilla/5.0 (X11; Linux x86_64; rv:83.0) Gecko/20100101 Firefox/83.0"
},
"method": "POST",
"params": [
"goButton",
"searchFor"
]
}
}%
desired output is raw request as following
POST /search.php HTTP/1.1
Host: testphp.vulnweb.com
Content-Length: 23
Content-Type: application/x-www-form-urlencoded
User-Agent: Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/95.0.4638.54 Safari/537.36 Edg/95.0.1020.30
Connection: close
searchFor=a&goButton=go
and BTW the default char in the end of lines is \r\n as it the default format of HTTP raw request and there is extra \r\n in the line before parameters.
i really don`t know if awk could handle this or not but if it how hard it will be??
plz also recommend any online solution to do so if i have to go that way because i struggling in search for reformat from json to raw request.
Thanks

Scraping Lazada data

I have used Selenium to get data like item name, price, reviews and so on from the Lazada website. However, it will block me after the first scraping. My question is there any way to solve this? Could you guys give some solution in details. Thankyou

Lazada having high security, for getting data without blocking you must use proxy. you can even get the data using python request try below code
cookies = {
"user": "en"
}
req_headers = {
"user-agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.198 Safari/537.36",
"accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9",
"x-requested-with": "XMLHttpRequest",
}
proxies = {"https": "http://000.0.0.0:0000"}
response_data = requests.get(product_url, headers=req_headers, cookies=cookies, proxies=proxies, verify=False)
you can get the product data from response text.
for getting reviews you can use this url :
host = "lazada.sg" // you can use any region here
"https://my.{}/pdp/review/getReviewList?itemId={}&pageSize=100&filter=0&sort=1&pageNo={}".format(host,item_id,page_no)
if you want to use selenium you need to set proxy in selenium

Error when connecting to wunderground API

I am trying to scrape wind speed data for different UK weather stations using the site wunderground. I assume they have an API, I just have a hard to connecting to it.
Here's the XHR link I use:
https://api.weather.com/v1/location/EGNV:9:GB/observations/historical.json?apiKey=6532d6454b8aa370768e63d6ba5a832e&units=e&startDate=20150101&endDate=20150131
This is the data I would like. The table in the bottom for wind speed:
https://www.wunderground.com/history/monthly/gb/darlington/EGNV/date/2015-1
My code, pretty simple: I first load the headers, my function get_data gets me the response in json format.
In my main I append the data to a dataframe and print it.
from bs4 import BeautifulSoup
import pandas as pd
import requests
import urllib
from urllib.request import urlopen
headers = {
':authority': 'api.weather.com',
#':path': '/v1/location/EGNV:9:GB/observations/historical.json?apiKey=6532d6454b8aa370768e63d6ba5a832e&units=e&startDate=20150101&endDate=20150131',
':scheme': 'https',
'accept': 'application/json, text/plain, */*',
'accept-encoding' : 'gzip, deflate, br',
'accept-language': 'en-GB,en;q=0.9,en-US;q=0.8,da;q=0.7',
'origin': 'https://www.wunderground.com',
#'apiKey': '6532d6454b8aa370768e63d6ba5a832e',
'referer': 'https://www.wunderground.com/history/monthly/gb/darlington/EGNV/date/2015-1',
'sec-fetch-mode': 'cors',
'sec-fetch-site': 'cross-site',
'user-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.87 Safari/537.36'
}
def get_data(response):
df = response.json()
return df
if __name__ == "__main__":
date = pd.datetime.now().strftime("%d-%m-%Y")
api_key = "6532d6454b8aa370768e63d6ba5a832e"
start_date = "20150101"
end_date = "20150131"
urls = [
"https://api.weather.com/v1/location/EGNV:9:GB/observations/historical.json?apiKey="+ api_key +"&units=e&startDate="+start_date+"&endDate="+end_date+""
]
df = pd.DataFrame()
for url in urls:
res = requests.get(url, headers= headers)
data = get_data(res)
df = df.append(data)
print(df)
The error I get:
SSLError: HTTPSConnectionPool(host='api.weather.com', port=443): Max retries exceeded with url: /v1/location/EGNV:9:GB/observations/historical.json?apiKey=6532d6454b8aa370768e63d6ba5a832e&units=e&startDate=20150101&endDate=20150131
(Caused by SSLError(SSLError("bad handshake: Error([('SSL routines', 'tls_process_server_certificate', 'certificate verify failed')])")))
Update:
Even when without trying to connect to the API, but by scraping the page using BS4, I still get denied access. Not sure why, and how they can detect my scraper?

I solved it.
If I add verify = False in my requests.get() I manage to go around the error.

ember-simple-auth oauth2 authorizer issue

I am trying to set up authorization on an Ember App running on a Node.js server.
I am using the oauth2 Authenticator, which is requesting a token from the server. This is working fine. I am able to provide the app with a token, which it saves in the local-storage.
However, when I make subsequent requests, the authorizer is not adding the token to the header, I have initialized the authorizer using the method described in the documentation (http://ember-simple-auth.simplabs.com/ember-simple-auth-oauth2-api-docs.html):
Ember.Application.initializer({
name: 'authentication',
initialize: function(container, application) {
Ember.SimpleAuth.setup(container, application, {
authorizerFactory: 'authorizer:oauth2-bearer'
});
}
});
var App = Ember.Application.create();
And I have added an init method to the Authorizer, to log a message to the server when it is initialized, so I know that it is being loaded. The only thing is, the authorize method of the authorizer is never called.
It feels like I am missing a fundamental concept of the library.
I have a users route which I have protected using the AuthenticatedRouteMixin like so:
App.UsersRoute = Ember.Route.extend(Ember.SimpleAuth.AuthenticatedRouteMixin, {
model: function() {
return this.get('store').find('user');
}
});
Which is fetching the data, fine, and redirects to /login if no token is in the session, but the request headers do not include the token:
GET /users HTTP/1.1
Host: *****
Connection: keep-alive
Cache-Control: no-cache
Pragma: no-cache
Accept: application/json, text/javascript, */*; q=0.01
Origin: *****
User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/34.0.1847.116 Safari/537.36
Referer: *****
Accept-Encoding: gzip,deflate,sdch
Accept-Language: en-US,en;q=0.8
Any help you could give me would be greatly appreciated.

Is your REST API served on a different origin than the app is loaded from maybe? Ember.SimpleAuth does not authorizer cross origin requests by default (see here: https://github.com/simplabs/ember-simple-auth#cross-origin-authorization)

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Scraping angellist start-up data - selenium

I am afraid that you will not be able to scrape this webpage. The problem is that they use cloudflare protection that is specially designed to prevent such automatic bot scraping... Only suggestion would be to accept this fact and not waste your time...

Related

HLS Stream OneTimePassword (GetHLSToken from 511nj.org)

reformat HTTP request from json file to raw

Scraping Lazada data

Error when connecting to wunderground API

ember-simple-auth oauth2 authorizer issue

Categories

Resources