I found a XHR request with all the street address info I want to scrape within it.
However, I do not know how to extract it to a pandas dataframe or a python list. Any ideas? Thank you very much!
Since it's graphql, you can formulate the query string however you like, but here I've written it the same way it's sent when the browser makes a request:
def main():
import requests
url = "https://api-endpoint.cons-prod-us-central1.kw.com/graphql"
headers = {
"x-shared-secret": "MjFydHQ0dndjM3ZAI0ZHQCQkI0BHIyM="
}
query = """{
ListOfficeQuery {
id
name
address
subAddress
phone
fax
lat
lng
url
contacts {
name
email
phone
__typename
}
__typename
}
}
"""
payload = {
"operationName": None,
"variables": {},
"query": query
}
response = requests.post(url, headers=headers, json=payload)
response.raise_for_status()
offices = response.json()["data"]["ListOfficeQuery"]
print(f"There are {len(offices)} offices, and the first one's address is \"{offices[0]['address']}\"")
return 0
if __name__ == "__main__":
import sys
sys.exit(main())
Output:
There are 1173 offices, and the first one's address is "1801 South Mo-Pac Expressway, Suite 100"
>>>
You can replicate the XHR using python's requests library using the information in the headers tab (more info here). Then parse the data using json library, and extract information.
Related
I need binance data to build a mobile app. Only USDT pairs are sufficient. In the link below it takes all trading pairs, but I only want USDT pairs. Which link should I use for this?
https://api.binance.com/api/v3/ticker/price
You can use the Binance Exchange API. There is no need for registering.
The used API call is this: https://api.binance.com/api/v3/exchangeInfo
I recomend you use google colab and python, or any other python resource:
import requests
def get_response(url):
response = requests.get(url)
response.raise_for_status() # raises exception when not a 2xx response
if response.status_code != 204:
return response.json()
def get_exchange_info():
base_url = 'https://api.binance.com'
endpoint = '/api/v3/exchangeInfo'
return get_response(base_url + endpoint)
def create_symbols_list(filter='USDT'):
rows = []
info = get_exchange_info()
pairs_data = info['symbols']
full_data_dic = {s['symbol']: s for s in pairs_data if filter in s['symbol']}
return full_data_dic.keys()
create_symbols_list('USDT')
Result:
['BTCUSDT', 'ETHUSDT', 'BNBUSDT', 'BCCUSDT', 'NEOUSDT', 'LTCUSDT',...
The api call brings you a very large response fill with with interesting data about the exchange. In the function create_symbols_list you get all this data in the full_data_dic dictionary.
There is a python binance client library and you can do check the list of tickers which tickers are quoted in USDT (and status is trading):
from binance.client import Client
client = Client()
info = client.get_exchange_info()
for c in info['symbols']:
if c['quoteAsset']=='USDT' and c['status']=="TRADING":
print(c['symbol'])
I want to get a channel's members' count but I don't know which method should I use?
I am not admin in that channel, I just want to get the count number.
EDIT:I am using main telegram api, not telegram Bot api
You can use getChatMembersCount method.
Use this method to get the number of members in a chat.
It worked for me :)
from telethon import TelegramClient, sync
from telethon.tl.functions.channels import GetFullChannelRequest
api_id = API ID
api_hash = 'API HASH'
client = TelegramClient('session_name', api_id, api_hash)
client.start()
if (client.is_user_authorized() == False):
phone_number = 'PHONE NUMBER'
client.send_code_request(phone_number)
myself = client.sign_in(phone_number, input('Enter code: '))
channel = client.get_entity('CHANNEL LINK')
members = client.get_participants(channel)
print(len(members))
It is possible to do it also through GetFullChannelRequest in telethon
async def main():
async with client_to_manage as client:
full_info = await client(GetFullChannelRequest(channel="moscowproc"))
print(f"count: {full_info.full_chat.participants_count}")
if __name__ == '__main__':
client_to_manage.loop.run_until_complete(main())
or to write it without async/await
def main():
with client_to_manage as client:
full_info = client.loop.run_until_complete(client(GetFullChannelRequest(channel="moscowproc")))
print(f"count: {full_info.full_chat.participants_count}")
if __name__ == '__main__':
main()
Also as above was said, it is also feasible by bot-api with
getChatMembersCount method. You can curl it or use python to query needed url
with python code can look like this one:
import json
from urllib.request import urlopen
url ="https://api.telegram.org/bot<your-bot-api-token>/getChatMembersCount?chat_id=#<channel-name>"
with urlopen(url) as f:
resp = json.load(f)
print(resp['result'])
where <your-bot-api-token> is token provided by BotFather, and <channel-name> is channel name which amount of subscribers you want to know (of course, everything without "<>")
to check firstly, simply curl it:
curl https://api.telegram.org/bot<your-bot-api-token>/getChatMembersCount?chat_id=#<channel-name>
I'm experiencing a problem with GCP pubsub where a small percentage of data was lost when publishing thousands of messages in couple seconds.
I'm logging both message_id from pubsub and a session_id unique to each message on both the publishing end as well as the receiving end, and the result I'm seeing is that some message on the receiving end has same session_id, but different message_id. Also, some messages were missing.
For example, in one test I send 5,000 messages to pubsub, and exactly 5,000 messages were received, with 8 messages lost. The log lost messages look like this:
MISSING sessionId:sessionId: 731 (missing in log from pull request, but present in log from Flask API)
messageId FOUND: messageId:108562396466545
API: 200 **** sessionId: 731, messageId:108562396466545 ******(Log from Flask API)
Pubsub: sessionId: 730, messageId:108562396466545(Log from pull request)
And the duplicates looks like:
======= Duplicates FOUND on sessionId: 730=======
sessionId: 730, messageId:108562396466545
sessionId: 730, messageId:108561339282318
(both are logs from pull request)
All missing data and duplicates look like this.
From the above example, it is clear that some messages has taken the message_id of another message, and has been sent twice with two different message_ids.
I wonder if anyone would help me figure out what is going on? Thanks in advance.
Code
I have an API sending message to pubsub, which looks like this:
from flask import Flask, request, jsonify, render_template
from flask_cors import CORS, cross_origin
import simplejson as json
from google.cloud import pubsub
from functools import wraps
import re
import json
app = Flask(__name__)
ps = pubsub.Client()
...
#app.route('/publish', methods=['POST'])
#cross_origin()
#json_validator
def publish_test_topic():
pubsub_topic = 'test_topic'
data = request.data
topic = ps.topic(pubsub_topic)
event = json.loads(data)
messageId = topic.publish(data)
return '200 **** sessionId: ' + str(event["sessionId"]) + ", messageId:" + messageId + " ******"
And this is the code I used to read from pubsub:
from google.cloud import pubsub
import re
import json
ps = pubsub.Client()
topic = ps.topic('test-xiu')
sub = topic.subscription('TEST-xiu')
max_messages = 1
stop = False
messages = []
class Message(object):
"""docstring for Message."""
def __init__(self, sessionId, messageId):
super(Message, self).__init__()
self.seesionId = sessionId
self.messageId = messageId
def pull_all():
while stop == False:
m = sub.pull(max_messages = max_messages, return_immediately = False)
for data in m:
ack_id = data[0]
message = data[1]
messageId = message.message_id
data = message.data
event = json.loads(data)
sessionId = str(event["sessionId"])
messages.append(Message(sessionId = sessionId, messageId = messageId))
print '200 **** sessionId: ' + sessionId + ", messageId:" + messageId + " ******"
sub.acknowledge(ack_ids = [ack_id])
pull_all()
For generating session_id, sending request & logging response from API:
// generate trackable sessionId
var sessionId = 0
var increment_session_id = function () {
sessionId++;
return sessionId;
}
var generate_data = function () {
var data = {};
// data.sessionId = faker.random.uuid();
data.sessionId = increment_session_id();
data.user = get_rand(userList);
data.device = get_rand(deviceList);
data.visitTime = new Date;
data.location = get_rand(locationList);
data.content = get_rand(contentList);
return data;
}
var sendData = function (url, payload) {
var request = $.ajax({
url: url,
contentType: 'application/json',
method: 'POST',
data: JSON.stringify(payload),
error: function (xhr, status, errorThrown) {
console.log(xhr, status, errorThrown);
$('.result').prepend("<pre id='json'>" + JSON.stringify(xhr, null, 2) + "</pre>")
$('.result').prepend("<div>errorThrown: " + errorThrown + "</div>")
$('.result').prepend("<div>======FAIL=======</div><div>status: " + status + "</div>")
}
}).done(function (xhr) {
console.log(xhr);
$('.result').prepend("<div>======SUCCESS=======</div><pre id='json'>" + JSON.stringify(payload, null, 2) + "</pre>")
})
}
$(submit_button).click(function () {
var request_num = get_request_num();
var request_url = get_url();
for (var i = 0; i < request_num; i++) {
var data = generate_data();
var loadData = changeVerb(data, 'load');
sendData(request_url, loadData);
}
})
UPDATE
I made a change on the API, and the issue seems to go away. The changes I made was instead of using one pubsub.Client() for all request, I initialized a client for every single request coming in. The new API looks like:
from flask import Flask, request, jsonify, render_template
from flask_cors import CORS, cross_origin
import simplejson as json
from google.cloud import pubsub
from functools import wraps
import re
import json
app = Flask(__name__)
...
#app.route('/publish', methods=['POST'])
#cross_origin()
#json_validator
def publish_test_topic():
ps = pubsub.Client()
pubsub_topic = 'test_topic'
data = request.data
topic = ps.topic(pubsub_topic)
event = json.loads(data)
messageId = topic.publish(data)
return '200 **** sessionId: ' + str(event["sessionId"]) + ", messageId:" + messageId + " ******"
Talked with some guy from Google, and it seems to be an issue with the Python Client:
The consensus on our side is that there is a thread-safety problem in the current python client. The client library is being rewritten almost from scratch as we speak, so I don't want to pursue any fixes in the current version. We expect the new version to become available by end of June.
Running the current code with thread_safe: false in app.yaml or better yet just instantiating the client in every call should is the work around -- the solution you found.
For detailed solution, please see the Update in the question
Google Cloud Pub/Sub message IDs are unique. It should not be possible for "some messages [to] taken the message_id of another message." The fact that message ID 108562396466545 was seemingly received means that Pub/Sub did deliver the message to the subscriber and was not lost.
I recommend you check how your session_ids are generated to ensure that they are indeed unique and that there is exactly one per message. Searching for the sessionId in your JSON via a regular expression search seems a little strange. You would be better off parsing this JSON into an actual object and accessing fields that way.
In general, duplicate messages in Cloud Pub/Sub are always possible; the system guarantees at-least-once delivery. Those messages can be delivered with the same message ID if the duplication happens on the subscribe side (e.g., the ack is not processed in time) or with a different message ID (e.g., if the publish of the message is retried after an error like a deadline exceeded).
You shouldn't need to create a new client for every publish operation. I'm betting that the reason that that "fixed the problem" is because it mitigated a race that exists in the publisher client side. I'm also not convinced that the log line you've shown on the publisher side:
API: 200 **** sessionId: 731, messageId:108562396466545 ******
corresponds to a successful publish of sessionId 731 by publish_test_topic(). Under what conditions is that log line printed? The code that has been presented so far does not show this.
The following code has been written by me to extract tweets with specific hashtags.
import json
import oauth2
import time
import io
Consumer_Key = ""
Consumer_Secret = ""
access_token = ""
access_token_secret = ""
def oauth_req(url, key, secret, http_method="GET", post_body="", http_headers=None):
consumer = oauth2.Consumer(key="", secret="")
token = oauth2.Token(key=key, secret=secret)
client = oauth2.Client(consumer, token)
content = client.request( url, method=http_method, body=post_body, headers=http_headers )
return content
tweet_url = 'https://twitter.com/search.json?q=%23IPv4%20OR%20%23ISP%20OR%20%23WiFi%20OR%20%23Modem%20OR%20%23Internet%20OR%20%23IPV6'
jsn = oauth_req( tweet_url, access_token, access_token_secret )
print jsn
My hashtags are: IPv4, IPv6, ISP, Internet, Modem. I want my code to see if a tweet has at least one of the hashtags that tweet should be written to my file.
But, unfortunately it is returning the html tags instead.
The output is as follows:
({'content-length': '338352', 'x-xss-protection': '1; mode=block', 'x-content-type-options': 'nosniff',........................
.............................-post-iframe" name="tweet-post-iframe"></iframe>\n <iframe aria-hidden="true" class="dm-post-iframe" name="dm-post-iframe"></iframe>\n\n</div>\n\n </body>\n</html>\n')
Any lead in this regard will be appreciated.
Take a look at your tweet url which is
tweet_url = 'https://twitter.com/search.json?q=%23IPv4%20OR%20%23ISP%20OR%20%23WiFi%20OR%20%23Modem%20OR%20%23Internet%20OR%20%23IPV6'
which is the url of website.
But if you are trying to extract tweets through Twitter API just replace above url with this url :
tweet_url = 'https://api.twitter.com/1.1/search/tweets.json?q=%23IPv4%20OR%20%23ISP%20OR%20%23WiFi%20OR%20%23Modem%20OR%20%23Internet%20OR%20%23IPV6'
I'm trying to replicate the following successful cURL operation with Grinder.
curl -X PUT -d "title=Here%27s+the+title&content=Here%27s+the+content&signature=myusername%3A3ad1117dab0ade17bdbd47cc8efd5b08" http://www.mysite.com/api
Here's my script:
from net.grinder.script import Test
from net.grinder.script.Grinder import grinder
from net.grinder.plugin.http import HTTPRequest
from HTTPClient import NVPair
import hashlib
test1 = Test(1, "Request resource")
request1 = HTTPRequest(url="http://www.mysite.com/api")
test1.record(request1)
log = grinder.logger.info
test1.record(log)
m = hashlib.md5()
class TestRunner:
def __call__(self):
params = [NVPair("title","Here's the title"),NVPair("content", "Here's the content")]
params.sort(key=lambda param: param.getName())
ps = ""
for param in params:
ps = ps + param.getValue() + ":"
ps = ps + "myapikey"
m.update(ps)
params.append(NVPair("signature", ("myusername:" + m.hexdigest())))
request1.setFormData(tuple(params))
result = request1.PUT()
The test runs okay, but it seems that my script doesn't actually send any of the params data to the API, and I can't work out why. There are no errors generated, but I get a 401 Unauthorized response from the API, indicating that a successful PUT request reached it, but obviously without a signature the request was rejected.
This isn't exactly an answer, more of a workaround that I came up with, that I've decided to post since this question hasn't yet received any responses, and it may help anyone else trying to achieve the same thing.
The workaround is basically to use the httplib and urllib modules to build and make the PUT request instead of the HTTPClient module.
import hashlib
import httplib, urllib
....
params = [("title", "Here's the title"),("content", "Here's the content")]
params.sort(key=lambda param: param[0])
ps = ""
for param in params:
ps = ps + param[1] + ":"
ps = ps + "myapikey"
m = hashlib.md5()
m.update(ps)
params.append(("signature", "myusername:" + m.hexdigest()))
params = urllib.urlencode(params)
print params
headers = {"Content-type": "application/x-www-form-urlencoded"}
conn = httplib.HTTPConnection("www.mysite.com:80")
conn.request("PUT", "/api", params, headers)
response = conn.getresponse()
print response.status, response.reason
print response.read()
conn.close()
(Based on the example at the bottom of this documentation page.)
You have to refer to the multi-form posting example in Grinder script gallery, but changing the Post to Put. It works for me.
files = ( NVPair("self", "form.py"), )
parameters = ( NVPair("run number", str(grinder.runNumber)), )
# This is the Jython way of creating an NVPair[] Java array
# with one element.
headers = zeros(1, NVPair)
# Create a multi-part form encoded byte array.
data = Codecs.mpFormDataEncode(parameters, files, headers)
grinder.logger.output("Content type set to %s" % headers[0].value)
# Call the version of POST that takes a byte array.
result = request1.PUT("/upload", data, headers)