What is the right way to post image to REST API and gather data with Falcon library? - api

I try to post an image to process it through my REST API. I use falcon for the backend but could not figure out how to post and receive the data.
This is how I currently send my file
img = open('img.png', 'rb')
r = requests.post("http://localhost:8000/rec",
files={'file':img},
data = {'apikey' : 'bla'})
However at the Falcon repo they say that Falcon does not support HTML forms to send data instead it aims full scope of POSTed and PUTed data which I do not differentiate POSTed image data and the one sent as above.
So eventually, I like to learn what is the right workaround to send a image and receive it by a REST API which is supposedly written by Falcon. Could you give some pointers?

For this you can use the following approach:
Falcon API Code:
import falcon
import base64
import json
app = falcon.API()
app.add_route("/rec/", GetImage())
class GetImage:
def on_post(self, req, res):
json_data = json.loads(req.stream.read().decode('utf8'))
image_url = json_data['image_name']
base64encoded_image = json_data['image_data']
with open(image_url, "wb") as fh:
fh.write(base64.b64decode(base64encoded_image))
res.status = falcon.HTTP_203
res.body = json.dumps({'status': 1, 'message': 'success'})
For API call:
import requests
import base64
with open("yourfile.png", "rb") as image_file:
encoded_image = base64.b64encode(image_file.read())
r = requests.post("http://localhost:8000/rec/",
data={'image_name':'yourfile.png',
'image_data':encoded_image
}
)
print(r.status_code, r.reason)
I hope this will help.

Related

How to swiftly scrap a list of urls from dynamically rendered websites using scrapy-playwright using parallel processing?

Here is my Spider that works well, but is not very fast for larger numbers of pages (10s of thousands)
import scrapy
import csv
from scrapy_playwright.page import PageMethod
class ImmoSpider(scrapy.Spider):
name = "immo"
def start_requests(self):
with open("urls.csv","r") as f:
reader = csv.DictReader(f)
list = [item['Url-scraper'] for item in reader][0: 1]
for elem in list :
yield scrapy.Request(elem, meta={'playwright': True, 'playwright_include_page' : True, "playwright_page_methods": [
PageMethod("wait_for_load_state", 'networkidle')
],})
async def parse(self, response):
page = response.meta["playwright_page"]
await page.close()
#parse stuff
yield {
#yield stuff
}
My scraper project is setup like the official scrapy getting started tutorial.
I'm still a beginner scraper so maybe I missed the simple solution.

await in a python for loop never finishing

I'm trying a new thing for me, using playwright in google colab.
this combination requires/forces async programming.
I've got a context manager which is able to handle the login and logout called "Login". That works great!
The internal page I'm trying to get to has datasets, with no links, just div's to click on.
the locator (I believe) is working fine and should return multiple elements when combined with .element_handles() I'm assuming.
from playwright.async_api import async_playwright
import asyncio
from IPython.display import Image
import nest_asyncio
nest_asyncio.apply()
# browser is set to webkit in the Login() context manager
...
async def loop_over_datasets(browser=None, page=None):
print("starting")
datasets = page.locator("div.horizontal.clickable")
print("continuing")
datasets = await asyncio.gather(datasets.element_handles())
for ds in datasets:
print(f'inside the loop, ds is {ds}')
print("doesn't get here in tact")
# for each dataset I want to launch a new page where the dataset is clicked but I'll settle for sync programming at this point.
# new_page = await ds.click()
# ds_page = await browser.new_page(new_page)
# ds_page.click()
async def get_all_new_info():
async with Login() as (b,l):
await loop_over_datasets(browser=b,page = l)
asyncio.run(get_all_new_info()) #has to be killed manually or it will run forever.
In the line datasets = await asyncio.gather(datasets.element_handles()) gather() doesn't actually work without await and await never returns
which means I don't get "inside the loop...".
without await I get the "ds" variable but it's not anything I can do something with.
How is this supposed to be used?
Without full code it's a little bit hard to test but wanted to share few things that may help:
datasets = await asyncio.gather(datasets.element_handles())
As far as I can see in Playwright documentation element_handles() returns <List[ElementHandle]> and your are trying to pass this list to asyncio.gather which needs awaitable objects which are coroutines, Tasks, and Futures and probably thats why it's not working, so I would just done
datasets = datasets.element_handles()
Now, I assume you'd like to go through those datasets in an asynchronous manner. You should be able to put the content of the for loop into a coroutine and based on that create tasks that will be executed by gather.
async def process_dataset(ds):
new_page = await ds.click()
ds_page = await browser.new_page(new_page)
ds_page.click()
tasks = []
for ds in datasets:
tasks.append(asyncio.create_task(process_dataset(ds)))
await asyncio.gather(*tasks)

How can I get a list of all bots registered with BotFather?

My task is to get a list of all user bots, after authorization in the telegram client through the API. I looked in the documentation for a specific method, but did not find it. Can someone tell me how this can be done, and is it possible at all?
I don't think there's a direct API for that unfortunately. But consider automating the interaction with the BotFather to gather the list programmatically.
Here is a sample script in Telethon
from telethon import TelegramClient, events
API_ID = ...
API_HASH = "..."
client = TelegramClient('session', api_id=API_ID, api_hash=API_HASH)
bots_list = []
#client.on(events.MessageEdited(chats="botfather"))
#client.on(events.NewMessage(chats="botfather"))
async def message_handler(event):
if 'Choose a bot from the list below:' in event.message.message:
last_page = True
for row in event.buttons:
for button in row:
if button.text == '»':
last_page = False
await button.click()
elif button.text != '«':
bots_list.append(button.text)
if last_page:
print(bots_list)
await client.disconnect()
exit(0)
async def main():
await client.send_message('botfather', '/mybots')
with client:
client.loop.run_until_complete(main())
client.run_until_disconnected()
a sample run would print all the bots from botfather:
['#xxxxxBot', '#xxxxxBot', … ]

scrapy-splash do not crawl recursively with CrawlerSpider

I have integrated scrapy-splash in my CrawlerSpider process_request in rules like this:
def process_request(self,request):
request.meta['splash']={
'args': {
# set rendering arguments here
'html': 1,
}
}
return request
The problem is that the crawl renders just urls in the first depth,
I wonder also how can I get response even with bad http code or redirected reponse;
Thanks in advance,
Your problem may be related to this: https://github.com/scrapy-plugins/scrapy-splash/issues/92
In short, try to add this to your parsing callback function:
def parse_item(self, response):
"""Parse response into item also create new requests."""
page = RescrapItem()
...
yield page
if isinstance(response, (HtmlResponse, SplashTextResponse)):
seen = set()
for n, rule in enumerate(self._rules):
links = [lnk for lnk in rule.link_extractor.extract_links(response)
if lnk not in seen]
if links and rule.process_links:
links = rule.process_links(links)
for link in links:
seen.add(link)
r = SplashRequest(url=link.url, callback=self._response_downloaded,
args=SPLASH_RENDER_ARGS)
r.meta.update(rule=rule, link_text=link.text)
yield rule.process_request(r)
In case, you wonder why this could return both items and new requests. Here is from the doc: https://doc.scrapy.org/en/latest/topics/spiders.html
In the callback function, you parse the response (web page) and return
either dicts with extracted data, Item objects, Request objects, or an
iterable of these objects. Those Requests will also contain a callback
(maybe the same) and will then be downloaded by Scrapy and then their
response handled by the specified callback.

Grab the resource contents in CasperJS or PhantomJS

I see that CasperJS has a "download" function and an "on resource received" callback but I do not see the contents of a resource in the callback, and I don't want to download the resource to the filesystem.
I want to grab the contents of the resource so that I can do something with it in my script. Is this possible with CasperJS or PhantomJS?
This problem has been in my way for the last couple of days. The proxy solution wasn't very clean in my environment so I found out where phantomjs's QTNetworking core put the resources when it caches them.
Long story short, here is my gist. You need the cache.js and mimetype.js files:
https://gist.github.com/bshamric/4717583
//for this to work, you have to call phantomjs with the cache enabled:
//usage: phantomjs --disk-cache=true test.js
var page = require('webpage').create();
var fs = require('fs');
var cache = require('./cache');
var mimetype = require('./mimetype');
//this is the path that QTNetwork classes uses for caching files for it's http client
//the path should be the one that has 16 folders labeled 0,1,2,3,...,F
cache.cachePath = '/Users/brandon/Library/Caches/Ofi Labs/PhantomJS/data7/';
var url = 'http://google.com';
page.viewportSize = { width: 1300, height: 768 };
//when the resource is received, go ahead and include a reference to it in the cache object
page.onResourceReceived = function(response) {
//I only cache images, but you can change this
if(response.contentType.indexOf('image') >= 0)
{
cache.includeResource(response);
}
};
//when the page is done loading, go through each cachedResource and do something with it,
//I'm just saving them to a file
page.onLoadFinished = function(status) {
for(index in cache.cachedResources) {
var file = cache.cachedResources[index].cacheFileNoPath;
var ext = mimetype.ext[cache.cachedResources[index].mimetype];
var finalFile = file.replace("."+cache.cacheExtension,"."+ext);
fs.write('saved/'+finalFile,cache.cachedResources[index].getContents(),'b');
}
};
page.open(url, function () {
page.render('saved/google.pdf');
phantom.exit();
});
Then when you call phantomjs, just make sure the cache is enabled:
phantomjs --disk-cache=true test.js
Some notes:
I wrote this for the purpose of getting the images on a page without using the proxy or taking a low res snapshot. QT uses compression on certain text file resources and you will have to deal with the decompression if you use this for text files. Also, I ran a quick test to pull in html resources and it didn't parse the http headers out of the result. But, this is useful to me, hopefully someone else will find it so, modify it if you have problems with a specific content type.
I've found that until the phantomjs matures a bit, according to the issue 158 http://code.google.com/p/phantomjs/issues/detail?id=158 this is a bit of a headache for them.
So you want to do it anyways? I've opted to go a bit higher to accomplish this and have grabbed PyMiProxy over at https://github.com/allfro/pymiproxy, downloaded, installed, set it up, took their example code and made this in proxy.py
from miproxy.proxy import RequestInterceptorPlugin, ResponseInterceptorPlugin, AsyncMitmProxy
from mimetools import Message
from StringIO import StringIO
class DebugInterceptor(RequestInterceptorPlugin, ResponseInterceptorPlugin):
def do_request(self, data):
data = data.replace('Accept-Encoding: gzip\r\n', 'Accept-Encoding:\r\n', 1);
return data
def do_response(self, data):
#print '<< %s' % repr(data[:100])
request_line, headers_alone = data.split('\r\n', 1)
headers = Message(StringIO(headers_alone))
print "Content type: %s" %(headers['content-type'])
if headers['content-type'] == 'text/x-comma-separated-values':
f = open('data.csv', 'w')
f.write(data)
print ''
return data
if __name__ == '__main__':
proxy = AsyncMitmProxy()
proxy.register_interceptor(DebugInterceptor)
try:
proxy.serve_forever()
except KeyboardInterrupt:
proxy.server_close()
Then I fire it up
python proxy.py
Next I execute phantomjs with the proxy specified...
phantomjs --ignore-ssl-errors=yes --cookies-file=cookies.txt --proxy=127.0.0.1:8080 --web-security=no myfile.js
You may want to turn your security on or such, it was needless for me currently as I'm scraping just one source. You should now see a bunch of text flowing through your proxy console and if it lands on something with the mime type of "text/x-comma-separated-values" it'll save it as data.csv. This will also save all the headers and everything, but if you've come this far I'm sure you can figure out how to pop those off.
One other detail, I've found that I've had to disable gzip encoding, I could use zlib and decompress data in gzip from my own apache webserver, but if it comes out of IIS or such the decompression will get errors and I'm not sure about that part of it.
So my power company won't offer me an API? Fine! We do it the hard way!
Did not realize I could grab the source from the document object like this:
casper.start(url, function() {
var js = this.evaluate(function() {
return document;
});
this.echo(js.all[0].outerHTML);
});
More info here.
You can use Casper.debugHTML() to print out contents of a HTML resource:
var casper = require('casper').create();
casper.start('http://google.com/', function() {
this.debugHTML();
});
casper.run();
You can also store the HTML contents in a var using casper.getPageContent(): http://casperjs.org/api.html#casper.getPageContent (available in lastest master)