why the scrapy-plugins/scrapy-jsonrpc can't get the spider's stats

why the scrapy-plugins/scrapy-jsonrpc can't get the spider's stats - scrapy

I just want monitor my running spider's stats.I get the latest scrapy-plugins/scrapy-jsonrpc and set the spider as follows:
EXTENSIONS = {
'scrapy_jsonrpc.webservice.WebService': 500,
}
JSONRPC_ENABLED = True
JSONRPC_PORT = [60853]
but when I browse the http://localhost:60853/ , it just return
{"resources": ["crawler"]}
and I just can get the running spiders name without the stats.
anyone who can told me, which place I set wrong, thanks!

http://localhost:60853/ returns the resources available, /crawler being the only top-level one.
If you want to get stats for a spider, you'll need to query the /crawler/stats endpoint and call get_stats().
Here's an example using python-jsonrpc: (here I configured the webservice to listen on localhost and port 6024)
>>> import pyjsonrpc
>>> http_client = pyjsonrpc.HttpClient('http://localhost:6024/crawler/stats')
>>> http_client.call('get_stats', 'httpbin')
{u'log_count/DEBUG': 4, u'scheduler/dequeued': 4, u'log_count/INFO': 9, u'downloader/response_count': 2, u'downloader/response_status_count/200': 2, u'log_count/WARNING': 1, u'scheduler/enqueued/memory': 4, u'downloader/response_bytes': 639, u'start_time': u'2016-09-28 08:49:57', u'scheduler/dequeued/memory': 4, u'scheduler/enqueued': 4, u'downloader/request_bytes': 862, u'response_received_count': 2, u'downloader/request_method_count/GET': 4, u'downloader/request_count': 4}
>>> http_client.call('get_stats')
{u'log_count/DEBUG': 4, u'scheduler/dequeued': 4, u'log_count/INFO': 9, u'downloader/response_count': 2, u'downloader/response_status_count/200': 2, u'log_count/WARNING': 1, u'scheduler/enqueued/memory': 4, u'downloader/response_bytes': 639, u'start_time': u'2016-09-28 08:49:57', u'scheduler/dequeued/memory': 4, u'scheduler/enqueued': 4, u'downloader/request_bytes': 862, u'response_received_count': 2, u'downloader/request_method_count/GET': 4, u'downloader/request_count': 4}
>>> from pprint import pprint
>>> pprint(http_client.call('get_stats'))
{u'downloader/request_bytes': 862,
u'downloader/request_count': 4,
u'downloader/request_method_count/GET': 4,
u'downloader/response_bytes': 639,
u'downloader/response_count': 2,
u'downloader/response_status_count/200': 2,
u'log_count/DEBUG': 4,
u'log_count/INFO': 9,
u'log_count/WARNING': 1,
u'response_received_count': 2,
u'scheduler/dequeued': 4,
u'scheduler/dequeued/memory': 4,
u'scheduler/enqueued': 4,
u'scheduler/enqueued/memory': 4,
u'start_time': u'2016-09-28 08:49:57'}
>>>
You can also use jsonrpc_client_call from scrapy_jsonrpc.jsonrpc.
>>> from scrapy_jsonrpc.jsonrpc import jsonrpc_client_call
>>> jsonrpc_client_call('http://localhost:6024/crawler/stats', 'get_stats', 'httpbin')
{u'log_count/DEBUG': 5, u'scheduler/dequeued': 4, u'log_count/INFO': 11, u'downloader/response_count': 3, u'downloader/response_status_count/200': 3, u'log_count/WARNING': 1, u'scheduler/enqueued/memory': 4, u'downloader/response_bytes': 870, u'start_time': u'2016-09-28 09:01:47', u'scheduler/dequeued/memory': 4, u'scheduler/enqueued': 4, u'downloader/request_bytes': 862, u'response_received_count': 3, u'downloader/request_method_count/GET': 4, u'downloader/request_count': 4}
This is what you get "on the wire" for a request made with a modified example-client.py (see code a bit below, the example in https://github.com/scrapy-plugins/scrapy-jsonrpc is outdated as I write these lines):
POST /crawler/stats HTTP/1.1
Accept-Encoding: identity
Content-Length: 73
Host: localhost:6024
Content-Type: application/x-www-form-urlencoded
Connection: close
User-Agent: Python-urllib/2.7
{"params": ["httpbin"], "jsonrpc": "2.0", "method": "get_stats", "id": 1}
And the response
HTTP/1.1 200 OK
Content-Length: 504
Access-Control-Allow-Headers: X-Requested-With
Server: TwistedWeb/16.4.1
Connection: close
Date: Tue, 27 Sep 2016 11:21:43 GMT
Access-Control-Allow-Origin: *
Access-Control-Allow-Methods: GET, POST, PATCH, PUT, DELETE
Content-Type: application/json
{"jsonrpc": "2.0", "result": {"log_count/DEBUG": 5, "scheduler/dequeued": 4, "log_count/INFO": 11, "downloader/response_count": 3, "downloader/response_status_count/200": 3, "log_count/WARNING": 3, "scheduler/enqueued/memory": 4, "downloader/response_bytes": 870, "start_time": "2016-09-27 11:16:25", "scheduler/dequeued/memory": 4, "scheduler/enqueued": 4, "downloader/request_bytes": 862, "response_received_count": 3, "downloader/request_method_count/GET": 4, "downloader/request_count": 4}, "id": 1}
Here's the modified client to query /crawler/stats, which I called with ./example-client.py -H localhost -P 6024 get-spider-stats httpbin (for a running "httpbin" spider, JSONRPC_PORT being 6024 for me)
#!/usr/bin/env python
"""
Example script to control a Scrapy server using its JSON-RPC web service.
It only provides a reduced functionality as its main purpose is to illustrate
how to write a web service client. Feel free to improve or write you own.
Also, keep in mind that the JSON-RPC API is not stable. The recommended way for
controlling a Scrapy server is through the execution queue (see the "queue"
command).
"""
from __future__ import print_function
import sys, optparse, urllib, json
from six.moves.urllib.parse import urljoin
from scrapy_jsonrpc.jsonrpc import jsonrpc_client_call, JsonRpcError
def get_commands():
return {
'help': cmd_help,
'stop': cmd_stop,
'list-available': cmd_list_available,
'list-running': cmd_list_running,
'list-resources': cmd_list_resources,
'get-global-stats': cmd_get_global_stats,
'get-spider-stats': cmd_get_spider_stats,
}
def cmd_help(args, opts):
"""help - list available commands"""
print("Available commands:")
for _, func in sorted(get_commands().items()):
print(" ", func.__doc__)
def cmd_stop(args, opts):
"""stop <spider> - stop a running spider"""
jsonrpc_call(opts, 'crawler/engine', 'close_spider', args[0])
def cmd_list_running(args, opts):
"""list-running - list running spiders"""
for x in json_get(opts, 'crawler/engine/open_spiders'):
print(x)
def cmd_list_available(args, opts):
"""list-available - list name of available spiders"""
for x in jsonrpc_call(opts, 'crawler/spiders', 'list'):
print(x)
def cmd_list_resources(args, opts):
"""list-resources - list available web service resources"""
for x in json_get(opts, '')['resources']:
print(x)
def cmd_get_spider_stats(args, opts):
"""get-spider-stats <spider> - get stats of a running spider"""
stats = jsonrpc_call(opts, 'crawler/stats', 'get_stats', args[0])
for name, value in stats.items():
print("%-40s %s" % (name, value))
def cmd_get_global_stats(args, opts):
"""get-global-stats - get global stats"""
stats = jsonrpc_call(opts, 'crawler/stats', 'get_stats')
for name, value in stats.items():
print("%-40s %s" % (name, value))
def get_wsurl(opts, path):
return urljoin("http://%s:%s/"% (opts.host, opts.port), path)
def jsonrpc_call(opts, path, method, *args, **kwargs):
url = get_wsurl(opts, path)
return jsonrpc_client_call(url, method, *args, **kwargs)
def json_get(opts, path):
url = get_wsurl(opts, path)
return json.loads(urllib.urlopen(url).read())
def parse_opts():
usage = "%prog [options] <command> [arg] ..."
description = "Scrapy web service control script. Use '%prog help' " \
"to see the list of available commands."
op = optparse.OptionParser(usage=usage, description=description)
op.add_option("-H", dest="host", default="localhost", \
help="Scrapy host to connect to")
op.add_option("-P", dest="port", type="int", default=6080, \
help="Scrapy port to connect to")
opts, args = op.parse_args()
if not args:
op.print_help()
sys.exit(2)
cmdname, cmdargs, opts = args[0], args[1:], opts
commands = get_commands()
if cmdname not in commands:
sys.stderr.write("Unknown command: %s\n\n" % cmdname)
cmd_help(None, None)
sys.exit(1)
return commands[cmdname], cmdargs, opts
def main():
cmd, args, opts = parse_opts()
try:
cmd(args, opts)
except IndexError:
print(cmd.__doc__)
except JsonRpcError as e:
print(str(e))
if e.data:
print("Server Traceback below:")
print(e.data)
if __name__ == '__main__':
main()
In the example command above, I got this:
log_count/DEBUG 5
scheduler/dequeued 4
log_count/INFO 11
downloader/response_count 3
downloader/response_status_count/200 3
log_count/WARNING 3
scheduler/enqueued/memory 4
downloader/response_bytes 870
start_time 2016-09-27 11:16:25
scheduler/dequeued/memory 4
scheduler/enqueued 4
downloader/request_bytes 862
response_received_count 3
downloader/request_method_count/GET 4
downloader/request_count 4

Related

Populating numpy array with most performance

I have few arrays a,b,c and d as shown below and would like to populate a matrix by evaluating a function f(...) which consumes a,b,c and d.
with nested for loop this is obviously possible but I'm looking for more pythonic and fast way to do this.
So far I tried, np.fromfunction with no luck.
Thanks
PS: This function f has a conditional. I still can consider approaches which does not support conditionals but if the solution supports conditionals that would be fantastic.
example function in case helpful
def fun(a,b,c,c): return a+b+c+d if a==b else a*b*c*d
Also why fromfunction failed is shown below
>>> a = np.array([1,2,3,4,5])
>>> b = np.array([10,20,30])
>>> def fun(i,j): return a[i] * b[j]
>>> np.fromfunction(fun, (3,5))
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "C:\Users\Anaconda3\lib\site-packages\numpy\core\numeric.py", line 1853, in fromfunction
return function(*args, **kwargs)
File "<stdin>", line 1, in fun
IndexError: arrays used as indices must be of integer (or boolean) type

The reason the function fails is that np.fromfunction passes floating-point values, which are not valid as indices. You can modify your function like this to make it work:
def fun(i,j):
return a[j.astype(int)] * b[i.astype(int)]
print(np.fromfunction(fun, (3,5)))
[[ 10 20 30 40 50]
[ 20 40 60 80 100]
[ 30 60 90 120 150]]

Jake has explained why your fromfunction approach fails. However, you don't need fromfunction for your example. You could simply add an axis to b and have numpy broadcast the shapes:
a = np.array([1,2,3,4,5])
b = np.array([10,20,30])
def fun(i,j): return a[j.astype(int)] * b[i.astype(int)]
f1 = np.fromfunction(fun, (3, 5))
f2 = b[:, None] * a
(f1 == f2).all() # True
Extending this to the function you showed that contains an if condition, you could just split the if into two operations in sequence: creating an array given by the if expression, and overwriting the relevant parts by the else expression.
a = np.array([1, 2, 3, 4, 5])
b = np.array([5, 4, 3, 2, 1])
c = np.array([100, 200, 300, 400, 500])
d = np.array([0, 1, 2, 3])
# Calculate the values at all indices as the product
result = d[:, None] * (a * b * c)
# array([[ 0, 0, 0, 0, 0],
# [ 500, 1600, 2700, 3200, 2500],
# [1000, 3200, 5400, 6400, 5000],
# [1500, 4800, 8100, 9600, 7500]])
# Calculate sum
sum_arr = d[:, None] + (a + b + c)
# array([[106, 206, 306, 406, 506],
# [107, 207, 307, 407, 507],
# [108, 208, 308, 408, 508],
# [109, 209, 309, 409, 509]])
# Set diagonal elements (i==j) to sum:
np.fill_diagonal(result, np.diag(sum_arr))
which gives the following result:
array([[ 106, 0, 0, 0, 0],
[ 500, 207, 2700, 3200, 2500],
[1000, 3200, 308, 6400, 5000],
[1500, 4800, 8100, 409, 7500]])

How to store all scraped stats moment before spider closes?

I want to store all the stats collected from the spider into a single output file stored as json format. However, I get this error:
'MemoryStatsCollector' object has no attribute 'get_all'
: The documentation mentions that stats.get_all is how you get all the stores. What is the correct method of implementation for this?
import scrapy
from scrapy import signals
from scrapy import crawler
import jsonlines
class TestSpider(scrapy.Spider):
name = 'stats'
start_urls = ['http://quotes.toscrape.com']
def __init__(self, stats):
self.stats = stats
#classmethod
def from_crawler(cls, crawler, *args, **kwargs):
#spider = super(TestSpider, cls).from_crawler(crawler, *args, **kwargs)
stat = cls(crawler.stats)
crawler.signals.connect(stat.spider_closed, signals.spider_closed)
return stat
def spider_closed(self):
#self.stats = stat
txt_file = 'some_text.jl'
with jsonlines.open(txt_file, 'w') as f:
f.write(self.stats.get_all())
def start_requests(self):
for url in self.start_urls:
yield scrapy.Request(
url=url,
callback=self.parse
)
def parse(self, response):
content = response.xpath('//div[#class = "row"]')
for items in content:
yield {
'some_items_links':items.xpath(".//a//#href").get()
}

Turns out there is no get_all for the method and instead I had to input get_stats(), the documentation provides a few examples of some:
stats.get_value()
stats.get_stats()
stats.max_value()/stats.min_value()
stats.inc_value()
stats.set_value()
Some further information provided in the documentation for stats.
The working part:
def spider_closed(self):
#self.stats = stat
txt_file = 'some_text.jl'
with jsonlines.open(txt_file, 'w') as f:
# f.write(f'{self.stats.get_all()}') --- Changed
f.write(f'{self.stats.get_stats()}')
Output:
{
"log_count/INFO": 10,
"log_count/DEBUG": 3,
"start_time": datetime.datetime(2022, 7, 6, 16, 16, 30, 553373),
"memusage/startup": 59895808,
"memusage/max": 59895808,
"scheduler/enqueued/memory": 1,
"scheduler/enqueued": 1,
"scheduler/dequeued/memory": 1,
"scheduler/dequeued": 1,
"downloader/request_count": 1,
"downloader/request_method_count/GET": 1,
"downloader/request_bytes": 223,
"downloader/response_count": 1,
"downloader/response_status_count/200": 1,
"downloader/response_bytes": 2086,
"httpcompression/response_bytes": 11053,
"httpcompression/response_count": 1,
"response_received_count": 1,
"item_scraped_count": 1,
"elapsed_time_seconds": 0.34008,
"finish_time": datetime.datetime(2022, 7, 6, 16, 16, 30, 893453),
"finish_reason": "finished",
}

Scrapy finishing process before all pages are scraped

I sat up a test Scrapy scraper which looks like this:
import scrapy
class testSpider(scrapy.Spider):
name = 'test'
start_urls = ['https://www.realestate.com.kh/buy/']
def parse(self, response):
nr_pages = response.xpath('//div[#class="desktop-buttons"]/a[#class="css-1en2dru"]//text()').getall()
for nr in range(1, 40):
req_url = f'?page={nr}'
self.item = {}
self.item['page'] = nr
yield scrapy.Request(url=response.urljoin(req_url), callback=self.parse_page, meta={'item': self.item})
def parse_page(self, response):
page = response.meta['item']['page']
ads = response.xpath('//*[#class="featured css-ineky e1jqslr40"]//a/#href')
for url in ads:
absolute_url = response.urljoin(url.extract())
self.item = {}
self.item['page'] = page
yield scrapy.Request(absolute_url, callback=self.parse_ad, meta={'item': self.item})
def parse_ad(self, response):
page = response.meta['item']['page']
# DO THINGS
yield {
'page': page
}
It goes though loads each https://www.realestate.com.kh/buy/?page=NR, where nr is all numbers between 1 and 40
On each of these pages, it get all ads
On each ad of each page, it scraps things and yield them.
It work fine for the first 26 items (two first pages, and 2 or 3 items from the 3rd one, out of 40) and then finished the scraping without an error.
Here are the stats :
{'downloader/request_bytes': 23163,
'downloader/request_count': 66,
'downloader/request_method_count/GET': 66,
'downloader/response_bytes': 3801022,
'downloader/response_count': 66,
'downloader/response_status_count/200': 66,
'elapsed_time_seconds': 5.420036,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2020, 10, 22, 19, 48, 37, 549853),
'item_scraped_count': 26,
'log_count/INFO': 9,
'memusage/max': 49963008,
'memusage/startup': 49963008,
'request_depth_max': 2,
'response_received_count': 66,
'scheduler/dequeued': 66,
'scheduler/dequeued/memory': 66,
'scheduler/enqueued': 66,
'scheduler/enqueued/memory': 66,
'start_time': datetime.datetime(2020, 10, 22, 19, 48, 32, 129817)}
What could be ending the scraping so early?

Your spider is actually going through all the pages, the problem is that in parse_page the selector for ads only works in the earlier pages, in later pages the class name changes. The class name seems to be dynamically generated, so you need an XPath that won't select by class.
This XPath '//div/header/parent::div' would return the same div element that '//*[#class="featured css-ineky e1jqslr40"]' so replacing this line should allow you to select all ads from all pages:
ads = response.xpath('//div/header/parent::div/article/a/#href')
Unrelated note:
This isn't causing any problems yet, but it's a recipe for future problems.
for url in ads:
absolute_url = response.urljoin(url.extract())
self.item = {}
self.item['page'] = page
yield scrapy.Request(absolute_url, callback=self.parse_ad, meta={'item': self.item})
Scrapy works in an asynchronous way, so most of the time using an instance variable ( like self.item) gives the wrong intuition, as you don't really control the order in which the requests are parsed. That's why when you need to pass information between methods you use meta (or cb_kwargs) and not just store it in an instance variable.

You are working on the wrong way to get pages, there are 50 pages on the site right now. You should walk around by the next page. Look this code:
import scrapy
from scrapy.shell import inspect_response
class testSpider(scrapy.Spider):
name = 'test'
start_urls = ['https://www.realestate.com.kh/buy/']
def parse(self, response):
page = response.xpath('//div[#class="list"]//div[#class="desktop-buttons"]/a[#class="css-owq2hj"]/span/text()').get()
ads = response.xpath('//div[#class="list"]/div/header/a/#href').getall()
for url in ads:
yield scrapy.Request(response.urljoin(url), callback=self.parse_ad, meta={'page': page})
# next page
url = response.xpath('//div[#class="list"]//div[#class="desktop-buttons"]/a[#class="css-owq2hj"]/following-sibling::a[1]/#href').get()
if url:
yield scrapy.Request(response.urljoin(url), callback=self.parse)
def parse_ad(self, response):
page = response.meta['page']
# DO THINGS
yield {
'page': page
}

How to a make a model ready for TensorFlow Serving REST interface with a base64 encoded image?

My understanding is that I should be able to grab a TensorFlow model from Google's AI Hub, deploy it to TensorFlow Serving and use it to make predictions by POSTing images via REST requests using curl.
I could not find any bbox predictors on AI Hub at this time but I did find one on the TensorFlow model zoo:
http://download.tensorflow.org/models/object_detection/ssd_mobilenet_v2_coco_2018_03_29.tar.gz
I have the model deployed to TensorFlow serving, but the documentation is unclear with respect to exactly what should be included in the JSON of the REST request.
My understanding is that
The SignatureDefinition of the model determines what the JSON should look like
I should base64 encode the images
I was able to get the signature definition of the model like so:
>python tensorflow/tensorflow/python/tools/saved_model_cli.py show --dir /Users/alexryan/alpine/git/tfserving-tutorial3/model-volume/models/bbox/1/ --all
MetaGraphDef with tag-set: 'serve' contains the following SignatureDefs:
signature_def['serving_default']:
The given SavedModel SignatureDef contains the following input(s):
inputs['in'] tensor_info:
dtype: DT_UINT8
shape: (-1, -1, -1, 3)
name: image_tensor:0
The given SavedModel SignatureDef contains the following output(s):
outputs['out'] tensor_info:
dtype: DT_FLOAT
shape: unknown_rank
name: detection_boxes:0
Method name is: tensorflow/serving/predict
I think the shape info here is telling me that the model can handle images of any dimensions?
The input layer looks like this in Tensorboard:
But how do I convert this SignatureDefinition to a valid JSON request?
I'm assuming that I'm supposed to use the predict API ...
and Google's doc says ...
URL
POST
http://host:port/v1/models/${MODEL_NAME}[/versions/${MODEL_VERSION}]:predict
/versions/${MODEL_VERSION} is optional. If omitted the latest version
is used.
Request format
The request body for predict API must be JSON object
formatted as follows:
{
// (Optional) Serving signature to use.
// If unspecifed default serving signature is used.
"signature_name": <string>,
// Input Tensors in row ("instances") or columnar ("inputs") format.
// A request can have either of them but NOT both.
"instances": <value>|<(nested)list>|<list-of-objects>
"inputs": <value>|<(nested)list>|<object>
}
Encoding binary values JSON uses UTF-8 encoding. If you have input
feature or tensor values that need to be binary (like image bytes),
you must Base64 encode the data and encapsulate it in a JSON object
having b64 as the key as follows:
{ "b64": "base64 encoded string" }
You can specify this object as a value for an input feature or tensor.
The same format is used to encode output response as well.
A classification request with image (binary data) and caption features
is shown below:
{ "signature_name": "classify_objects", "examples": [
{
"image": { "b64": "aW1hZ2UgYnl0ZXM=" },
"caption": "seaside"
},
{
"image": { "b64": "YXdlc29tZSBpbWFnZSBieXRlcw==" },
"caption": "mountains"
} ] }
Uncertainties include:
should I use "instances" in my JSON
should I base64 encode a JPG or PNG or something else?
Should the image be of a particular
width and height?
In Serving Image-Based Deep Learning Models with TensorFlow-Serving’s RESTful API this format is suggested:
{
"instances": [
{"b64": "iVBORw"},
{"b64": "pT4rmN"},
{"b64": "w0KGg2"}
]
}
I used this image:
https://tensorflow.org/images/blogs/serving/cat.jpg
and base64 encoded it like so:
# Download the image
dl_request = requests.get(IMAGE_URL, stream=True)
dl_request.raise_for_status()
# Compose a JSON Predict request (send JPEG image in base64).
jpeg_bytes = base64.b64encode(dl_request.content).decode('utf-8')
predict_request = '{"instances" : [{"b64": "%s"}]}' % jpeg_bytes
But when I use curl to POST the base64 encoded image like so:
{"instances" : [{"b64": "/9j/4AAQSkZJRgABAQAASABIAAD/4QBYRXhpZgAATU0AKgAA
...
KACiiigAooooAKKKKACiiigAooooA//Z"}]}
I get a response like this:
>./test_local_tfs.sh
HEADER=|Content-Type:application/json;charset=UTF-8|
URL=|http://127.0.0.1:8501/v1/models/saved_model/versions/1:predict|
* Trying 127.0.0.1...
* TCP_NODELAY set
* Connected to 127.0.0.1 (127.0.0.1) port 8501 (#0)
> POST /v1/models/saved_model/versions/1:predict HTTP/1.1
> Host: 127.0.0.1:8501
> User-Agent: curl/7.54.0
> Accept: */*
> Content-Type:application/json;charset=UTF-8
> Content-Length: 85033
> Expect: 100-continue
>
< HTTP/1.1 100 Continue
* We are completely uploaded and fine
< HTTP/1.1 400 Bad Request
< Content-Type: application/json
< Date: Tue, 17 Sep 2019 10:47:18 GMT
< Content-Length: 85175
<
{ "error": "Failed to process element: 0 of \'instances\' list. Error: Invalid argument: JSON Value: {\n \"b64\": \"/9j/4AAQSkZJRgABAQAAS
...
ooooA//Z\"\n} Type: Object is not of expected type: uint8" }
I've tried converting a local version of the same file to base64 like so (confirming that the dtype is uint8) ...
img = cv2.imread('cat.jpg')
print('dtype: ' + str(img.dtype))
_, buf = cv2.imencode('.jpg', img)
jpeg_bytes = base64.b64encode(buf).decode('utf-8')
predict_request = '{"instances" : [{"b64": "%s"}]}' % jpeg_bytes
But posting this JSON generates the same error.
However, when the json is formated like so ...
{'instances': [[[[112, 71, 48], [104, 63, 40], [107, 70, 20], [108, 72, 21], [109, 77, 0], [106, 75, 0], [92, 66, 0], [106, 80, 0], [101, 80, 0], [98, 77, 0], [100, 75, 0], [104, 80, 0], [114, 88, 17], [94, 68, 0], [85, 54, 0], [103, 72, 11], [93, 62, 0], [120, 89, 25], [131, 101, 37], [125, 95, 31], [119, 91, 27], [121, 93, 29], [133, 105, 40], [119, 91, 27], [119, 96, 56], [120, 97, 57], [119, 96, 53], [102, 78, 36], [132, 103, 44], [117, 88, 28], [125, 89, 4], [128, 93, 8], [133, 94, 0], [126, 87, 0], [110, 74, 0], [123, 87, 2], [120, 92, 30], [124, 95, 33], [114, 90, 32],
...
, [43, 24, 33], [30, 17, 36], [24, 11, 30], [29, 20, 38], [37, 28, 46]]]]}
... it works.
The problem is this json file is >11 MB in size.
How do I make the base64 encoded version of the json work?
UPDATE: It seems that we have to edit the pretrained model to accept base64 images at the input layer
This article describes how to edit the model ...
Medium: Serving Image-Based Deep Learning Models with TensorFlow-Serving’s RESTful API
... unfortunately, it assumes that we have access to the code which generated the model.
user260826's solution provides a work-around using an estimator but it assumes the model is a keras model. Not true in this case.
Is there a generic method to make a model ready for TensorFlow Serving REST interface with a base64 encoded image that works with any of the TensorFlow model formats?

The first step is to export the trained model in the appropriate format. Use export_inference_graph.py like this
python export_inference_graph \
--input_type encoded_image_string_tensor \
--pipeline_config_path path/to/ssd_inception_v2.config \
--trained_checkpoint_prefix path/to/model.ckpt \
--output_directory path/to/exported_model_directory
in the above code snippet, it is important to specify
--input_type encoded_image_string_tensor
after exporting the model, run the tensorflow server as usual with the newly exported model.
The inference code will look like this:
from __future__ import print_function
import base64
import requests
SERVER_URL = 'http://localhost:8501/v1/models/vedNet:predict'
IMAGE_URL = 'test_images/19_inp.jpg'
def main():
with open(IMAGE_URL, "rb") as image_file:
jpeg_bytes = base64.b64encode(image_file.read()).decode('utf-8')
predict_request = '{"instances" : [{"b64": "%s"}]}' % jpeg_bytes
response = requests.post(SERVER_URL, predict_request)
response.raise_for_status()
prediction = response.json()['predictions'][0]
if __name__ == '__main__':
main()

As you mentioned JSON is a very inefficient approach, as payload normally exceeds original filesize, you need to convert the model to be able to process the image bytes written to a string using Base64 encoding:
{"b64": base64_encoded_string}
This new conversion will reduce the prediction time and bandwidth utilization used to transfer image from prediction client to your infrastructure.
I recently used a Transfer Learning model with TF Hub and Keras which was using a JSON as input, as you mentioned this is not optimal for prediction.
I used the following function to overwrite it:
Using the following code we add a new serving function which will be able to process Base64 encoded images.
Using TF estimator model:
h5_model_path = os.path.join('models/h5/best_model.h5')
tf_model_path = os.path.join('models/tf')
estimator = keras.estimator.model_to_estimator(
keras_model_path=h5_model_path,
model_dir=tf_model_path)
def image_preprocessing(image):
"""
This implements the standard preprocessing that needs to be applied to the
image tensors before passing them to the model. This is used for all input
types.
"""
image = tf.expand_dims(image, 0)
image = tf.image.resize_bilinear(image, [HEIGHT, WIDTH], align_corners=False)
image = tf.squeeze(image, axis=[0])
image = tf.cast(image, dtype=tf.uint8)
return image
def serving_input_receiver_fn():
def prepare_image(image_str_tensor):
image = tf.image.decode_jpeg(image_str_tensor, channels=CHANNELS)
return image_preprocessing(image)
input_ph = tf.placeholder(tf.string, shape=[None])
images_tensor = tf.map_fn(
prepare_image, input_ph, back_prop=False, dtype=tf.uint8)
images_tensor = tf.image.convert_image_dtype(images_tensor, dtype=tf.float32)
return tf.estimator.export.ServingInputReceiver(
{'input': images_tensor},
{'image_bytes': input_ph})
export_path = os.path.join('/tmp/models/json_b64', version)
if os.path.exists(export_path): # clean up old exports with this version
shutil.rmtree(export_path)
estimator.export_savedmodel(
export_path,
serving_input_receiver_fn=serving_input_receiver_fn)
A good example here

I have been struggling with the same problem. Finally I could make it work. I just had to add a new signature to the model:
import tensorflow as tf
model = tf.saved_model.load("/path/to/the/original/model")
# This is the current signature, that only accepts image tensors as input
signature = model.signatures["default"]
#tf.function()
def my_predict(image_b64):
# Model doesn't support batch!!
img_dec = tf.image.decode_png(image_b64[0], channels=3)
img_tensor = tf.image.convert_image_dtype(img_dec, tf.float32)[tf.newaxis, ...]
prediction = signature(img_tensor)
return prediction
# Create new signature, to read b64 images
new_signature = my_predict.get_concrete_function(
image_b64=tf.TensorSpec([None], dtype=tf.string, name="image_b64")
)
tf.saved_model.save(
model,
export_dir="/path/to/the/saved/model",
signatures=new_signature
)
Finally, after serving I can make predictions passing an input like this:
{
"instances": [
{
"b64": "youBase64ImageHere"
}
]
}

Plotly and JupyterLab: graphs not showing from remote server

I am using a JupyterLab server that I am connecting to via ssh port forwarding. For some reason, Plotly graphs don't show up. Instead of the image, all I get the textual description of the FigureWidget:
import plotly.graph_objs as go
x = [1,2,3,4,5]
y = [2,5,6,1,-5]
go.FigureWidget(data=[{'x':x, 'y':y}])
output:
Figure({
'data': [{'type': 'scatter',
'uid': '2aae0f73-7b85-4dc7-8551-7a1393d1e3c8',
'x': [1, 2, 3, 4, 5],
'y': [2, 5, 6, 1, -5]}],
'layout': {}
})
This is how I connect to the server:
ssh -N -f -L 8888:localhost:9000 user#server
And this is how I started the server:
jupyter lab --port=9000 --no-browser
Edit:
It seems to run fine with Firefox but not with Chrome.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

why the scrapy-plugins/scrapy-jsonrpc can't get the spider's stats - scrapy

Related

Populating numpy array with most performance

How to store all scraped stats moment before spider closes?

Scrapy finishing process before all pages are scraped

How to a make a model ready for TensorFlow Serving REST interface with a base64 encoded image?

Plotly and JupyterLab: graphs not showing from remote server

Categories

Resources