Heroku python NLP search background task

Heroku python NLP search background task - redis

Hi I have a NLP spacy phrase matching task (in Python) on 100000 documents and return a list of matched document ids as result to the client to display. I have a RQ worker to perform search function as a background job. This is working perfectly on a standalone Machine but throwing H12 error when deployed to Heroku. The front end is very simple. It takes user query and waits to render the document ids. Please help

Related

Jmet assessment

Anyone can help solve this assessment?
Using JMeter framework (https://jmeter.apache.org) please implement a load test script:
The script should send 10 concurrent requests to Capital API: https://restcountries.eu/rest/v2/capital/?fields=name;capital;currencies;latlng;regionalBlocs
The script should read the capital values from a CSV file (contains 10 capital names)
The script should perform a status code verification for the transaction response
The script should run for 2 minutes.
The script should contain at least 2 listeners.

My suggestion is: create the script on your own. It's best way to learn any subject. Contributors to this forum will be more that happy to answer any specific questions if you get stuck in your quest.

The script should send 10 concurrent requests - concurrency is defined at Thread Group level
Requests are configured using HTTP Request sampler
Values can be read from the CSV file using CSV Data Set Config
Status code verification is being more or less automatically done by JMeter, it treats status codes below 400 as successful, additionally you can use Response Assertion for this
It's not recommended to use Listeners at all

Issue with page load timeout and next task is directly sent to except

I was running a python script responsible for web scraping some pages using multiprocessing techniques.
In order to wait for a page to get fully loaded, I used the method set_page_load_timeout() set to 30s.
I put a driver.get() inside a try-except structure.
However, I observed that in my case, when all the 5 Chrome instances are busy with 5 different pages, the next page pulled from the task queue goes to my exception and stored in my error file. I presume this occurs since the Chrome instances are still trying to load the 5 previous pages and not able to redirect to this new page.
How could I get over this issue?

Stop Scrapy spider when date from page is older that yesterday

This code is part of my Scrapy spider:
# scraping data from page has been done before this line
publish_date_datetime_object = (datetime.strptime(publish_date, '%d.%m.%Y.')).date()
yesterday = (datetime.now() - timedelta(days=1)).date()
if publish_date_datetime_object > yesterday:
continue
if publish_date_datetime_object < yesterday:
raise scrapy.exceptions.CloseSpider('---STOP---DATE IS OLDER THAN YESTERDAY')
# after this is ItemLoader and yield
This is working fine.
My question is Scrapy spider best place to have this code/logic?
I do not know how to put implement it in another place.
Maybe it can be implemented in a pipeline, but AFAIK the pipeline is evaluated after the scraping has been done, so that means that I need to scrape all adds, even thous that I do not need.
A scale is 5 adds from yesterday versus 500 adds on the whole page.
I do not see any benefit in moving code to pipeline it that means processing(downloading and scraping) 500 adds if I only need 5 from it.

It is the right place if you need your spider to stop crawling after something indicates there's no more useful data to collect.
It is also the right way to do it, rising a CloseSpider exception with a verbose closing reason message.
A pipeline would be more suitable only if there were items to be collected after the threshold detected, but if they are ALL disposable this would be a waste of resources.

Scrapy Use both the CORE in the system

I am running scrapy using their internal API and everything is well and good so far. But I noticed that its not fully using the concurrency of 16 as mentioned in the settings. I have changed delay to 0 and everything else I can do. But then looking into the HTTP requests being sent , its clear that scrapy is not exactly downloading 16 sites at all point of times. At some point of time its downloading only 3 to 4 links. And the queue is not empty at that point of time.
When I checked the core usage , what i found was that out of 2 core , one is 100% and other is mostly idle.
That is when i got to know that twisted library on top which scrapy is build is single threaded and that is why its only using single core.
Is there any workaround to convince scrapy to use all the core ?

Scrapy is based on the twisted framework. Twisted is event loop based framework, so it does scheduled processing and not multiprocessing. That's is why your scrapy crawl runs on just one process. Now you can technically start two spiders using the below code
import scrapy
from scrapy.crawler import CrawlerProcess
class MySpider1(scrapy.Spider):
# Your first spider definition
...
class MySpider2(scrapy.Spider):
# Your second spider definition
...
process = CrawlerProcess()
process.crawl(MySpider1)
process.crawl(MySpider2)
process.start() # the script will block here until all crawling jobs are finished
And there is nothing that stops you from having the same class for both the spiders.
process.crawl method takes *args and **kwargs to pass to your spider. So you can parametrize your spiders using this approach. Let's say your spider is suppose to crawl 100 pages, you can add a start and end parameter to your crawler class and do something like below
process.crawl(YourSpider, start=0, end=50)
process.crawl(YourSpider, start=51, end=100)
Note, that both the crawlers will have their own settings, so if you have 16 requests set for your spider, then both combined will effectively have 32.
In most cases scraping is less about CPU and more about Network access, which is actually non-blocking in case of twisted, so I am not sure this would give you a very huge advantage against setting the CONCURRENT_REQUEST to 32 in a single spider.
PS: Consider reading this page to understand more https://doc.scrapy.org/en/latest/topics/practices.html#running-multiple-spiders-in-the-same-process

Another option is to run your spiders using Scrapyd, which lets you run multiple processes concurrently. See max_proc and max_proc_per_cpu options in the documentation. If you don't want to solve your problem programmatically, this could be the way to go.

How to update file upload messages using backbone?

I am uploading multiple files using javascript.
After I upload the files, I need to run several processing functions.
Because of the processing time that is required, I need a UI on the front telling the user the estimated time left of the entire process.
Basically I have 3 functions:
/upload - this is an endpoint for uploading the files
/generate/metadata - this is the next endpoint that should be triggered after /upload
/process - this is the last endpoint. SHould be triggered after /generate/metadata
This is how I expect the screen to look like basically.
Information such as percentage remaining and time left should be displayed.
However, I am unsure whether to allow server to supply the information or I do a hackish estimate solely using javascript.
I would also need to update the screen like telling the user messages such as
"currently uploading"
if I am at function 1.
"Generating metadata" if I am at function 2.
"Processing ..." if I am at function 3.
Function 2 only occurs after the successful completion of 1.
Function 3 only occurs after the successful completion of 2.
I am already using q.js promises to handle some parts of this, but the code has gotten scarily messy.
I recently come across Backbone and it allows structured ways to handle single page app behavior which is what I wanted.
I have no problems with the server-side returning back json responses for success or failure of the endpoints.
I was wondering what would be a good way to implement this function using Backbone.js

You can use a "progress" file or DB entry which stores the state of the backend process. Have your backend process periodically update this file. For example, write this to the file:
{"status": "Generating metadata", "time": "3 mins left"}
After the user submits the files have the frontend start pinging a backend progress function using a simple ajax call and setTimeout. the progress function will simply open this file, grab the JSON-formatted status info, and then update the frontend progress bar.
You'll probably want the ajax call to be attached to your model(s). Have your frontend view watch for changes to the status and update accordingly (e.g. a progress bar).

Long Polling request:
Polling request for updating Backbone Models/Views
Basically when you upload a File you will assign a "FileModel" to every given file. The FileModel will start a long polling request every N seconds, until get the status "complete".

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Heroku python NLP search background task - redis

Related

Jmet assessment

Issue with page load timeout and next task is directly sent to except

Stop Scrapy spider when date from page is older that yesterday

Scrapy Use both the CORE in the system

How to update file upload messages using backbone?

Categories

Resources