Delay between when assignment is submitted and when HIT becomes 'reviewable' - mechanicalturk

I have a pipeline written using boto3 and Python that automatically creates HITs and fetches the responses as they are completed. As part of my pipeline, I periodically poll for 'reviewable' HITs that I can fetch the answers from. However, I have discovered that there is often a significant delay (over 10 minutes) between the time when all assignments have been submitted, to the time when the HIT state transitions to 'reviewable'. Has anyone else experienced this delay? Is this just something I need to learn to live with?

unfortunately, yes. mechanical turk is, in my experience, pretty slow at updating a lot of things.

Related

Bigquery streaming inserts taking time

During load testing of our module we found that bigquery insert calls are taking time (3-4 s). I am not sure if this is ok. We are using java biguqery client libarary and on an average we push 500 records per api call. We are expecting a million records per second traffic to our module so bigquery inserts are bottleneck to handle this traffic. Currently it is taking hours to push data.
Let me know if we need more info regarding code or scenario or anything.
Thanks
Pankaj
Since streaming has a limited payload size, see Quota policy it's easier to talk about times, as the payload is limited in the same way to both of us, but I will mention other side effects too.
We measure between 1200-2500 ms for each streaming request, and this was consistent over the last month as you can see in the chart.
We seen several side effects although:
the request randomly fails with type 'Backend error'
the request randomly fails with type 'Connection error'
the request randomly fails with type 'timeout' (watch out here, as only some rows are failing and not the whole payload)
some other error messages are non descriptive, and they are so vague that they don't help you, just retry.
we see hundreds of such failures each day, so they are pretty much constant, and not related to Cloud health.
For all these we opened cases in paid Google Enterprise Support, but unfortunately they didn't resolved it. It seams the recommended option to take for these is an exponential-backoff with retry, even the support told to do so. Which personally doesn't make me happy.
The approach you've chosen if takes hours that means it does not scale, and won't scale. You need to rethink the approach with async processes. In order to finish sooner, you need to run in parallel multiple workers, the streaming performance will be the same. Just having 10 workers in parallel it means time will be 10 times less.
Processing in background IO bound or cpu bound tasks is now a common practice in most web applications. There's plenty of software to help build background jobs, some based on a messaging system like Beanstalkd.
Basically, you needed to distribute insert jobs across a closed network, to prioritize them, and consume(run) them. Well, that's exactly what Beanstalkd provides.
Beanstalkd gives the possibility to organize jobs in tubes, each tube corresponding to a job type.
You need an API/producer which can put jobs on a tube, let's say a json representation of the row. This was a killer feature for our use case. So we have an API which gets the rows, and places them on tube, this takes just a few milliseconds, so you could achieve fast response time.
On the other part, you have now a bunch of jobs on some tubes. You need an agent. An agent/consumer can reserve a job.
It helps you also with job management and retries: When a job is successfully processed, a consumer can delete the job from the tube. In the case of failure, the consumer can bury the job. This job will not be pushed back to the tube, but will be available for further inspection.
A consumer can release a job, Beanstalkd will push this job back in the tube, and make it available for another client.
Beanstalkd clients can be found in most common languages, a web interface can be useful for debugging.

Google BigQuery: Slow streaming inserts performance

We are using BigQuery as event logging platform.
The problem we faced was very slow insertAll post requests (https://cloud.google.com/bigquery/docs/reference/v2/tabledata/insertAll).
It does not matter where they are fired - from server or client side.
Minimum is 900ms, average is 1500s, where nearly 1000ms is connection time.
Even if there is 1 request per second (so no throttling here).
We use Google Analytics measurement protocol and timings from the same machines are 50-150ms.
The solution described in BigQuery streaming 'insertAll' performance with PHP suugested to use queues, but it seems to be overkill because we send no more than 10 requests per second.
The question is if 1500ms is normal for streaming inserts and if not, how to make them faster.
Addtional information:
If we send malformed JSON, response arrives in 50-100ms.
Since streaming has a limited payload size, see Quota policy it's easier to talk about times, as the payload is limited in the same way to both of us, but I will mention other side effects too.
We measure between 1200-2500 ms for each streaming request, and this was consistent over the last month as you can see in the chart.
We seen several side effects although:
the request randomly fails with type 'Backend error'
the request randomly fails with type 'Connection error'
the request randomly fails with type 'timeout' (watch out here, as only some rows are failing and not the whole payload)
some other error messages are non descriptive, and they are so vague that they don't help you, just retry.
we see hundreds of such failures each day, so they are pretty much constant, and not related to Cloud health.
For all these we opened cases in paid Google Enterprise Support, but unfortunately they didn't resolved it. It seams the recommended option to take for these is an exponential-backoff with retry, even the support told to do so. Which personally doesn't make me happy.
Also the failure rate fits the 99.9% uptime we have in the SLA, so there is no reason for objection.
There's something to keep in mind in regards to the SLA, it's a very strictly defined structure, the details are here. The 99.9% is uptime not directly translated into fail rate. What this means is that if BQ has a 30 minute downtime one month, and then you do 10,000 inserts within that period but didn't do any inserts in other times of the month, it will cause the numbers to be skewered. This is why we suggest a exponential backoff algorithm. The SLA is explicitly based on uptime and not error rate, but logically the two correlates closely if you do streaming inserts throughout the month at different times with backoff-retry setup. Technically, you should experience on average about 1/1000 failed insert if you are doing inserts through out the month if you have setup the proper retry mechanism.
You can check out this chart about your project health:
https://console.developers.google.com/project/YOUR-APP-ID/apiui/apiview/bigquery?tabId=usage&duration=P1D
It happens that my response is on the linked other article, and I proposed the queues, because it made our exponential-backoff with retry very easy, and working with queues is very easy. We use Beanstalkd.
To my experience any request to bigquery will take long. We've tried using it as a database for performance data but eventually are moving out due to slow response times. As far as I can see. BQ is built for handling big requests within a 1 - 10 second response time. These are the requests BQ categorizes as interactive. BQ doesn't get faster by doing less. We stream quite some records to BQ but always make sure we batch them up (per table). And run all requests asynchronously (or if you have to in another theat).
PS. I can confirm what Pentium10 sais about faillures in BQ. Make sure you retry the stuff that fails and if it fails again log it to file for retrying it another time.

Abort a table import stuck in 'pending'

Similar questions have been asked but not exactly what I am looking for.
The problem: on some occasions importing a table from Google Cloud to Big Query gets stuck in a 'pending' state for hours if not days. Tables that get stuck in this state never seem to come out of it, or at least we didn't bother waiting that long. I know it's not a queue issue since in the mean time we can import other tables just fine. No errors are returned by Big Query.
My question: in this situation, and in general, how can we safely abort/cancel an import to Big Query without having the table quietly import on us without us knowing. This would actually apply to any table regardless of its state, as long as it hasn't finished importing.
Thanks.
You may be hitting job load rate limits. For example, if you try to start more than two load jobs per minute for the same table, the load jobs against that table will be defferred, while other load jobs against other tables may continue at normal speed.
There are per-project limits on rate at which load jobs will be started and limits on the number of load jobs that can be running per project at any one time. If you send jobs faster than this, we'll queue, but as you've noticed, our queueing is not a fair queue, and can start newer jobs before older ones.
Aborting pending jobs is a commonly requested feature. If you file a feature request here that will help us prioritize it.

having a script autorun on a web server

sorry if my question is a bit ambiguous, I'll explain what i want to do.
i want to run a game on a webserver. its a turn based game, some of you people might have come across it.
Its a game called mafia: http://mafiascum.net/wiki/index.php?title=Newbie_Guide.
I know how it needs to work in terms of a mysql database a server side scripting language etc etc.
What i am not sure about is whats the best way to get a script to activate when the game starts, and be able to run a script every 3 minutes to update the game status:
once 10 people join the game starts
people vote during a 3 minute period. (votes would be stored in a database)
after 3 minutes a script needs to run to calculate the votes and remove a player
then 1 and a half minutes later the script needs to run again.
This cycle of 3 minutes, 1 and a half minutes need to repeat until a certain condition is met, i.e all players but 2 are dead or something.
when players refresh the page they need to be updated on the games status.
Ive read about sockets, and wonder if this might be a good path to take. would sockets be able to send json back to the clients? so that jquery can then update the client with game results.
Ideally i would like the the front end to be done in jquery and the backend script processing to be done by php or something.
How open would this be? in terms of people trying to cheat by sending attacks such as post variables sqli attacks etc etc.
Its quite a broad question, and i am sure there is more than one approcah so is more than one correct answer, but i would be intrested on peoples thoughts on how they would go about developing it.
Thanks for your time :)
I would simply use a CRON job or similar on the backend to update the status every x seconds as you have suggested.
To trigger a game start, simply fire off a PHP command to set your CRON job running.
This way the timing is controlled behind the scenes on the server, and you are free to update the status of the game using jQuery to your actual players.

Worker process reached its allowed processing time

We are experiencing this issue approximately once a month. It is very hard to pinpoint the cause so any help would be appreciated. This causes the App pool to stop and brings the site down. We have gone through all log files and have concluded nothing. We are using the 2.0.3 version on IIS 6.
I've noticed IIS defaults web apps on a 29-hour recycle schedule, which can be troublesome since it may recycle at times your users do not expect it to.
For example: web app starts at 12 am, which means the next day it recycles at 5am, the day after that at 10am, the day after that at 3pm, etc. (this is assuming there is enough request activity against your app to keep it alive so it does not shutdown due to inactivity)
If your web app relies heavily on in-memory session state this is especially bad because the recycle will kill sessions and possibly force users to re-authenticate and lose any unsaved work. (if you don't design your app to work seamlessly with recycling)
Check the recycle schedule and make sure it recycles at a time that you expect. See this for screenshots: http://remy.supertext.ch/2010/08/iis7-worker-process-reached-its-allowed-processing-time-limit/
Not sure about the infinite loop suggestion... sounds like you just have a recycling configuration issue to resolve.
This likely indicates an infinite loop in your application code.
Basically, every time a request comes into the web server, IIS hands the request off to a worker process. You can configure in IIS how many of those workers there are, and what the timeout value is. The timeout is to keep things moving in case the application code hangs -- it gets killed so the thread can go back in the pool to keep servicing new requests.
So look through your code for likely infinite loops. Or alternatively, it could be an extremely long-running database query that could have eventually finished but exceeded the timeout value. Perhaps your web application offers the end user an opportunity to make too broad of a query that returns too much data or requires too much DB processing time.
It's hard to give a specific cause for you, of course, but try to think along these lines.
If you're experiencing a crash as a result (sounds like you are) then you might want to grab a copy of Debugging Tools for Windows and spend some time reading Tess Ferrandez' blog--she offers great advice on performing post mortem crash analysis and makes WinDbg a whole lot more approachable.