How to reduce time allotted for a batch of HITs? - mechanicalturk

today I created a small batch of 20 categorization HITs with the name Grammatical or Ungrammatical using the web UI. Can you tell me the easiest way to manage this batch so that I can reduce its time allotted to 15 minutes from 1 hour and remove also remove the categorization of masters. This is a very simple task that's set to auto-approve within 1 hour, and I am fine with that. I just need to make it more lucrative for people to attempt this at the penny rate.

You need to register a new HITType with the relevant properties (reduced time and no masters qualification) and then perform a ChangeHITTypeOfHIT operation on all of the HITs in the batch.
API documentation here: http://docs.aws.amazon.com/AWSMechTurk/latest/AWSMturkAPI/ApiReference_ChangeHITTypeOfHITOperation.html

Related

Snakemake: Combine cluster profile with resources: attempt

Let's say I have a rule that for 90% of my data needs 1h, but occasionally needs 3h. In a busy cluster environment, I however do not want to submit all jobs with a time limit of 3h to be save, as this would slow down the scheduling of my jobs.
Hence, I played around with the attempt variable:
resources:
# Increase time limit in factors of 1h, if the job fails due to time limit.
time = lambda wildcards, input, threads, attempt: int(60 * int(attempt))
(one could be even smarter and use powers of 2 to amortize better...).
But this approach forces me to put the base time (1h) direclty into the rule. How can I combine this approach with cluster profiles, where the base time is in some cluster_config.yaml file?
Thanks and so long
Lucas

Collect statistics on current traffic with Bro

I want to collect statistics on traffic every 10 seconds and the only tool that I found is connection_state_remove event,
event connection_state_remove(c: connection)
{
SumStats::observe( "traffic", [$str="all"] [$num=c$orig$num_bytes_ip] );
}
how to deal with those connections that did not removed by the end of this period. How to get statistics from them?
The events you're processing are independent of the time interval at which the SumStats framework reports statistics. First, you need to define what exactly are the statistics you care about — for example, you may want to count the number of connections for which Bro completes processing in a given time interval. Second, you need to define the time interval (in your case, 10 seconds) and how to process the statistical observations in the SumStats framework. This latter part is missing in your snippet: you're only making an observation but not telling the framework what to do with it.
The examples in the SumStats documentation are very close to what you're looking for.

Waiting time of SUMO

I am using sumo for traffic signal control, and want to optimize the phase to reduce some objectives. During the process, I use the traci module as an output of states in traffic junction. The confusing part is traci.lane.getWaitingTime.
I don't know how the waiting time is calculated and also after I use two detectors as an output to observe, I think it is too large.
Can someone explain how the waiting time is calculated in SUMO?
The waiting time essentially counts the number of seconds a vehicle has a speed of less than 0.1 m/s. In the case of traci.lane this means it is the number of (nearly) standing vehicles multiplied with the time step length (since traci.lane returns the values for the last step).

Cloud DataFlow performance - are our times to be expected?

Looking for some advice on how best to architect/design and build our pipeline.
After some initial testing, we're not getting the results that we were expecting. Maybe we're just doing something stupid, or our expectations are too high.
Our data/workflow:
Google DFP writes our adserver logs (CSV compressed) directly to GCS (hourly).
A day's worth of these logs has in the region of 30-70 million records, and about 1.5-2 billion for the month.
Perform transformation on 2 of the fields, and write the row to BigQuery.
The transformation involves performing 3 REGEX operations (due to increase to 50 operations) on 2 of the fields, which produces new fields/columns.
What we've got running so far:
Built a pipeline that reads the files from GCS for a day (31.3m), and uses a ParDo to perform the transformation (we thought we'd start with just a day, but our requirements are to process months & years too).
DoFn input is a String, and its output is a BigQuery TableRow.
The pipeline is executed in the cloud with instance type "n1-standard-1" (1vCPU), as we think 1 vCPU per worker is adequate given that the transformation is not overly complex, nor CPU intensive i.e. just a mapping of Strings to Strings.
We've run the job using a few different worker configurations to see how it performs:
5 workers (5 vCPUs) took ~17 mins
5 workers (10 vCPUs) took ~16 mins (in this run we bumped up the instance to "n1-standard-2" to get double the cores to see if it improved performance)
50 min and 100 max workers with autoscale set to "BASIC" (50-100 vCPUs) took ~13 mins
100 min and 150 max workers with autoscale set to "BASIC" (100-150 vCPUs) took ~14 mins
Would those times be in line with what you would expect for our use case and pipeline?
You can also write the output to files and then load it into BigQuery using command line/console. You'd probably save some dollars of instance's uptime. This is what I've been doing after running into issues with Dataflow/BigQuery interface. Also from my experience there is some overhead bringing instances up and tearing them down (could be 3-5 minutes). Do you include this time in your measurements as well?
BigQuery has a write limit of 100,000 rows per second per table OR 6M/per minute. At 31M rows of input that would take ~ 5 minutes of just flat out writes. When you add back the discrete processing time per element & then the synchronization time (read from GCS->dispatch->...) of the graph this looks about right.
We are working on a table sharding model so you can write across a set of tables and then use table wildcards within BigQuery to aggregate across the tables (common model for typical BigQuery streaming use case). I know the BigQuery folks are also looking at increased table streaming limits, but nothing official to share.
Net-net increasing instances is not going to get you much more throughput right now.
Another approach - in the mean time while we work on improving the BigQuery sync - would be to shard your reads using pattern matching via TextIO and then run X separate pipelines targeting X number of tables. Might be a fun experiment. :-)
Make sense?

GAE Java API Channel

In http://code.google.com/intl/es-ES/appengine/docs/quotas.html#Channel you
can read that with billing enabled the maximum channel created rate is 60
creations/minute. Does it mean that we can created only 86,400
channels/day. It's very low rate, isn't it? And if i have estimated that I
could have peaks of for example: 4,000 creations/minute... What i can do?
60 creations/minute are few creations if the channels are 1to1... Is this
correct?
My interpretation of that section is that you will NOT be able to create 4k connections per minute. Here is how I would think about it: over ANY 1-minute period, no more than 60 channels can be created. For example, you can create 60 channels at time T. Then, for the next 60 seconds you won't be able to create any. Or, you can create 30 at time T. Then, every 2 seconds, create a channel.
I believe another way to think about this is in terms of the token bucket algorithm.
Anyway, I believe you can fill out this form to request a higher limit. There is a link to that form from the docs that you linked to in your question.