QueueDequeue step is a significant bottleneck in Tensorflow code - tensorflow

I've seen a few other questions regarding the dequeueing step causing a bottleneck but I have tried many of those suggestions without any success:
I am using multiple threads based on cpu count
I have tried with small batches (of 100) and larger batches (of 1000)
I have tried switching to the shuffle_batch_join and batch_join
None of these things seem to be helping the overall time. Also the extent to which the Dequeue step is causing a bottleneck in my code seems much worse than what others have experienced. When you look at the timeline all of the other steps practically disappear in comparison. I am wondering if this is partially from using BigQuery and the BigQueryReader as my source although with other people also experiencing a slowdown I am assuming it isn't the only cause.
I'm not entirely sure how to properly interpret this chart but it doesn't seem like the the problem is being caused by a completely empty queue.
Additional info
Capacity is set to batch size * 10
min after dequeue is set to batch size * 2 + 1
enqueue many is set to true
Does anyone have any thoughts on what else I might be able to try to speed things up while still using bigquery as the source of my data?

Related

GUROBI only uses single core to setup problem with cvxpy (python)

I have a large MILP that I build with cvxpy and want to solve with GUROBI. When I give use the solve() function of cvxpy it take a really really really long time to setup and does not start solving for hours. Whilest doing that only 1 core of my cluster is being used. It is used for 100%. I would like to use multiple cores to build the model so that the process of building the model does not take so long. Running grbprobe also shows that gurobi knows about the other cores and for solving the problem it uses multiple cores.
I have tried to run with different flags i.e. turning presolve off and on or giving the number of Threads to be used (this seemed like i didn't even for the solving.
I also have reduce the number of constraints in the problem and it start solving much faster which means that this is definitively not a problem of the model itself.
The problem in it's normal state should have 2200 constraints i reduce it to 150 and it took a couple of seconds until it started to search for a solution.
The problem is that I don't see anything since it takes so long to get the ""set username parameters"" flag and I don't get any information on what the computer does in the mean time.
Is there a way to tell GUROBI or CVXPY that it can take more cpus for the build-up?
Is there another way to solve this problem?
Sorry. The first part of the solve (cvxpy model generation, setup, presolving, scaling, solving the root, preprocessing) is almost completely serial. The parallel part is when it really starts working on the branch-and-bound tree. For many problems, the parallel part is by far the most expensive, but not for all.
This is not only the case for Gurobi. Other high-end solvers have the same behavior.
There are options to do less presolving and preprocessing. That may get you earlier in the B&B. However, usually, it is better not to touch these options.
Running things with verbose=True may give you more information. If you have more detailed questions, you may want to share the log.

Any Logic Freezes after 36 replications

I'm running a stochastic experiment and would therefore like to do N=500 (or some reasonably large N) replications of the simulation before collecting averaged results.
I've set up a Monte Carlo experiment to do this, and because I was told AnyLogic doesn't naturally average outputs over replications, I cumulatively add the output of each experiment and then once all experiments are finished I divide by the number of replications I ran. I don't store the outputs of each experiment just the cumulative value.
My problem is that the experiment seems to freeze after 36 replications and I'm not sure why this might happen. Note that Each replication takes around 5 seconds to run (and they are not taking progressively longer each time).
Has anyone else experienced something like this/can anyone suggest a way to diagnose the problem?
Yes, had it many times. Two options:
too little memory: increase the experiment memory
It is a fault in your model, has nothing to do with AnyLogic :) . You need to do some investigations yourself, probably some special infinite loop triggered in that iteration.

How to disable summary for Tensorflow Estimator?

I'm using Tensorflow-GPU 1.8 API on Windows 10. For many projects I use the tf.Estimator's, which really work great. It takes care of a bunch of steps including writting summaries for Tensorboard. But right now the 'events.out.tfevents' file getting way to big and I am running into "out of space" errors. For that reason I want to disable the summary writting or at least reduce the amount of summaries written.
Going along with that mission I found out about the RunConfig you can pass over at construction of tf.Estimator. Apparently the parameter 'save_summary_steps' (which by default is 200) controls the way summaries are wrtitten out. Unfortunately changing this parameter seems to have no effect at all. It won't disable (using None value) the summary or reducing (choosing higher values, e.g. 3000) the file size of 'events.out.tfevents'.
I hope you guys can help me out here. Any help is appreciated.
Cheers,
Tobs.
I've observed the following behavior. It doesn't make sense to me so I hope we get a better answer:
When the input_fn gets data from tf.data.TFRecordDataset then the number of steps between saving events is the minimum of save_summary_steps and (number of training examples divided by batch size). That means it does it a minimum of once per epoch.
When the input_fn gets data from tf.TextLineReader, it follows save_summary_steps as you'd expect and I can give it a large value for infrequent updates.

BigQuery interactive query response times degradation since 2/19/16

For the Google BigQuery infrastructure folks: we've been running a set of short running interactive queries for many months now averaging about 5 seconds to complete. Starting Friday 2/19 these response times have been rising steadily (SQL has not changed and we're dealing with a steady stream of data we're querying using a sliding window)
Is this a global BigQuery issue you are aware of?
edit: more granular response times:
There is good news and bad news; the good news is that the query took only 0.5 seconds to execute. The bad news is that it took 191 seconds to find the files where the data was stored.
We have a couple of performance regressions that cause high tail latency for resolving paths. Tables (like yours) where the data is stored in many paths will see worse performance.
This is performance issue is exacerbated by the fact that you're using time-range decorators, which mean that our efforts to optimize the file layout doesn't work as well.
We are starting the roll-out of a fix to the underlying performance problem this afternoon; it will likely take at least a week for it to take effect everywhere. I'll update this answer once it is complete (if I forget, please remind me)
In the mean time, you may get faster results by removing the time-range decorators from your queries. You are already filtering by time, so the queries should still be correct. Of course, this may mean that the queries cost a bit more to run.

boto dynamodb: is there a way to optimize batch writing?

I am indexing large amounts of data into DynamoDB and experimenting with batch writing to increase actual throughput (i.e. make indexing faster). Here's a block of code (this is the original source):
def do_batch_write(items,conn,table):
batch_list = conn.new_batch_write_list()
batch_list.add_batch(table, puts=items)
while True:
response = conn.batch_write_item(batch_list)
unprocessed = response.get('UnprocessedItems', None)
if not unprocessed:
break
# identify unprocessed items and retry batch writing
I am using boto version 2.8.0. I get an exception if items has more than 25 elements. Is there a way to increase this limit? Also, I noticed that sometimes, even if items is shorter, it cannot process all of them in a single try. But there does not seem to be correlation between how often this happens, or how many elements are left unprocessed after a try, and the original length of items. Is there a way to avoid this and write everything in one try? Now, the ultimate goal is to make processing faster, not just avoid repeats, so sleeping for a long period of time between successive tries is not an option.
Thx
From the documentation:
"The BatchWriteItem operation puts or deletes multiple items in one or more tables. A single call to BatchWriteItem can write up to 16 MB of data, which can comprise as many as 25 put or delete requests. Individual items to be written can be as large as 400 KB."
The reason for some not succeeded is probably due to exceeding the provisioned throughput of your table. Do you have other write operations being performed on the table at the same time? Have you tried increasing the write throughput on your table to see if more items are processed.
I'm not aware of any way of increasing the limit of 25 items per request but you could try asking on the AWS Forums or through your support channel.
I think the best way to get maximum throughput is to increase the write capacity units as high as you can and to parallelize the batch write operations across several threads or processes.
From my experience, there is little to be gained in trying to optimize your write throughput using either batch write or multithreading. Batch write saves a little network time, and multithreading saves close to nothing as the item size limitation is quite low and the bottleneck is very often DDB throttling your request.
So (like it or not) increasing your Write Capacity in DynamoDB is the way to go.
Ah, like garnaat said, latency inside the region is often really different (like from 15ms to 250ms) from inter-region or outside AWS.
Not only increasing the Write Capacity will make it faster.
if your HASH KEY diversity is poor, then even if you will increase your write capacity, then you can have throughput errors.
throughput errors are depends on your hit map.
example: if your hash key is a number between 1-10, and you have 10 records with hash value 1-10 but 10k records with value 10, then you will have many throughput errors even while increasing your write capacity.