How to get CNTK Parallel Speedup on SequenceToSequence Example - cntk

I cloned the CNTK repository last week and built it using Nvidia-docker running on a p2.8xlarge instance on AWS. Everything appears to be working, except that I'm not getting a speedup from running multiple GPUs when enabling 1-bit SGD. I'm running the CMUDict Sequence2Sequence_distributed.py example. Here is my transcript when I run it on one GPU:
root#cb3aab88d4e9:/cntk/Examples/SequenceToSequence/CMUDict/Python# python Sequence2Sequence_Distributed.py
Selected GPU[0] Tesla K80 as the process wide default device.
ping [requestnodes (before change)]: 1 nodes pinging each other
ping [requestnodes (after change)]: 1 nodes pinging each other
requestnodes [MPIWrapperMpi]: using 1 out of 1 MPI nodes on a single host (1 requested); we (0) are in (participating)
ping [mpihelper]: 1 nodes pinging each other
-------------------------------------------------------------------
Build info:
Built time: Jun 2 2017 19:46:11
Last modified date: Fri Jun 2 19:21:14 2017
Build type: release
Build target: GPU
With 1bit-SGD: yes
With ASGD: yes
Math lib: mkl
CUDA_PATH: /usr/local/cuda
CUB_PATH: /usr/local/cub-1.4.1
CUDNN_PATH: /usr/local/cudnn
Build Branch: master
Build SHA1: 2bcdc9dff6dc6393813f6043d80e167fb31aed72
Built by Source/CNTK/buildinfo.h$$0 on 72cb11c66133
Build Path: /cntk
MPI distribution: Open MPI
MPI version: 1.10.3
-------------------------------------------------------------------
Finished Epoch[1 of 160]: [Training] loss = 4.234002 * 64, metric = 98.44% * 64 3.014s ( 21.2 samples/s);
Finished Epoch[2 of 160]: [Training] loss = 4.231473 * 71, metric = 85.92% * 71 1.013s ( 70.1 samples/s);
Finished Epoch[3 of 160]: [Training] loss = 4.227827 * 61, metric = 81.97% * 61 0.953s ( 64.0 samples/s);
Finished Epoch[4 of 160]: [Training] loss = 4.227088 * 68, metric = 86.76% * 68 0.970s ( 70.1 samples/s);
Finished Epoch[5 of 160]: [Training] loss = 4.222957 * 62, metric = 88.71% * 62 0.922s ( 67.2 samples/s);
Finished Epoch[6 of 160]: [Training] loss = 4.221479 * 63, metric = 84.13% * 63 0.950s ( 66.3 samples/s);
Here's the transcript when I run two GPUs:
root#cb3aab88d4e9:/cntk/Examples/SequenceToSequence/CMUDict/Python# mpiexec --allow-run-as-root --npernode 2 python Sequence2Sequence_Distributed.py -q 1
Selected GPU[0] Tesla K80 as the process wide default device.
Selected CPU as the process wide default device.
ping [requestnodes (before change)]: 2 nodes pinging each other
ping [requestnodes (before change)]: 2 nodes pinging each other
ping [requestnodes (after change)]: 2 nodes pinging each other
ping [requestnodes (after change)]: 2 nodes pinging each other
requestnodes [MPIWrapperMpi]: using 2 out of 2 MPI nodes on a single host (2 requested); we (0) are in (participating)
ping [mpihelper]: 2 nodes pinging each other
requestnodes [MPIWrapperMpi]: using 2 out of 2 MPI nodes on a single host (2 requested); we (1) are in (participating)
ping [mpihelper]: 2 nodes pinging each other
-------------------------------------------------------------------
Build info:
Built time: Jun 2 2017 19:46:11
Last modified date: Fri Jun 2 19:21:14 2017
Build type: release
Build target: GPU
With 1bit-SGD: yes
With ASGD: yes
Math lib: mkl
CUDA_PATH: /usr/local/cuda
CUB_PATH: /usr/local/cub-1.4.1
CUDNN_PATH: /usr/local/cudnn
Build Branch: master
Build SHA1: 2bcdc9dff6dc6393813f6043d80e167fb31aed72
Built by Source/CNTK/buildinfo.h$$0 on 72cb11c66133
Build Path: /cntk
MPI distribution: Open MPI
MPI version: 1.10.3
-------------------------------------------------------------------
-------------------------------------------------------------------
Build info:
Built time: Jun 2 2017 19:46:11
Last modified date: Fri Jun 2 19:21:14 2017
Build type: release
Build target: GPU
With 1bit-SGD: yes
With ASGD: yes
Math lib: mkl
CUDA_PATH: /usr/local/cuda
CUB_PATH: /usr/local/cub-1.4.1
CUDNN_PATH: /usr/local/cudnn
Build Branch: master
Build SHA1: 2bcdc9dff6dc6393813f6043d80e167fb31aed72
Built by Source/CNTK/buildinfo.h$$0 on 72cb11c66133
Build Path: /cntk
MPI distribution: Open MPI
MPI version: 1.10.3
-------------------------------------------------------------------
Here's an error message -- does this mean the GPUs are not being utilized when I run the job as two MPI processes? How would I fix this?
NcclComm: disabled, at least one rank using CPU device
NcclComm: disabled, at least one rank using CPU device
You can see the number of samples/s is down:
Finished Epoch[1 of 160]: [Training] loss = 4.233786 * 73, metric = 97.26% * 73 5.377s ( 13.6 samples/s);
Finished Epoch[1 of 160]: [Training] loss = 4.233786 * 73, metric = 97.26% * 73 5.877s ( 12.4 samples/s);
Finished Epoch[2 of 160]: [Training] loss = 4.232235 * 67, metric = 94.03% * 67 2.196s ( 30.5 samples/s);
Finished Epoch[2 of 160]: [Training] loss = 4.232235 * 67, metric = 94.03% * 67 2.197s ( 30.5 samples/s);
Finished Epoch[3 of 160]: [Training] loss = 4.229795 * 54, metric = 83.33% * 54 2.227s ( 24.2 samples/s);
Finished Epoch[3 of 160]: [Training] loss = 4.229795 * 54, metric = 83.33% * 54 2.227s ( 24.2 samples/s);
Finished Epoch[4 of 160]: [Training] loss = 4.229072 * 83, metric = 87.95% * 83 2.229s ( 37.2 samples/s);
Finished Epoch[4 of 160]: [Training] loss = 4.229072 * 83, metric = 87.95% * 83 2.229s ( 37.2 samples/s);
Finished Epoch[5 of 160]: [Training] loss = 4.227438 * 46, metric = 86.96% * 46 1.667s ( 27.6 samples/s);
Finished Epoch[5 of 160]: [Training] loss = 4.227438 * 46, metric = 86.96% * 46 1.666s ( 27.6 samples/s);
Finished Epoch[6 of 160]: [Training] loss = 4.225661 * 65, metric = 84.62% * 65 2.388s ( 27.2 samples/s);
Finished Epoch[6 of 160]: [Training] loss = 4.225661 * 65, metric = 84.62% * 65 2.388s ( 27.2 samples/s);

I was a able to fix the initial issue by getting Nvidia-docker to take multiple GPUs by telling the nvidia-docker which GPUs to use; for example -- if you want to set 2 GPUs, then use:
NV_GPU=0,1 nvidia-docker run ....

By default Sequence2Sequence_distributed.py runs data parallel SGD with 32-bit. Can you try other distributed training algorithms, like block momentum with a warm-start?
Besides, consider increase your minibatch size (the default is 16) if you want better parallelism with multiple GPU. You can use nvidia-smi to check GPU utilization and memory usage.

Related

Is there a way to implement equations as Dymos path constraints?

For example, if I have a function h_max(mach) and I want the altitude to always respect this predefined altitude-mach relationship throughout the flight enveloppe, how could I impliment this?
I have tried calculating the limit quantity (in this case, h_max) as its own state and then calculating another state as h_max-h and then constraining that through a path constraint to being greater than 0. This type of approach has worked, but involved two explicit components, a group and alot of extra coding just to get a constraint working. I was wondering if there was a better way?
Thanks so much in advance.
The next version of Dymos, 1.7.0 will be released soon and will support this.
In the mean time, you can install the latest developmental version of Dymos directly from github to have access to this capability:
python -m pip install git+https://github.com/OpenMDAO/dymos.git
Then, you can define boundary and path constraints with an equation. Note the equation must have an equals sign in it, and then lower, upper, or equals will apply to the result of the equation.
In reality, dymos is just inserting an OpenMDAO ExecComp for you under the hood, so the one caveat to this is that your expression must be compatible with complex-step differentiation.
Here's an example of the brachistochrone that uses constraint expressions to set the final y value to a specific value while satisfying a path constraint defined with a second equation.
import openmdao.api as om
import dymos as dm
from dymos.examples.plotting import plot_results
from dymos.examples.brachistochrone import BrachistochroneODE
import matplotlib.pyplot as plt
#
# Initialize the Problem and the optimization driver
#
p = om.Problem(model=om.Group())
p.driver = om.ScipyOptimizeDriver()
p.driver.declare_coloring()
#
# Create a trajectory and add a phase to it
#
traj = p.model.add_subsystem('traj', dm.Trajectory())
phase = traj.add_phase('phase0',
dm.Phase(ode_class=BrachistochroneODE,
transcription=dm.GaussLobatto(num_segments=10)))
#
# Set the variables
#
phase.set_time_options(fix_initial=True, duration_bounds=(.5, 10))
phase.add_state('x', fix_initial=True, fix_final=True)
phase.add_state('y', fix_initial=True, fix_final=False)
phase.add_state('v', fix_initial=True, fix_final=False)
phase.add_control('theta', continuity=True, rate_continuity=True,
units='deg', lower=0.01, upper=179.9)
phase.add_parameter('g', units='m/s**2', val=9.80665)
Y_FINAL = 5.0
Y_MIN = 5.0
phase.add_boundary_constraint(f'bcf_y = y - {Y_FINAL}', loc='final', equals=0.0)
phase.add_path_constraint(f'path_y = y - {Y_MIN}', lower=0.0)
#
# Minimize time at the end of the phase
#
phase.add_objective('time', loc='final', scaler=10)
p.model.linear_solver = om.DirectSolver()
#
# Setup the Problem
#
p.setup()
#
# Set the initial values
#
p['traj.phase0.t_initial'] = 0.0
p['traj.phase0.t_duration'] = 2.0
p.set_val('traj.phase0.states:x', phase.interp('x', ys=[0, 10]))
p.set_val('traj.phase0.states:y', phase.interp('y', ys=[10, 5]))
p.set_val('traj.phase0.states:v', phase.interp('v', ys=[0, 9.9]))
p.set_val('traj.phase0.controls:theta', phase.interp('theta', ys=[5, 100.5]))
#
# Solve for the optimal trajectory
#
dm.run_problem(p)
# Check the results
print('final time')
print(p.get_val('traj.phase0.timeseries.time')[-1])
p.list_problem_vars()
Note the constraints from the list_problem_vars() call that come from timeseries_exec_comp - this is the OpenMDAO ExecComp that Dymos automatically inserts for you.
--- Constraint Report [traj] ---
--- phase0 ---
[final] 0.0000e+00 == bcf_y [None]
[path] 0.0000e+00 <= path_y [None]
/usr/local/lib/python3.8/dist-packages/openmdao/recorders/sqlite_recorder.py:227: UserWarning:The existing case recorder file, dymos_solution.db, is being overwritten.
Model viewer data has already been recorded for Driver.
Full total jacobian was computed 3 times, taking 0.057485 seconds.
Total jacobian shape: (71, 51)
Jacobian shape: (71, 51) (12.51% nonzero)
FWD solves: 12 REV solves: 0
Total colors vs. total size: 12 vs 51 (76.5% improvement)
Sparsity computed using tolerance: 1e-25
Time to compute sparsity: 0.057485 sec.
Time to compute coloring: 0.054118 sec.
Memory to compute coloring: 0.000000 MB.
/usr/local/lib/python3.8/dist-packages/openmdao/core/total_jac.py:1585: DerivativesWarning:Constraints or objectives [('traj.phases.phase0.timeseries.timeseries_exec_comp.path_y', inds=[(0, 0)])] cannot be impacted by the design variables of the problem.
Optimization terminated successfully (Exit mode 0)
Current function value: [18.02999766]
Iterations: 14
Function evaluations: 14
Gradient evaluations: 14
Optimization Complete
-----------------------------------
final time
[1.80299977]
----------------
Design Variables
----------------
name val size indices
-------------------------- -------------- ---- ---------------------------------------------
traj.phase0.t_duration [1.80299977] 1 None
traj.phase0.states:x |12.14992234| 9 [1 2 3 4 5 6 7 8 9]
traj.phase0.states:y |22.69124774| 10 [ 1 2 3 4 5 6 7 8 9 10]
traj.phase0.states:v |24.46289861| 10 [ 1 2 3 4 5 6 7 8 9 10]
traj.phase0.controls:theta |266.48489386| 21 [ 0 1 2 3 4 5 ... 4 15 16 17 18 19 20]
-----------
Constraints
-----------
name val size indices alias
----------------------------------------------------------- ------------- ---- --------------------------------------------- ----------------------------------------------------
timeseries.timeseries_exec_comp.bcf_y [0.] 1 [29] traj.phases.phase0->final_boundary_constraint->bcf_y
timeseries.timeseries_exec_comp.path_y |15.73297378| 30 [ 0 1 2 3 4 5 ... 3 24 25 26 27 28 29] traj.phases.phase0->path_constraint->path_y
traj.phase0.collocation_constraint.defects:x |6e-08| 10 None None
traj.phase0.collocation_constraint.defects:y |7e-08| 10 None None
traj.phase0.collocation_constraint.defects:v |3e-08| 10 None None
traj.phase0.continuity_comp.defect_control_rates:theta_rate |0.0| 9 None None
----------
Objectives
----------
name val size indices
------------- ------------- ---- -------
traj.phase0.t [18.02999766] 1 -1

Unable to load large pandas dataframe to pyspark

I've been trying to join two large pandas dataframes using pyspark using the following code. I'm trying to vary executor cores allocated for the application and measure scalability of pyspark (strong scaling).
r = 1000000000 # 1Bn rows
it = 10
w = 256
unique = 0.9
TOTAL_MEM = 240
TOTAL_NODES = 14
max_val = r * unique
rng = default_rng()
frame_data = rng.integers(0, max_val, size=(r, 2))
frame_data1 = rng.integers(0, max_val, size=(r, 2))
print(f"data generated", flush=True)
df_l = pd.DataFrame(frame_data).add_prefix("col")
df_r = pd.DataFrame(frame_data1).add_prefix("col")
print(f"data loaded", flush=True)
procs = int(math.ceil(w / TOTAL_NODES))
mem = int(TOTAL_MEM*0.9)
print(f"world sz {w} procs per worker {procs} mem {mem} iter {it}", flush=True)
spark = SparkSession\
.builder\
.appName(f'join {r} {w}')\
.master('spark://node:7077')\
.config('spark.executor.memory', f'{int(mem*0.6)}g')\
.config('spark.executor.pyspark.memory', f'{int(mem*0.4)}g')\
.config('spark.cores.max', w)\
.config('spark.driver.memory', '100g')\
.config('sspark.sql.execution.arrow.pyspark.enabled', 'true')\
.getOrCreate()
sdf0 = spark.createDataFrame(df_l).repartition(w).cache()
sdf1 = spark.createDataFrame(df_r).repartition(w).cache()
print(f"data loaded to spark", flush=True)
try:
for i in range(it):
t1 = time.time()
out = sdf0.join(sdf1, on='col0', how='inner')
count = out.count()
t2 = time.time()
print(f"timings {r} {w} {i} {(t2 - t1) * 1000:.0f} ms, {count}", flush=True)
del out
del count
gc.collect()
finally:
spark.stop()
Cluster:
I am using standalone spark cluster in a 15 node cluster with 48 cores and 240GB RAM each. I've spawned master and the driver code in node1, while other 14 nodes have spawned workers allocating maximum memory.
In the spark context, I am reserving 90% of total memory to executor, splitting 60% to jvm and 40% to pyspark.
Issue:
When I run the above program, I can see that the executors are being assigned to the app. But it doesn't move forward, even after 60 mins. For smaller row count (10M), this was working without a problem.
Driver output
world sz 256 procs per worker 19 mem 216 iter 8
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
22/08/26 14:52:22 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
/N/u/d/dnperera/.conda/envs/cylonflow/lib/python3.8/site-packages/pyspark/sql/pandas/conversion.py:425: UserWarning: createDataFrame attempted Arrow optimization because 'spark.sql.execution.arrow.pyspark.enabled' is set to true; however, failed by the reason below:
Negative initial size: -589934400
Attempting non-optimization as 'spark.sql.execution.arrow.pyspark.fallback.enabled' is set to true.
warn(msg)
Any help on this is much appreciated.

Odoo14 service keep crash (Dumping stacktrace of limit exceeding threads before reloading)

i face the following error
2022-05-30 15:00:26,943 1940 WARNING ? odoo.service.server: Server memory limit (4934283264) reached.
2022-05-30 15:00:26,954 1940 INFO ? odoo.service.server: Dumping stacktrace of limit exceeding threads before reloading
2022-05-30 15:00:26,997 1940 INFO ? odoo.tools.misc:
# Thread: <_MainThread(MainThread, started 140592199739200)> (db:n/a) (uid:n/a) (url:n/a)
File: "/opt/odoo/odoo14/odoo-bin", line 8, in <module>
odoo.cli.main()
File: "/opt/odoo/odoo14/odoo/cli/command.py", line 61, in main
o.run(args)
File: "/opt/odoo/odoo14/odoo/cli/server.py", line 178, in run
main(args)
File: "/opt/odoo/odoo14/odoo/cli/server.py", line 172, in main
rc = odoo.service.server.start(preload=preload, stop=stop)
File: "/opt/odoo/odoo14/odoo/service/server.py", line 1298, in start
rc = server.run(preload, stop)
File: "/opt/odoo/odoo14/odoo/service/server.py", line 546, in run
dumpstacks(thread_idents=[thread.ident for thread in self.limits_reached_threads])
File: "/opt/odoo/odoo14/odoo/tools/misc.py", line 957, in dumpstacks
for line in extract_stack(stack):
2022-05-30 15:00:27,007 1940 INFO ? odoo.service.server: Initiating server reload
and i tried several solutions like increase
limit_request = 8192
limit_time_cpu = 600
limit_time_real = 1200
max_cron_threads = 1
limit_memory_hard = 536870637100
limit_memory_soft = 483183573400
but still facing same issue as error log, also i try to run the server after 30 mins as maximum i got same error again & again..
Best Regards.
Brother take a look at this link Configuration suggestions for Odoo server
If you have a VPS with 4 CPU cores and 16 GB of RAM, the number of workers should be 9 (CPU cores * 2 + 1), total limit-memory-soft value will be 640MB x 9 = 5760 MB , and total limit-memory-hard 768MB x 9 = 6912 MB,
so Odoo will use maximum 5.4 GB of RAM.
You server is 4vCPUs so try the below in your config file:
limit_memory_hard = 640MB * 9 * 1024 * 1024 = 7247757312
limit_memory_soft = 768MB * 9 * 1024 * 1024 = 6039797760
max_cron_threads = 1
workers = 8

How to deal with the error when using Gurobi with cvxpy :Unable to retrieve attribute 'BarIterCount'

How to deal with the error when using Gurobi with cvxpy :AttributeError: Unable to retrieve attribute 'BarIterCount'.
I have an Integer programming problem, using cvxpy and set gurobi as a solver.
When the number of variables is small, the result is ok. After the number of variables reaches a level of like 43*13*6, then the error occurred. I suppose it may be caused by the scale of the problem, in which the gurobi solver can not estimate the BarIterCount, which is the max Iterations needed.
Thus, I wonder, is there any way to manually set the BarItercount attribute of gurobi through the interface of the CVX? Or whether there exists another way to solve this problem?
Thanks for any suggestions you may provide for me.
The trace log is as follows:
If my model is small, like I set a number which indicates the scale of model as 3, then the program is ok. The trace is :
Using license file D:\software\lib\site-packages\gurobipy\gurobi.lic
Restricted license - for non-production use only - expires 2022-01-13
Parameter OutputFlag unchanged
Value: 1 Min: 0 Max: 1 Default: 1
D:\software\lib\site-packages\cvxpy\reductions\solvers\solving_chain.py:326: DeprecationWarning: Deprecated, use Model.addMConstr() instead
solver_opts, problem._solver_cache)
Changed value of parameter QCPDual to 1
Prev: 0 Min: 0 Max: 1 Default: 0
Gurobi Optimizer version 9.1.0 build v9.1.0rc0 (win64)
Thread count: 16 physical cores, 32 logical processors, using up to 32 threads
Optimize a model with 126 rows, 370 columns and 2689 nonzeros
Model fingerprint: 0x70d49530
Variable types: 0 continuous, 370 integer (369 binary)
Coefficient statistics:
Matrix range [1e+00, 7e+00]
Objective range [1e+00, 1e+00]
Bounds range [1e+00, 1e+00]
RHS range [1e+00, 6e+00]
Found heuristic solution: objective 7.0000000
Presolve removed 4 rows and 90 columns
Presolve time: 0.01s
Presolved: 122 rows, 280 columns, 1882 nonzeros
Variable types: 0 continuous, 280 integer (279 binary)
Root relaxation: objective 4.307692e+00, 216 iterations, 0.00 seconds
Nodes | Current Node | Objective Bounds | Work
Expl Unexpl | Obj Depth IntInf | Incumbent BestBd Gap | It/Node Time
0 0 4.30769 0 49 7.00000 4.30769 38.5% - 0s
H 0 0 6.0000000 4.30769 28.2% - 0s
0 0 5.00000 0 35 6.00000 5.00000 16.7% - 0s
0 0 5.00000 0 37 6.00000 5.00000 16.7% - 0s
0 0 5.00000 0 7 6.00000 5.00000 16.7% - 0s
Cutting planes:
Gomory: 4
Cover: 9
MIR: 4
StrongCG: 1
GUB cover: 9
Zero half: 1
RLT: 1
Explored 1 nodes (849 simplex iterations) in 0.12 seconds
Thread count was 32 (of 32 available processors)
Solution count 2: 6 7
Optimal solution found (tolerance 1.00e-04)
Best objective 6.000000000000e+00, best bound 6.000000000000e+00, gap 0.0000%
If the number is 6, then error occurs:
-------------------------------------------------------
Using license file D:\software\lib\site-packages\gurobipy\gurobi.lic
Restricted license - for non-production use only - expires 2022-01-13
Parameter OutputFlag unchanged
Value: 1 Min: 0 Max: 1 Default: 1
D:\software\lib\site-packages\cvxpy\reductions\solvers\solving_chain.py:326: DeprecationWarning: Deprecated, use Model.addMConstr() instead
solver_opts, problem._solver_cache)
Changed value of parameter QCPDual to 1
Prev: 0 Min: 0 Max: 1 Default: 0
Gurobi Optimizer version 9.1.0 build v9.1.0rc0 (win64)
Thread count: 16 physical cores, 32 logical processors, using up to 32 threads
Traceback (most recent call last):
File "model.py", line 274, in <module>
problem.solve(solver=cp.GUROBI,verbose=True)
File "D:\software\lib\site-packages\cvxpy\problems\problem.py", line 396, in solve
return solve_func(self, *args, **kwargs)
File "D:\software\lib\site-packages\cvxpy\problems\problem.py", line 754, in _solve
self.unpack_results(solution, solving_chain, inverse_data)
File "D:\software\lib\site-packages\cvxpy\problems\problem.py", line 1058, in unpack_results
solution = chain.invert(solution, inverse_data)
File "D:\software\lib\site-packages\cvxpy\reductions\chain.py", line 79, in invert
solution = r.invert(solution, inv)
File "D:\software\lib\site-packages\cvxpy\reductions\solvers\qp_solvers\gurobi_qpif.py", line 59, in invert
s.NUM_ITERS: model.BarIterCount,
File "src\gurobipy\model.pxi", line 343, in gurobipy.gurobipy.Model.__getattr__
File "src\gurobipy\model.pxi", line 1842, in gurobipy.gurobipy.Model.getAttr
File "src\gurobipy\attrutil.pxi", line 100, in gurobipy.gurobipy.__getattr
AttributeError: Unable to retrieve attribute 'BarIterCount'
Hopefully this can provide more hint for solution.
BarIterCount is the number of barrier iterations performed to solve an LP. This is not a limit on the number of iterations and it should only be queried when the current optimization process has been finished. You cannot set this attribute either, of course.
To actually limit the number of iterations the barrier algorithm is allowed to take, you can use the parameter BarIterLimit.
Please inspect your log file for further information about the solver's behavior.

Accessing epoch value across multiple threads using input_producer/limit_epochs/epochs:0 local variable

I tried to extract the current epoch number while reading data using multiple cpu threads. However during a trial code I observed an output which did not make any sense. Consider the code below :
with tf.Session() as sess:
train_filename_queue = tf.train.string_input_producer(trainimgs, num_epochs=4, shuffle=True)
value = train_filename_queue.dequeue()
init_op = tf.group(tf.global_variables_initializer(), tf.local_variables_initializer())
sess.run(init_op)
coord = tf.train.Coordinator()
tf.train.start_queue_runners(coord=coord)
collections = [v.name for v in tf.get_collection(tf.GraphKeys.LOCAL_VARIABLES,\
scope='input_producer/limit_epochs/epochs:0')]
print(collections)
threads = [threading.Thread(target=work, args=(coord, value, sess, collections)) for i in \
range(20)]
for t in threads:
t.start()
coord.join(threads)
coord.request_stop()
The work function is defined as below :
def work(coord, val, sess, collections):
counter = 0
while not coord.should_stop():
try:
epoch = sess.run(collections[0])
filename = sess.run(val).decode(encoding='UTF-8')
print(filename + ' ' + str(epoch))
except tf.errors.OutOfRangeError:
coord.request_stop()
return None
The output I obtain is the following :
I tensorflow/core/common_runtime/gpu/gpu_device.cc:885] Found device 0 with properties:
name: GeForce GTX TITAN X
major: 5 minor: 2 memoryClockRate (GHz) 1.076
pciBusID 0000:84:00.0
Total memory: 11.92GiB
Free memory: 11.80GiB
I tensorflow/core/common_runtime/gpu/gpu_device.cc:906] DMA: 0
I tensorflow/core/common_runtime/gpu/gpu_device.cc:916] 0: Y
I tensorflow/core/common_runtime/gpu/gpu_device.cc:975] Creating TensorFlow device (/gpu:0) -> (device: 0, name: GeForce GTX TITAN X, pci bus id: 0000:84:00.0)
I tensorflow/compiler/xla/service/platform_util.cc:58] platform CUDA present with 1 visible devices
I tensorflow/compiler/xla/service/platform_util.cc:58] platform Host present with 20 visible devices
I tensorflow/compiler/xla/service/service.cc:180] XLA service executing computations on platform Host. Devices:
I tensorflow/compiler/xla/service/service.cc:187] StreamExecutor device (0): <undefined>, <undefined>
I tensorflow/compiler/xla/service/platform_util.cc:58] platform CUDA present with 1 visible devices
I tensorflow/compiler/xla/service/platform_util.cc:58] platform Host present with 20 visible devices
I tensorflow/compiler/xla/service/service.cc:180] XLA service executing computations on platform CUDA. Devices:
I tensorflow/compiler/xla/service/service.cc:187] StreamExecutor device (0): GeForce GTX TITAN X, Compute Capability 5.2
['input_producer/limit_epochs/epochs:0']
/local/ujjwal/ILSVRC2015/Data/CLS-LOC/train/n01768244/n01768244_4760.JPEG 0 2
/local/ujjwal/ILSVRC2015/Data/CLS-LOC/train/n01768244/n01768244_703.JPEG 0 4
/local/ujjwal/ILSVRC2015/Data/CLS-LOC/train/n01768244/n01768244_11768.JPEG 0 4
/local/ujjwal/ILSVRC2015/Data/CLS-LOC/train/n01768244/n01768244_3271.JPEG 0 4
/local/ujjwal/ILSVRC2015/Data/CLS-LOC/train/n01768244/n01768244_1015.JPEG 0 4
/local/ujjwal/ILSVRC2015/Data/CLS-LOC/train/n01768244/n01768244_730.JPEG 0 4
/local/ujjwal/ILSVRC2015/Data/CLS-LOC/train/n01768244/n01768244_1945.JPEG 0 4
/local/ujjwal/ILSVRC2015/Data/CLS-LOC/train/n01768244/n01768244_3149.JPEG 0 4
/local/ujjwal/ILSVRC2015/Data/CLS-LOC/train/n01768244/n01768244_4209.JPEG 0 4
/local/ujjwal/ILSVRC2015/Data/CLS-LOC/train/n01768244/n01768244_40.JPEG 0 4
/local/ujjwal/ILSVRC2015/Data/CLS-LOC/train/n01768244/n01768244_11768.JPEG 0 4
/local/ujjwal/ILSVRC2015/Data/CLS-LOC/train/n01768244/n01768244_4760.JPEG 0 4
/local/ujjwal/ILSVRC2015/Data/CLS-LOC/train/n01768244/n01768244_703.JPEG 0 4
/local/ujjwal/ILSVRC2015/Data/CLS-LOC/train/n01768244/n01768244_4209.JPEG 0 4
/local/ujjwal/ILSVRC2015/Data/CLS-LOC/train/n01768244/n01768244_40.JPEG 0 4
/local/ujjwal/ILSVRC2015/Data/CLS-LOC/train/n01768244/n01768244_730.JPEG 0 4
/local/ujjwal/ILSVRC2015/Data/CLS-LOC/train/n01768244/n01768244_3271.JPEG 0 4
/local/ujjwal/ILSVRC2015/Data/CLS-LOC/train/n01768244/n01768244_1015.JPEG 0 4
/local/ujjwal/ILSVRC2015/Data/CLS-LOC/train/n01768244/n01768244_3149.JPEG 0 4
/local/ujjwal/ILSVRC2015/Data/CLS-LOC/train/n01768244/n01768244_1945.JPEG 0 4
/local/ujjwal/ILSVRC2015/Data/CLS-LOC/train/n01768244/n01768244_40.JPEG 0 4
/local/ujjwal/ILSVRC2015/Data/CLS-LOC/train/n01768244/n01768244_4209.JPEG 0 4
/local/ujjwal/ILSVRC2015/Data/CLS-LOC/train/n01768244/n01768244_730.JPEG 0 4
/local/ujjwal/ILSVRC2015/Data/CLS-LOC/train/n01768244/n01768244_1945.JPEG 0 4
/local/ujjwal/ILSVRC2015/Data/CLS-LOC/train/n01768244/n01768244_4760.JPEG 0 4
/local/ujjwal/ILSVRC2015/Data/CLS-LOC/train/n01768244/n01768244_3271.JPEG 0 4
/local/ujjwal/ILSVRC2015/Data/CLS-LOC/train/n01768244/n01768244_703.JPEG 0 4
/local/ujjwal/ILSVRC2015/Data/CLS-LOC/train/n01768244/n01768244_1015.JPEG 0 4
/local/ujjwal/ILSVRC2015/Data/CLS-LOC/train/n01768244/n01768244_11768.JPEG 0 4
/local/ujjwal/ILSVRC2015/Data/CLS-LOC/train/n01768244/n01768244_3149.JPEG 0 4
/local/ujjwal/ILSVRC2015/Data/CLS-LOC/train/n01768244/n01768244_4209.JPEG 0 4
/local/ujjwal/ILSVRC2015/Data/CLS-LOC/train/n01768244/n01768244_11768.JPEG 0 4
/local/ujjwal/ILSVRC2015/Data/CLS-LOC/train/n01768244/n01768244_4760.JPEG 0 4
/local/ujjwal/ILSVRC2015/Data/CLS-LOC/train/n01768244/n01768244_730.JPEG 0 4
/local/ujjwal/ILSVRC2015/Data/CLS-LOC/train/n01768244/n01768244_703.JPEG 0 4
/local/ujjwal/ILSVRC2015/Data/CLS-LOC/train/n01768244/n01768244_3149.JPEG 0 4
/local/ujjwal/ILSVRC2015/Data/CLS-LOC/train/n01768244/n01768244_3271.JPEG 0 4
/local/ujjwal/ILSVRC2015/Data/CLS-LOC/train/n01768244/n01768244_1945.JPEG 0 4
/local/ujjwal/ILSVRC2015/Data/CLS-LOC/train/n01768244/n01768244_1015.JPEG 0 4
/local/ujjwal/ILSVRC2015/Data/CLS-LOC/train/n01768244/n01768244_40.JPEG 0 4
The last number in each line corresponds to the value of input_producer/limit_epochs/epochs:0' local variable.
For a first trial, I kept only 10 images in the queue meaning I should get a total of 40 lines of output, which I get.
However, I should get equal number of 1,2, 3 and 4 as the last character in each line, since each filename should be extracted in each of the 4 epochs.
Why am I getting the same number 4 in all the lines ?
Further Information
I tried using range(1) (for a single thread), and still the same observation.
Don't bother with the digit '0'. It is simply the label of the corresponding file. I saved the image file names in such a way.
I did a lot of experiments and finally concluded the following :
I used to believe that -
tf.train.string_input_producer() enqueues a queue epoch-wise.
Meaning that, first one complete epoch is enqueued (in multiple
stages if capacity is less than the number of filenames) and then
further epochs are enqueued.
It is not really the case.
When tf.start_queue_runners() is executed, all the epochs are
enqueued together (in multiple stages if capacity is less than number
of filenames). The local variable epochs:0 is used by tf.train.string_input_producer to maintain the epoch that is being enqueued. Once epochs:0 reaches num_epochs, it remains constant and no matter how many threads are dequeuing from the queue, it does not change.
When you capture the value of epochs:0 it gives you the instantaneous value of the counter epochs and it tells you that at that time which epoch of the dataset is being enqueued. It does not tell you that which epoch of the dataset are you dequeuing.
So it is a bad idea to get the value of the current epoch from the epochs:0 local_variable.