concatenate results after multiprocessing - pandas

I have a function which is creating a data frame by doing multiprocessing on a df:-
Suppose if I am having 10 rows in my df so the function processor will process all 10 rows separately. what I want is to concatenate all the output of the function processor and make one data frame.
def processor(dff):
"""
reading data from a data frame and doing all sorts of data manipulation
for multiprocessing
"""
return df
def main(infile, mdebug):
global debug
debug = mdebug
try:
lines = sum(1 for line in open(infile))
except Exception as err:
print("Error {} opening file: {}").format(err, infile)
sys.exit(2000)
if debug >= 2:
print(infile)
try:
dff = pd.read_csv(infile)
except Exception as err:
print("Error {}, opening file: {}").format(err, infile)
sys.exit(2000)
df_split = np.array_split(dff, (lines+1))
cores = multiprocessing.cpu_count()
cores = 64
# pool = Pool(cores)
pool = Pool(lines-1)
for n, frame in enumerate(pool.imap(processor, df_split), start=1):
if frame is not None:
frame.to_csv('{}'.format(n))
pool.close()
pool.join()
if __name__ == "__main__":
args = parse_args()
"""
print "Debug is: {}".format(args.debug)
"""
if args.debug >= 1:
print("Running in debug mode: "), args.debug
main(infile=args.infile, mdebug=args.debug)

you can use either the data frame constructor or concat to solve your problem. the appropriate one to use depends on details of your code that you haven't included
here's a more complete example:
import numpy as np
import pandas as pd
# create dummy dataset
dff = pd.DataFrame(np.random.rand(101, 5), columns=list('abcde'))
# process data
with Pool() as pool:
result = pool.map(processor, np.array_split(dff, 7))
# put it all back together in one dataframe
result = np.concat(result)

Related

passing panda dataframe data to functions and its not outputting the results

In my code, I am trying to extract data from csv file to use in the function, but it doesnt output anything, and gives no error. My code works because I tried it with just numpy array as inputs. not sure why it doesnt work with panda.
import numpy as np
import pandas as pd
import os
# change the current directory to the directory where the running script file is
os.chdir(os.path.dirname(os.path.abspath(__file__)))
# finding best fit line for y=mx+b by iteration
def gradient_descent(x,y):
m_iter = b_iter = 1 #starting point
iteration = 10000
n = len(x)
learning_rate = 0.05
last_mse = 10000
#take baby steps to reach global minima
for i in range(iteration):
y_predicted = m_iter*x + b_iter
#mse = 1/n*sum([value**2 for value in (y-y_predicted)]) # cost function to minimize
mse = 1/n*sum((y-y_predicted)**2) # cost function to minimize
if (last_mse - mse)/mse < 0.001:
break
# recall MSE formula is 1/n*sum((yi-y_predicted)^2), where y_predicted = m*x+b
# using partial deriv of MSE formula, d/dm and d/db
dm = -(2/n)*sum(x*(y-y_predicted))
db = -(2/n)*sum((y-y_predicted))
# use current predicted value to get the next value for prediction
# by using learning rate
m_iter = m_iter - learning_rate*dm
b_iter = b_iter - learning_rate*db
print('m is {}, b is {}, cost is {}, iteration {}'.format(m_iter,b_iter,mse,i))
last_mse = mse
#x = np.array([1,2,3,4,5])
#y = np.array([5,7,8,10,13])
#gradient_descent(x,y)
df = pd.read_csv('Linear_Data.csv')
x = df['Area']
y = df['Price']
gradient_descent(x,y)
My code works because I tried it with just numpy array as inputs. not sure why it doesnt work with panda.
Well no, your code also works with pandas dataframes:
df = pd.DataFrame({'Area': [1,2,3,4,5], 'Price': [5,7,8,10,13]})
x = df['Area']
y = df['Price']
gradient_descent(x,y)
Above will give you the same output as with numpy arrays.
Try to check what's in Linear_Data.csv and/or add some print statements in the gradient_descent function just to check your assumptions. I would suggest to first of all add a print statement before the condition with the break statement:
print(last_mse, mse)
if (last_mse - mse)/mse < 0.001:
break

Pandas Making multiple HTTP requests

I have below code that reads from a csv file a number of ticker symbols into a dataframe.
Each ticker calls the Web Api returning a dafaframe df which is then attached to the last one until complete. The code works , but when a large number of tickers is used the code slows down tremendously. I understand I can use multiprocessing and threads to speed up my code but dont know where to start and what would be the most suited in my particular case.
What code should I use to get my data into a combined daframe in the fastest possible manner?
import pandas as pd
import numpy as np
import json
tickers=pd.read_csv("D:/verhuizen/pensioen/MULTI.csv",names=['symbol','company'])
read_str='https://financialmodelingprep.com/api/v3/income-statement/AAPL?limit=120&apikey=demo'
df = pd.read_json (read_str)
df = pd.DataFrame(columns=df.columns)
for ind in range(len(tickers)):
read_str='https://financialmodelingprep.com/api/v3/income-statement/'+ tickers['symbol'][ind] +'?limit=120&apikey=demo'
df1 = pd.read_json (read_str)
df=pd.concat([df,df1], ignore_index=True)
df.set_index(['date','symbol'], inplace=True)
df.sort_index(inplace=True)
df.to_csv('D:/verhuizen/pensioen/MULTI_out.csv')
The code provided works fine for smaller data sets, but when I use a large number of tickers (>4,000) at some point I get the below error. Is this because the web api gets overloaded or is there another problem?
Traceback (most recent call last):
File "D:/Verhuizen/Pensioen/Equity_Extractor_2021.py", line 43, in <module>
data = pool.starmap(download_data, enumerate(TICKERS, start=1))
File "C:\Users\MLUY\AppData\Local\Programs\Python\Python37-32\lib\multiprocessing\pool.py", line 276, in starmap
return self._map_async(func, iterable, starmapstar, chunksize).get()
File "C:\Users\MLUY\AppData\Local\Programs\Python\Python37-32\lib\multiprocessing\pool.py", line 657, in get
raise self._value
multiprocessing.pool.MaybeEncodingError: Error sending result: '<multiprocessing.pool.ExceptionWithTraceback object at 0x00C33E30>'. Reason: 'TypeError("cannot serialize '_io.BufferedReader' object")'
Process finished with exit code 1
It keeps giving the same error (for a larger amount of tickers)
code is exactly as provided:
def download_data(pool_id, symbols):
df = []
for symbol in symbols:
print("[{:02}]: {}".format(pool_id, symbol))
#do stuff here
read_str = BASEURL.format(symbol)
df.append(pd.read_json(read_str))
#df.append(pd.read_json(fake_data(symbol)))
return pd.concat(df, ignore_index=True)
It failed again with the pool.map, but one strange thing I noticed. Each time it fails it does so around 12,500 tickers (total is around 23,000 tickers) Similar error:
Traceback (most recent call last):
File "C:/Users/MLUY/AppData/Roaming/JetBrains/PyCharmCE2020.1/scratches/Equity_naive.py", line 21, in <module>
data = pool.map(download_data, TICKERS)
File "C:\Users\MLUY\AppData\Local\Programs\Python\Python37-32\lib\multiprocessing\pool.py", line 268, in map
return self._map_async(func, iterable, mapstar, chunksize).get()
File "C:\Users\MLUY\AppData\Local\Programs\Python\Python37-32\lib\multiprocessing\pool.py", line 657, in get
raise self._value
multiprocessing.pool.MaybeEncodingError: Error sending result: '<multiprocessing.pool.ExceptionWithTraceback object at 0x078D1BF0>'. Reason: 'TypeError("cannot serialize '_io.BufferedReader' object")'
Process finished with exit code 1
I get the tickers also from a API call https://financialmodelingprep.com/api/v3/financial-statement-symbol-lists?apikey=demo (I noticed it does not work without subscription), I wanted to attach the data it as a csv file but I dont have sufficient rights. I dont think its a good idea to paste the returned data here...
I tried adding time.sleep(0.2) before return as suggested, but again I ge the same error at ticker 12,510. Strange everytime its around the same location. As there are multiple processes going on I cannot see at what point its breaking
Traceback (most recent call last):
File "C:/Users/MLUY/AppData/Roaming/JetBrains/PyCharmCE2020.1/scratches/Equity_naive.py", line 24, in <module>
data = pool.map(download_data, TICKERS)
File "C:\Users\MLUY\AppData\Local\Programs\Python\Python37-32\lib\multiprocessing\pool.py", line 268, in map
return self._map_async(func, iterable, mapstar, chunksize).get()
File "C:\Users\MLUY\AppData\Local\Programs\Python\Python37-32\lib\multiprocessing\pool.py", line 657, in get
raise self._value
multiprocessing.pool.MaybeEncodingError: Error sending result: '<multiprocessing.pool.ExceptionWithTraceback object at 0x00F32C90>'. Reason: 'TypeError("cannot serialize '_io.BufferedReader' object")'
Process finished with exit code 1
Something very very strange is going on , I have split the data in chunks of 10,000 / 5,000 / 4,000 and 2,000 and each time the code breaks approx 100 tickers from the end. Clearly there is something going on that not right
import time
import pandas as pd
import multiprocessing
# get tickers from your csv
df=pd.read_csv('D:/Verhuizen/Pensioen/All_Symbols.csv',header=None)
# setting the Dataframe to a list (in total 23,000 tickers)
df=df[0]
TICKERS=df.tolist()
#Select how many tickers I want
TICKERS=TICKERS[0:2000]
BASEURL = "https://financialmodelingprep.com/api/v3/income-statement/{}?limit=120&apikey=demo"
def download_data(symbol):
print(symbol)
# do stuff here
read_str = BASEURL.format(symbol)
df = pd.read_json(read_str)
#time.sleep(0.2)
return df
if __name__ == "__main__":
with multiprocessing.Pool(multiprocessing.cpu_count()) as pool:
data = pool.map(download_data, TICKERS)
df = pd.concat(data).set_index(["date", "symbol"]).sort_index()
df.to_csv('D:/verhuizen/pensioen/Income_2000.csv')
In this particular example the code breaks at position 1,903
RPAI
Traceback (most recent call last):
File "C:/Users/MLUY/AppData/Roaming/JetBrains/PyCharmCE2020.1/scratches/Equity_testing.py", line 27, in <module>
data = pool.map(download_data, TICKERS)
File "C:\Users\MLUY\AppData\Local\Programs\Python\Python37-32\lib\multiprocessing\pool.py", line 268, in map
return self._map_async(func, iterable, mapstar, chunksize).get()
File "C:\Users\MLUY\AppData\Local\Programs\Python\Python37-32\lib\multiprocessing\pool.py", line 657, in get
raise self._value
multiprocessing.pool.MaybeEncodingError: Error sending result: '<multiprocessing.pool.ExceptionWithTraceback object at 0x0793EAF0>'. Reason: 'TypeError("cannot serialize '_io.BufferedReader' object")'
First optimization is to avoid concatenate your dataframe at each iteration.
You can try something like that:
url = "https://financialmodelingprep.com/api/v3/income-statement/{}?limit=120&apikey=demo"
df = []
for symbol in tickers["symbol"]:
read_str = url.format(symbol)
df.append(pd.read_json(read_str))
df = pd.concat(df, ignore_index=True)
If it's not sufficient, we will see to use async, threading or multiprocessing.
Edit:
The code below can do the job:
import pandas as pd
import numpy as np
import multiprocessing
import time
import random
PROCESSES = 4 # number of parallel process
CHUNKS = 6 # one process handle n symbols
# get tickers from your csv
TICKERS = ["BCDA", "WBAI", "NM", "ZKIN", "TNXP", "FLY", "MYSZ", "GASX", "SAVA", "GCE",
"XNET", "SRAX", "SINO", "LPCN", "XYF", "SNSS", "DRAD", "WLFC", "OILD", "JFIN",
"TAOP", "PIC", "DIVC", "MKGI", "CCNC", "AEI", "ZCMD", "YVR", "OCG", "IMTE",
"AZRX", "LIZI", "ORSN", "ASPU", "SHLL", "INOD", "NEXI", "INR", "SLN", "RHE-PA",
"MAX", "ARRY", "BDGE", "TOTA", "PFMT", "AMRH", "IDN", "OIS", "RMG", "IMV",
"CHFS", "SUMR", "NRG", "ULBR", "SJI", "HOML", "AMJL", "RUBY", "KBLMU", "ELP"]
# create a list of n sublist
TICKERS = [TICKERS[i:i + CHUNKS] for i in range(0, len(TICKERS), CHUNKS)]
BASEURL = "https://financialmodelingprep.com/api/v3/income-statement/{}?limit=120&apikey=demo"
def fake_data(symbol):
dti = pd.date_range("1985", "2020", freq="Y")
df = pd.DataFrame({"date": dti, "symbol": symbol,
"A": np.random.randint(0, 100, size=len(dti)),
"B": np.random.randint(0, 100, size=len(dti))})
time.sleep(random.random()) # to simulate network delay
return df.to_json()
def download_data(pool_id, symbols):
df = []
for symbol in symbols:
print("[{:02}]: {}".format(pool_id, symbol))
# do stuff here
# read_str = BASEURL.format(symbol)
# df.append(pd.read_json(read_str))
df.append(pd.read_json(fake_data(symbol)))
return pd.concat(df, ignore_index=True)
if __name__ == "__main__":
with multiprocessing.Pool(PROCESSES) as pool:
data = pool.starmap(download_data, enumerate(TICKERS, start=1))
df = pd.concat(data).set_index(["date", "symbol"]).sort_index()
In this example, I split the list of tickers into sublists for each process retrieves data for multiple symbols and limits overhead due to create and destroy processes.
The delay is to simulate the response time from the network connection and highlight the multiprocess behaviour.
Edit 2: simpler but naive version for your needs
import pandas as pd
import multiprocessing
# get tickers from your csv
TICKERS = ["BCDA", "WBAI", "NM", "ZKIN", "TNXP", "FLY", "MYSZ", "GASX", "SAVA", "GCE",
"XNET", "SRAX", "SINO", "LPCN", "XYF", "SNSS", "DRAD", "WLFC", "OILD", "JFIN",
"TAOP", "PIC", "DIVC", "MKGI", "CCNC", "AEI", "ZCMD", "YVR", "OCG", "IMTE",
"AZRX", "LIZI", "ORSN", "ASPU", "SHLL", "INOD", "NEXI", "INR", "SLN", "RHE-PA",
"MAX", "ARRY", "BDGE", "TOTA", "PFMT", "AMRH", "IDN", "OIS", "RMG", "IMV",
"CHFS", "SUMR", "NRG", "ULBR", "SJI", "HOML", "AMJL", "RUBY", "KBLMU", "ELP"]
BASEURL = "https://financialmodelingprep.com/api/v3/income-statement/{}?limit=120&apikey=demo"
def download_data(symbol):
print(symbol)
# do stuff here
read_str = BASEURL.format(symbol)
df = pd.read_json(read_str)
return df
if __name__ == "__main__":
with multiprocessing.Pool(multiprocessing.cpu_count()) as pool:
data = pool.map(download_data, TICKERS)
df = pd.concat(data).set_index(["date", "symbol"]).sort_index()
Note about pool.map: for each symbol in TICKERS, create a process and call function download_data.

Correct way of passing dataframe to ray

I am trying to do the simplest thing with Ray, but no matter what I do it just never releases memory and fails.
The usage case is simply
read parquet files to DF -> pass to pool of actors -> make changes to DF -> return DF
class Main_func:
def calculate(self,data):
#do some things with the DF
return df.copy(deep=True) <- one of many attempts to fix the problem, but didnt work
cpus = 24
actors = []
for _ in range(cpus):
actors.append(Main_func.remote())
from ray.util import ActorPool
pool = ActorPool(actors)
import os
arr = os.listdir("/some/files")
def to_ray():
try:
filename = arr.pop(0)
pf = ParquetFile("/some/files/" + filename)
df = pf.to_pandas()
pool.submit(lambda a,v:a.calculate.remote(v),df.copy(deep=True)
except Exception as e:
print(e)
for _ in range(cpus):
to_ray()
while(True):
res = pool.get_next_unordered()
write('./temp/' + random_filename, res,compression='GZIP')
del res
to_ray()
I have tried other ways of doing the same thing, manually submitting rather than the map command, but whatever i do it always locks memory and fails after a few 100 dataframes.
Does each task needs to preserve state among different files? Ray has tasks abstraction that should simplify things:
import ray
ray.init()
#ray.remote
def read_and_write(path):
df = pd.read_parquet(path)
... do things
df.to_parquet("./temp/...")
import os
arr = os.listdir("/some/files")
results = ray.get([read_and_write.remote(path) for path in arr])

dask how to define a custom (time fold) function that operates in parallel and returns a dataframe with a different shape

I am trying to implement a time fold function to be 'map'ed to various partitions of a dask dataframe which in turn changes the shape of the dataframe in question (or alternatively produces a new dataframe with the altered shape). This is how far I have gotten. The result 'res' returned on compute is a list of 3 delayed objects. When I try to compute each of them in a loop (last tow lines of code) this results in a "TypeError: 'DataFrame' object is not callable" After going through the examples for map_partitions, I also tried altering the input DF (inplace) in the function with no return value which causes a similar TypeError with NoneType. What am I missing?
Also, looking at the visualization (attached) I feel like there is a need for reducing the individually computed (folded) partitions into a single DF. How do I do this?
#! /usr/bin/env python
# Start dask scheduler and workers
# dask-scheduler &
# dask-worker --nthreads 1 --nprocs 6 --memory-limit 3GB localhost:8786 --local-directory /dev/shm &
from dask.distributed import Client
from dask.delayed import delayed
import pandas as pd
import numpy as np
import dask.dataframe as dd
import math
foldbucketsecs=30
periodicitysecs=15
secsinday=24 * 60 * 60
chunksizesecs=60 # 1 minute
numts = 5
start = 1525132800 # 01/05
end = 1525132800 + (3 * 60) # 3 minute
c = Client('127.0.0.1:8786')
def fold(df, start, bucket):
return df
def reduce_folds(df):
return df
def load(epoch):
idx = []
for ts in range(0, chunksizesecs, periodicitysecs):
idx.append(epoch + ts)
d = np.random.rand(chunksizesecs/periodicitysecs, numts)
ts = []
for i in range(0, numts):
tsname = "ts_%s" % (i)
ts.append(tsname)
gts.append(tsname)
res = pd.DataFrame(index=idx, data=d, columns=ts, dtype=np.float64)
res.index = pd.to_datetime(arg=res.index, unit='s')
return res
gts = []
load(start)
cols = len(gts)
idx1 = pd.DatetimeIndex(start=start, freq=('%sS' % periodicitysecs), end=start+periodicitysecs, dtype='datetime64[s]')
meta = pd.DataFrame(index=idx1[:0], data=[], columns=gts, dtype=np.float64)
dfs = [delayed(load)(fn) for fn in range(start, end, chunksizesecs)]
from_delayed = dd.from_delayed(dfs, meta, 'sorted')
nfolds = int(math.ceil((end - start)/foldbucketsecs))
cprime = nfolds * cols
gtsnew = []
for i in range(0, cprime):
gtsnew.append("ts_%s,fold=%s" % (i%cols, i/cols))
idx2 = pd.DatetimeIndex(start=start, freq=('%sS' % periodicitysecs), end=start+foldbucketsecs, dtype='datetime64[s]')
meta = pd.DataFrame(index=idx2[:0], data=[], columns=gtsnew, dtype=np.float64)
folded_df = from_delayed.map_partitions(delayed(fold)(from_delayed, start, foldbucketsecs), meta=meta)
result = c.submit(reduce_folds, folded_df)
c.gather(result).visualize(filename='/usr/share/nginx/html/svg/df4.svg')
res = c.gather(result).compute()
for f in res:
f.compute()
Never mind! It was my fault, instead of wrapping my function in delayed I simply passed it to the map_partitions call like so and it worked.
folded_df = from_delayed.map_partitions(fold, start, foldbucketsecs, nfolds, meta=meta)

Is there a way to get tensorflow tf.Print output to appear in Jupyter Notebook output

I'm using the tf.Print op in a Jupyter notebook. It works as required, but will only print the output to the console, without printing in the notebook. Is there any way to get around this?
An example would be the following (in a notebook):
import tensorflow as tf
a = tf.constant(1.0)
a = tf.Print(a, [a], 'hi')
sess = tf.Session()
a.eval(session=sess)
That code will print 'hi[1]' in the console, but nothing in the notebook.
Update Feb 3, 2017
I've wrapped this into memory_util package. Example usage
# install memory util
import urllib.request
response = urllib.request.urlopen("https://raw.githubusercontent.com/yaroslavvb/memory_util/master/memory_util.py")
open("memory_util.py", "wb").write(response.read())
import memory_util
sess = tf.Session()
a = tf.random_uniform((1000,))
b = tf.random_uniform((1000,))
c = a + b
with memory_util.capture_stderr() as stderr:
sess.run(c.op)
print(stderr.getvalue())
** Old stuff**
You could reuse FD redirector from IPython core. (idea from Mark Sandler)
import os
import sys
STDOUT = 1
STDERR = 2
class FDRedirector(object):
""" Class to redirect output (stdout or stderr) at the OS level using
file descriptors.
"""
def __init__(self, fd=STDOUT):
""" fd is the file descriptor of the outpout you want to capture.
It can be STDOUT or STERR.
"""
self.fd = fd
self.started = False
self.piper = None
self.pipew = None
def start(self):
""" Setup the redirection.
"""
if not self.started:
self.oldhandle = os.dup(self.fd)
self.piper, self.pipew = os.pipe()
os.dup2(self.pipew, self.fd)
os.close(self.pipew)
self.started = True
def flush(self):
""" Flush the captured output, similar to the flush method of any
stream.
"""
if self.fd == STDOUT:
sys.stdout.flush()
elif self.fd == STDERR:
sys.stderr.flush()
def stop(self):
""" Unset the redirection and return the captured output.
"""
if self.started:
self.flush()
os.dup2(self.oldhandle, self.fd)
os.close(self.oldhandle)
f = os.fdopen(self.piper, 'r')
output = f.read()
f.close()
self.started = False
return output
else:
return ''
def getvalue(self):
""" Return the output captured since the last getvalue, or the
start of the redirection.
"""
output = self.stop()
self.start()
return output
import tensorflow as tf
x = tf.constant([1,2,3])
a=tf.Print(x, [x])
redirect=FDRedirector(STDERR)
sess = tf.InteractiveSession()
redirect.start();
a.eval();
print "Result"
print redirect.stop()
I ran into the same problem and got around it by using a function like this in my notebooks:
def tf_print(tensor, transform=None):
# Insert a custom python operation into the graph that does nothing but print a tensors value
def print_tensor(x):
# x is typically a numpy array here so you could do anything you want with it,
# but adding a transformation of some kind usually makes the output more digestible
print(x if transform is None else transform(x))
return x
log_op = tf.py_func(print_tensor, [tensor], [tensor.dtype])[0]
with tf.control_dependencies([log_op]):
res = tf.identity(tensor)
# Return the given tensor
return res
# Now define a tensor and use the tf_print function much like the tf.identity function
tensor = tf_print(tf.random_normal([100, 100]), transform=lambda x: [np.min(x), np.max(x)])
# This will print the transformed version of the tensors actual value
# (which was summarized to just the min and max for brevity)
sess = tf.InteractiveSession()
sess.run([tensor])
sess.close()
FYI, using a logger instead of calling "print" in my custom function worked wonders for me as the stdout is often buffered by jupyter and not shown before "Loss is Nan" kind of errors -- which was the whole point in using that function in the first place in my case.
You can check the terminal where you launched the jupyter notebook to see the message.
import tensorflow as tf
tf.InteractiveSession()
a = tf.constant(1)
b = tf.constant(2)
opt = a + b
opt = tf.Print(opt, [opt], message="1 + 2 = ")
opt.eval()
In the terminal, I can see:
2018-01-02 23:38:07.691808: I tensorflow/core/kernels/logging_ops.cc:79] 1 + 2 = [3]
A simple way, tried it in regular python, but not jupyter yet.
os.dup2(sys.stdout.fileno(), 1)
os.dup2(sys.stdout.fileno(), 2)
Explanation is here: In python, how to capture the stdout from a c++ shared library to a variable
The issue that I faced was that one can't run a session inside a Tensorflow Graph, like in the training or in the evaluation.
That's why the options to use sess.run(opt) or opt.eval() were not a solution for me.
The best thing was to use tf.Print() and redirect the logging to an external file.
I did this using a temporal file, which I transferred to a regular file like this:
STDERR=2
import os
import sys
import tempfile
class captured:
def __init__(self, fd=STDERR):
self.fd = fd
self.prevfd = None
def __enter__(self):
t = tempfile.NamedTemporaryFile()
self.prevfd = os.dup(self.fd)
os.dup2(t.fileno(), self.fd)
return t
def __exit__(self, exc_type, exc_value, traceback):
os.dup2(self.prevfd, self.fd)
with captured(fd=STDERR) as tmp:
...
classifier.evaluate(input_fn=input_fn, steps=100)
with open('log.txt', 'w') as f:
print(open(tmp.name).read(), file=f)
And then in my evaluation I do:
a = tf.constant(1)
a = tf.Print(a, [a], message="a: ")