Tensorflow reproducibility for the following script? - tensorflow

For the following script https://github.com/abdulfatir/prototypical-networks-tensorflow/blob/master/ProtoNet-MiniImageNet-v2.ipynb when I add the statements -
tf.set_random_seed(1) and np.random.seed(1) at the beginning of the script, I still get different results from different runs. I am not sure where the extra randomness is coming from.
Can anybody help me out?

Related

Plots from excel with panda and seaborn 'ufunc 'isfinite' not supported for the input types'

I am trying to configure a template for creating plots for my test data. Therefore I need to say I am pretty new to that in python, and I already googled quite a lot regarding my question but what I found could not help me. I have a excel table with data in two columns, which I want to plot against each other. My code looks as follows
file='C:/Documents/Test/test_file.xlsx'
df1=pd.read_Excel(file,sheet_name='sheet1',header=0, engine="openpyxl")
plt.figure()
sns.lineplot(data=df1[:,:],x="eps",y="sigma",sort=False,linewidth=0.8)
The excel has -as mentioned a header with eps and sigma as x and y values. The values following are floats, when I check the datatype with df1.dtypes, the result is 'float64' So has anyone an idea what is not working? I get the error 'ufunc 'isfinite' not supported for the input types'
Plotting data from excel with panda and seaborn against each other and save the image.
This might be a library issue. I've been running into the same problem with example datasets and even a very simple:
sns.lineplot(x=[1], y=[1])
I'll update if I find a solution.
Edit: There seems to be an issue with Numpy that is causing this issue with Seaborn. Solution is to downgrade Numpy to 1.23 until 1.24.1 is released.
https://github.com/mwaskom/seaborn/issues/3192

How to use Huggingface Data Collator

I was following this tutorial which comes with this notebook.
I plan to use Tensorflow for my project, so I followed this tutorial and added the line
tokenized_datasets = tokenized_datasets["train"].to_tf_dataset(columns=["input_ids"], shuffle=True, batch_size=16, collate_fn=data_collator)
to the end of the notebook.
However, when I ran it, I got the following error:
RuntimeError: Index put requires the source and destination dtypes match, got Float for the destination and Long for the source.
Why didn't this work? How can I use the collator?
The issue is not your code, but how the collator is set up. (It's set up to not use Tensorflow by default.)
If you look at this, you'll see that their collator uses the return_tensors="tf" argument. If you add this to your collator, your code for using the collator will work.
In short, your collator creation should look like
data_collator = DataCollatorForLanguageModeling(tokenizer, mlm_probability=0.15, return_tensors="tf")
This will fix the issue.

How to make pandas show the entire dataframe without cropping it by columns?

I am trying to represent cubic spline interpolation information for function f(x) as a dataframe.
When trying to print into a spyder, I find that the columns are being cut off. When trying to reproduce the output in Jupiter Lab, I got the same thing.
When I ran in ipython via terminal I got the desired full dataframe output.
I searched the integnet and tried the pandas commands setting options pd.set_options(), but nothing came of it.
I attach a screenshot with the output in ipython.
In Juputer can use:
from IPython.display import display, HTML
and instead of
print(dataframe)
use of in anyway place
display(HTML(dataframe.to_html()))
This will create a nice table.
Unfortunately, this will not work in the spyder. So you can try to adjust the width of the ipython were suggested. But in most cases this will make the output poorly or unreadable.
After trying the dataframe methods, I found what appears to be a cropping setting.
In Spyder I used:
pd.set_option('expand_frame_repr', False)
print(dataframe)
This method explains why increasing max_column didn't help me previously.
You can specify a maximum number for rows or columns using pd.set_options(display.max_columns=1000)
But you don't have to set an arbitrary value, but rather use None instead to make sure every size will be covered.
For rows, use:
pd.set_option('display.max_rows', None)
And for columns, use:
pd.set_option('display.max_columns', None)
It is a result of the display width. You can use the following set_options():
pd.set_options(display.width=1000) #make huge
You may also have to raise max columns but it should be smart enough to adjust automatically after you make width bigger:
pd.set_options(display.max_columns=None)

How to get python to generate the tweedie deviance for xgboost?

Using statsmodel's GLM, the tweedie deviance is included in the summary function, but I don't know how to do this for xgboost. Reading the API didn't help either.
In Python this is how you do it. Suppose predictions is the result of your gradient boosted tree and real are the actual numbers. Then using statsmodels you would run this:
import statsmodels as sm
dev = sm.families.Tweedie(pow_var=1.5).deviance(predictions, real)

TensorFlow: Opening log data written by SummaryWriter

After following this tutorial on summaries and TensorBoard, I've been able to successfully save and look at data with TensorBoard. Is it possible to open this data with something other than TensorBoard?
By the way, my application is to do off-policy learning. I'm currently saving each state-action-reward tuple using SummaryWriter. I know I could manually store/train on this data, but I thought it'd be nice to use TensorFlow's built in logging features to store/load this data.
As of March 2017, the EventAccumulator tool has been moved from Tensorflow core to the Tensorboard Backend. You can still use it to extract data from Tensorboard log files as follows:
from tensorboard.backend.event_processing.event_accumulator import EventAccumulator
event_acc = EventAccumulator('/path/to/summary/folder')
event_acc.Reload()
# Show all tags in the log file
print(event_acc.Tags())
# E. g. get wall clock, number of steps and value for a scalar 'Accuracy'
w_times, step_nums, vals = zip(*event_acc.Scalars('Accuracy'))
Easy, the data can actually be exported to a .csv file within TensorBoard under the Events tab, which can e.g. be loaded in a Pandas dataframe in Python. Make sure you check the Data download links box.
For a more automated approach, check out the TensorBoard readme:
If you'd like to export data to visualize elsewhere (e.g. iPython
Notebook), that's possible too. You can directly depend on the
underlying classes that TensorBoard uses for loading data:
python/summary/event_accumulator.py (for loading data from a single
run) or python/summary/event_multiplexer.py (for loading data from
multiple runs, and keeping it organized). These classes load groups of
event files, discard data that was "orphaned" by TensorFlow crashes,
and organize the data by tag.
As another option, there is a script
(tensorboard/scripts/serialize_tensorboard.py) which will load a
logdir just like TensorBoard does, but write all of the data out to
disk as json instead of starting a server. This script is setup to
make "fake TensorBoard backends" for testing, so it is a bit rough
around the edges.
I think the data are encoded protobufs RecordReader format. To get serialized strings out of files you can use py_record_reader or build a graph with TFRecordReader op, and to deserialize those strings to protobuf use Event schema. If you get a working example, please update this q, since we seem to be missing documentation on this.
I did something along these lines for a previous project. As mentioned by others, the main ingredient is tensorflows event accumulator
from tensorflow.python.summary import event_accumulator as ea
acc = ea.EventAccumulator("folder/containing/summaries/")
acc.Reload()
# Print tags of contained entities, use these names to retrieve entities as below
print(acc.Tags())
# E. g. get all values and steps of a scalar called 'l2_loss'
xy_l2_loss = [(s.step, s.value) for s in acc.Scalars('l2_loss')]
# Retrieve images, e. g. first labeled as 'generator'
img = acc.Images('generator/image/0')
with open('img_{}.png'.format(img.step), 'wb') as f:
f.write(img.encoded_image_string)
You can also use the tf.train.summaryiterator: To extract events in a ./logs-Folder where only classic scalars lr, acc, loss, val_acc and val_loss are present you can use this GIST: tensorboard_to_csv.py
Chris Cundy's answer works well when you have less than 10000 data points in your tfevent file. However, when you have a large file with over 10000 data points, Tensorboard will automatically sampling them and only gives you at most 10000 points. It is a quite annoying underlying behavior as it is not well-documented. See https://github.com/tensorflow/tensorboard/blob/master/tensorboard/backend/event_processing/event_accumulator.py#L186.
To get around it and get all data points, a bit hacky way is to:
from tensorboard.backend.event_processing.event_accumulator import EventAccumulator
class FalseDict(object):
def __getitem__(self,key):
return 0
def __contains__(self, key):
return True
event_acc = EventAccumulator('path/to/your/tfevents',size_guidance=FalseDict())
It looks like for tb version >=2.3 you can streamline the process of converting your tb events to a pandas dataframe using tensorboard.data.experimental.ExperimentFromDev().
It requires you to upload your logs to TensorBoard.dev, though, which is public. There are plans to expand the capability to locally stored logs in the future.
https://www.tensorflow.org/tensorboard/dataframe_api
You can also use the EventFileLoader to iterate through a tensorboard file
from tensorboard.backend.event_processing.event_file_loader import EventFileLoader
for event in EventFileLoader('path/to/events.out.tfevents.xxx').Load():
print(event)
Surprisingly, the python package tb_parse has not been mentioned yet.
From documentation:
Installation:
pip install tensorflow # or tensorflow-cpu pip install -U tbparse # requires Python >= 3.7
Note: If you don't want to install TensorFlow, see Installing without TensorFlow.
We suggest using an additional virtual environment for parsing and plotting the tensorboard events. So no worries if your training code uses Python 3.6 or older versions.
Reading one or more event files with tbparse only requires 5 lines of code:
from tbparse import SummaryReader
log_dir = "<PATH_TO_EVENT_FILE_OR_DIRECTORY>"
reader = SummaryReader(log_dir)
df = reader.scalars
print(df)