improper, flat priors in pymc3 - bayesian

I am "translating" selected models from the ARM book from Stan to pymc3
(I hope to post them on Github soon) and I have a question on "improper priors".
I understand that Stan default is to use uniform priors on parameters. If unbounded this means uniform(-inf, +inf)
My question is : Is there anyway to specify such priors in pymc3?
Here is an example to illustrate the problem and what I have tried so far.
import numpy as np
import pymc3 as pm
light_speed <- np.array(28, 26, 33, 24, 34, -44, 27, 16, 40, -2, 29, 22, 24, 21, 25,
30, 23, 29, 31, 19, 24, 20, 36, 32, 36, 28, 25, 21, 28, 29,
37, 25, 28, 26, 30, 32, 36, 26, 30, 22, 36, 23, 27, 27, 28,
27, 31, 27, 26, 33, 26, 32, 32, 24, 39, 28, 24, 25, 32, 25,
29, 27, 28, 29, 16, 23)
In stan (pystan)
#the model in pystan as specified in stan arm examples
model_string = '''
data {
int<lower=0> N;
vector[N] y;
}
parameters {
vector[1] beta;
real<lower=0> sigma;
}
model {
y ~ normal(beta[1],sigma);
}
'''
# Stan object
StanDSO = ps.StanModel(model_code = model_string)
# data
data = dict(N = len(light_speed),
y = light_speed
)
#fit and verify model results
fit_model = StanDSO.sampling(data=data, iter = 5000, chains = 2, thin = 1)
print fit_model
in pymc3
model_1= pm.Model()
with model_1:
#priors as specified in stan model
mu = pm.Uniform('mu', lower = -np.inf, upper= np.inf)
sigma = pm.Uniform('sigma', lower = 0, upper= np.inf)
#using vague priors works
#mu = pm.Uniform('mu', lower = light_speed.std()/1000.0, upper= light_speed.std()*1000.0)
#sigma = pm.Uniform('sigma', lower = light_speed.std()/1000.0, upper= light_speed.std()*1000.0)
# define likelihood
y_obs = pm.Normal('Y_obs', mu = mu, sd = sigma, observed = light_speed)
# inference fitting the model
# I have to use slice because the following command
#trace = pm.sample(5000)
# produce the error
# ValueError: Cannot construct a ufunc with more than 32 operands
#(requested number were: inputs = 51 and outputs = 1)valueerror
xstart = pm.find_MAP()
xstep = pm.Slice()
trace = pm.sample(5000, xstep, xstart, random_seed = 123, progressbar= True)
pm.summary(trace)
I looked up the source code of glm and it seems that it uses vague priors as default
and this practice is also recommended by Krutschke and many BUGS examples and it works (see above).
However, stand reference manual (p.52) says that they actively discourage users from using the default scale priors because they concentrate too much probability mass outside of reasonable posterior values (...) can have the profound effect of skewing posteriors.

Uniform priors are defined in Stan on the support of a parameter. So that if you declare a parameter real<lower=0> sigma; that declares sigma to have a uniform prior on positive values. It does that by log-transforming sigma to (-inf, inf) and then accounting for the Jacobian; I believe PyMC3 can do the same thing. Stan allows improper priors if the posterior is proper. Any other priors beyond the default uniform ones specified for a variable are multiplied in (added on the log scale), so they behave as expected (the default uniform distribution has no effect).
Having said that, we recommend in the Stan manual using at least weakly informative priors if not even stronger priors. Check out the papers by Gelman cited in the regression section's discussion of priors to see how priors can get skewed. And it's not just having priors that are too vague; often uniform priors on a closed interval (as in many of the BUGS examples) will have a strong effect on the posterior --- you can visualize this in terms of the truncated likelihood function. All of these can cause computational problems depending on the model's posterior geometry (and not just for HMC --- also for Gibbs or Metropolis).
The Gelman and Hill regression book examples (ARM) that we have on GitHub are not up to Stan's current modeling standards---they were just directly translated from the ARM code. Revising all of those is on our to-do list for 2016 (along with the BUGS examples). Some of the ARM examples are translated from examples in the book for R's lm() or glm() functions and some into the lme4 package's lmer() and glmer() functions, none of which accept priors. We're about to release the RStanARM package, which will accept R's regression (and lme4's multilevel regression) notation and allow either MLE (with error) or HMC estimates (with highly optimized model code on the back end).

Related

Can't visualize plotted Confusion Matrix

I am new to ML and learning the fundamentals.
I am working on Dog-vision dataset (https://www.kaggle.com/c/dog-breed-identification) and I am trying to plot a confusion matrix but can't get where I am doing wrong, need help!
My true_label looks like this
true_label[:10]
array([26, 96, 8, 15, 3, 10, 62, 82, 92, 16]
And predicted_label looks like this
predicted_l[:10]
array([26, 96, 8, 15, 3, 10, 62, 82, 92, 16]
They are almost same but not the whole elements in the array are same.
Then I had converted them into a panda dataframe, with code like this
import pandas as pd
from sklearn.metrics import confusion_matrix
classes=[]
for i in range(0, 99):
classes.append(i)
cf_matrix = confusion_matrix(true_l, predicted_l)
cf_matrix_df = pd.DataFrame(cf_matrix, index=classes,columns=classes)
cf_matrix_df
And then the output is like this-
Then I tried to plot the confusion matrix with this dataframe
but it's not being plotted in correct manner. Here is the code and the output of my confusion matrix:-
import seaborn as sns
figure = plt.figure(figsize=(8, 8))
sns.heatmap(cf_matrix_df, annot=True,cmap=plt.cm.Blues)
plt.tight_layout()
plt.ylabel('True label')
plt.xlabel('Predicted label')
plt.show()
Output
If you need more info then please have a look on my notebook here.
https://colab.research.google.com/drive/1SoXJJNTnGx39uZHizAut-HuMtKhQQolk?usp=sharing
You can make your plot better by removing annot=True argument, since it writes the data value in each cell. Simply remove this argument to get a better visualization:
sns.heatmap(cf_matrix_df, cmap=plt.cm.Blues)
UPDATE: Increasing the figure size figsize() will help to make visualization more clearer.

LGBMClassifier + Unbalanced data + GridSearchCV()

The dependent variable is binary, the unbalanced data is 1:10, the dataset has 70k rows, the scoring is the roc curve, and I'm trying to use LGBM + GridSearchCV to get a model. However, I'm struggling with the parameters as sometimes it doesn't recognize them even when I use the parameters as the documentation shows:
params = {'num_leaves': [10, 12, 14, 16],
'max_depth': [4, 5, 6, 8, 10],
'n_estimators': [50, 60, 70, 80],
'is_unbalance': [True]}
best_classifier = GridSearchCV(LGBMClassifier(), params, cv=3, scoring="roc_auc")
best_classifier.fit(X_train, y_train)
So:
What is the difference between putting the parameters in the GridsearchCV() and params?
As it's unbalanced data, I'm trying to use the roc_curve as the scoring metric as it's a metric that considers the unbalanced data. Should I use the argument scoring="roc_auc" put it in the params argument?
The difference between putting the parameters in GridsearchCV()or params is mentioned in the docs of GridSearch:
When you put it in params:
Dictionary with parameters names (str) as keys and lists of parameter settings to try as values, or a list of such dictionaries,
in which case the grids spanned by each dictionary in the list are
explored. This enables searching over any sequence of parameter
settings.
And yes you can put the scoring also in the params.

keras.preprocessing.text.Tokenizer equivalent in Pytorch?

Basically the title; is there any equivalent tokeras.preprocessing.text.Tokenizer in Pytorch? I have yet to find any that gives all the utilities without handcrafting things.
I find Torchtext more difficult to use for simple things. PyTorch-NLP can do this in a more straightforward way:
from torchnlp.encoders.text import StaticTokenizerEncoder, stack_and_pad_tensors, pad_tensor
loaded_data = ["now this ain't funny", "so don't you dare laugh"]
encoder = StaticTokenizerEncoder(loaded_data, tokenize=lambda s: s.split())
encoded_data = [encoder.encode(example) for example in loaded_data]
print(encoded_data)
[tensor([5, 6, 7, 8]), tensor([ 9, 10, 11, 12, 13])]
encoded_data = [pad_tensor(x, length=10) for x in encoded_data]
print(stack_and_pad_tensors(encoded_data))
# alternatively, use encoder.batch_encode()
BatchedSequences(tensor=tensor([[ 5, 6, 7, 8, 0, 0, 0, 0, 0, 0], [ 9, 10, 11, 12, 13, 0, 0, 0, 0, 0]]), lengths=tensor([10, 10]))
​
It comes with other types of encoders, such as spaCy's tokenizer, subword encoder, etc.
PyTorch itself does not provide a function like this, you either need to it manually (which should be easy: use a tokenizer of your choice and do a dictionary lookup for the indices).
Alternatively, you can use Torchtext, which provides basic abstraction from text processing. All you need to do is create a Field object. You can use string.split, SpaCy or custom function for tokenization. You can provide a vocabulary or create it directly from data. Then you just call the process method which tokenizes text and does the vocabulary lookup.
If you want something more complex, you might consider using also AllenNLP. In AllenNLP, you do separately the tokenization and the vocabulary lookup.

Tensorflow sliding window transformation over large data

I would like to feed my model with stride-1 windows from a very long data sequence (tens of millions of entries). This is similar to the aim presented in this thread, only that my data sequence may contain several features to begin with, so the final number of features is n_features * window_size. i.e. with two original features and a window size of 3, this would mean transforming this:
[[1, 2, 3, 4, 5], [6, 7, 8, 9, 10]]
to:
[[1, 2, 3, 6, 7, 8], [2, 3, 4, 7, 8, 9], [3, 4, 5, 8, 9, 10]]
I was trying to use slicing with map_fn or Dataset.map, applied to a sequence of indices (per the answer in the above-mentioned thread), as in:
ti = tf.range(data.shape[0] - window_size)
train_dataset = tf.data.Dataset.from_tensor_slices((ti, labels))
def get_window(l, label):
wnd = tf.reshape(data_tensor[l:(l + window_size), :], (-1, window_size * n_features))
wnd = tf.squeeze(wnd)
return (wnd, label)
train_dataset = train_dataset.map(get_window)
train_dataset = train_dataset.batch(batch_size)
...
This is working in principle, but the training is extremely slow, with minimal GPU utilization (1-5%, probably in part because the mapping is done in the CPU).
When trying to do the same with tf.map_fn, the graph building becomes very lengthy, with tremendous memory utilization.
Another option I was trying is to transform all of the data in advance before I load it in Tensorflow. This works much faster (even when considering the pre-processing time, I wonder why - shouldn't it be the same operation as the mapping during training?) but is very inefficient in terms of memory and storage, as the data becomes window_size-fold larger. That is a deal-breaker for my large datasets.
I thought about splitting these transformed bloated datasets into several files ("hyper-batches") and go through them in sequence for each epoch, but this seems very inefficient and I was wondering if there is a better way to achieve this simple transformation.

Parallel h5py/hdf5 writing to large dataset skips data chunks

I am using mpi and h5py/hdf5 (Hdf5 and h5py were compiled to have parallel capabilities and everything runs on python 3.4.) on a cluster to stitch a dataset of overlapping tiles (200 or more images/numpy arrays 2048x2048).
Each tile has an assigned index number that correspond to the position where it should be written and all the indexes are stored in an array:
Example tile_array:
array([[ 0, 1, 2, 3],
[ 4, 5, 6, 7],
[ 8, 9, 10, 11],
[12, 13, 14, 15]], dtype=int32)
The issue I have are:
When Running the code on the cluster:
- When I use multiple cores to write the stitched image in parallel I get randomly missing tiles.
- When I run the stitching serially in rank==0 I still have missing tiles.
- If the dataset is small there are no issues.
- If do not activate atomic mode in the parallel writing the number of errors increases. The number of errors increases if the atomic mode is
activated when I run the writing on rank==0
Running the code on a laptop:
- When I run the same code on single core in a laptop, stripped of the mpi code, no tiles are missing.
Any comment will be greatly appreciated.
Highlight of the procedure:
0) Before stitching, each image of the set is processed and the overlapping edges are blended in order to get a better result in the overlapping regions
1) I open the hdf file where I want to save the data: stitching_file=h5py.File(fpath,'a',driver='mpio',comm=MPI.COMM_WORLD)
2) I create a big dataset that will contain the stitched images
3) I fill the dataset with zeros
4) The processed images will be written (added) to the big dataset by different cores. However, adjacent images have overlapping regions. In order to avoid that different cores will write tiles with some overlap and break the writing I will run four rounds of writing in order to cover the whole dataset with this kind of patttern:
Example tile_array:
array([[ 0, 1, 2, 3],
[ 4, 5, 6, 7],
[ 8, 9, 10, 11],
[12, 13, 14, 15]], dtype=int32)
Writing rounds:
writing round one: array([ 0, 2, 8, 10], dtype=int32)
writing round two: array([ 1, 3, 9, 11], dtype=int32)
writing round three: array([ 4, 6, 12, 14], dtype=int32)
writing round four: array([ 5, 7, 13, 15], dtype=int32)
If I write on single rank I don't use this scheme and just write each position sequentially
Pseudocode
# Preprocessing of the images
------------------------------
------------------------------
# MPI data chunking function
def tasks_chunking(tasks,size):
# This function scatter any kind of list, not only ordered ones
# If the number of cores is bigger than the length of the list
# the size of the chunk will be zero and the
# Chunk the list of tasks
Chunked_list=np.array_split(tasks,size)
NumberChunks=np.zeros(len(Chunked_list),dtype='int32')
for idx,chunk in enumerate(Chunked_list):
NumberChunks[idx]=len(chunk)
# Calculate the displacement
Displacement=np.zeros(len(NumberChunks),dtype='int32')
for idx,count in enumerate(NumberChunks[0:-1]):
Displacement[idx+1]=Displacement[idx]+count
return Chunked_list,NumberChunks,Displacement
# Flush the file after the preprocessing
stitching_file.flush()
# Writing dictionary. Each group will contain a subset of the tiles to be written.
# Data is the size of the final array where to save the data
fake_tile_set = np.arange(Data.size,dtype='int32')
fake_tile_set = fake_tile_set.reshape(Data.shape)
writing_dict = {}
writing_dict['even_group1']=fake_tile_array[::2, ::2].flat[:]
writing_dict['even_group2']=fake_tile_array[::2, 1::2].flat[:]
writing_dict['odd_group1'] =fake_tile_array[1::2, ::2].flat[:]
writing_dict['odd_group2'] =fake_tile_array[1::2, 1::2].flat[:]
# Withouth activating the atomic mode the number of errors in parallel mode is higher.
# Withouth activating the atomic mode the number of errors in parallel mode is higher.
stitching_file.atomic=True
# Loop through the dictionary items to start the writing
for key, item in writing_dict.items():
# Chunk the positions that need to be written
if rank==0:
Chunked_list,NumberChunks,Displacement=tasks_chunking(item,size)
else:
NumberChunks = None
Displacement = None
Chunked_list= None
# Make the cores aware of the number of jobs that will need to run
# The variable count is created by the scatter function and has the number of
# processes and is different in every core
cnt = np.zeros(1, dtype='int32')
comm.Scatter(NumberChunks, cnt, root=0)
# Define the local variable that will be filled up wuth the scattered data
xlocal = np.zeros(cnt, dtype='int32') # Use rank for determine the size of the xlocal on the different cores
# Scatter the value of tasks to the different cores
comm.Scatterv([item,NumberChunks,Displacement,MPI.INT],xlocal, root=0)
for tile_ind in xlocal:
# This writing function is the same called when I run the writing on rank 0 or on a single core in a laptop.
paste_in_final_image(joining, temp_file, stitched_group, tile_ind, nr_pixels)
comm.Barrier()
stitching_file.flush()
stitching_file.close()