How can I get reproducible results in a Jupyter Notebook (Python3)?
Defining a seed for the main random generators seems to be not enough, see MWE below:
import numpy as np
import random
import os
random.seed(0)
np.random.seed(0)
os.environ['PYTHONHASHSEED']=str(0)
import networkx
from networkx.algorithms.mis import maximal_independent_set
G = networkx.Graph()
G.add_edges_from([ ('A', 'B'), ('B', 'C'), ('C', 'D'), ('D','E'), ('E','F'), ('A','E'), ('B','E') ])
for i in range(0,10):
print( maximal_independent_set(G, seed=0) )
Gives the same result in each run in the loop.
However, when restarting the kernel and running the cells again, results change to another subset.
The short answer for your question is: In the current implementation it is not reproducible - if taking restarts of kernel into account.
Long Answer
You need to use OrderedGraph and re-implement the function maximal_independent_set using an OrderedSet (see Does Python have an ordered set?) to may ensure the reproducibility you are looking for. Alternatively to OrderedSets, you can ensure the ordering for the casts from the sets to the lists by sorting, see here.
Debugging - based on the implementation
What is the cause of the problem. First observation, it's not the random generator. That is correctly instantiated and has the same state over multiple runs (At the beginning and after the end - check with seed.getstate(). The problematic calls within the maximal_independent_set are all set operations. For your mwe, for example:
neighbors = set.union(*[set(G.adj[v]) for v in nodes])
# next line added for simple debugging:
print(nodes, neighbors, G.adj["D"])
# Output 1: ['D', 'A', 'F']
# {'D'} {'E', 'C'} OrderedDict([('C', OrderedDict()), ('E', OrderedDict())])
# ['D', 'F', 'B']
# {'D'} {'C', 'E'} OrderedDict([('C', OrderedDict()), ('E', OrderedDict())])
As you see, multiple runs result into different sets (with respect to order). The same holds for the other seed, especially, available_nodes. Hence, fixing the selected list id by fixing the random number generator does not ensure reproducibility, since the sorting of the elements within the list can be different.
Related
I have a DataFrame from which I wanted to randomly select 20% of the data to use as test data. However, I need to remove said data from my original set to use as training data.
I have a list of the indexes the random sample is made up from (indexes of the original DF). When i use a for loop and the function .pop() the indexes change so the elements been removed after that the first iteration are not the ones that are in my test data frame. I need help to remove the data from the first data frame but no functions will take a list of indexes as an argument. What can i do about this? Is there a way to subtract a data from from another?
Regarding your question,
Is there a way to subtract a data from from another?
You can simply drop the indexes belonging to Test from the primary DataFrame to get your Train. Try this -
train = df.drop(test.index, axis=0)
#Where df is the main dataset from which test data has been sampled.
#train, test, df are all pd.DataFrames
However, if you are preparing data for a machine learning problem, I would recommend some better methods, as discussed in the next part of my answer.
1. Using Sklearn API (Recommended)
You could try using the sklearn.model_selection.train_test_split api to save you a lot of time in doing such train test splits.
from sklearn.model_selection import train_test_split
df = pd.DataFrame(np.random.random((100,10)))
train, test = train_test_split(df, test_size=0.2)
train.shape, test.shape
((80, 10), (20, 10))
2. Using pandas methods
Another way is to sample 20% data from df and then filter the rest for train.
test = df.sample(frac=0.2)
train = df.loc[~df.index.isin(test.index)]
train.shape, test.shape
((80, 10), (20, 10))
3. Starting with a list of indexes
Let's say you already have a list of indexes (test_idx), as you mention in your question. In that case, you can still work with pandas methods to do this without any for loops or pop()
test_idx = np.random.choice(range(100), 20, replace=False) #approx 20% random indexes
test = df.loc[df.index.isin(test_idx)]
train = df.loc[~df.index.isin(test_idx)]
train.shape, test.shape
((80, 10), (20, 10))
There are a couple of solutions to this problem. You could...
Iterate in reverse
Create another array to store the values
Use list comprehension
An example of the third method is as follows.
Let's say that you want to remove all 2's from an array:
data = [1, 2, 3, 2, 2, 1]
new_data = [n for n in data if n != 2]
# new_data = [1, 3, 1]
In my past experience this is always the method I use when cleaning/reconstructing arrays.
I am currently working on a project where I have to deal with Bayesian Networks and given the graphical nature of these probabilistic models, it is very essential to visualize them as a graph. I am using pgmpy for my project.
The model I am dealing with has a large number of variables, often having long names as data identifiers. I therefore was contemplating on illustrating the graph with a legend and each node having a 'code' or a 'number', mapping to a data identifier (perhaps a dict could be used).
The edges I have are in the following format:
[('A','B'), ('B', 'C'), ('C','A')]
In other words, an array of 2-tuples of strings.
It would be great if someone could help me in solving this particular issue.
pgmpy models (at least BayesianNetworks) inherit from nx.Digraphs and can be visualized using nx.draw, which takes a Matplotlib Axes object as optional parameter.
Therefore, one can create an axes object, manually add a legend, hide the handles, relabel the nodes and draw the model.
Here is an example using a dict (as suggested) for identifier mapping:
import networkx as nx
from pgmpy.models import BayesianNetwork
import matplotlib.pyplot as plt
from matplotlib import patches
identifier_mapping = {'long_identifier_for_A': 'A',
'long_identifier_for_B': 'B',
'long_identifier_for_C': 'C'}
model = BayesianNetwork([('long_identifier_for_A', 'long_identifier_for_B'),
('long_identifier_for_C', 'long_identifier_for_B')])
# add identifier mappings to legend
ax = plt.subplot()
handles = [patches.Patch(label=f'{identifier_mapping[node]}: {node}') for node in model.nodes()]
ax.legend(handles=handles, handlelength=0, handletextpad=0, fancybox=True)
# draw model
nx_graph = nx.relabel_nodes(model, identifier_mapping)
nx.draw(nx_graph, ax=ax, with_labels=True, pos=nx.random_layout(nx_graph))
plt.show()
PS:
To avoid the relabeling step one can directly create the model with short identifiers and store a mapping to the long ones.
The edges from the question ([('A','B'), ('B', 'C'), ('C','A')]) form a cycle (Bayesian Networks are asyclic).
In case nx.draw raises a StopIteration exception checkout this question.
The matplotlib documentation for scatter() states:
In addition to the above described arguments, this function can take a data keyword argument. If such a data argument is given, the following arguments are replaced by data[]:
All arguments with the following names: ‘s’, ‘color’, ‘y’, ‘c’, ‘linewidths’, ‘facecolor’, ‘facecolors’, ‘x’, ‘edgecolors’.
However, I cannot figure out how to get this to work.
The minimal example
import matplotlib.pyplot as plt
import numpy as np
data = np.random.random(size=(3, 2))
props = {'c': ['r', 'g', 'b'],
's': [50, 100, 20],
'edgecolor': ['b', 'g', 'r']}
plt.scatter(data[:, 0], data[:, 1], data=props)
plt.show()
produces a plot with the default color and sizes, instead of the supplied one.
Anyone has used that functionality?
This seems to be an overlooked feature added about two years ago. The release notes have a short example (
https://matplotlib.org/users/prev_whats_new/whats_new_1.5.html#working-with-labeled-data-like-pandas-dataframes). Besides this question and a short blog post (https://tomaugspurger.github.io/modern-6-visualization.html) that's all I could find.
Basically, any dict-like object ("labeled data" as the docs call it) is passed in the data argument, and plot parameters are specified based on its keys. For example, you can create a structured array with fields a, b, and c
coords = np.random.randn(250, 3).view(dtype=[('a', float), ('b', float), ('c', float)])
You would normally create a plot of a vs b using
pyplot.plot(coords['a'], coords['b'], 'x')
but using the data argument it can be done with
pyplot.plot('a', 'b','x', data=coords)
The label b can be confused with a style string setting the line to blue, but the third argument clears up that ambiguity. It's not limited to x and y data either,
pyplot.scatter(x='a', y='b', c='c', data=coords)
Will set the point color based on column 'c'.
It looks like this feature was added for pandas dataframes, and handles them better than other objects. Additionally, it seems to be poorly documented and somewhat unstable (using x and y keyword arguments fails with the plot command, but works fine with scatter, the error messages are not helpful). That being said, it gives a nice shorthand when the data you want to plot has labels.
In reference to your example, I think the following does what you want:
plt.scatter(data[:, 0], data[:, 1], **props)
That bit in the docs is confusing to me, and looking at the sources, scatter in axes/_axes.py seems to do nothing with this data argument. Remaining kwargs end up as arguments to a PathCollection, maybe there is a bug there.
You could also set these parameters after scatter with the the various set methods in PathCollection, e.g.:
pc = plt.scatter(data[:, 0], data[:, 1])
pc.set_sizes([500,100,200])
While the new Categorical Series support since pandas 0.15.0 is fantastic, I'm a bit annoyed with how they decided to make the underlying data inaccessible except through underscored variables. Consider the following code:
import numpy as np
import pandas as pd
x = np.empty(3, dtype=np.int64)
s = pd.DatetimeIndex(x, tz='UTC')
x
Out[17]: array([140556737562568, 55872352, 32])
s[0]
Out[18]: Timestamp('1970-01-02 15:02:36.737562568+0000', tz='UTC')
x[0] = 0
s[0]
Out[20]: Timestamp('1970-01-01 00:00:00+0000', tz='UTC')
y = s.values
y[0] = 5
x[0]
Out[23]: 5
s[0]
Out[24]: Timestamp('1970-01-01 00:00:00.000000005+0000', tz='UTC')
We can see that both in construction and when asked for underlying values, no deep copies are being made in this DatetimeIndex with regards to its underlying data. Not only is this potentially useful in terms of efficiency, but it's great if you are using a DataFrame as a buffer. You can easily get the numpy primitive containing the underlying data, from there get a pointer to the raw data, which some low level C routine can use to do a copy into from some block of memory.
Now lets look at the behavior of the new Categorical Series. The underlying data of course is not the levels, but the codes.
x2 = np.zeros(3, dtype=np.int64)
s2 = pd.Categorical.from_codes(x2, ["hello", "bye"])
s2
Out[27]:
[hello, hello, hello]
Categories (2, object): [hello, bye]
x2[0] = 1
s2[0]
Out[29]: 'hello'
y2 = s2.codes
y2[0] = 1
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-31-0366d645c98d> in <module>()
----> 1 y2[0] = 1
ValueError: assignment destination is read-only
y2 = s2._codes
y2[0] = 1
s2[0]
Out[34]: 'bye'
The net effect of this behavior is that as a developer, efficient manipulation of the underlying data for Categoricals is not part of the interface. Also as a user, the from_codes constructor is slow as it deep copies the codes, which may often be unnecessary. There should at least be an option for this.
But the fact that codes is a read only variable and _codes needs to be used strikes me as worse. Why wouldn't .codes give the same behavior as .values? Is there some justification for this beyond the concept that the codes are "private"? I'm hoping some of the pandas gurus on stackoverflow can shed some light on this.
The Categorical type is different from almost all other types in that it is a compound type that has a certain guarantee among its data. Namely that the codes provide a factorization of the levels.
So the argument against mutability is that it would be easy to break the codes-categories mapping, and it could be non-performant. Of course these could possibly be mitigated with checking on the setitem instead (but with some added code complexity).
The vast majority of users are not going to manipulate the codes/categories directly (and only use exposed methods) so this is really a protection against accidently breaking these guarantees.
If you need to efficiently manipulate the underlying data, best/easiest is simply to pull out the codes/categories. Mutate them, then create a new Categorical (which is cheap if codes/categories are already provided).
e.g.
In [3]: s2 = pd.Categorical.from_codes(x2, ["hello", "bye"])
In [4]: s2
Out[4]:
[hello, hello, hello]
Categories (2, object): [hello, bye]
In [5]: s2.codes
Out[5]: array([0, 0, 0], dtype=int8)
In [6]: pd.Categorical(s2.codes+1,s2.categories,fastpath=True)
Out[6]:
[bye, bye, bye]
Categories (2, object): [hello, bye]
Of course this is quite dangerous, if you added 2 to the expression would blow up. Manipulation of the codes directly is simply buyer-be-ware.
I got the following error while using NumPy argmax method. Could some one help me to understand what happened:
import numpy as np
b = np.zeros(1, dtype={'names':['a','b'], 'formats': ['i4']*2})
b.argmax()
The error is
TypeError: expected a readable buffer object
While the following runs without a problem:
a = np.zeros(3)
a.argmax()
It seems the error dues to the structured array. But could you anyone help to explain the reason?
Your b is:
array([(0, 0)], dtype=[('a', '<i4'), ('b', '<i4')])
I get a different error message with argmax:
TypeError: Cannot cast array data from dtype([('a', '<i4'), ('b', '<i4')]) to dtype('V8') according to the rule 'safe'
But this works:
In [88]: b['a'].argmax()
Out[88]: 0
Generally you can't do math operations across the fields of a structured array. You can operate within each field (if it is numeric). Since the fields could be a mix of numbers, strings and other objects, so there's been no effort to handle special cases where such operations might make sense.
If you really must to operations across the fields, try a different view, eg:
In [94]: b.view('<i4').argmax()
Out[94]: 0