Indexing matrix in numpy using Node id - numpy

Is there a way to index a numpy matrix, built via networkx as an adjacenjy matrix, using node name
(I built the networkx graph parsing lines from a .txt file.
Each line represents an edge and it's in the form SourceNode:DestNode:EdgeWeight)
I need the matrix because I'm going to calculate the hitting probabilities of some nodes

Regardless of how you constructed your graph, you can compute an adjacency matrix of it. The docs state that the order of the rows and columns in this graph will be "as produced by G.nodes()" if you don't specify it.
For example,
# create your graph
G = nx.DiGraph()
with open("spec.txt") as f:
for line in f:
for src, dest, weight in line.split(':'):
G.add_edge(src, dest, weight=weight)
# create adjacency matrix
# - store index now, in case graph is changed.
nodelist = G.nodes()
# extract matrix, and convert to dense representation
A = nx.adjacency_matrix(G, nodelist=nodelist).todense()
# normalise each row by incoming edges, or whatever
B = A / A.sum(axis=1).astype(float)
Let us presume that your nodes are labelled alphabetically, C-G. The node ordering is just according to the dictionary hash, and this sequence for me: ['C', 'E', 'D', 'G', 'F'].
If you want to look up information from the matrix, you could use a lookup like this:
ix = nodelist.index('D') # ix is 2 here
print A[ix,:]

Related

How can I find the optimal number of topics in LDA with scikit-learn?

I'm computing topic models through scikit-learn with this script (I'm starting with a dataset "df" which has one document per row in the column "Text")
from sklearn.decomposition import LatentDirichletAllocation
#Applying LDA
# the vectorizer object will be used to transform text to vector form
vectorizer = CountVectorizer(max_df=int(0.9*len(df)), min_df=int(0.01*len(df)), token_pattern='\w+|\$[\d\.]+|\S+')
# apply transformation
tf = vectorizer.fit_transform(df.Text).toarray()
# tf_feature_names tells us what word each column in the matric represents
tf_feature_names = vectorizer.get_feature_names()
number_of_topics = 6
model = LatentDirichletAllocation(n_components=number_of_topics, random_state=0)
model.fit(tf)
I'm interested in comparing models with different number of topics (kind of from 2 to 20 topics) through a coherence measure. How can I do it?

Tensorflow vocabularyprocessor

I am following the wildml blog on text classification using tensorflow. I am not able to understand the purpose of max_document_length in the code statement :
vocab_processor = learn.preprocessing.VocabularyProcessor(max_document_length)
Also how can i extract vocabulary from the vocab_processor
I have figured out how to extract vocabulary from vocabularyprocessor object. This worked perfectly for me.
import numpy as np
from tensorflow.contrib import learn
x_text = ['This is a cat','This must be boy', 'This is a a dog']
max_document_length = max([len(x.split(" ")) for x in x_text])
## Create the vocabularyprocessor object, setting the max lengh of the documents.
vocab_processor = learn.preprocessing.VocabularyProcessor(max_document_length)
## Transform the documents using the vocabulary.
x = np.array(list(vocab_processor.fit_transform(x_text)))
## Extract word:id mapping from the object.
vocab_dict = vocab_processor.vocabulary_._mapping
## Sort the vocabulary dictionary on the basis of values(id).
## Both statements perform same task.
#sorted_vocab = sorted(vocab_dict.items(), key=operator.itemgetter(1))
sorted_vocab = sorted(vocab_dict.items(), key = lambda x : x[1])
## Treat the id's as index into list and create a list of words in the ascending order of id's
## word with id i goes at index i of the list.
vocabulary = list(list(zip(*sorted_vocab))[0])
print(vocabulary)
print(x)
not able to understand the purpose of max_document_length
The VocabularyProcessor maps your text documents into vectors, and you need these vectors to be of a consistent length.
Your input data records may not (or probably won't) be all the same length. For example if you're working with sentences for sentiment analysis they'll be of various lengths.
You provide this parameter to the VocabularyProcessor so that it can adjust the length of output vectors. According to the documentation,
max_document_length: Maximum length of documents. if documents are
longer, they will be trimmed, if shorter - padded.
Check out the source code.
def transform(self, raw_documents):
"""Transform documents to word-id matrix.
Convert words to ids with vocabulary fitted with fit or the one
provided in the constructor.
Args:
raw_documents: An iterable which yield either str or unicode.
Yields:
x: iterable, [n_samples, max_document_length]. Word-id matrix.
"""
for tokens in self._tokenizer(raw_documents):
word_ids = np.zeros(self.max_document_length, np.int64)
for idx, token in enumerate(tokens):
if idx >= self.max_document_length:
break
word_ids[idx] = self.vocabulary_.get(token)
yield word_ids
Note the line word_ids = np.zeros(self.max_document_length).
Each row in raw_documents variable will be mapped to a vector of length max_document_length.

How to read data from numpy files in TensorFlow? [duplicate]

I have read the CNN Tutorial on the TensorFlow and I am trying to use the same model for my project.
The problem is now in data reading. I have around 25000 images for training and around 5000 for testing and validation each. The files are in png format and I can read them and convert them into the numpy.ndarray.
The CNN example in the tutorials use a queue to fetch the records from the file list provided. I tried to create my own such binary file by reshaping my images into 1-D array and attaching a label value in the front of it. So my data looks like this
[[1,12,34,24,53,...,105,234,102],
[12,112,43,24,52,...,115,244,98],
....
]
The single row of the above array is of length 22501 size where the first element is the label.
I dumped the file to using pickle and the tried to read from the file using the
tf.FixedLengthRecordReader to read from the file as demonstrated in example
I am doing the same things as given in the cifar10_input.py to read the binary file and putting them into the record object.
Now when I read from the files the labels and the image values are different. I can understand the reason for this to be that pickle dumps the extra information of braces and brackets also in the binary file and they change the fixed length record size.
The above example uses the filenames and pass it to a queue to fetch the files and then the queue to read a single record from the file.
I want to know if I can pass the numpy array as defined above instead of the filenames to some reader and it can fetch records one by one from that array instead of the files.
Probably the easiest way to make your data work with the CNN example code is to make a modified version of read_cifar10() and use it instead:
Write out a binary file containing the contents of your numpy array.
import numpy as np
images_and_labels_array = np.array([[...], ...], # [[1,12,34,24,53,...,102],
# [12,112,43,24,52,...,98],
# ...]
dtype=np.uint8)
images_and_labels_array.tofile("/tmp/images.bin")
This file is similar to the format used in CIFAR10 datafiles. You might want to generate multiple files in order to get read parallelism. Note that ndarray.tofile() writes binary data in row-major order with no other metadata; pickling the array will add Python-specific metadata that TensorFlow's parsing routines do not understand.
Write a modified version of read_cifar10() that handles your record format.
def read_my_data(filename_queue):
class ImageRecord(object):
pass
result = ImageRecord()
# Dimensions of the images in the dataset.
label_bytes = 1
# Set the following constants as appropriate.
result.height = IMAGE_HEIGHT
result.width = IMAGE_WIDTH
result.depth = IMAGE_DEPTH
image_bytes = result.height * result.width * result.depth
# Every record consists of a label followed by the image, with a
# fixed number of bytes for each.
record_bytes = label_bytes + image_bytes
assert record_bytes == 22501 # Based on your question.
# Read a record, getting filenames from the filename_queue. No
# header or footer in the binary, so we leave header_bytes
# and footer_bytes at their default of 0.
reader = tf.FixedLengthRecordReader(record_bytes=record_bytes)
result.key, value = reader.read(filename_queue)
# Convert from a string to a vector of uint8 that is record_bytes long.
record_bytes = tf.decode_raw(value, tf.uint8)
# The first bytes represent the label, which we convert from uint8->int32.
result.label = tf.cast(
tf.slice(record_bytes, [0], [label_bytes]), tf.int32)
# The remaining bytes after the label represent the image, which we reshape
# from [depth * height * width] to [depth, height, width].
depth_major = tf.reshape(tf.slice(record_bytes, [label_bytes], [image_bytes]),
[result.depth, result.height, result.width])
# Convert from [depth, height, width] to [height, width, depth].
result.uint8image = tf.transpose(depth_major, [1, 2, 0])
return result
Modify distorted_inputs() to use your new dataset:
def distorted_inputs(data_dir, batch_size):
"""[...]"""
filenames = ["/tmp/images.bin"] # Or a list of filenames if you
# generated multiple files in step 1.
for f in filenames:
if not gfile.Exists(f):
raise ValueError('Failed to find file: ' + f)
# Create a queue that produces the filenames to read.
filename_queue = tf.train.string_input_producer(filenames)
# Read examples from files in the filename queue.
read_input = read_my_data(filename_queue)
reshaped_image = tf.cast(read_input.uint8image, tf.float32)
# [...] (Maybe modify other parameters in here depending on your problem.)
This is intended to be a minimal set of steps, given your starting point. It may be more efficient to do the PNG decoding using TensorFlow ops, but that would be a larger change.
In your question, you specifically asked:
I want to know if I can pass the numpy array as defined above instead of the filenames to some reader and it can fetch records one by one from that array instead of the files.
You can feed the numpy array to a queue directly, but it will be a more invasive change to the cifar10_input.py code than my other answer suggests.
As before, let's assume you have the following array from your question:
import numpy as np
images_and_labels_array = np.array([[...], ...], # [[1,12,34,24,53,...,102],
# [12,112,43,24,52,...,98],
# ...]
dtype=np.uint8)
You can then define a queue that contains the entire data as follows:
q = tf.FIFOQueue([tf.uint8, tf.uint8], shapes=[[], [22500]])
enqueue_op = q.enqueue_many([image_and_labels_array[:, 0], image_and_labels_array[:, 1:]])
...then call sess.run(enqueue_op) to populate the queue.
Another—more efficient—approach would be to feed records to the queue, which you could do from a parallel thread (see this answer for more details on how this would work):
# [With q as defined above.]
label_input = tf.placeholder(tf.uint8, shape=[])
image_input = tf.placeholder(tf.uint8, shape=[22500])
enqueue_single_from_feed_op = q.enqueue([label_input, image_input])
# Then, to enqueue a single example `i` from the array.
sess.run(enqueue_single_from_feed_op,
feed_dict={label_input: image_and_labels_array[i, 0],
image_input: image_and_labels_array[i, 1:]})
Alternatively, to enqueue a batch at a time, which will be more efficient:
label_batch_input = tf.placeholder(tf.uint8, shape=[None])
image_batch_input = tf.placeholder(tf.uint8, shape=[None, 22500])
enqueue_batch_from_feed_op = q.enqueue([label_batch_input, image_batch_input])
# Then, to enqueue a batch examples `i` through `j-1` from the array.
sess.run(enqueue_single_from_feed_op,
feed_dict={label_input: image_and_labels_array[i:j, 0],
image_input: image_and_labels_array[i:j, 1:]})
I want to know if I can pass the numpy array as defined above instead
of the filenames to some reader and it can fetch records one by one
from that array instead of the files.
tf.py_func, that wraps a python function and uses it as a TensorFlow operator, might help. Here's an example.
However, since you've mentioned that your images are stored in png files, I think the simplest solution would be to replace this:
reader = tf.FixedLengthRecordReader(record_bytes=record_bytes)
result.key, value = reader.read(filename_queue)
with this:
result.key, value = tf.WholeFileReader().read(filename_queue))
value = tf.image.decode_jpeg(value)

Numpy: find mean coordinate of points along line

I have a bunch of points in a 2D space which all reside on a line (polygon). How can I compute the mean coordinate of these points on the line?
I don't mean the centroid of the points in the 2D space (as #rth initially proposed in his answer), but the mean location of the points along the line on which they reside. So basically, I could transform the line to a 1D axis, compute the mean location in 1D, and transform the location of the mean back into the 2D space.
Maybe these are exactly the necessary steps, but I think (or hope) that there is a function in numpy/scipy which allows me to do this in one step.
Edit: The approach you describe in the question is indeed probably the simplest way for solving this problem.
Here is an implementation that calculates the positions of vertices along the line in 1D, takes their mean, and finally calculates the corresponding 2D position with parametric interpolation,
import numpy as np
from scipy.interpolate import splprep, splev
vert = np.random.randn(1000, 2) # vertices definition here
# calculate the Euclidean distances between consecutive vertices
# equivalent to a for loop with
# dl[i] = ((vert[i+1, 0] - vert[i, 0])**2 + (vert[i+1,1] - vert[i,1])**2)**0.5
dl = (np.diff(vert, axis=0)**2).sum(axis=1)**0.5
# pad with 0, so dl.shape[0] == vert.shape[0] for convenience
dl = np.insert(dl, 0, 0.0)
l = np.cumsum(dl) # 1D coordinates along the line
l_mean = np.mean(l) # mean in the line coordinates
# calculate the coordinate of l_mean in 2D space
# with parametric B-spline interpolation
tck, _ = splprep(x=vert.T, u=l, k=3)
res = splev(l_mean, tck)
print(res)
Edit2: Assuming now that you have a high resolution set of points for your path vert_full and some approximate measurements vert_1, vert_2, etc, what you could do is the following.
Project each points of vert_1, etc. onto the exact path. Assuming that vert_full has much more datapoints than vert_1, we can simply look for the nearest neighbours of vert_1 in vert_full:
from scipy.spatial import cKDTree
tr = cKDTree(vert_full)
d, idx = tr.query(vert_1, k=1)
vert_1_proj = vert_full[idx] # this gives the projected corrdinates onto vert_full
# I have not actually run this, so it might require minor changes
Use the above mean calculation with the new vert_1_proj vector.
Meanwhile I've found the answer to my question, although using Shapely instead of Numpy.
from shapely.geometry import LineString, Point
# lists of points as (x,y) tuples
path_xy = [...]
points_xy = [...] # should be on or near path
path = LineString(path_xy) # create path object
pts = [Point(p) for p in points_xy] # create point objects
dist = [path.project(p) for p in pts] # distances along path
mean_dist = np.mean(dist) # mean distance along path
mean = path.interpolate(mean_dist) # mean point
mean_xy = (mean.x,mean.y)
This works perfectly!
(That's is also why I have to accept it as the answer, though I highly appreciate #rth's help!)

How to get a subarray in numpy

I have an 3d array and I want to get a sub-array of size (2n+1) centered around an index indx. Using slices I can use
y[slice(indx[0]-n,indx[0]+n+1),slice(indx[1]-n,indx[1]+n+1),slice(indx[2]-n,indx[2]+n+1)]
which will only get uglier if I want a different size for each dimension. Is there a nicer way to do this.
You don't need to use the slice constructor unless you want to store the slice object for later use. Instead, you can simply do:
y[indx[0]-n:indx[0]+n+1, indx[1]-n:indx[1]+n+1, indx[2]-n:indx[2]+n+1]
If you want to do this without specifying each index separately, you can use list comprehensions:
y[[slice(i-n, i+n+1) for i in indx]]
You can create numpy arrays for indexing into different dimensions of the 3D array and then use use ix_ function to create indexing map and thus get the sliced output. The benefit with ix_ is that it allows for broadcasted indexing maps. More info on this could be found here. Then, you can specify different window sizes for each dimension for a generic solution. Here's the implementation with sample input data -
import numpy as np
A = np.random.randint(0,9,(17,18,16)) # Input array
indx = np.array([5,10,8]) # Pivot indices for each dim
N = [4,3,2] # Window sizes
# Arrays of start & stop indices
start = indx - N
stop = indx + N + 1
# Create indexing arrays for each dimension
xc = np.arange(start[0],stop[0])
yc = np.arange(start[1],stop[1])
zc = np.arange(start[2],stop[2])
# Create mesh from multiple arrays for use as indexing map
# and thus get desired sliced output
Aout = A[np.ix_(xc,yc,zc)]
Thus, for the given data with window sizes array, N = [4,3,2], the whos info shows -
In [318]: whos
Variable Type Data/Info
-------------------------------
A ndarray 17x18x16: 4896 elems, type `int32`, 19584 bytes
Aout ndarray 9x7x5: 315 elems, type `int32`, 1260 bytes
The whos info for the output, Aout seems to be coherent with the intended output shape which must be 2N+1.