Word2Vec Tensorflow tutorial weird output - tensorflow

I'm trying out the Word2Vec tutorial at tensorflow (see here: https://www.tensorflow.org/tutorials/text/word2vec)
While all seems to work fine, the output is somewhat unexpected to me, especially the small cluster in the PCA. The 'closet' words in the embedding dimension also don't make much sense, especially compared to other examples.
Am I doing something (trivially) wrong? Or is this expected?
For completeness, I run this in the nvidia-docker image, but also found similar results running cpu only.
Here is the projected embedding showing the cluster.

There can be various reasons.
One reason is that this is due to the so-called hubness problem of embedding spaces, which is an artifact of the high-dimensional space. Some words end up close to a large part of the space and act as sort of hubs in the nearest neighbor search, so through these words, you can get quickly from everywhere to everywhere.
Another reason might be that the model is just undertrained for this particular word. Word embeddings are typically trained on very large datasets, such that every word appears in sufficiently many contexts. If a word does not appear frequently enough or in too ambiguous contexts, then it also ends up to be similar to basically everything.

Related

Reproducibility, Controlling Randomness, Operator-level Randomness in TFF

I have a TFF code that takes a slightly different optimization path while training across different runs, despite having set all the operator-level seeds, numpy seeds for sampling clients in each round, etc. The FAQ section on TFF website does talk about randomness and expectation in TFF, but I found the answer slightly confusing. Is it the case that some aspects of the randomness can't be directly controlled even after setting all the operator-level seeds that one could; because one can't control the way sub-sessions are started and ended?
To be more specific, these are all the operator-level seeds that my code already sets: dataset.shuffle, create_tf_dataset_from_all_clients, keras.initializers and np.random.seed for per-round client sampling (which uses numpy). I have verified that the initial model state is the same across runs, but as soon as training starts, the model states start diverging across different runs. The divergence is gradual/slow in most cases, but not always.
The code is quite complex, so not adding it here.
There is one more source of non-determinism that would be very hard to control -- summation of float32 numbers is not commutative.
When you simulate a number of clients in a round, the TFF executor does not have a way to control the order in which the model updates are added together. As a result, there could be some differences at the bottom of the float32 range. While this may sound negligible, it can add up over a number of rounds (I have seen hundreds, but could be also less), and eventually cause different loss/accuracy/model weights trajectories, as the gradients will start to be computed at slightly different points.
BTW, this tutorial has more info on best practices in controlling randomness in TFF.

Reverse Image search (for image duplicates) on local computer

I have a bunch of poor quality photos that I extracted from a pdf. Somebody I know has the good quality photo's somewhere on her computer(Mac), but it's my understanding that it will be difficult to find them.
I would like to
loop through each poor quality photo
perform a reverse image search using each poor quality photo as the query image and using this persons computer as the database to search for the higher quality images
and create a copy of each high quality image in one destination folder.
Example pseudocode
for each image in poorQualityImages:
search ./macComputer for a higherQualityImage of image
copy higherQualityImage to ./higherQualityImages
I need to perform this action once.
I am looking for a tool, github repo or library which can perform this functionality more so than a deep understanding of content based image retrieval.
There's a post on reddit where someone was trying to do something similar
imgdupes is a program which seems like it almost achieves this, but I do not want to delete the duplicates, I want to copy the highest quality duplicate to a destination folder
Update
Emailed my previous image processing prof and he sent me this
Off the top of my head, nothing out of the box.
No guaranteed solution here, but you can narrow the search space.
You’d need a little program that outputs the MSE or SSIM similarity
index between two images, and then write another program or shell
script that scans the hard drive and computes the MSE between each
image on the hard drive and each query image, then check the images
with the top X percent similarity score.
Something like that. Still not maybe guaranteed to find everything
you want. And if the low quality images are of different pixel
dimensions than the high quality images, you’d have to do some image
scaling to get the similarity index. If the poor quality images have
different aspect ratios, that’s even worse.
So I think it’s not hard but not trivial either. The degree of
difficulty is partly dependent on the nature of the corruption in the
low quality images.
UPDATE
Github project I wrote which achieves what I want
What you are looking for is called image hashing
. In this answer you will find a basic explanation of the concept, as well as a go-to github repo for plug-and-play application.
Basic concept of Hashing
From the repo page: "We have developed a new image hash based on the Marr wavelet that computes a perceptual hash based on edge information with particular emphasis on corners. It has been shown that the human visual system makes special use of certain retinal cells to distinguish corner-like stimuli. It is the belief that this corner information can be used to distinguish digital images that motivates this approach. Basically, the edge information attained from the wavelet is compressed into a fixed length hash of 72 bytes. Binary quantization allows for relatively fast hamming distance computation between hashes. The following scatter plot shows the results on our standard corpus of images. The first plot shows the distances between each image and its attacked counterpart (e.g. the intra distances). The second plot shows the inter distances between altogether different images. While the hash is not designed to handle rotated images, notice how slight rotations still generally fall within a threshold range and thus can usually be matched as identical. However, the real advantage of this hash is for use with our mvp tree indexing structure. Since it is more descriptive than the dct hash (being 72 bytes in length vs. 8 bytes for the dct hash), there are much fewer false matches retrieved for image queries.
"
Another blogpost for an in-depth read, with an application example.
Available Code and Usage
A github repo can be found here. There are obviously more to be found.
After importing the package you can use it to generate and compare hashes:
>>> from PIL import Image
>>> import imagehash
>>> hash = imagehash.average_hash(Image.open('test.png'))
>>> print(hash)
d879f8f89b1bbf
>>> otherhash = imagehash.average_hash(Image.open('other.bmp'))
>>> print(otherhash)
ffff3720200ffff
>>> print(hash == otherhash)
False
>>> print(hash - otherhash)
36
The demo script find_similar_images also on the mentioned github, illustrates how to find similar images in a directory.
Premise
I'll focus my answer on the image processing part, as I believe implementation details e.g. traversing a file system is not the core of your problem. Also, all that follows is just my humble opinion, I am sure that there are better ways to retrieve your image of which I am not aware. Anyway, I agree with what your prof said and I'll follow the same line of thought, so I'll share some ideas on possible similarity indexes you might use.
Answer
MSE and SSIM - This is a possible solution, as suggested by your prof. As I assume the low quality images also have a different resolution than the good ones, remember to downsample the good ones (and not upsample the bad ones).
Image subtraction (1-norm distance) - Subtract two images -> if they are equal you'll get a black image. If they are slightly different, the non-black pixels (or the sum of the pixel intensity) can be used as a similarity index. This is actually the 1-norm distance.
Histogram distance - You can refer to this paper: https://www.cse.huji.ac.il/~werman/Papers/ECCV2010.pdf. Comparing two images' histograms might be potentially robust for your task. Check out this question too: Comparing two histograms
Embedding learning - As I see you included tensorflow, keras or pytorch as tags, let's consider deep learning. This paper came to my
mind: https://arxiv.org/pdf/1503.03832.pdf The idea is to learn a
mapping from the image space to a Euclidian space - i.e. compute an
embedding of the image. In the embedding hyperspace, images are
points. This paper learns an embedding function by minimizing the
triplet loss. The triplet loss is meant to maximize the distance
between images of different classes and minimize the distance between
images of the same class. You could train the same model on a Dataset
like ImageNet. You could augment the dataset with by lowering the
quality of the images, in order to make the model "invariant" to
difference in image quality (e.g. down-sampling followed by
up-sampling, image compression, adding noise, etc.). Once you can
compute embedding, you could compute the Euclidian distance (as a
substitute of the MSE). This might work better than using MSE/SSIM as a similarity indexes. Repo of FaceNet: https://github.com/timesler/facenet-pytorch. Another general purpose approach (not related to faces) which might help you: https://github.com/zegami/image-similarity-clustering.
Siamese networks for predicting similarity score - I am referring to this paper on face verification: http://bmvc2018.org/contents/papers/0410.pdf. The siamese network takes two images as input and outputs a value in the [0, 1]. We can interpret the output as the probability that the two images belong to the same class. You can train a model of this kind to predict 1 for image pairs of the following kind: (good quality image, artificially degraded image). To degrade the image, again, you can combine e.g. down-sampling followed by
up-sampling, image compression, adding noise, etc. Let the model predict 0 for image pairs of different classes (e.g. different images). The output of the network can e used as a similarity index.
Remark 1
These different approaches can also be combined. They all provide you with similarity indexes, so you can very easily average the outcomes.
Remark 2
If you only need to do it once, the effort you need to put in implementing and training deep models might be not justified. I would not suggest it. Still, you can consider it if you can't find any other solution and that Mac is REALLY FULL of images and a manual search is not possible.
If you look at the documentation of imgdupes you will see there is the following option:
--dry-run
dry run (do not delete any files)
So if you run imgdupes with --dry-run you will get a listing of all the duplicate images but it will not actually delete anything. You should be able to process that output to move the images around as you need.
Try similar image finder I have developed to address this problem.
There is an explanation and the algorithm there, so you can implement your own version if needed.

How to successfully convert math papers to plain text

Goals:
1.Develop a canonical method to use plain text to uniquely represent STEM papers in general and math papers in particular.
Develop softwares that can convert existing typed STEM papers into that canonical form with 100% accuracy. Note that I can't tolerate any inaccuracy simply because as a single individual I can't proofread millions of papers to correct inaccuracy in conversion even at the rate of 0.001 errors per paper on average.
Problems:
All PDF to text, TeX to text etc programs I have seen here on Stackoverflow and elsewhere such as PyMuPDF do not really work due to math symbols that can not be processed.
2.PDF is really hard to process.
3.TeX is really hard to process because of numerous macros STEM paper authors tend to add to their source files which tend to break LatexML and other converters. It is very easy to process my own papers because I don't use a lot of new commands. However there are many authors whose papers contain \def macros which can not even be processed by de-macro. To actually get TeX to work, assuming that I can even get source files of most papers on arXiv at all, I will pretty much have to actually write my own variant of TeX engine that somehow expand all required macros and produce a plain text document.
Is there any other way to solve this problem? Currently the target format I prefer is pretty much just plain text + math symbols written in LaTeX without formatting other than those that are semantically significant such as \mathcal{A} and A being separate entities. I can learn to set up a neural network to train it to understand these printed math symbols assuming that my laptop is sufficiently powerful. There are literally just less than 200 symbols for the network to learn and their shapes should be very easy to recognize due to lack of variation. Shall I do that?
Yes you can try that. Recognition of symbols, with subsequent transformation of them into LaTeX format(for example, for every square root to write \sqrt).
You can further refer to the issue of recognition to this paper:
https://www.sciencedirect.com/science/article/abs/pii/003132039090113Y -
Recognition of handwritten symbols
Torfinn Taxt,Jórunn B.Ólafsdóttir,MortenDæhlen∥
http://neuralnetworksanddeeplearning.com/chap1.html - here you can find out more, with code samples, on implementation of neural network to handwritten manuscripts.

Lstm to improve tokenization

Recently I stared toying with tensor flow, dnns etc. now I'm trying to implement something more serious, information retrieval from short sentences (doctor instructions).
Unfortunately the dataset I have is, as always, quite "dirty". As I'm trying to use word embeddings, I actually need "clean" data. Take one example:
"Take two pilleach day". There is a missing white space between pill and each. I am implementing "tokenizer improver" to look at each sentence and propose new tokenization based on joint probability of each word in sentence given the frequency of terms in whole document (tf) . As I was doing it today, a thought came to my mind: why bother writing suboptimal solution for this problem when I can employ powerful learning algorithms such as Lstm networks to do that for me. However, as of today, I have only a feeling that it's actually possible to do that. As we know, feelings are not best when it comes to architecting such complex problems. I don't know where to begin: what should be my training set and learning goal.
I know this is a broad question, but I know there are many brilliant people with more knowledge about tensorflow and neural nets, so I'm sure that somebody has either already solved similar problem or just knows how to approach this problem.
Any guidance is welcome, I do not except you to solve this for me of course:)
Besos and all the best to all the tensorflow community:)
Having the same issue. I solved it by using a character level net. Basically I rewrote Character-Aware Neural Language Models, kicked out the whole "words"-elements and just stayed with the caracter level.
Training Data: I took the data I had, as dirty as it was, used the dirty data as targets and made it even more dirty to create inputs.
So your "Take two pilleach day" will be learned as in many cases you do have a clean and similar phrase, e.g. "Take one pill each morning" that with the regime mentioned will serve as target and you train the net on destroyed inputs like "Take oe pileach mornin"

Using tensorflow to identify lego bricks?

having read this article about a guy who uses tensorflow to sort cucumber into nine different classes I was wondering if this type of process could be applied to a large number of classes. My idea would be to use it to identify Lego parts.
At the moment, a site like Bricklink describes more than 40,000 different parts so it would be a bit different than the cucumber example but I am wondering if it sounds suitable. There is no easy way to get hundreds of pictures for each part but does the following process sound feasible :
take pictures of a part ;
try to identify the part using tensorflow ;
if it does not identify the correct part, take more pictures and feed the neural network with them ;
go on with the next part.
That way, each time we encounter a new piece we "teach" the network its reference so that it can better be recognized the next time. Like that and after hundreds of iterations monitored by a human, could we imagine tensorflow to be able to recognize the parts? At least the most common ones?
My question might sound stupid but I am not into neural networks so any advice is welcome. At the moment I have not found any way to identify a lego part based on pictures and this "cucumber example" sounds promising so I am looking for some feedback.
Thanks.
You can read about the work of Jacques Mattheij, he actually uses a customized version of Xception1 running on https://keras.io/.
The introduction is Sorting 2 Metric Tons of Lego.
In Sorting 2 Tons of Lego, The software Side you can read:
The hard challenge to deal with next was to get a training set large
enough to make working with 1000+ classes possible. At first this
seemed like an insurmountable problem. I could not figure out how to
make enough images and to label them by hand in acceptable time, even
the most optimistic calculations had me working for 6 months or longer
full-time in order to make a data set that would allow the machine to
work with many classes of parts rather than just a couple.
In the end the solution was staring me in the face for at least a week
before I finally clued in: it doesn’t matter. All that matters is that
the machine labels its own images most of the time and then all I need
to do is correct its mistakes. As it gets better there will be fewer
mistakes. This very rapidly expanded the number of training images.
The first day I managed to hand-label about 500 parts. The next day
the machine added 2000 more, with about half of those labeled wrong.
The resulting 2500 parts where the basis for the next round of
training 3 days later, which resulted in 4000 more parts, 90% of which
were labeled right! So I only had to correct some 400 parts, rinse,
repeat… So, by the end of two weeks there was a dataset of 20K images,
all labeled correctly.
This is far from enough, some classes are severely under-represented
so I need to increase the number of images for those, perhaps I’ll
just run a single batch consisting of nothing but those parts through
the machine. No need for corrections, they’ll all be labeled
identically.
A recent update is Sorting 2 Tons of Lego, Many Questions, Results.
1CHOLLET, François. Xception: Deep Learning with Depthwise Separable Convolutions. arXiv preprint arXiv:1610.02357, 2016.
I have started this using IBM Watson's Visual Recognition.
I had six different bricks to be recognized on the transport belt background.
I am actually thinking about tensorflow, since I can have it running locally.
The codelab : TensorFlow for Poets, describes almost exactly what you want to achieve,
For a demo of the Watson version:
https://www.ibm.com/developerworks/community/blogs/ibmandgoogle/entry/Lego_bricks_recognition_with_Watosn_lego_and_raspberry_pi?lang=en