How to write text to a file using TextLineDataset - tensorflow

I am trying to read text in a file Shakespear.txt line by line using
tensorflow TextLineDataset. Split the words in a line and write the words in another file txt.txt one word per line. Here is my code
import tensorflow as tf
tf.enable_eager_execution()
BATCH_SIZE=2
#from tensorflow.keras.model import Sequential
dataset_in_lines=tf.data.TextLineDataset("Shakespear.txt")
dataset=dataset_in_lines.map(lambda string: tf.string_split([string]).values)
with open("txt.txt","w") as f:
for k in dataset.take(2):
for x in k:
f.write("\n".join(x))
When i run it it gives the error: Cannot iterate over a scalar tensor
in the f.write line. Please help me figure out the issue

It will be helpful if you could share the shakespear.txt file, but based on your error, it seems liek it is receiving the tensor not the actual value.
So, you first need to get the value from tensor k, you can use k.numpy().
Replace for x in k: with for x in k.numpy():
Let us know if it works.

I found a better way, replace dataset=dataset_in_lines.map(lambda string:tf.string_split([string]).values) with tokenizer.tokenize. The following code achieves the objective(see https://www.tensorflow.org/tutorials/load_data/text for more details)
import tensorflow as tf
tf.enable_eager_execution()
import tensorflow_datasets as tfds
tokenizer = tfds.features.text.Tokenizer()
dataset_in_lines=tf.data.TextLineDataset("Shakespear.txt")
vocabulary_set = set()
for x in dataset_in_lines:
k=tokenizer.tokenize(x.numpy())
vocabulary_set.update(k)
with open("txt.txt","w") as f:
for x in vocabulary_set:
f.write(x+"\n")

Related

MATLAB .mat in Pandas DataFrame to be used in Tensorflow

I have gone days trying to figure this out, hopefully someone can help.
I am uploading a .mat file into python using scipy.io, placing the struct into a dataframe, which will then be used in Tensorflow.
from scipy.io import loadmat
import pandas as pd
import numpy as p
import matplotlib.pyplot as plt
#import TF
path = '/home/anthony/PycharmProjects/Deep_Learning_MATLAB/circuit-data/for tinghao/template1-lib5-eqns-CR-RESULTS-SET1-FINAL.mat'
raw_data = loadmat(path, squeeze_me=True)
data = raw_data['Graphs']
df = pd.DataFrame(data, dtype=int)
df.pop('transferFunc')
print(df.dtypes)
The out put is:
A object
Ln object
types object
nz int64
np int64
dtype: object
Process finished with exit code 0
The struct is (43249x6). Each cell in the 'A' column is a different sized matrix, i.e. 18x18, or 16x16 etc. Each cell in "Ln" is a row of letters each in their own separate cell. Each cell in 'Types' contains 12 columns of numbers, and 'nz' and 'np' i have no issues with.
I want to put all columns into a dataframe, and use column A or LN or Types as the 'Labels' and nz and np as 'features', again i do not have issues with the latter. Can anyone help with this or have some kind of work around.
The end goal is to have tensorflow train on nz and np and give me either a matrix, Ln, or Type.
What type of data is your .mat file of ? Is your application very time critical?
If you can collect all your data in a struct you could give jsonencode a try, make the struct a json file and load it back into python via json (see json documentation on loading data).
Then you can create a pandas dataframe via
pd.df.from_dict()
Of course this would only be a workaround. Still you would have to ensure your data in the MATLAB struct is correctly orderer to be then imported and transferred to a df.
raw_data = loadmat(path, squeeze_me=True)
data = raw_data['Graphs']
graph_labels = pd.DataFrame()
graph_labels['perf'] = raw_data['Objective'][0:1000]
graph_labels['np'] = data['np'][0:1000]
The code above helped out. Its very simple and drawn out, but it got the job done. But, it does not work in tensorflow because tensorflow does not accept this format, and that was my main issue. I have to convert adjacency matrices to networkx graphs, then upload them into stellargraph.

Tensorflow: Load unknown TFRecord dataset

I got a TFRecord data file filename = train-00000-of-00001 which contains images of unknown size and maybe other information as well. I know that I can use dataset = tf.data.TFRecordDataset(filename) to open the dataset.
How can I extract the images from this file to save it as a numpy-array?
I also don't know if there is any other information saved in the TFRecord file such as labels or resolution. How can I get these information? How can I save them as a numpy-array?
I normally only use numpy-arrays and am not familiar with TFRecord data files.
1.) How can I extract the images from this file to save it as a numpy-array?
What you are looking for is this:
record_iterator = tf.python_io.tf_record_iterator(path=filename)
for string_record in record_iterator:
example = tf.train.Example()
example.ParseFromString(string_record)
print(example)
# Exit after 1 iteration as this is purely demonstrative.
break
2.) How can I get these information?
Here is the official documentation. I strongly suggest that you read the documentation because it goes step by step in how to extract the values that you are looking for.
Essentially, you have to convert example to a dictionary. So if I wanted to find out what kind of information is in a tfrecord file, I would do something like this (in context with the code stated in the first question): dict(example.features.feature).keys()
3.) How can I save them as a numpy-array?
I would build upon the for loop mentioned above. So for every loop, it extracts the values that you are interested in and appends them to numpy arrays. If you want, you could create a pandas dataframe from those arrays and save it as a csv file.
But...
You seem to have multiple tfrecord files...tf.data.TFRecordDataset(filename) returns a dataset that is used to train models.
So in the event for multiple tfrecords, you would need a double for loop. The outer loop will go through each file. For that particular file, the inner loop will go through all of the tf.examples.
EDIT:
Converting to np.array()
import tensorflow as tf
from PIL import Image
import io
for string_record in record_iterator:
example = tf.train.Example()
example.ParseFromString(string_record)
print(example)
# Get the values in a dictionary
example_bytes = dict(example.features.feature)['image_raw'].bytes_list.value[0]
image_array = np.array(Image.open(io.BytesIO(example_bytes)))
print(image_array)
break
Sources for the code above:
Base code
Converting bytes to PIL.JpegImagePlugin.JpegImageFile
Converting from PIL.JpegImagePlugin.JpegImageFile to np.array
Official Documentation for PIL
EDIT 2:
import tensorflow as tf
from PIL import Image
import io
import numpy as np
# Load image
cat_in_snow = tf.keras.utils.get_file(path, 'https://storage.googleapis.com/download.tensorflow.org/example_images/320px-Felis_catus-cat_on_snow.jpg')
#------------------------------------------------------Convert to tfrecords
def _bytes_feature(value):
"""Returns a bytes_list from a string / byte."""
return tf.train.Feature(bytes_list=tf.train.BytesList(value=[value]))
def image_example(image_string):
feature = {
'image_raw': _bytes_feature(image_string),
}
return tf.train.Example(features=tf.train.Features(feature=feature))
with tf.python_io.TFRecordWriter('images.tfrecords') as writer:
image_string = open(cat_in_snow, 'rb').read()
tf_example = image_example(image_string)
writer.write(tf_example.SerializeToString())
#------------------------------------------------------
#------------------------------------------------------Begin Operation
record_iterator = tf.python_io.tf_record_iterator(path to tfrecord file)
for string_record in record_iterator:
example = tf.train.Example()
example.ParseFromString(string_record)
print(example)
# OPTION 1: convert bytes to arrays using PIL and IO
example_bytes = dict(example.features.feature)['image_raw'].bytes_list.value[0]
PIL_array = np.array(Image.open(io.BytesIO(example_bytes)))
# OPTION 2: convert bytes to arrays using Tensorflow
with tf.Session() as sess:
TF_array = sess.run(tf.image.decode_jpeg(example_bytes, channels=3))
break
#------------------------------------------------------
#------------------------------------------------------Compare results
(PIL_array.flatten() != TF_array.flatten()).sum()
PIL_array == TF_array
PIL_img = Image.fromarray(PIL_array, 'RGB')
PIL_img.save('PIL_IMAGE.jpg')
TF_img = Image.fromarray(TF_array, 'RGB')
TF_img.save('TF_IMAGE.jpg')
#------------------------------------------------------
Remember that tfrecords is just simply a way of storing information for tensorflow models to read in an efficient manner.
I use PIL and IO to essentially convert the bytes to an image. IO takes the bytes and converts them to a file like object that PIL.Image can then read
Yes, there is a pure tensorflow way to do it: tf.image.decode_jpeg
Yes, there is a difference between the two approaches when you compare the two arrays
Which one should you pick? Tensorflow is not the way to go if you are worried about accuracy as stated in Tensorflow's github : "The TensorFlow-chosen default for jpeg decoding is IFAST, sacrificing image quality for speed". Credit for this information belongs to this post

how to modify the code when change the backend from theano to tensorflow?

Here is the code sample:
https://github.com/keiserlab/keras-neural-graph-fingerprint/blob/master/NGF/layers.py
in the beginning, it has
import theano.tensor as T
import keras.backend as K
However, I want to use tensorflow as backend. How can I modify the code?
Also, Later the programmer use some code like
# Set pad value and set subtensor of actual tensor
output = T.set_subtensor(output[:, :paddings[0], :], padvalue)
output = T.set_subtensor(output[:, paddings[1]:, :], padvalue)
output = T.set_subtensor(output[:, paddings[0]:x.shape[1] + paddings[0], :], x)
Do I need to also change this code since I change the backen?
Thanks

torch.pow does not work

I'm trying to create a custom loss function using PyTorch, and am running into a simple error.
When I try to use torch.pow to take the exponent of a PyTorch Variable, I get the following error message:
AttributeError: 'torch.LongTensor' object has no attribute 'pow'
In the python terminal, I created a simple Variable, and attempted to do the same, and received the same error. Here's a snippet that should recreate the problem:
import torch
from torch.autograd import Variable
import numpy as np
v = Variable(torch.from_numpy(np.array([1, 2, 3, 4])))
torch.pow(v, 2)
I can't find any information on this issue, and nothing is showing up in search results. Help?
EDIT: this problem also occurs when I try to use torch.sqrt()
EDIT: same problem also happens if I try to do
v.pow(2)
pow is definitely a method of v, and the docs clearly state that pow is a method that exists and takes a tensor as it's argument. I really don't see how this is happening, and it seems to me that the docs are just flat out wrong and these methods don't actually work.
You need to initialize the tensor as floats, because pow always returns a Float.
import torch
from torch.autograd import Variable
import numpy as np
v = Variable(torch.from_numpy(np.array([1, 2, 3, 4], dtype="float32")))
torch.pow(v, 2)
You can cast it back to integers afterwards
torch.pow(v, 2).type(torch.LongTensor)
yields
Variable containing:
1
4
9
16
[torch.LongTensor of size 4]

numpy concatenate dimension mismatch

I've been running into an issue with numpy's concatenate that I've been unable to make sense of, and was hoping that someone had encountered and resolved the same problem.I'm attempting to join together two arrays created by SciKit-Learn's TfidfVectorizer and labelencoder, but getting an error message that "arrays must have the same number of dimensions", despite the fact that the inputs are a (77946, 12157) array and (77946, 1000) array, respectively. (As requested in the comments, a reproducible example is at the bottom)
TV=TfidfVectorizer(min_df=1,max_features=1000)
tagvect=preprocessing.LabelBinarizer()
tagvect2=preprocessing.LabelBinarizer()
tagvect2.fit(DS['location2'].tolist())
TV.fit(DS['tweet'])
GBR=GradientBoostingRegressor()
print "creating Xtrain and test"
A=tagvect2.transform(DS['location2'])
B=TV.transform(DS['tweet'])
print A.shape
print B.shape
pdb.set_trace()
Xtrain=np.concatenate([A,B.todense()],axis=1)
I initially thought the fact that B was encoded as a sparse matrix might be causing the issues, but converting it to a dense matrix didn't resolve the issue. I had the same issue using hstack instead.
Even more peculiar is that adding in a third labelencoder matrix results in no error :
TV.fit(DS['tweet'])
tagvect.fit(DS['state'].tolist())
tagvect2.fit(DS['location'].tolist())
GBR=GradientBoostingRegressor()
print "creating Xtrain and test"
Xtrain=pd.DataFrame(np.concatenate([tagvect.transform(DS['state']),tagvect2.transform(DS['location']),TV.transform(DS['tweet'])],axis=1))
Here is the error message:
Traceback (most recent call last):
File "smallerdimensions.py", line 49, in <module>
Xtrain=pd.DataFrame(np.concatenate((A,B.todense()),axis=1))
ValueError: arrays must have same number of dimensions
Thank you for any help you can provide. Here is a reproducible example:
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.ensemble import GradientBoostingRegressor
from sklearn import preprocessing
import numpy as np
tweets=["Jazz for a Rainy Afternoon","RT: #mention: I love rainy days.", "Good Morning Chicago!"]
location=["Oklahoma", "Oklahoma","Illinois"]
DS=pd.DataFrame({"tweet":tweets,"location":location})
TV=TfidfVectorizer(min_df=1,max_features=1000)
tagvect=preprocessing.LabelBinarizer()
DS['location']=DS['location'].fillna("none")
tagvect.fit(DS['location'].tolist())
TV.fit(DS['tweet'])
print "before problem"
print DS['tweet']
print DS['location']
print tagvect.transform(DS['location'])
print tagvect.transform(DS['location']).shape
print TV.transform(DS['tweet']).shape
print TV.transform(DS['tweet'])
print TV.transform(DS['tweet']).todense()
print np.concatenate([tagvect.transform(DS['location']),TV.transform(DS['tweet'])],axis=1)
Numpy is v 1.6.1, pandas is v 0.12.0, scikit is 0.14.1.