I have loaded the MNIST dataset using the following command:
from dataget import data
dataset = data("mnist").get()
How do I convert it to Sklearn-friendly format, i.e. features_train, labels_train, features_test, labels_test?
I have tried "np.loadtxt" but got this error:
ValueError: could not convert string to float: data
I have also tried the following lines of code:
df = next(dataset.training_set.random_batch_dataframe_generator(10))
df
And it has returned this error:
AttributeError: training_set
Please, can someone help me, I have been googling alternative methods but I still receive errors. Thank you!
P.S. Here's another way I've used to obtain the MNIST dataset:
dataset = fetch_mldata('MNIST original')
#E.Z. helped me out with the answer!
features, labels = dataset.data, dataset.target
I then split them into training and test sets using the following lines of code:
msk = np.random.rand(len(features)) < 0.8
mrk = np.random.rand(len(labels)) < 0.8
features_train = features[msk]
features_test = features[~msk]
labels_train = labels[mrk]
labels_test = labels[~mrk]
Related
In my code, I am trying to extract data from csv file to use in the function, but it doesnt output anything, and gives no error. My code works because I tried it with just numpy array as inputs. not sure why it doesnt work with panda.
import numpy as np
import pandas as pd
import os
# change the current directory to the directory where the running script file is
os.chdir(os.path.dirname(os.path.abspath(__file__)))
# finding best fit line for y=mx+b by iteration
def gradient_descent(x,y):
m_iter = b_iter = 1 #starting point
iteration = 10000
n = len(x)
learning_rate = 0.05
last_mse = 10000
#take baby steps to reach global minima
for i in range(iteration):
y_predicted = m_iter*x + b_iter
#mse = 1/n*sum([value**2 for value in (y-y_predicted)]) # cost function to minimize
mse = 1/n*sum((y-y_predicted)**2) # cost function to minimize
if (last_mse - mse)/mse < 0.001:
break
# recall MSE formula is 1/n*sum((yi-y_predicted)^2), where y_predicted = m*x+b
# using partial deriv of MSE formula, d/dm and d/db
dm = -(2/n)*sum(x*(y-y_predicted))
db = -(2/n)*sum((y-y_predicted))
# use current predicted value to get the next value for prediction
# by using learning rate
m_iter = m_iter - learning_rate*dm
b_iter = b_iter - learning_rate*db
print('m is {}, b is {}, cost is {}, iteration {}'.format(m_iter,b_iter,mse,i))
last_mse = mse
#x = np.array([1,2,3,4,5])
#y = np.array([5,7,8,10,13])
#gradient_descent(x,y)
df = pd.read_csv('Linear_Data.csv')
x = df['Area']
y = df['Price']
gradient_descent(x,y)
My code works because I tried it with just numpy array as inputs. not sure why it doesnt work with panda.
Well no, your code also works with pandas dataframes:
df = pd.DataFrame({'Area': [1,2,3,4,5], 'Price': [5,7,8,10,13]})
x = df['Area']
y = df['Price']
gradient_descent(x,y)
Above will give you the same output as with numpy arrays.
Try to check what's in Linear_Data.csv and/or add some print statements in the gradient_descent function just to check your assumptions. I would suggest to first of all add a print statement before the condition with the break statement:
print(last_mse, mse)
if (last_mse - mse)/mse < 0.001:
break
I'm trying to get a "tf.transform encoded dict" with this tfx.components.Transform function.
transform = Transform(
examples=example_gen.outputs['examples'],
schema=schema_gen.outputs['schema'],
module_file=os.path.abspath(_taxi_transform_module_file),
instance_name="taxi")
context.run(transform)
I need a dict like this: " a dict of the data you load ({feature_name: feature_value})."
Transform as mentioned above gives me a TfRecord file. How can i decode it properly?
Any help would be appreciated.
import tensorflow_transform as tft
def preprocessing_fn(inputs):
NUMERIC_FEATURE_KEYS = ['PetalLengthCm', 'PetalWidthCm',
'SepalLengthCm', 'SepalWidthCm']
TARGET_FEATURES = "Species"
outputs = inputs.copy()
del outputs['Id']
for key in NUMERIC_FEATURE_KEYS:
outputs[key] = tft.scale_to_0_1(outputs[key])
return outputs
Write a module like this i have written one for iris dataset it's simple to understand for your dataset also you can do like this it will be saved as a tfrecord dataset
I have my encode function that looks like this:
from transformers import BertTokenizer, BertModel
MODEL = 'bert-base-multilingual-uncased'
tokenizer = BertTokenizer.from_pretrained(MODEL)
def encode(texts, tokenizer=tokenizer, maxlen=10):
# import pdb; pdb.set_trace()
inputs = tokenizer.encode_plus(
texts,
return_tensors='tf',
return_attention_masks=True,
return_token_type_ids=True,
pad_to_max_length=True,
max_length=maxlen
)
return inputs['input_ids'], inputs["token_type_ids"], inputs["attention_mask"]
I want to get my data encoded on the fly by doing this:
x_train = (tf.data.Dataset.from_tensor_slices(df_train.comment_text.astype(str).values)
.map(encode))
However, this chucks the error:
ValueError: Input is not valid. Should be a string, a list/tuple of strings or a list/tuple of integers.
Now from my understanding when I set a breakpoint inside encode it was because I was sending a non-numpy array. How do I get huggingface transformers to play nice with tensorflow strings as inputs?
If you need a dummy dataframe here it is:
df_train = pd.DataFrame({'comment_text': ['Today was a good day']*5})
What I tried
So I tried to use from_generator so that I can parse in the strings to the encode_plus function. However, this does not work with TPUs.
AUTO = tf.data.experimental.AUTOTUNE
def get_gen(df):
def gen():
for i in range(len(df)):
yield encode(df.loc[i, 'comment_text']) , df.loc[i, 'toxic']
return gen
shapes = ((tf.TensorShape([maxlen]), tf.TensorShape([maxlen]), tf.TensorShape([maxlen])), tf.TensorShape([]))
train_dataset = tf.data.Dataset.from_generator(
get_gen(df_train),
((tf.int32, tf.int32, tf.int32), tf.int32),
shapes
)
train_dataset = train_dataset.batch(BATCH_SIZE).prefetch(AUTO)
Version Info:
transformers.__version__, tf.__version__ => ('2.7.0', '2.1.0')
the tokenizer of bert works on a string, a list/tuple of strings or a list/tuple of integers. So, check is your data getting converted to string or not. To apply tokenizer on whole dataset I used Dataset.map, but this runs on graph mode. So, I need to wrap it in a tf.py_function. The tf.py_function will pass regular tensors (with a value and a .numpy() method to access it), to the wrapped python function. My data was getting converted to bytes after using py_function hence I applied tf.compat.as_str to convert bytes to string.
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
def encode(lang1, lang2):
lang1 = tokenizer.encode(tf.compat.as_str(lang1.numpy()), add_special_tokens=True)
lang2 = tokenizer.encode(tf.compat.as_str(lang2.numpy()), add_special_tokens=True)
return lang1, lang2
def tf_encode(pt, en):
result_pt, result_en = tf.py_function(func = encode, inp = [pt, en], Tout=[tf.int64, tf.int64])
result_pt.set_shape([None])
result_en.set_shape([None])
return result_pt, result_en
train_dataset = dataset3.map(tf_encode)
BUFFER_SIZE = 200
BATCH_SIZE = 64
train_dataset = train_dataset.shuffle(BUFFER_SIZE).padded_batch(BATCH_SIZE,
padded_shapes=(60, 60))
a,p = next(iter(train_dataset))
When you create the tensorflow dataset with: tf.data.Dataset.from_tensor_slices(df_train.comment_text.astype(str).values)
tensorflow converts your strings into tensors of string type which is not an accepted input of of tokenizer.encode_plus. Like the error message says it only accepts a string, a list/tuple of strings or a list/tuple of integers. You can verify this by adding a print(type(texts)) inside your encode function (Output:<class 'tensorflow.python.framework.ops.Tensor'>).
I'm not sure what your follow up plan is and why you need a tf.data.Dataset, but you have to encode your input before you turn it into a tf.data.Dataset:
import tensorflow as tf
from transformers import BertTokenizer, BertModel
MODEL = 'bert-base-multilingual-uncased'
tokenizer = BertTokenizer.from_pretrained(MODEL)
texts = ['Today was a good day', 'Today was a bad day',
'Today was a rainy day', 'Today was a sunny day',
'Today was a cloudy day']
#inputs['input_ids'], inputs["token_type_ids"], inputs["attention_mask"]
inputs = tokenizer.batch_encode_plus(
texts,
return_tensors='tf',
return_attention_masks=True,
return_token_type_ids=True,
pad_to_max_length=True,
max_length=10
)
dataset = tf.data.Dataset.from_tensor_slices((inputs['input_ids'],
inputs['attention_mask'],
inputs['token_type_ids']))
print(type(dataset))
Output:
<class 'tensorflow.python.data.ops.dataset_ops.TensorSliceDataset'>
I had this exact error but my mistake was simple, I had a few NaNs in my texts.
So make sure to check if there are NaNs in your texts dataframe.
I am working on a sample project of California housing price problem and getting above error while training my model.
Following this article
https://colab.research.google.com/notebooks/mlcc/first_steps_with_tensor_flow.ipynb?utm_source=mlcc&utm_campaign=colab-external&utm_medium=referral&utm_content=firststeps-colab&hl=en#scrollTo=pDIxp6vcU809
In my case, the error was caused by passing:
my_feature = california_housing_dataframe["total_rooms"]
into:
ds = Dataset.from_tensor_slices((features, targets))
The solution is to pass:
my_feature = california_housing_dataframe[["total_rooms"]]
y_train = train_set.pop("satisfaction")
train_input_fn = make_input_fn(train_set, y_train) #make_input_fn is the input function
linear_est.train(train_input_fn) # train
The error for me was that I wrote y_train = "satisfaction" instead of
y_train = train_set.pop("satisfaction"). The pop function will allow you to remove a specified column and save it to in this case the y_train var. You then use to predict that value in your model later on
I wonder how to generate covariance matrices in batch in TensorFlow. If we use the following code
dim = 3
df = 5
ds = tf.contrib.distributions
scale_sqrt = tf.random_normal([dim, dim], seed=1234)
scale = tf.matmul(scale_sqrt, tf.transpose(scale_sqrt))
sigma = ds.WishartCholesky(df=df, scale=scale).sample()
it would work. But if we try the batch version of this code by adding an additional batch dimension then TF would throw an error. My batch version looks as follows:
dim = 3
df = 5
ds = tf.contrib.distributions
num_per_batch = 10
scale_sqrt = tf.random_normal([num_per_batch, dim, dim], seed=1234)
scale = tf.matmul(scale_sqrt, tf.transpose(scale_sqrt, [0,2,1]))
sigma = ds.WishartCholesky(df=df, scale=scale).sample()
Please let me know how to sample in batch efficiently.
For posteriority, this bug has been fixed, and should be available in the tf-nightly versions (when I call sess.run(...) on sigma, I get back values with no errors).