How to concatenate + tokenize + pad strings in TFX preprocessing? - tensorflow

I'd like to perform the usual text preprocessing steps in a TensorFlow Extended pipeline's Transform step/component. My data is the following (strings in independent features, 0/1 integers in label column):
field1 field2 field3 label
--------------------------
aa bb cc 0
ab gfdg ssdg 1
import tensorflow as tf
import tensorflow_text as tf_text
from tensorflow_text import UnicodeCharTokenizer
def preprocessing_fn(inputs):
outputs = {}
outputs['features_xf'] = tf.sparse.concat(axis=0, sp_inputs=[inputs["field1"], inputs["field2"], inputs["field3"]])
outputs['label_xf'] = tf.convert_to_tensor(inputs["label"], dtype=tf.float32)
return outputs
but this doesn't work:
ValueError: Arrays were not all the same length: 3 vs 1 [while running 'Transform[TransformIndex0]/ConvertToRecordBatch']
(Later on I want to apply char-level tokenization and padding to MAX_LEN as well).
Any idea?

Related

statsmodelformula.api.ols.fit().pvalues returns a Pandas series instead of numpy array

So this may be hard to explain cause its a chunk of some really large code - I don't expect it to be reproducible.
But essentially it's a simulation which (using multiple simulated datasets) creates a one-way or two-way regression and calculates the respective t-values and p-values for them.
However, putting some of the datasets (with the same information and no missing values), results in stats.model.formula.ols.fit() returning the pvals / tvals as a pandas series instead of a numpy array (even one way studies).
Could someone please explain why / if there is a way to specify the output?
An example dataframe looks like this: (x0-x187 is our y, genotype and treatment are the desired factors, staging is a factor used for normalisation)
x0
x1
...
treatment
genotype
200926_ku20_e1_wt_veh
0.075821
0.012796
...
veh
wt
201210_ku25_e7_wt_veh
0.082307
0.007596
...
veh
wt
201127_ku55_e6_wt_veh
0.083049
0.008978
...
veh
wt
201220_ku52_e2_wt_veh
0.078414
0.013488
...
veh
wt
...
...
...
...
...
...
210913_b6ku_22297_e5_wt
0.067858
0.008081
...
treat
wt
210821_b6ku_3_e5_wt
0.070417
0.012396
...
treat
wt
And then the code:
'''import subprocess as sub
import os
import struct
from pathlib import Path
import tempfile
from typing import Tuple
import shutil
from logzero import logger as logging
import numpy as np
import pandas as pd
import statsmodels.formula.api as smf
for col in range(data.shape[1]):
if not df[f'x{col}'].any():
p = np.nan
t = np.nan
else:
if two_way:
# two way model - if its just the geno or treat comparison; the one-factor col
# will
# be ignored
# for some simulations smf is returning a Series.
fit = smf.ols(formula=f'x{col} ~ genotype * treatment + staging', data=df, missing='drop').fit()
# get all pvals except intercept and staging
p = fit.pvalues[~fit.pvalues.index.isin(['Intercept', 'staging'])]
t = fit.tvalues[~fit.tvalues.index.isin(['Intercept', 'staging'])]
else:
fit = smf.ols(formula=f'x{col} ~ genotype + staging', data=df, missing='drop').fit()
p = fit.pvalues['genotype[T.wt]']
t = fit.tvalues['genotype[T.wt]']
pvals.append(p)
tvals.append(t)
p_all = np.array(pvals)
print("example", p_all[0])
print(type(p_all[0][0]), p_all[0][0])
And finally some output:
Desired output:
'''example [1.63688492e-01 6.05907115e-06 7.70710934e-02]
<class 'numpy.float64'> 0.16368849176977607 '''
"Error" output:
'''example genotype[T.wt] 0.862423
treatment[T.veh] 0.000177
genotype[T.wt]:treatment[T.veh] 0.522066
dtype: float64
< class 'numpy.float64'> 0.8624226150886212'''
I've manually corrected the data but I would rather not have to do dumb fixes in the future.

tensorflow model trained on keras.preprocessing.timeseries_dataset_from_array yields unexpected output shape of (sequence_length, 1)

I'm trying to train a tensorflow model where my inputs are a lagged timeseries of multiple features and I want to predict a single value.
Somehow the output shape ends up as an array of (lag/sequence_length, 1) when my lagged dataset has more than one feature, but I haven't been able to figure out why exactly that is. Here is a minimal example of what I'm trying to do
import tensorflow as tf
from tensorflow.keras import layers
import numpy as np
import pandas as pd
# generate some dummy data
x0 = np.array(range(300))
x1 = np.array(range(300)) * 2
df = pd.DataFrame({"x0": x0, "x1": x1})
y = np.array(range(100))
# also tried reshaping my y, but no help
# y = np.array(range(100)).reshape(100,1)
# make a dataset with lagged values
ds = tf.keras.preprocessing.timeseries_dataset_from_array(
data=df,
targets=y,
sequence_length=3,
sequence_stride=1,
sampling_rate=1,
batch_size=5
)
# show an example of what we are working with
list(ds.take(1))
# define simple model and train it
model = tf.keras.Sequential(
[
layers.Dense(32),
layers.Dense(1),
]
)
model.compile(loss="mse", optimizer=tf.optimizers.Adam())
model.fit(ds, epochs=4)
# make predictions on dataset
predictions = model.predict(ds)
# show predictions
predictions
print(predictions.shape)
"""
(100, 3, 1)
"""
If I create the dataset with only a single feature as:
ds = tf.keras.preprocessing.timeseries_dataset_from_array(
data=x1,
targets=y,
sequence_length=3,
sequence_stride=1,
sampling_rate=1,
batch_size=5
)
My outputs are of expected shape.
Would appreciate any pointers. I'm guessing something is probably getting broadcast which then results in the output I'm seeing but I haven't been able to figure out what exactly is going on.

tensorflow error while training model - Labels dtype should be integer Instead got <dtype: 'string'>

I am currently learning tensorflow and am new to the concept. I am trying a multi-class classification using LinearClassifier
I have a dataset where I have reduced the number of input variables using PCA to 30. I have named the PCA columns as PCA_Col_0 -- PCA_Col_29. The PCA was done using scikit learn
I then created tensorflow feature column for each of the 30 variables using the following code:
feat_cols = [PCA_Col_0, .... PCA_Col_29]
d = {}
for item in feat_cols:
d[item] = tf.feature_column.numeric_column(item)
feat_cols2 = list(d.values())
I then initialized the model
import tensorflow as tf
n_classes = 3914
model = tf.estimator.LinearClassifier(feature_columns = feat_cols2, n_classes = n_classes)
input_fn = tf.estimator.inputs.pandas_input_fn(x = DF_Final_V1[feat_cols], y = DF_Final_V1['nUnique_ID'], shuffle = False)
model.train(input_fn)
I get the error Labels dtype should be integer Instead got on tensorflow
I have verified the following:
The Input dataset has only float64 entries
There are no null or nan values in the input dataset
the tf.feature_column shows dtype as float32
Why isn't my model training and why am I getting this error?
Credit to #Bruce Swain in the comments.
The code worked after I modified the output value from 0 to n_classes -1

Get embedding vectors from Embedding Column in Tensorflow

I want to get the numpy vectors created using the "Embedding Column" in Tensorflow.
For example, creating a sample DF:
sample_column1 = ["Apple","Apple","Mango","Apple","Banana","Mango","Mango","Banana","Banana"]
sample_column2 = [1,2,1,3,4,6,2,1,3]
ds = pd.DataFrame(sample_column1,columns=["A"])
ds["B"] = sample_column2
ds
Converting the pandas DF to Tensorflow object
# A utility method to create a tf.data dataset from a Pandas Dataframe
def df_to_dataset(dataframe, shuffle=True, batch_size=32):
dataframe = dataframe.copy()
labels = dataframe.pop('B')
ds = tf.data.Dataset.from_tensor_slices((dict(dataframe), labels))
#print (ds)
if shuffle:
ds = ds.shuffle(buffer_size=len(dataframe))
#print (ds)
ds = ds.batch(batch_size)
return ds
Creating a embedding column:
tf_ds = df_to_dataset(ds)
# embedding cols
col_a = feature_column.categorical_column_with_vocabulary_list(
'A', ['Apple', 'Mango', 'Banana'])
col_a_embedding = feature_column.embedding_column(col_a, dimension=8)
Is there anyway to get the embeddings as numpy vectors from the 'col_a_embedding' object?
Example,
The category "Apple" will be embedded into a vector size 8:
[a1 a2 a3 a4 a5 a6 a7 a8]
Can we fetch that vector?
I don't see a way to get what you want using feature columns (I dont see a function named sequence_embedding_column or similar in the available functions in tf.feature_column). Because the result from feature columns seem to be a fixed-length tensor. They achieve that by using a combiner to aggregate individual embedding vectors (sum, mean, sqrtn etc). So the dimension on the sequence of categories are actually lost.
But it's totally doable if you use lower-level apis.
First you could construct a lookup table to convert categorical strings to ids.
features = tf.constant(["apple", "banana", "apple", "mango"])
table = tf.lookup.index_table_from_file(
vocabulary_file="fruit.txt", num_oov_buckets=1)
ids = table.lookup(features)
#Content of "fruit.txt"
apple
mango
banana
unknown
Now you could initialize the embedding as a 2d variable. Its shape is [number of categories, embedding dimension].
num_categories = 3
embedding_dim = 64
category_emb = tf.get_variable(
"embedding_table", [num_categories, embedding_dim],
initializer=tf.truncated_normal_initializer(stddev=0.02))
You could then lookup category embedding like below:
ids_embeddings = tf.nn.embedding_lookup(category_emb, ids)
Note the results in ids_embeddings is a concatenated long tensor. Feel free to reshape it to the shape you want.
I suggest the easiest fastest way is to do like this, which is what I am doing in my own app:
Use pandas to read_csv your file into a string column of type
"category" in pandas using the dtype parameter. Let's call it field
"f". This is the original string column, not a numerical column yet.
Still in pandas, create a new column and copy the original column's
pandas cat.codes into the new column. Let's call it field "f_code". Pandas automatically encodes this into a compactly represented numerical column. It will have the numbers you need for passing to neural networks.
Now in an Embedding layer in your keras functional api neural
network model, pass the f_code to your model's Input layer. The
value in the f_code will be a number now, like int8. The Embedding
layer will process it correctly now. Don't pass the original column to the model.
Below are some sample code lines copied out of my project doing exactly the steps above.
all_col_types_readcsv = {'userid':'int32','itemid':'int32','rating':'float32','user_age':'int32','gender':'category','job':'category','zipcode':'category'}
<some code omitted>
d = pd.read_csv(fn, sep='|', header=0, dtype=all_col_types_readcsv, encoding='utf-8', usecols=usecols_readcsv)
<some code omitted>
from pandas.api.types import is_string_dtype
# Select the columns to add code columns to. Numeric cols work fine with Embedding layer so ignore them.
cat_cols = [cn for cn in d.select_dtypes('category')]
print(cat_cols)
str_cols = [cn for cn in d.columns if is_string_dtype(d[cn])]
print(str_cols)
add_code_columns = [cn for cn in d.columns if (cn in cat_cols) and (cn in str_cols)]
print(add_code_columns)
<some code omitted>
# Actually add _code column for the selected columns
for cn in add_code_columns:
codecolname = cn + "_code"
if not codecolname in d.columns:
d[codecolname] = d[cn].cat.codes
You can see the numeric codes pandas made for you:
d.info()
d.head()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 99991 entries, 0 to 99990
Data columns (total 5 columns):
userid 99991 non-null int32
itemid 99991 non-null int32
rating 99991 non-null float32
job 99991 non-null category
job_code 99991 non-null int8
dtypes: category(1), float32(1), int32(2), int8(1)
memory usage: 1.3 MB
Finally, you can omit the job column and retain the job_code column, in this example, for passing into your keras neural network model. Here is some of my model code:
v = Lambda(lambda z: z[:, field_num0_X_cols[cn]], output_shape=(), name="Parser_" + cn)(input_x)
emb_input = Lambda(lambda z: tf.expand_dims(z, axis=-1), output_shape=(1,), name="Expander_" + cn)(v)
a = Embedding(input_dim=num_uniques[cn]+1, output_dim=emb_len[cn], input_length=1, embeddings_regularizer=reg, name="E_" + cn)(emb_input)
By the way, please also wrap np.array() around all pandas dataframes when passing them into model.fit(). It's not well documented and apparnetly also not checked at runtime that pandas dataframes cannot be safely passed in. You get massive memory allocs otherwise which crash hosts.

Slicing on scan output with TensorFlow

If I want to slice after a scan operation in TensorFlow.
But I just get strange results with TensorFlow:
k = 10
x = 2
out = tf.scan(lambda previous_output, current_input: previous_output * current_input,
tf.fill([k], x), initializer=tf.constant(1))
result = out[-1] # slice with tensorflow - don't work
with tf.Session() as sess:
sess.run(tf.initialize_all_variables())
print(sess.run(out)[-1]) # works but all values are computed and stored in an np array
print(sess.run(result)) # don't work???
I get as output:
1024
3
The second value is obviously wrong and random (sometimes 0 or other values).
So my question is why? The analog code in Theano e.g. works and Theano can do some optimization when querying just the last element of the output tensor.