Efficiently handling time series data from multiple sources in TF.Keras - tensorflow

I have a dataset which looks something like this:
time | src | a | b | c | d | e | Label
----------------------------------------
0. | 1 | # | # | # | # | # | #
1. | 1 | # | # | # | # | # | #
2. | 1 | # | # | # | # | # | #
3. | 1 | # | # | # | # | # | #
4. | 1 | # | # | # | # | # | #
....
0. | 2 | # | # | # | # | # | #
1. | 2 | # | # | # | # | # | #
2. | 2 | # | # | # | # | # | #
3. | 2 | # | # | # | # | # | #
4. | 2 | # | # | # | # | # | #
I'm training a model to predict label against a window of [a,b,c,d,e] values. So my X is of shape (window_size,5) and my y would be the value of label at the end of the window. All values of X must have the same value in src (i.e. a window of data should only come from a single source).
I've been previously compiling X/y pairs, with a little tf.keras.utils.Sequence to hack semi usable memory management. In looking for a better way, I found tf.keras.utils.timeseries_dataset_from_array, but, based on my understanding, it would have no concept of src, meaning a single X datum could be from numerous src's. How can I leverage something like tf.keras.utils.timeseries_dataset_from_array, but have it only extract windows of data that have one src value?
note: I'd like to get a rolling window. i.e. every possible window with overlap, from each source.
Progress
1
I successfully used timeseries_dataset_from_array, but it doesn't respect src
# ============= Prep ===========
import tensorflow as tf
import numpy as np
#creating numpy data structures representing the problem
X = np.random.random((100,5))
y = np.random.random((100))
src = np.array([0]*50 + [1]*50)
window_size = 5
#making a time series dataset which does not respect src
Xy_ds = tf.keras.utils.timeseries_dataset_from_array(X, y, batch_size = 2, sequence_length=window_size,
sequence_stride=window_size, shuffle=True)
# ============= Train ===========
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, InputLayer, LSTM, Flatten
#training a model, to validate the dataset is working correctly
model = Sequential()
model.add(InputLayer(input_shape=[window_size,5]))
model.add(LSTM(3))
model.add(Flatten())
model.add(Dense(1, activation='relu'))
optimizer = tf.keras.optimizers.Adam(learning_rate=0.001)
model.compile(loss='categorical_crossentropy', optimizer=optimizer, metrics=['accuracy'])
history = model.fit(Xy_ds,epochs=1)
2
Implemented mdaoust's solution, but I'm getting shape errors when training
# ============= Prep ===========
import tensorflow as tf
import numpy as np
#creating numpy data structures representing the problem
X = np.random.random((100,5))
y = np.random.random((100))
src = np.expand_dims(np.array([0]*50 + [1]*50),1)
window_size = 5
#appending source information to X, for filtration
X = np.append(src, X, 1)
#making a time series dataset which does not respect src
Xy_ds = tf.keras.utils.timeseries_dataset_from_array(X, y, sequence_length=window_size,
sequence_stride=1, shuffle=True)
def single_source(x,y):
source = x[:,0]
return tf.reduce_all(source == source[0])
#filtering by and removing src info
def drop_source(x,y):
return x[:, 1:], y
def set_shapes(x,y, shape):
x.set_shape(shape)
return x,y
Xy_ds = Xy_ds.filter(single_source).map(drop_source)
# ============= Train ===========
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, InputLayer, LSTM, Flatten
#training a model, to validate the dataset is working correctly
model = Sequential()
model.add(InputLayer(input_shape=[window_size,5]))
model.add(LSTM(3))
model.add(Flatten())
model.add(Dense(1, activation='relu'))
optimizer = tf.keras.optimizers.Adam(learning_rate=0.001)
model.compile(loss='categorical_crossentropy', optimizer=optimizer, metrics=['accuracy'])
history = model.fit(Xy_ds,epochs=1)
Error:
ValueError: Input 0 is incompatible with layer sequential_3: expected shape=(None, None, 5), found shape=(None, None, 6)
Presumably related to this Github thread
I tried this and something like this, but no dice.

The simplest thing you can do is:
Include the source as one of the coluns of X.
Use timeseries_dataset_from_array.
Use filter to drop slices that have mixed sources.
Xy_ds = tf.keras.utils.timeseries_dataset_from_array(...)
def single_source(x,y):
source = x[:,0]
return tf.reduce_all(source == source[0])
def drop_source(x,y):
return x[:, 1:], y
Xy_ds = Xy_ds.filter(single_source).map(drop_source)

Based on mdaoust's answer, but the final working code.
Prep
This will create the time series dataset and do all the manipulation to format it correcly
# ============= Prep ===========
import tensorflow as tf
import numpy as np
batch_size = 32
#creating numpy data structures representing the problem
X = np.random.random((100,5))
y = np.random.random((100))
src = np.expand_dims(np.array([0]*50 + [1]*50),1)
window_size = 5
#appending source information to X, for filtration
X = np.append(src, X, 1)
#making a time series dataset which does not respect src
Xy_ds = tf.keras.utils.timeseries_dataset_from_array(X, y, sequence_length=window_size, batch_size=1,
sequence_stride=1, shuffle=True)
#filtering by and removing src info
def single_source(x,y):
source = x[:,:,0]
return tf.reduce_all(source == source[0])
def drop_source(x,y):
x_ = x[:, :, 1:]
print(x_)
return x_, y
Xy_ds = Xy_ds.filter(single_source)
Xy_ds = Xy_ds.map(drop_source)
Xy_ds = Xy_ds.unbatch().batch(batch_size)
#printing the dataset
i = 0
for x, y in Xy_ds:
i+=1
print(x)
print(y)
print('total batches: {}'.format(i))
Training
training, just to sanity check that everything is working
# ============= Train ===========
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, InputLayer, LSTM, Flatten
#training a model, to validate the dataset is working correctly
model = Sequential()
model.add(InputLayer(input_shape=[window_size,5]))
model.add(LSTM(3))
model.add(Flatten())
model.add(Dense(1, activation='relu'))
optimizer = tf.keras.optimizers.Adam(learning_rate=0.001)
model.compile(loss='categorical_crossentropy', optimizer=optimizer, metrics=['accuracy'])
history = model.fit(Xy_ds,epochs=1)
Important Note: in order for this to work, batching must occur after the filter and map are applied. That's why batch_size = 1 initially, then batching happens after.

Related

How to generate scatter plot from the vector values in each row for PCA?

I have successfully created label and feature vectors and am able to apply pca analysis on it but what happens is that the column generated is of datatype vector and each row is a vector. How do I plot a scatter plot for the pca components
from pyspark.ml.feature import PCA
from pyspark.ml.linalg import Vectors
import matplotlib.pyplot as plt
import numpy as np
# from matplotlib.pyplot import matplotlib
pca = PCA(k=2, inputCol="scaledFeatures", outputCol="pcaFeatures")
model = pca.fit(df2)
result = model.transform(df2).select("pcaFeatures")
result.show(truncate=False)
result.printSchema()
sum(model.explainedVariance)
The output I get is as follows:
pcaFeatures |
+-----------------------------------------+
|[0.9636424850516068,0.3313811478935345] |
|[0.8373410183626885,0.3880024159205323] |
|[-0.10845002652578276,0.6564023408615134]|
|[-0.479560942670008,1.1082617061107987] |
|[0.9576794865061756,0.2714643678687506] |
|[0.7879027918969023,0.5145147352059565] |
|[0.5124304692668866,-0.1917648708243116] |
|[-0.7369547765884317,1.0356901001261056] |
|[-0.10282606527163515,0.671822806010155] |
|[1.0661514594145962,0.3285042864447201] |
|[-0.32474294634018674,0.8134787300694735]|
|[-0.2109752165189983,0.7625432021333773] |
|[0.9643915702012056,0.3276715407315949] |
|[0.8970032005901719,0.3514814197107741] |
|[0.47244006359864477,0.6034483574148226] |
|[0.7840860892766188,0.421458958932977] |
|[-0.7640855001185652,1.117508731487764] |
|[0.5078194714105165,0.5364599694359978] |
|[1.020982108328857,0.36510796039610344] |
|[-0.6823665987365033,-0.5902523648089859]|
+-----------------------------------------+
only showing top 20 rows
root
|-- pcaFeatures: vector (nullable = true)
0.4127855508907272

Converting .tflite to .pb

Problem: How can i convert a .tflite (serialised flat buffer) to .pb (frozen model)? The documentation only talks about one way conversion.
Use-case is: I have a model that is trained on converted to .tflite but unfortunately, i do not have details of the model and i would like to inspect the graph, how can i do that?
I found the answer here
We can use Interpreter to analysis the model and the same code looks like following:
import numpy as np
import tensorflow as tf
# Load TFLite model and allocate tensors.
interpreter = tf.lite.Interpreter(model_path="converted_model.tflite")
interpreter.allocate_tensors()
# Get input and output tensors.
input_details = interpreter.get_input_details()
output_details = interpreter.get_output_details()
# Test model on random input data.
input_shape = input_details[0]['shape']
input_data = np.array(np.random.random_sample(input_shape), dtype=np.float32)
interpreter.set_tensor(input_details[0]['index'], input_data)
interpreter.invoke()
output_data = interpreter.get_tensor(output_details[0]['index'])
print(output_data)
Netron is the best analysis/visualising tool i found, it can understand lot of formats including .tflite.
I don't think there is a way to restore tflite back to pb as some information are lost after conversion. I found an indirect way to have a glimpse on what is inside tflite model is to read back each of the tensor.
interpreter = tf.contrib.lite.Interpreter(model_path=model_path)
interpreter.allocate_tensors()
# trial some arbitrary numbers to find out the num of tensors
num_layer = 89
for i in range(num_layer):
detail = interpreter._get_tensor_details(i)
print(i, detail['name'], detail['shape'])
and you would see something like below. As there are only limited of operations that are currently supported, it is not too difficult to reverse engineer the network architecture. I have put some tutorials too on my Github
0 MobilenetV1/Logits/AvgPool_1a/AvgPool [ 1 1 1 1024]
1 MobilenetV1/Logits/Conv2d_1c_1x1/BiasAdd [ 1 1 1 1001]
2 MobilenetV1/Logits/Conv2d_1c_1x1/Conv2D_bias [1001]
3 MobilenetV1/Logits/Conv2d_1c_1x1/weights_quant/FakeQuantWithMinMaxVars [1001 1 1 1024]
4 MobilenetV1/Logits/SpatialSqueeze [ 1 1001]
5 MobilenetV1/Logits/SpatialSqueeze_shape [2]
6 MobilenetV1/MobilenetV1/Conv2d_0/Conv2D_Fold_bias [32]
7 MobilenetV1/MobilenetV1/Conv2d_0/Relu6 [ 1 112 112 32]
8 MobilenetV1/MobilenetV1/Conv2d_0/weights_quant/FakeQuantWithMinMaxVars [32 3 3 3]
9 MobilenetV1/MobilenetV1/Conv2d_10_depthwise/Relu6 [ 1 14 14 512]
10 MobilenetV1/MobilenetV1/Conv2d_10_depthwise/depthwise_Fold_bias [512]
11 MobilenetV1/MobilenetV1/Conv2d_10_depthwise/weights_quant/FakeQuantWithMinMaxVars [ 1 3 3 512]
12 MobilenetV1/MobilenetV1/Conv2d_10_pointwise/Conv2D_Fold_bias [512]
13 MobilenetV1/MobilenetV1/Conv2d_10_pointwise/Relu6 [ 1 14 14 512]
14 MobilenetV1/MobilenetV1/Conv2d_10_pointwise/weights_quant/FakeQuantWithMinMaxVars [512 1 1 512]
15 MobilenetV1/MobilenetV1/Conv2d_11_depthwise/Relu6 [ 1 14 14 512]
16 MobilenetV1/MobilenetV1/Conv2d_11_depthwise/depthwise_Fold_bias [512]
17 MobilenetV1/MobilenetV1/Conv2d_11_depthwise/weights_quant/FakeQuantWithMinMaxVars [ 1 3 3 512]
18 MobilenetV1/MobilenetV1/Conv2d_11_pointwise/Conv2D_Fold_bias [512]
19 MobilenetV1/MobilenetV1/Conv2d_11_pointwise/Relu6 [ 1 14 14 512]
20 MobilenetV1/MobilenetV1/Conv2d_11_pointwise/weights_quant/FakeQuantWithMinMaxVars [512 1 1 512]
I have done this with TOCO, using tf 1.12
tensorflow_1.12/tensorflow/bazel-bin/tensorflow/contrib/lite/toco/toco --
output_file=coco_ssd_mobilenet_v1_1.0_quant_2018_06_29.pb --
output_format=TENSORFLOW_GRAPHDEF --input_format=TFLITE --
input_file=coco_ssd_mobilenet_v1_1.0_quant_2018_06_29.tflite --
inference_type=FLOAT --input_type=FLOAT --input_array="" --output_array="" --
input_shape=1,450,450,3 --dump_grapHviz=./
(you can remove the dump_graphviz option)

3D Tensor in a correct data shape for neural network

I'm starting with Neural Networks and I'm having some issues with my data format. I have a pandas DataFrame with 130 rows, 4 columns and each data point is an array of 595 items.
| Col 1 | Col 2 | Col 3 | Col 4 |
Row 1 | [x1, ..., x595] | [x1, ..., x595] | [x1, ..., x595] | [x1, ..., x595] |
Row 2 | [x1, ..., x595] | [x1, ..., x595] | [x1, ..., x595] | [x1, ..., x595] |
Row 3 | [x1, ..., x595] | [x1, ..., x595] | [x1, ..., x595] | [x1, ..., x595] |
I created the X_train, X_test, y_train and y_test using train_test_split. However, when I check the shape of X_train it returns (52,4) and I'm not being able to create a model for my NN because it doesn't accept this shape. This is the error:
"ValueError: Error when checking input: expected dense_4_input to have
3 dimensions, but got array with shape (52, 4)"
I believe it's because it should be (52,4,595), right? So, I'm kind of confused, how can I specify this input_format correctly or maybe reshape my data for the appropriate data format?
I am using pandas, keras, tensorflow and jupyter notebook.
You have to reshape your data to a 3d numpy array.
Suppose we have a data frame where each cell is a numpy array as you described
import pandas as pd
import numpy as np
data=pd.DataFrame(np.zeros((130,4))).astype('object')
for i in range(130):
for k in range(4):
#print(i,k)
data.iloc[i,k]=np.zeros(595)
we can then reshape our data frame to a 3d numpy array doing:
dataar=data.values
dataar=np.stack((np.vstack(dataar[:,0]),np.vstack(dataar[:,1]),np.vstack(dataar[:,2]),np.vstack(dataar[:,3])))
dataar=dataar.reshape(130,4,595)
dataar.shape
# (130, 4, 595)

Feature Union with Pandas: Properly Selecting Columns?

I have a df file likeso:
+------------+----------------------+---------+--+--+
| number_col | text_col | outcome | | |
+------------+----------------------+---------+--+--+
| 42 | Here isa string | 1 | | |
+------------+----------------------+---------+--+--+
| 52 | And here is a string | 0 | | |
+------------+----------------------+---------+--+--+
I am trying to use an SVM to predict outcome, with a feature union of text_col being transformed with a vectorizer and number-col being transformed by standardizing scalar. Here is my code (drawing on this example):
class ItemSelector(BaseEstimator, TransformerMixin):
def __init__(self, key):
self.key = key
def fit(self, x, y=None):
return self
def transform(self, data_dict):
return data_dict[self.key]
a = Pipeline([
('union', FeatureUnion(
transformer_list=[
# Pipeline for vectorizing text
('subject', Pipeline([
('selector', ItemSelector(key='text_col')),
('tfidf', TfidfVectorizer(min_df=50)),
])),
# Pipeline for standardizing number_col
('body_bow', Pipeline([
('selector', ItemSelector(key='number_col')),
('std', StandardScaler()),
])),
]
)),
# Use a SVC classifier on the combined features
('svc', SVC()),
])
a.fit(df, df['outcome'])
I get KeyError: 'text_col' because evidently transform is dropping off the other columns? How would I make my code work so that I vectorize text_col but don't lop off number_col?

How to use LinearRegression across groups in DataFrame?

Let us say my spark DataFrame (DF) looks like
id | age | earnings| health
----------------------------
1 | 34 | 65 | 8
2 | 65 | 12 | 4
2 | 20 | 7 | 10
1 | 40 | 75 | 7
. | .. | .. | ..
and I would like to group the DF, apply a function (say linear
regression which depends on multiple columns - two columns in this case -
of aggregated DF) on each aggregated DF and get output like
id | intercept| slope
----------------------
1 | ? | ?
2 | ? | ?
from sklearn.linear_model import LinearRegression
lr_object = LinearRegression()
def linear_regression(ith_DF):
# Note: for me it is necessary that ith_DF should contain all
# data within this function scope, so that I can apply any
# function that needs all data in ith_DF
X = [i.earnings for i in ith_DF.select("earnings").rdd.collect()]
y = [i.health for i in ith_DF.select("health").rdd.collect()]
lr_object.fit(X, y)
return lr_object.intercept_, lr_object.coef_[0]
coefficient_collector = []
# following iteration is not possible in spark as 'GroupedData'
# object is not iterable, please consider it as pseudo code
for ith_df in df.groupby("id"):
c, m = linear_regression(ith_df)
coefficient_collector.append((float(c), float(m)))
model_df = spark.createDataFrame(coefficient_collector, ["intercept", "slope"])
model_df.show()
I think this can be done since Spark 2.3 using pandas_UDF. In fact, there is an example of fitting grouped regressions on the announcement of pandas_UDFs here:
Introducing Pandas UDF for Python
What I'd do is to filter the main DataFrame to create smaller DataFrames and do the processing, say a linear regression.
You can then execute the linear regression in parallel (on separate threads using the same SparkSession which is thread-safe) and the main DataFrame cached.
That should give you the full power of Spark.
p.s. My limited understanding of that part of Spark makes me think that a very similar approach is used for grid search-based model selection in Spark MLlib and also TensorFrames which is "Experimental TensorFlow binding for Scala and Apache Spark".