Get embedding vectors from Embedding Column in Tensorflow - numpy

I want to get the numpy vectors created using the "Embedding Column" in Tensorflow.
For example, creating a sample DF:
sample_column1 = ["Apple","Apple","Mango","Apple","Banana","Mango","Mango","Banana","Banana"]
sample_column2 = [1,2,1,3,4,6,2,1,3]
ds = pd.DataFrame(sample_column1,columns=["A"])
ds["B"] = sample_column2
ds
Converting the pandas DF to Tensorflow object
# A utility method to create a tf.data dataset from a Pandas Dataframe
def df_to_dataset(dataframe, shuffle=True, batch_size=32):
dataframe = dataframe.copy()
labels = dataframe.pop('B')
ds = tf.data.Dataset.from_tensor_slices((dict(dataframe), labels))
#print (ds)
if shuffle:
ds = ds.shuffle(buffer_size=len(dataframe))
#print (ds)
ds = ds.batch(batch_size)
return ds
Creating a embedding column:
tf_ds = df_to_dataset(ds)
# embedding cols
col_a = feature_column.categorical_column_with_vocabulary_list(
'A', ['Apple', 'Mango', 'Banana'])
col_a_embedding = feature_column.embedding_column(col_a, dimension=8)
Is there anyway to get the embeddings as numpy vectors from the 'col_a_embedding' object?
Example,
The category "Apple" will be embedded into a vector size 8:
[a1 a2 a3 a4 a5 a6 a7 a8]
Can we fetch that vector?

I don't see a way to get what you want using feature columns (I dont see a function named sequence_embedding_column or similar in the available functions in tf.feature_column). Because the result from feature columns seem to be a fixed-length tensor. They achieve that by using a combiner to aggregate individual embedding vectors (sum, mean, sqrtn etc). So the dimension on the sequence of categories are actually lost.
But it's totally doable if you use lower-level apis.
First you could construct a lookup table to convert categorical strings to ids.
features = tf.constant(["apple", "banana", "apple", "mango"])
table = tf.lookup.index_table_from_file(
vocabulary_file="fruit.txt", num_oov_buckets=1)
ids = table.lookup(features)
#Content of "fruit.txt"
apple
mango
banana
unknown
Now you could initialize the embedding as a 2d variable. Its shape is [number of categories, embedding dimension].
num_categories = 3
embedding_dim = 64
category_emb = tf.get_variable(
"embedding_table", [num_categories, embedding_dim],
initializer=tf.truncated_normal_initializer(stddev=0.02))
You could then lookup category embedding like below:
ids_embeddings = tf.nn.embedding_lookup(category_emb, ids)
Note the results in ids_embeddings is a concatenated long tensor. Feel free to reshape it to the shape you want.

I suggest the easiest fastest way is to do like this, which is what I am doing in my own app:
Use pandas to read_csv your file into a string column of type
"category" in pandas using the dtype parameter. Let's call it field
"f". This is the original string column, not a numerical column yet.
Still in pandas, create a new column and copy the original column's
pandas cat.codes into the new column. Let's call it field "f_code". Pandas automatically encodes this into a compactly represented numerical column. It will have the numbers you need for passing to neural networks.
Now in an Embedding layer in your keras functional api neural
network model, pass the f_code to your model's Input layer. The
value in the f_code will be a number now, like int8. The Embedding
layer will process it correctly now. Don't pass the original column to the model.
Below are some sample code lines copied out of my project doing exactly the steps above.
all_col_types_readcsv = {'userid':'int32','itemid':'int32','rating':'float32','user_age':'int32','gender':'category','job':'category','zipcode':'category'}
<some code omitted>
d = pd.read_csv(fn, sep='|', header=0, dtype=all_col_types_readcsv, encoding='utf-8', usecols=usecols_readcsv)
<some code omitted>
from pandas.api.types import is_string_dtype
# Select the columns to add code columns to. Numeric cols work fine with Embedding layer so ignore them.
cat_cols = [cn for cn in d.select_dtypes('category')]
print(cat_cols)
str_cols = [cn for cn in d.columns if is_string_dtype(d[cn])]
print(str_cols)
add_code_columns = [cn for cn in d.columns if (cn in cat_cols) and (cn in str_cols)]
print(add_code_columns)
<some code omitted>
# Actually add _code column for the selected columns
for cn in add_code_columns:
codecolname = cn + "_code"
if not codecolname in d.columns:
d[codecolname] = d[cn].cat.codes
You can see the numeric codes pandas made for you:
d.info()
d.head()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 99991 entries, 0 to 99990
Data columns (total 5 columns):
userid 99991 non-null int32
itemid 99991 non-null int32
rating 99991 non-null float32
job 99991 non-null category
job_code 99991 non-null int8
dtypes: category(1), float32(1), int32(2), int8(1)
memory usage: 1.3 MB
Finally, you can omit the job column and retain the job_code column, in this example, for passing into your keras neural network model. Here is some of my model code:
v = Lambda(lambda z: z[:, field_num0_X_cols[cn]], output_shape=(), name="Parser_" + cn)(input_x)
emb_input = Lambda(lambda z: tf.expand_dims(z, axis=-1), output_shape=(1,), name="Expander_" + cn)(v)
a = Embedding(input_dim=num_uniques[cn]+1, output_dim=emb_len[cn], input_length=1, embeddings_regularizer=reg, name="E_" + cn)(emb_input)
By the way, please also wrap np.array() around all pandas dataframes when passing them into model.fit(). It's not well documented and apparnetly also not checked at runtime that pandas dataframes cannot be safely passed in. You get massive memory allocs otherwise which crash hosts.

Related

Is there an efficient way to select 5 regions of a tensor in Tensorflow?

For example, given a tensor m which its shape is [28, 28].
I want to randomly select five regions with the tensor, the shape of each region is [3, 3].
Then, I want to modify the values of these regions.
One sulution would be random extraction inside a loop:
import random
tensor = tf.ones(shape=(28,28))
desired_shape = (3,3)
dim1 = random.randint(0,tensor.shape[0] - desired_shape[0])
dim2 = random.randint(0,tensor.shape[1] - desired_shape[1])
extracted_tensor = tensor[dim1:dim1+desired_shape[0]][:,dim2 + desired_shape[1]]
First import the random module and create a (or use your) tensor. Set your desired_shape.
Then create two random variables, one for each dimension and extract the tensor via sublisting.
But, keep in mind, that you cannot assign values to a tensor in tensorflow as this thread says.
To solve this, first convert it to a numpy array, change the values and convert it to a tensor again, so this would be a solution for your issue.
np_arr = tensor.numpy()
for i in range(5):
dim1 = random.randint(0,tensor.shape[0] - desired_shape[0])
dim2 = random.randint(0,tensor.shape[1] - desired_shape[1])
np_arr[dim1:dim1+desired_shape[0]][:,dim2 + desired_shape[1]] = [1,2,3] # any value
new_tens = tf.convert_to_tensor(np_arr)

How to flatten a tensorflow dataset along feature columns when using data.make_csv_dataset?

I am using tf.contrib.data.make_csv_dataset to read CSV files having differing numbers of feature columns.
After reading each file I want to concatenate all the feature columns.
dataset = tf.contrib.data.make_csv_dataset(file_names[0],48,select_columns=['Load_residential_multi_0','Load_residential_multi_1'],shuffle=False)
dataset = dataset.batch(2)
get_batch = dataset.make_one_shot_iterator()
get_batch = get_batch.get_next()
with tf.Session() as sess:
power_data = sess.run(get_batch)
print(power_data.keys())
Above code will give an ordered dictionary with two keys as shown below:
odict_keys(['Load_residential_multi_0', 'Load_residential_multi_1'])
I can access individual features using the feature names. For example power_data['Load_residential_multi_0'] will give me,
array([[0.075 , 0.1225, 0.0775, 0.12 ],
[0.0875, 0.1125, 0.095 , 0.1025]], dtype=float32)
However, I want both the feature columns 'Load_residential_multi_0','Load_residential_multi_1'to be concatenated.
I this I can do this using dataset.flatmap(map_func) but I am not sure what I should use as the argument to flatmap().
By using dataset.map you can concat both the dictionary values:
dataset = dataset.map(lambda x: tf.stack(list(x.values())))
get_batch = dataset.make_one_shot_iterator()

Sklearn and Sparse Matrices ValueError

I'm aware similar questions have been asked before, and I've tried everything suggested in them, but I'm still stumped. I have a dataset with 2 columns: The first with vectors representing words stored as a 1x10000 sparse csr matrix (so a matrix in each cell), and the second contains integer ratings which I will use for classification. When I run the following code
for index, row in data.iterrows():
print(row)
print(row[0].shape)
I get the correct output for all the rows
Name: 0, dtype: object
(1, 10000)
Vector (0, 0)\t1.0\n (0, 1)\t1.0\n (0, 2)\t1.0\n ...
Rating 5
Now when I try passing my data in any SKlearn classifier like so:
uniform_random_classifier = DummyClassifier(strategy='uniform')
uniform_random_classifier.fit(data["Vectors"], data["Ratings"])
I get the following error:
array = np.array(array, dtype=dtype, order=order, copy=copy)
ValueError: setting an array element with a sequence.
What am I doing wrong? I've made sure all my sparse matrices are the same size and I've tried reshaping my data in various ways, but with no luck, and the Sklearn classifiers are supposed to be able to deal with csr matrices.
Update: Converting the entire "Vectors" column into one large 2-D matrix did the trick, but for completeness sake the following is the code I used to generate my dataframe if anyone is curious and wants to try solving the original issue. Assume data is a pandas dataframe with rows that look like
"560 420 222" 5.0
"2345 2344 2344 5" 3.0
def vectorize(feature, size):
"""Given a numeric string generated from a vocabulary table return a binary vector representation of
each feature"""
vector = sparse.lil_matrix((1, size))
for number in feature.split(' '):
try:
vector[0, int(number) - 1] = 1
except ValueError:
pass
return vector
def vectorize_dataset(data, vectorize, size):
"""Given a dataset in the appropriate "num num num..." format, a specific vectorization format, and a vector size,
returns the dataset in vectorized form"""
result_data = pd.DataFrame(index=range(data.shape[0]), columns=["Vector", "Rating"])
for index, row in data.iterrows():
# All the mixing up of decodings and encoding has made it so that Pandas incorrectly parses EOF chars
if type(row[0]) == type('str'):
result_data.iat[index, 0] = vectorize(row[0], size).tocsr()
result_data.iat[index, 1] = data.loc[index][1]
return result_data

Combine Sklearn TFIDF with Additional Data

I am trying to prepare data for supervised learning. I have my Tfidf data, which was generated from a column in my dataframe called "merged"
vect = TfidfVectorizer(stop_words='english', use_idf=True, min_df=50, ngram_range=(1,2))
X = vect.fit_transform(merged['kws_name_desc'])
print X.shape
print type(X)
(57629, 11947)
<class 'scipy.sparse.csr.csr_matrix'>
But I also need to add additional columns to this matrix. For each document in the TFIDF matrix, I have a list of additional numeric features. Each list is length 40 and it's comprised of floats.
So for clarify, I have 57,629 lists of length 40 which I'd like to append on to my TDIDF result.
Currently, I have this in a DataFrame, example data: merged["other_data"]. Below is an example row from the merged["other_data"]
0.4329597715,0.3637511039,0.4893141843,0.35840...
How can I append the 57,629 rows of my dataframe column with the TF-IDF matrix? I honestly don't know where to begin and would appreciate any pointers/guidance.
This will do the work.
`df1 = pd.DataFrame(X.toarray()) //Convert sparse matrix to array
df2 = YOUR_DF of size 57k x 40
newDf = pd.concat([df1, df2], axis = 1)`//newDf is the required dataframe
I figured it out:
First: iterate over my pandas column and create a list of lists
for_np = []
for x in merged['other_data']:
row = x.split(",")
row2 = map(float, row)
for_np.append(row2)
Then create a np array:
n = np.array(for_np)
Then use scipy.sparse.hstack on X (my original tfidf sparse matrix and my new matrix. I'll probably end-up reweighting these 40-d vectors if they do not improve the classification results, but this approach worked!
import scipy.sparse
X = scipy.sparse.hstack([X, n])
You could have a look at the answer to this question:
use Featureunion in scikit-learn to combine two pandas columns for tfidf
Obviously, the anwers given should work, but as soon as you want your classifier to make predictions, you definitely want to work with pipelines and feature unions.

tensorflow wide model: how to use one-hot feature?

I have read about the model in https://www.tensorflow.org/versions/r0.9/tutorials/wide_and_deep/index.html
the feature in article has two type: Categorical and Continuous
In my case, I have a column which describe the userid ,range from 0 to 10000000
I treat this column as Categorical and use hash-bucket , but only get a pool auc value about 0.50010
1)is it need to use one-hot to process this id column?
2)if it's needed, how to achieve this? I find a "tf.contrib.layers.one_hot_encoding" ,but it's not support column names so cannot be used in wide-n-deep demo.
No, you don't need to encode the UserID column. Each value is unique and is not a Categorical value. It makes sense to one-hot-encode when there are less than 1000 categories.
To answer your question on how to use the one_hot_encoding, assuming you have a list of labels (note that they must be integers):
import tensorflow as tf
with tf.Session() as sess:
labels = [0, 1, 2, 3]
labels_t = tf.constant(labels)
num_classes = len(labels)
one_hot = tf.contrib.layers.one_hot_encoding(labels_t, num_classes=num_classes)
print(one_hot.eval())