Axis in tf.linalg.normalize - tensorflow

I am trying to normalize my tensor with rows as samples and columns as features.
https://www.tensorflow.org/api_docs/python/tf/linalg/normalize
If I use axis=1, I get error message. If I used axis=0, the output looks fine.
But how to decide axis?

I believe this is the general rule applied for axis
axis=0: Apply operation column-wise, across all rows for each column.
axis=1: Apply operation row-wise, across all columns for each row.

Related

Should I join features and targets dataframes for use with scikit-learn?

I am trying to create a regression model to predict deliverables (dataframe 2) using design parameters (dataframe 1). Both dataframes have a id number that I used as an index.
Is it possible to use two dataframes to create a dataset for sklearn? Or do I need to join them? If I need to join them then what would be the best way?
# import data
df1= pd.read_excel('data.xlsx', sheet_name='Data1',index_col='Unnamed: 0')
df2= pd.read_excel('data.xlsx', sheet_name='Data2',index_col='Unnamed: 0')
I have only used sklearn on a single dataframe that had all of the columns for the feature and target vectors in it. So not sure how to handle the case where I am using two dataframes where one has the features and one has the targets.
All estimators in scikit-learn have a signature like estimator.fit(X, y), X being training features and y training targets.
Then, prediction will be achieved by calling some kind of estimator.predict(X_test), with X_test being the test features.
Even train_test_split takes as parameters two arrays X and y.
This means that, as long as you maintain the right order in rows, nothing requires you to merge features and targets.
Completely agree with the Guillaume answer.
Just be aware, as he said, of the rows order. That's the key of your problem. If they have the same order, you don't need to merge dataframe and you can fit the model directly.
But, if they are not in the same order, you have to combine both dataframes (similar to left join in SQL) in order to relate features and targets of one ID. You can do it like this (more information here):
df_final= pd.concat([df1, df2], axis=1)
As you used the ID as index, it should work properly. Be aware that maybe NaN values will appear if some ID appears in one Dataframe but not in the other one. You will have to handle with them.

Simple question about slicing a Numpy Tensor

I have a Numpy Tensor,
X = np.arange(64).reshape((4,4,4))
I wish to grab the 2,3,4 entries of the first dimension of this tensor, which you can do with,
Y = X[[1,2,3],:,:]
Is this a simpler way of writing this instead of explicitly writing out the indices [1,2,3]? I tried something like [1,:], which gave me an error.
Context: for my real application, the shape of the tensor is something like (30000,100,100). I would like to grab the last (10000, 100,100) to (30000,100,100) of this tensor.
The simplest way in your case is to use X[1:4]. This is the same as X[[1,2,3]], but notice that with X[1:4] you only need one pair of brackets because 1:4 already represent a range of values.
For an N dimensional array in NumPy if you specify indexes for less than N dimensions you get all elements of the remaining dimensions. That is, for N equal to 3, X[1:4] is the same as X[1:4, :, :] or X[1:4, :]. Only if you want to index some dimension while getting all elements in a dimension that comes before it is that you actually need to pass :. Such as X[:, 2:4], for instance.
If you wish to select from some row to the end of array, simply use python slicing notation as below:
X[10000:,:,:]
This will select all rows from 10000 to the end of array and all columns and depths for them.

tf.nn.embedding_lookup - row or column?

This is a very simple question. I'm learning tensorflow and converting my numpy-written code using Tensorflow.
I have word embedding matrix defined U = [embedding_size, vocab_size] therefore each column is the embedding vector of each word.
I converted U into TF like below:
U = tf.Variable(tf.truncated_normal([embedding_size, vocab_size], -0.1, 0.1))
So far, so good.
Now I need to look up each word's embedding for training. I assume it would be
tf.nn.embedding_lookup(U, word_index)
My question is because my embedding is a column vector, I need to look up like this U[:,x[t]] in numpy.
How does TF figure out it needs to return the row OR column by word_index?
What's the default? Row or column?
If it's a row vector, then do I need to transpose my embedding matrix?
https://www.tensorflow.org/api_docs/python/tf/nn/embedding_lookup
doesn't mention this. If anyone could point me to right resource, I'd appreciate it.
If params is a single tensor, the tf.nn.embedding_lookup(params, ids) operation treats ids as the indices of rows in params. If params is a list of tensors or a partitioned variable, then ids still correspond to rows in those tensors, but the partition_strategy (either "div" or "mod") determines how the ids map to a particular row.
As Aaron suggests, it will probably be easiest to define your embedding U as having shape [vocab_size, embedding_size], so that you can use tf.nn.embedding_lookup() and related functions.
Alternatively, you can use the axis argument to tf.gather() to select columns from U:
embedding = tf.gather(U, word_index, axis=1)
U should be vocab_size x embedding_size, the transpose of what you have now.

what is the difference between series/dataframe and ndarray?

Leaving that they are from two different binaries.
I know that series/dataframe can hold any data type, and ndarray is also heterogenous data.
And also all the slicing operations of numpy are applicable to series.
Is there any other difference between them?
After some research I found the answer to my question I asked above. For anyone who needs, here it is from pandas docs:
A key difference between Series and ndarray is that operations between
Series automatically align the data based on the label. Thus, you can
write computations without giving consideration to whether the Series
involved have the same labels.
An example:
s[1:] + s[:-1]
The result for above would produce NaN for both first and last index.
If a label is not found in one Series or the other, the result will be marked as missing NaN.

Can apply in pandas split-apply-combine increase dimension?

In pandas documentation, it says that the apply step can be one of aggregation, transformation or filtration. Analogous to aggregation where the dimension is reduced and transformation where dimension does not change, is there a way to apply so that the dimension increases?
For example, instead of using just z-score to standardize (the transformation example in the doc), is there a way to apply a few ways of normalization (z-score, quantile, etc) at once?
Edit:
Thanks for the comments which pointed to agg and flexible apply.
I wonder how flexible the flexible apply can be? See comments below in excerpt from flexible apply documentation:
In [119]: grouped = df.groupby('A')['C']
In [120]: def f(group):
.....: return pd.DataFrame({'original' : group,
.....: 'demeaned' : group - group.mean()})
.....:
In [121]: grouped.apply(f)// this works, but df.groupby("A").apply(f) does not
Without groupby() but closely related and mentioned in flexible apply documentation,
apply on a Series can operate on a returned value from the applied
function, that is itself a series, and possibly upcast the result to a
DataFrame
If the return value of the applied function is itself a dataframe, can an "upcast" to dataframe with multiindex happen automatically?
Similarly, how flexible can agg() be? In the examples in the documentation, the function agg() takes all seems to take a series and return a value.