Concatenate tensors of different rank into a single tensor - tensorflow

I'm looking to feed an autoencoder my features, as both training and target data. Majority of the features have rank 1; are a single column of values like [1,2,3,4]. Some have been put through one hot encoding so the tensors are of rank 2 and have X columns, with X being the number of categorical values in the one hot encoder, so something like:
['a', 'a, 'b', c'] -> [[1,0,0], [1,0,0], [0,1,0], [0,0,1]]
For some reason keras's Model.fit don't accept y values if the training data are generators or datasets. So I have to provide my training data as a tuple of (features, targets), and in this case targets=features, but at the same time, features is a dictionary of tensors so I must concatenate all of the tensors in features into a single tensor.
I can do tf.concat across all of my feature columns except the one-hot encoded columns, which have rank 2 (instead of 1). How can I somehow turn the one-hot encoded features in X individual tensors and then concat them together?
Same issue here where the OP solved his issue using tf.concat, but I can't do that here.

I was trying to comment your question to ask for a couple of details, but I can't do that because I don't have reputation score (I am new to stack overflow and this is my first post ever). So I will try to explain based on what I understood of your problem - hope it helps.
You can use Pandas to hot-encode your categorical features.
Example Dataset
import pandas as pd
import random
import numpy as np
dataset = {'feature_A': np.random.randint(10, size=10),
'feature_B': np.random.randint(10, size=10),
'feature_C': np.random.randint(10, size=10),
'categorical_feature': np.array([chr(97 + random.randint(0, 2)) for i in range(10)])}
print(dataset)
{'feature_A': array([6, 8, 2, 0, 4, 8, 6, 4, 3, 8]),
'feature_B': array([0, 6, 8, 6, 7, 3, 4, 6, 1, 6]),
'feature_C': array([0, 7, 7, 3, 7, 4, 0, 3, 2, 7]),
'categorical_feature': array(['a', 'a', 'c', 'b', 'c', 'c', 'a', 'a', 'c', 'c'], dtype='<U1')}
Pandas DataFrame
Transform the dataset into a Pandas DataFrame
df = pd.DataFrame(dataset)
print(df)
Feature_A Feature_B Feature_C Categorical_Feature
0 1 9 9 e
1 9 9 8 c
2 5 8 3 c
3 3 5 7 c
4 9 8 10 d
5 6 9 6 c
6 1 4 5 d
7 9 5 2 e
8 3 9 2 c
9 3 8 10 a
One-hot & concatenate
One-hot encode the categorical features and concatenate to the main DataFrame (and drop original categorical feature column)
df = pd.concat((df, pd.get_dummies(df['categorical_feature'])), axis=1).drop('categorical_feature', axis=1)
print(df)
feature_A feature_B feature_C a b c
0 0 0 1 0 0 1
1 5 7 0 0 0 1
2 9 2 6 0 1 0
3 1 1 4 0 0 1
4 5 8 8 0 0 1
5 9 1 8 1 0 0
6 5 6 8 1 0 0
7 9 5 0 1 0 0
8 9 4 5 1 0 0
9 8 3 5 1 0 0
NumPy array
Then you can simply get the values of the DataFrame as a numpy array by using the attribute .values. Each row now is one training example that comprises of all the value features + the categorical features as hot encoded vector.
You can use the numpy array directly into your model or, if you wish, you can also transform it into tensorflow tensor by using tf.data.Dataset.from_tensor_slices().
dataset = df.values
print(dataset)
array([[0, 0, 1, 0, 0, 1],
[5, 7, 0, 0, 0, 1],
[9, 2, 6, 0, 1, 0],
[1, 1, 4, 0, 0, 1],
[5, 8, 8, 0, 0, 1],
[9, 1, 8, 1, 0, 0],
[5, 6, 8, 1, 0, 0],
[9, 5, 0, 1, 0, 0],
[9, 4, 5, 1, 0, 0],
[8, 3, 5, 1, 0, 0]], dtype=int32)

Related

Calculating time increments individually in pandas

I have a data frame with user IDs all in one column and "time series" for each user, which looks like this:
df = pd.DataFrame({'user_id': [1, 1, 1, 2, 2, 2, 3, 3, 3], 'time': [0, 1, 3, 4, 8, 10, 20, 30, 80], 'score': [1, 2, 3, 4, 5, 6, 7, 8, 9]})
I want to calculate time differences for each user_id:
df = pd.DataFrame({'user_id': [1, 1, 1, 2, 2, 2, 3, 3, 3], 'time': [0, 1, 2, 4, 4, 2, 20, 10, 50], 'score': [1, 2, 3, 4, 5, 6, 7, 8, 9]})
I think np.diff would work if I could limit it to each user_id. This is my first question on StackOverflow, hope my question is clear enough.
Try, using groupby with diff, then fillna first NaN with current time value:
df['diff'] = df.groupby('user_id')['time'].diff().fillna(df['time'])
Output:
user_id time score diff
0 1 0 1 0.0
1 1 1 2 1.0
2 1 3 3 2.0
3 2 4 4 4.0
4 2 8 5 4.0
5 2 10 6 2.0
6 3 20 7 20.0
7 3 30 8 10.0
8 3 80 9 50.0

How to assigne a dataframe mean to specific rows of dataframe?

I have a data frame like this
df_a = pd.DataFrame({'a': [2, 4, 5, 6, 12],
'b': [3, 5, 7, 9, 15]})
Out[112]:
a b
0 2 3
1 4 5
2 5 7
3 6 9
4 12 15
and mean out
df_a.mean()
Out[118]:
a 5.800
b 7.800
dtype: float64
I want this;
df_a[df_a.index.isin([3, 4])] = df.mean()
But I'm getting an error. How do I achieve this?
I gave an example here. There are observations that I need to change a lot in the data that I am working with. And I keep their index values in a list
If you want to overwrite the values of rows in a list, you can do it with iloc
df_a = pd.DataFrame({'a': [2, 4, 5, 6, 12], 'b': [3, 5, 7, 9, 15]})
idx_list = [3, 4]
df_a.iloc[idx_list,:] = df_a.mean()
Output
a b
0 2.0 3.0
1 4.0 5.0
2 5.0 7.0
3 5.8 7.8
4 5.8 7.8
edit
If you're using an older version of pandas and see NaNs instead of wanted values, you can use a for loop
df_a_mean = df_a.mean()
for i in idx_list:
df_a.iloc[i,:] = df_a_mean

Creating Multi -Column Index for a Dataframe

Is it possible to change a single level column dataframe to a multi-column dataframe? If we have a dataframe like this,
import pandas as pd
df = pd.DataFrame({
'a': [0, 1, 2, 3],
'b': [4, 5, 6, 7],
'c': [3, 5, 6, 2],
'd': [1, 5, 7, 0],
})
can we change it's column names as below?. So, briefly what I am trying to do is to have 2-levels of column index without changing the values of the dataframe.
A B
a b c d
0 0 4 3 1
1 1 5 5 5
2 2 6 6 7
3 3 7 2 0
Any help?
IIUC, use pd.MultiIndex.from_tuples to create multiindex header and assign to the dataframe.columns:
df = pd.DataFrame({
'a': [0, 1, 2, 3],
'b': [4, 5, 6, 7],
'a2': [3, 5, 6, 2],
'b2': [1, 5, 7, 0],
})
df.columns=pd.MultiIndex.from_tuples([('A','a'),('A','b'),('B','c'),('B','d')])
df
Output:
A B
a b c d
0 0 4 3 1
1 1 5 5 5
2 2 6 6 7
3 3 7 2 0

Multiplying Dataframe by Column Value

I'm currently trying to multiply a dataframe of local currency values and converting it to its relevant Canadian value by multiplying its relevant FX rate.
However, I keep getting this error:
ValueError: operands could not be broadcast together with shapes (12252,) (1021,)
This is the code I'm working with right now. It works when I have a handful rows of data, but keeps getting the ValueError once I use it on the full file (1022 rows of data incl. headers).
import pandas as pd
Local_File = ('RawData.xlsx')
df = pd.read_excel(Local_File, sheet_name = 'Local')
df2 = df.iloc[:,[2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13]].multiply(df['FX Spot Rate'],axis='index')
print (df2)
My dataframe looks something like this with 1022 rows of data (incl. header)
Appreciate any help! Thank you!
df = pd.DataFrame({'A': [1, 2, 3, 3, 1],
'B': [1, 2, 3, 3, 1],
'C': [9, 7, 4, 3, 9]})
A B C
0 1 1 9
1 2 2 7
2 3 3 4
3 3 3 3
4 1 1 9
df.iloc[:,1:] = df.iloc[:,1:].multiply(df['A'][:], axis="index")
df
A B C
0 1 1 9
1 2 4 14
2 3 9 12
3 3 9 9
4 1 1 9

Add rows as columns in pandas

I'm trying to change my dataset by making all the rows into columns in pandas.
5 6 7
8 9 10
Needs to be changed as
5 6 7 8 9 10
with different headers of course, any suggestions??
Use pd.DataFrame([df.values.flatten()]) as follows:
In [18]: df
Out[18]:
0 1 2
0 5 6 7
1 8 9 10
In [19]: pd.DataFrame([df.values.flatten()])
Out[19]:
0 1 2 3 4 5
0 5 6 7 8 9 10
Explanation:
df.values returns numpy.ndarray:
In [18]: df.values
Out[18]:
array([[ 5, 6, 7],
[ 8, 9, 10]], dtype=int64)
In [19]: type(df.values)
Out[19]: numpy.ndarray
and numpy arrays have .flatten() method:
In [20]: df.values.flatten?
Docstring:
a.flatten(order='C')
Return a copy of the array collapsed into one dimension.
In [21]: df.values.flatten()
Out[21]: array([ 5, 6, 7, 8, 9, 10], dtype=int64)
Pandas.DataFrame constructor expects lists/arrays of rows:
If we try this:
In [22]: pd.DataFrame([ 5, 6, 7, 8, 9, 10])
Out[22]:
0
0 5
1 6
2 7
3 8
4 9
5 10
Pandas thinks that it's a list of rows, where each row has one element.
So i've enclosed that array into square brackets:
In [23]: pd.DataFrame([[ 5, 6, 7, 8, 9, 10]])
Out[23]:
0 1 2 3 4 5
0 5 6 7 8 9 10
which will be understood as one row with 6 columns.
or just in one line:
df = pd.DataFrame([[1,2,3],[4,5,6]])
df.values.flatten()
#out: array([1, 2, 3, 4, 5, 6])
you can also use reduce()
from import pandas as pd
from functools import reduce
df = pd.DataFrame([[5, 6, 7],[8, 9, 10]])
df = pd.DataFrame([reduce(lambda x,y: list(x[1]) + list(y[1]), df.iterrows())])
df
0 1 2 3 4 5
0 5 6 7 8 9 10
Use the reshape function from numpy:
import pandas as pd
import numpy as np
df = pd.DataFrame([[5, 6, 7],[8, 9, 10]])
nparray = np.array(df.iloc[:,:])
x = np.reshape(nparray, -1)
df = pd.DataFrame(x) #to convert back to a dataframe