how to add the following feature to a tfidf matrix? - numpy

Hello I have a list called list_cluster, that looks as follows:
list_cluster=["hello,this","this is a test","the car is red",...]
I am using TfidfVectorizer to produce a model as follows:
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
with open('vectorizerTFIDF.pickle', 'rb') as infile:
tdf = pickle.load(infile)
tfidf2 = tdf.transform(list_cluster)
then I would like to add new features to this matrix called tfidf2, I have a list as follows:
dates=['010000000000', '001000000000', '001000000000', '000000000001', '001000000000', '000000000010',...]
this list has the same lenght of list_cluster, and represents the date has 12 positions and in the place where is the 1 is the corresponding month of the year,
for instance '010000000000' represents february,
in order to use it as feature first I tried:
import numpy as np
dates=np.array(listMonth)
dates=np.transpose(dates)
to get a numpy array and then to transpose it in order to concatenate it with the first matrix tfidf2
print("shape tfidf2: "+str(tfidf2.shape),"shape dates: "+str(dates.shape))
in order to concatenate my vector and matrix I tried:
tfidf2=np.hstack((tfidf2,dates[:,None]))
However this is the output:
shape tfidf2: (11159, 1927) shape dates: (11159,)
Traceback (most recent call last):
File "Main.py", line 230, in <module>
tfidf2=np.hstack((tfidf2,dates[:,None]))
File "/usr/local/lib/python3.5/dist-packages/numpy/core/shape_base.py", line 278, in hstack
return _nx.concatenate(arrs, 0)
ValueError: all the input arrays must have same number of dimensions
the shape seems good, but I am not sure what is failing, I would like to appreciate support to concatenate this feature to my tfidf2 matrix, thanks in advance for the atention,

You need to convert all strings to numerics for sklearn. One way to do this is use the LabelBinarizer class in the preprocessing module of sklearn. This creates a new binary column for each unique value in your original column.
If dates is the same number of rows as tfidf2 then I think this will work.
# create tfidf2
tfidf2 = tdf.transform(list_cluster)
#create dates
dates=['010000000000', '001000000000', '001000000000', '000000000001', '001000000000', '000000000010',...]
# binarize dates
lb = LabelBinarizer()
b_dates = lb.fit_transform(dates)
new_tfidf = np.concatenate((tfidf2, b_dates), axis=1)

Related

'pandas' has no attribute 'to_float'

I'm testing my data using the SVM Classifier. And my dataset is in a form of text and I'm trying to transform it into float.
I have data that may look like this:
dataset
Transform as float
df.columns = df('columns').str.rstrip('%').astype('float') / 100.0
TypeError Traceback (most recent call last)
<ipython-input-66-74921537411d> in <module>
1 # Transform as float
----> 2 df.columns = df('columns').str.rstrip('%').astype('float') / 100.0
3
TypeError: 'DataFrame' object is not callable
Basically, it is impossible to convert text to float. In your dataset, it seems that all the columns have text values, and not sure if the value can be numbers by using rstrip('%') (because the values are too long, so truncated in the image).
If the values of a columns can be numbers by using rstrip('%'), then you can convert it. In addition, you are using (), not [] for the dataframe. Because you are using`df(...'), it looks like a function call. You can do what you want if the values of a columns is numbers, as follows:
df['columns'] = df['columns'].str.rstrip('%').astype('float') / 100.0
Here is a full code sample:
import pandas as pd
df = pd.DataFrame({
'column_name': ['111%', '222%'],
})
# df looks like:
# columns
#0 111%
#1 222%
df['column_name'] = df['column_name'].str.rstrip('%').astype('float') / 100.0
print(df)
# columns
#0 1.11
#1 2.22

How to convert a matrix as string to ndarray?

I have a csv file with this structure:
id;matrix
1;[[1.2 1.3] [1.2 1.3] [1.2 1.3]]
I'm trying read the matrix field as numpy.ndarray using pandas.read_csv to read and making df.to_numpy() to convert the matrix, but the shape array result in (1,0). I was waiting for the shape equals (3,2) as:
matrix = [[1.2 1.3]
[1.2 1.3]
[1.2 1.3]]
I was try too numpy.asmatrix, but the result is like df.to_numpy()
A solution with pandas
Providing the format of the matrix column is consistent with that shown in the example, replace the spaces with ,, then use literal_eval to turn the string into a list of lists, and then apply np.array.
import pandas as pd
from ast import literal_eval
import numpy as np
# read the data
df = pd.read_csv('file.csv', sep=';')
# replace the spaces
df['matrix'] = df['matrix'].str.replace(' ', ',')
# apply literal_eval
df['matrix'] = df['matrix'].apply(literal_eval)
# apply numpy array
df['matrix'] = df['matrix'].apply(np.array)
print(type(df.iloc[0, 1]))
>>> numpy.ndarray
Each row of the matrix column will be an ndarray
The two apply calls can be combined into:
df['matrix'] = df['matrix'].apply(lambda x: np.array(literal_eval(x)))
Or this hot mess:
df['matrix'] = df['matrix'].str.replace(' ', ',').apply(lambda x: np.array(literal_eval(x)))
I personally prefer one transformation per line for code clarity.

Sklearn PCA: Correct Dimensionality of PCs

I have a dataframe, df, which contains a column called 'event' wherein there is a 24x24x40 numpy array. I want to:
extract this numpy array;
flatten it into a 1x23040 vector;
add this entry as a column in a new numpy array or dataframe;
perform PCA on the resulting matrix.
However, the PCA produces eigenvectors with the dimensions of 'the number of entries', not the 'number of dimensions in the data'.
To illustrate my problem, I demonstrate a minimal example that works perfectly well:
EXAMPLE 1
from sklearn import datasets, decomposition
digits = datasets.load_digits()
X = digits.data
pca = decomposition.PCA()
X_pca = pca.fit_transform(X)
print (X.shape)
Result: (1797, 64)
print (X_pca.shape)
Result: (1797, 64)
There are 1797 entries in each case, with eigenvectors of dimension 64.
Now onto my example:
EXAMPLE 2
from sklearn import datasets, decomposition
import pandas as pd
hdf=pd.HDFStore('./afile.h5')
df=hdf.select('batch0')
print(df['event'][0].shape)
Result: (1, 24, 24, 40)
print(df['event'][0].shape.flatten())
Result: (23040,)
for index, row in df.iterrows():
entry = df['event'][index].flatten()
_list.append(entry)
X = np.asarray(_list)
pca = decomposition.PCA()
X_pca=pca.fit_transform(X)
print (X.shape)
Result: (201, 23040)
print (X_pca.shape)
Result:(201, 201)
This has dimensions of the number of data, 201 entries!
I am unfamiliar with dataframes, so it could be that I am iterating through the dataframe incorrectly. However, I have checked that the rows of the resultant numpy array in X in Example 2 can be reshaped and plotted as expected.
Any thoughts would be appreciated!
Kind regards!
Sklearn's documentation states that the number of components retained when you don't specify the n_components parameter is min(n_samples, n_features).
Now, heading to your example:
In your first example, the number of data samples 1797 is less than the number of dimensions 64, therefore it keeps the whole dimensionality (since you are not specifying the number of components). However, in your second example, the number of data samples is far less than the number of features, hence, sklearns' PCA reduces the number of dimensions to n_samples.

Numpy: stack arrays whose internal dimensions differ

I have a situation similar to the following:
import numpy as np
a = np.random.rand(55, 1, 3)
b = np.random.rand(55, 626, 3)
Here the shapes represent the number of observations, then the number of time slices per observation, then the number of dimensions of the observation at the given time slice. So b is a full representation of 3 dimensions for each of the 55 observations at one new time interval.
I'd like to stack a and b into an array with shape 55, 627, 3. How can one accomplish this in numpy? Any suggestions would be greatly appreciated!
To follow up on Divakar's answer above, the axis argument in numpy is the index of a given dimension within an array's shape. Here I want to stack a and b by virtue of their middle shape value, which is at index = 1:
import numpy as np
a = np.random.rand(5, 1, 3)
b = np.random.rand(5, 100, 3)
# create the desired result shape: 55, 627, 3
stacked = np.concatenate((b, a), axis=1)
# validate that a was appended to the end of b
print(stacked[:, -1, :], '\n\n\n', a.squeeze())
This returns:
[[0.72598529 0.99395887 0.21811998]
[0.9833895 0.465955 0.29518207]
[0.38914048 0.61633291 0.0132326 ]
[0.05986115 0.81354865 0.43589306]
[0.17706517 0.94801426 0.4567973 ]]
[[0.72598529 0.99395887 0.21811998]
[0.9833895 0.465955 0.29518207]
[0.38914048 0.61633291 0.0132326 ]
[0.05986115 0.81354865 0.43589306]
[0.17706517 0.94801426 0.4567973 ]]
A purist might use instead np.all(stacked[:, -1, :] == a.squeeze()) to validate this equivalence. All glory to #Divakar!
Strictly for the curious, the use case for this concatenation is a kind of wonky data preparation pipeline for a Long Short Term Memory Neural Network. In that kind of network, the training data shape should be number_of_observations, number_of_time_intervals, number_of_dimensions_per_observation. I am generating new predictions of each object at a new time interval, so those predictions have shape number_of_observations, 1, number_of_dimensions_per_observation. To visualize the sequence of observations' positions over time, I want to add the new positions to the array of previous positions, hence the question above.

Pandas dataframe - multiplying DF's elementwise on same dates - something wrong?

I've been banging my head over this, I just cannot seem to get it right and I don't understand what is the problem... So I tried to do the following:
#!/usr/bin/env python
import matplotlib.pyplot as plt
import numpy as np
import quandl
btc_usd_price_kraken = quandl.get('BCHARTS/KRAKENUSD', returns="pandas")
btc_usd_price_kraken.replace(0, np.nan, inplace=True)
plt.plot(btc_usd_price_kraken.index, btc_usd_price_kraken['Weighted Price'])
plt.grid(True)
plt.title("btc_usd_price_kraken")
plt.show()
eur_usd_price = quandl.get('BUNDESBANK/BBEX3_D_USD_EUR_BB_AC_000', returns="pandas")
eur_dkk_price = quandl.get('ECB/EURDKK', returns="pandas")
usd_dkk_price = eur_dkk_price / eur_usd_price
btc_dkk = btc_usd_price_kraken['Weighted Price'] * usd_dkk_price
plt.plot(btc_dkk.index, btc_dkk) # WHY IS THIS [4785 rows x 1340 columns] ???
plt.grid(True)
plt.title("Historic value of 1 BTC converted to DKK")
plt.show()
As you can see in the comment, I don't understand why I get a result (which I'm trying to plot) that has size: [4785 rows x 1340 columns] ?
Anyway, the code results in a lot of error messages, something like e.g.
> Traceback (most recent call last): File
> "/usr/lib/python3.6/site-packages/matplotlib/backends/backend_qt5agg.py",
> line 197, in __draw_idle_agg
> FigureCanvasAgg.draw(self) File "/usr/lib/python3.6/site-packages/matplotlib/backends/backend_agg.py",
...
> return _from_ordinalf(x, tz) File "/usr/lib/python3.6/site-packages/matplotlib/dates.py", line 254, in
> _from_ordinalf
> dt = datetime.datetime.fromordinal(ix).replace(tzinfo=UTC) ValueError: ordinal must be >= 1
I read some posts and I know that Pandas/Dataframe when using multiply is able to automatically only do an elementwise multiplication, on data-pairs, where the date is the same (so if one DF has timeseries for e.g. 1999-2017 and the other only has e.g. 2012-2015, then only common dates between 2012-2015 will be multiplied, i.e. the intersection subset of the data set) - so this problem about understanding the error message(s) (and the solution) - the whole problem is related to calculating btc_dkk variable and plotting it (which is the price for Bitcoin in the currency DKK)...
This should work:
usd_dkk_price.multiply(btc_usd_price_kraken['Weighted Price'], axis='index').dropna()
You are multiplying on columns, not index (this happens since you are multiplying a dataframe and a series, if you had selected the column in usd_dkk_price, this would not have happened). Then afterwards just drop the rows with NaN.