I have a pandas data frame like this:
a other-columns
0.3 0.2 0.0 0.0 0.0... ....
I want to convert column a into SciPy sparse CSR matrix. a is a probability distribution. I would like to convert without expanding a into multiple columns.
This is naive solution with expanding a into multiple columns:
df = df.join(df['a'].str.split(expand = True).add_prefix('a')).drop(['a'], axis = 1)
df_matrix = scipy.sparse.csr_matrix(df.values)
But, I don't want to expand into multiple columns, as it shoots up the memory. Is it possible to do this by keeping a in 1 column only?
EDIT (Minimum Reproducible Example):
import pandas as pd
from scipy.sparse import csr_matrix
d = {'a': ['0.05 0.0', '0.2 0.0']}
df = pd.DataFrame(data=d)
df = df.join(df['a'].str.split(expand = True).add_prefix('a')).drop(['a'], axis = 1)
df = df.astype(float)
df_matrix = scipy.sparse.csr_matrix(df.values)
df_matrix
Output:
<2x2 sparse matrix of type '<class 'numpy.float64'>'
with 2 stored elements in Compressed Sparse Row format>
I want to achieve above, but, without splitting into multiple columns. Also, in my real file, I have 36 length string (separated by space) columns and millions of rows. It is sure that all rows will contain 36 spaces.
Also, in my real file, I have 36 length string (separated by space) columns and millions of rows. It is sure that all rows will contain 36 spaces.
Convert large csv to sparse matrix for use in sklearn
I can not overstate how much you should not do the thing that follows this sentence.
import pandas as pd
import numpy as np
from scipy import sparse
df = pd.DataFrame({'a': ['0.05 0.0', '0.2 0.0'] * 100000})
chunksize = 10000
sparse_coo = []
for i in range(int(np.ceil(df.shape[0]/chunksize))):
chunk = df.iloc[i * chunksize:min(i * chunksize +chunksize, df.shape[0]), :]
sparse_coo.append(sparse.coo_matrix(chunk['a'].apply(lambda x: [float(y) for y in x.split()]).tolist()))
sparse_coo = sparse.vstack(sparse_coo)
You could get the dense array from the column without the expand:
In [179]: df = pd.DataFrame(data=d)
e.g.
In [180]: np.array(df['a'].str.split().tolist(),float)
Out[180]:
array([[0.05, 0. ],
[0.2 , 0. ]])
But I doubt if that saves much in memory (though I only have a crude understanding of DataFrame memory use.
You could convert each string to a sparse matrix:
In [190]: def foo(astr):
...: alist = astr.split()
...: arr = np.array(alist, float)
...: return sparse.coo_matrix(arr)
In [191]: alist = [foo(row) for row in df['a']]
In [192]: alist
Out[192]:
[<1x2 sparse matrix of type '<class 'numpy.float64'>'
with 1 stored elements in COOrdinate format>,
<1x2 sparse matrix of type '<class 'numpy.float64'>'
with 1 stored elements in COOrdinate format>]
In [193]: sparse.vstack(alist)
Out[193]:
<2x2 sparse matrix of type '<class 'numpy.float64'>'
with 2 stored elements in COOrdinate format>
I tried to make the coo directly from the alist, but that didn't trim out the zeros. There's just as much conversion, but if sufficiently sparse (5% or less) it could save quite a bit on memory (if not time).
sparse.vstack combines the data,rows,cols values from the component matrices to define a new coo matrix. It's most straight forward way of combining sparse matrices, if not the fastest.
Looks like I could use apply as well
In [205]: df['a'].apply(foo)
Out[205]:
0 (0, 0)\t0.05
1 (0, 0)\t0.2
Name: a, dtype: object
In [206]: df['a'].apply(foo).values
Out[206]:
array([<1x2 sparse matrix of type '<class 'numpy.float64'>'
with 1 stored elements in COOrdinate format>,
<1x2 sparse matrix of type '<class 'numpy.float64'>'
with 1 stored elements in COOrdinate format>], dtype=object)
In [207]: sparse.vstack(df['a'].apply(foo))
Out[207]:
<2x2 sparse matrix of type '<class 'numpy.float64'>'
with 2 stored elements in COOrdinate format>
Related
I want to convert a multidimensional climate data into the pandas data frame. The shape of my numpy array is temperature.shape -> (365,100,200) -> ["time", "longitude", "latitude"]. Then I would like to have the following columns in my pandas dataframe: columns=["time", "lon", "lat", "temp"].
I tried this code:
df = pd.DataFrame(temperature, columns=['time', 'lat', 'lon', 'temp'])
I got this error:
ValueError: Must pass 2-d input
How can I solve it? I could not find any hint in suggested topics. Thanks.
Pandas is expects a 2D array where the columns and rows correspond to the final data frame.
It looks like you're trying to unravel the (365,100,200) array in 365*100*200=7,300,000 individual records. This can be done by flattening the array if you have the values for each independent quantity along each access.
For example, unravelling a (3,4,5) shaped 3D array with X, Y and Z dimensions given by the lists/arrays x_index, y_index, z_index, rather than time, longitude, latitude and M replacing temperature:
import numpy as np
import pandas as pd
nx = 3
ny = 4
nz = 5
M = np.ndarray((nx,ny,nz))
for i in range(nx):
for j in range(ny):
for k in range(nz):
M[i,j,k] = (i+j)*k
# constructed nx by ny by nz matrix from function f(x,y,z) = (x+y)*z
x_index = list(range(nx))
y_index = list(range(ny))
z_index = list(range(nz))
# Get arrays/list giving the values of x/y/z
X, Y, Z = np.meshgrid(x_index,y_index,z_index)
# Make (3,4,5) arrays of each independent variable
pd.DataFrame({"M=(X+Y)*Z":M.flatten(), "X":X.flatten(), "Y":Y.flatten(), "Z":Z.flatten()})
# Flatten the data and independent variables to make 3*4*5=60 individual records
I have a pandas DataFrame, which contains 610 rows, and every row contains a nested list of coordinate pairs, it looks like that:
[1377778.4800000004, 6682395.377599999] is one coordinate pair.
I want to unnest every row, so instead of one row containing a list of coordinates I will have one row for every coordinate pair, i.e.:
I've tried s.apply(pd.Series).stack() from this question Split nested array values from Pandas Dataframe cell over multiple rows but unfortunately that didn't work.
Please any ideas? Many thanks in advance!
Here my new answer to your problem. I used "reduce" to flatten your nested array and then I used "itertools chain" to turn everything into a 1d list. After that I reshaped the list into a 2d array which allows you to convert it to the dataframe that you need. I tried to be as generic as possible. Please let me know if there are any problems.
#libraries
import operator
from functools import reduce
from itertools import chain
#flatten lists of lists using reduce. Then turn everything into a 1d list using
#itertools chain.
reduced_coordinates = list(chain.from_iterable(reduce(operator.concat,
geometry_list)))
#reshape the coordinates 1d list to a 2d and convert it to a dataframe
df = pd.DataFrame(np.reshape(reduced_coordinates, (-1, 2)))
df.columns = ['X', 'Y']
One thing you can do is use numpy. It allows you to perform a lot of list/ array operations in a fast and efficient way. This includes "unnesting" (reshaping) lists. Then you only have to convert to pandas dataframe.
For example,
import numpy as np
#your list
coordinate_list = [[[1377778.4800000004, 6682395.377599999],[6582395.377599999, 2577778.4800000004], [6582395.377599999, 2577778.4800000004]]]
#convert list to array
coordinate_array = numpy.array(coordinate_list)
#print shape of array
coordinate_array.shape
#reshape array into pairs of
reshaped_array = np.reshape(coordinate_array, (3, 2))
df = pd.DataFrame(reshaped_array)
df.columns = ['X', 'Y']
The output will look like this. Let me know if there is something I am missing.
import pandas as pd
import numpy as np
data = np.arange(500).reshape([250, 2])
cols = ['coord']
new_data = []
for item in data:
new_data.append([item])
df = pd.DataFrame(data=new_data, columns=cols)
print(df.head())
def expand(row):
row['x'] = row.coord[0]
row['y'] = row.coord[1]
return row
df = df.apply(expand, axis=1)
df.drop(columns='coord', inplace=True)
print(df.head())
RESULT
coord
0 [0, 1]
1 [2, 3]
2 [4, 5]
3 [6, 7]
4 [8, 9]
x y
0 0 1
1 2 3
2 4 5
3 6 7
4 8 9
I have a dataframe, df, which contains a column called 'event' wherein there is a 24x24x40 numpy array. I want to:
extract this numpy array;
flatten it into a 1x23040 vector;
add this entry as a column in a new numpy array or dataframe;
perform PCA on the resulting matrix.
However, the PCA produces eigenvectors with the dimensions of 'the number of entries', not the 'number of dimensions in the data'.
To illustrate my problem, I demonstrate a minimal example that works perfectly well:
EXAMPLE 1
from sklearn import datasets, decomposition
digits = datasets.load_digits()
X = digits.data
pca = decomposition.PCA()
X_pca = pca.fit_transform(X)
print (X.shape)
Result: (1797, 64)
print (X_pca.shape)
Result: (1797, 64)
There are 1797 entries in each case, with eigenvectors of dimension 64.
Now onto my example:
EXAMPLE 2
from sklearn import datasets, decomposition
import pandas as pd
hdf=pd.HDFStore('./afile.h5')
df=hdf.select('batch0')
print(df['event'][0].shape)
Result: (1, 24, 24, 40)
print(df['event'][0].shape.flatten())
Result: (23040,)
for index, row in df.iterrows():
entry = df['event'][index].flatten()
_list.append(entry)
X = np.asarray(_list)
pca = decomposition.PCA()
X_pca=pca.fit_transform(X)
print (X.shape)
Result: (201, 23040)
print (X_pca.shape)
Result:(201, 201)
This has dimensions of the number of data, 201 entries!
I am unfamiliar with dataframes, so it could be that I am iterating through the dataframe incorrectly. However, I have checked that the rows of the resultant numpy array in X in Example 2 can be reshaped and plotted as expected.
Any thoughts would be appreciated!
Kind regards!
Sklearn's documentation states that the number of components retained when you don't specify the n_components parameter is min(n_samples, n_features).
Now, heading to your example:
In your first example, the number of data samples 1797 is less than the number of dimensions 64, therefore it keeps the whole dimensionality (since you are not specifying the number of components). However, in your second example, the number of data samples is far less than the number of features, hence, sklearns' PCA reduces the number of dimensions to n_samples.
I am trying to prepare data for supervised learning. I have my Tfidf data, which was generated from a column in my dataframe called "merged"
vect = TfidfVectorizer(stop_words='english', use_idf=True, min_df=50, ngram_range=(1,2))
X = vect.fit_transform(merged['kws_name_desc'])
print X.shape
print type(X)
(57629, 11947)
<class 'scipy.sparse.csr.csr_matrix'>
But I also need to add additional columns to this matrix. For each document in the TFIDF matrix, I have a list of additional numeric features. Each list is length 40 and it's comprised of floats.
So for clarify, I have 57,629 lists of length 40 which I'd like to append on to my TDIDF result.
Currently, I have this in a DataFrame, example data: merged["other_data"]. Below is an example row from the merged["other_data"]
0.4329597715,0.3637511039,0.4893141843,0.35840...
How can I append the 57,629 rows of my dataframe column with the TF-IDF matrix? I honestly don't know where to begin and would appreciate any pointers/guidance.
This will do the work.
`df1 = pd.DataFrame(X.toarray()) //Convert sparse matrix to array
df2 = YOUR_DF of size 57k x 40
newDf = pd.concat([df1, df2], axis = 1)`//newDf is the required dataframe
I figured it out:
First: iterate over my pandas column and create a list of lists
for_np = []
for x in merged['other_data']:
row = x.split(",")
row2 = map(float, row)
for_np.append(row2)
Then create a np array:
n = np.array(for_np)
Then use scipy.sparse.hstack on X (my original tfidf sparse matrix and my new matrix. I'll probably end-up reweighting these 40-d vectors if they do not improve the classification results, but this approach worked!
import scipy.sparse
X = scipy.sparse.hstack([X, n])
You could have a look at the answer to this question:
use Featureunion in scikit-learn to combine two pandas columns for tfidf
Obviously, the anwers given should work, but as soon as you want your classifier to make predictions, you definitely want to work with pipelines and feature unions.
I have sparse CSR matrices (from a product of two sparse vector) and I want to convert each matrix to a flat vector. Indeed, I want to avoid using any dense representation or iterating over indexes.
So far, the only solution that came up was to iterate over non null elements by using coo representation:
import numpy
from scipy import sparse as sp
matrices = [sp.csr_matrix([[1,2],[3,4]])]*3
vectorSize = matrices[0].shape[0]*matrices[0].shape[1]
flatMatrixData = []
flatMatrixRows = []
flatMatrixCols = []
for i in range(len(matrices)):
matrix = matrices[i].tocoo()
flatMatrixData += matrix.data.tolist()
flatMatrixRows += [i]*matrix.nnz
flatMatrixCols += [r+c*2 for r,c in zip(matrix.row, matrix.col)]
flatMatrix = sp.coo_matrix((flatMatrixData,(flatMatrixRows, flatMatrixCols)), shape=(len(matrices), vectorSize), dtype=numpy.float64).tocsr()
It is indeed unsatisfying and inelegant. Does any one know how to achieve this in an efficient way?
Your flatMatrix is (3,4); each row is [1 3 2 4]. If a submatrix is x, then the row is x.A.T.flatten().
F = sp.vstack([x.T.tolil().reshape((1,vectorSize)) for x in matrices])
F is the same (dtype is int). I had to convert each submatrix to lil since csr has not implemented reshape (in my version of sparse). I don't know if other formats work.
Ideally sparse would let you do the whole range of numpy array (or matrix) manipulations, but it isn't there yet.
Given the small dimensions in this example, I won't speculate on the speed of the alternatives.