pandas convert series of ndarrays to dataframe - pandas

I have a series S of 263 elements, each is a ndarray with the shape 1X768.
I need to convert it to dataframe. So, the dataframe should have the shape 263X768 and include the actual data from S.
What is the best way to do it?

You can use np.vstack:
# list of ndarrays
x = [np.ones((1, 768)) for x in range(263)]
# create dataframe
df = pd.DataFrame(np.vstack(x))
df.shape
Output:
(263, 768)

Related

Make columns from np array elements

I've got dataframe with shape (35,1). Where elements of this dataframe are np.arrays length of 50. I need to create dataframe (35,50). How can I make it?
Tried reshape(-1,1), but it's not suitable for this
df = pd.DataFrame(df["col"].tolist(), index= df.index)

Convert multidimensional climate numpy array to Pandas dataframe

I want to convert a multidimensional climate data into the pandas data frame. The shape of my numpy array is temperature.shape -> (365,100,200) -> ["time", "longitude", "latitude"]. Then I would like to have the following columns in my pandas dataframe: columns=["time", "lon", "lat", "temp"].
I tried this code:
df = pd.DataFrame(temperature, columns=['time', 'lat', 'lon', 'temp'])
I got this error:
ValueError: Must pass 2-d input
How can I solve it? I could not find any hint in suggested topics. Thanks.
Pandas is expects a 2D array where the columns and rows correspond to the final data frame.
It looks like you're trying to unravel the (365,100,200) array in 365*100*200=7,300,000 individual records. This can be done by flattening the array if you have the values for each independent quantity along each access.
For example, unravelling a (3,4,5) shaped 3D array with X, Y and Z dimensions given by the lists/arrays x_index, y_index, z_index, rather than time, longitude, latitude and M replacing temperature:
import numpy as np
import pandas as pd
nx = 3
ny = 4
nz = 5
M = np.ndarray((nx,ny,nz))
for i in range(nx):
for j in range(ny):
for k in range(nz):
M[i,j,k] = (i+j)*k
# constructed nx by ny by nz matrix from function f(x,y,z) = (x+y)*z
x_index = list(range(nx))
y_index = list(range(ny))
z_index = list(range(nz))
# Get arrays/list giving the values of x/y/z
X, Y, Z = np.meshgrid(x_index,y_index,z_index)
# Make (3,4,5) arrays of each independent variable
pd.DataFrame({"M=(X+Y)*Z":M.flatten(), "X":X.flatten(), "Y":Y.flatten(), "Z":Z.flatten()})
# Flatten the data and independent variables to make 3*4*5=60 individual records

How to convert a matrix as string to ndarray?

I have a csv file with this structure:
id;matrix
1;[[1.2 1.3] [1.2 1.3] [1.2 1.3]]
I'm trying read the matrix field as numpy.ndarray using pandas.read_csv to read and making df.to_numpy() to convert the matrix, but the shape array result in (1,0). I was waiting for the shape equals (3,2) as:
matrix = [[1.2 1.3]
[1.2 1.3]
[1.2 1.3]]
I was try too numpy.asmatrix, but the result is like df.to_numpy()
A solution with pandas
Providing the format of the matrix column is consistent with that shown in the example, replace the spaces with ,, then use literal_eval to turn the string into a list of lists, and then apply np.array.
import pandas as pd
from ast import literal_eval
import numpy as np
# read the data
df = pd.read_csv('file.csv', sep=';')
# replace the spaces
df['matrix'] = df['matrix'].str.replace(' ', ',')
# apply literal_eval
df['matrix'] = df['matrix'].apply(literal_eval)
# apply numpy array
df['matrix'] = df['matrix'].apply(np.array)
print(type(df.iloc[0, 1]))
>>> numpy.ndarray
Each row of the matrix column will be an ndarray
The two apply calls can be combined into:
df['matrix'] = df['matrix'].apply(lambda x: np.array(literal_eval(x)))
Or this hot mess:
df['matrix'] = df['matrix'].str.replace(' ', ',').apply(lambda x: np.array(literal_eval(x)))
I personally prefer one transformation per line for code clarity.

How to split a cell which contains nested array in a pandas DataFrame

I have a pandas DataFrame, which contains 610 rows, and every row contains a nested list of coordinate pairs, it looks like that:
[1377778.4800000004, 6682395.377599999] is one coordinate pair.
I want to unnest every row, so instead of one row containing a list of coordinates I will have one row for every coordinate pair, i.e.:
I've tried s.apply(pd.Series).stack() from this question Split nested array values from Pandas Dataframe cell over multiple rows but unfortunately that didn't work.
Please any ideas? Many thanks in advance!
Here my new answer to your problem. I used "reduce" to flatten your nested array and then I used "itertools chain" to turn everything into a 1d list. After that I reshaped the list into a 2d array which allows you to convert it to the dataframe that you need. I tried to be as generic as possible. Please let me know if there are any problems.
#libraries
import operator
from functools import reduce
from itertools import chain
#flatten lists of lists using reduce. Then turn everything into a 1d list using
#itertools chain.
reduced_coordinates = list(chain.from_iterable(reduce(operator.concat,
geometry_list)))
#reshape the coordinates 1d list to a 2d and convert it to a dataframe
df = pd.DataFrame(np.reshape(reduced_coordinates, (-1, 2)))
df.columns = ['X', 'Y']
One thing you can do is use numpy. It allows you to perform a lot of list/ array operations in a fast and efficient way. This includes "unnesting" (reshaping) lists. Then you only have to convert to pandas dataframe.
For example,
import numpy as np
#your list
coordinate_list = [[[1377778.4800000004, 6682395.377599999],[6582395.377599999, 2577778.4800000004], [6582395.377599999, 2577778.4800000004]]]
#convert list to array
coordinate_array = numpy.array(coordinate_list)
#print shape of array
coordinate_array.shape
#reshape array into pairs of
reshaped_array = np.reshape(coordinate_array, (3, 2))
df = pd.DataFrame(reshaped_array)
df.columns = ['X', 'Y']
The output will look like this. Let me know if there is something I am missing.
import pandas as pd
import numpy as np
data = np.arange(500).reshape([250, 2])
cols = ['coord']
new_data = []
for item in data:
new_data.append([item])
df = pd.DataFrame(data=new_data, columns=cols)
print(df.head())
def expand(row):
row['x'] = row.coord[0]
row['y'] = row.coord[1]
return row
df = df.apply(expand, axis=1)
df.drop(columns='coord', inplace=True)
print(df.head())
RESULT
coord
0 [0, 1]
1 [2, 3]
2 [4, 5]
3 [6, 7]
4 [8, 9]
x y
0 0 1
1 2 3
2 4 5
3 6 7
4 8 9

Combine Sklearn TFIDF with Additional Data

I am trying to prepare data for supervised learning. I have my Tfidf data, which was generated from a column in my dataframe called "merged"
vect = TfidfVectorizer(stop_words='english', use_idf=True, min_df=50, ngram_range=(1,2))
X = vect.fit_transform(merged['kws_name_desc'])
print X.shape
print type(X)
(57629, 11947)
<class 'scipy.sparse.csr.csr_matrix'>
But I also need to add additional columns to this matrix. For each document in the TFIDF matrix, I have a list of additional numeric features. Each list is length 40 and it's comprised of floats.
So for clarify, I have 57,629 lists of length 40 which I'd like to append on to my TDIDF result.
Currently, I have this in a DataFrame, example data: merged["other_data"]. Below is an example row from the merged["other_data"]
0.4329597715,0.3637511039,0.4893141843,0.35840...
How can I append the 57,629 rows of my dataframe column with the TF-IDF matrix? I honestly don't know where to begin and would appreciate any pointers/guidance.
This will do the work.
`df1 = pd.DataFrame(X.toarray()) //Convert sparse matrix to array
df2 = YOUR_DF of size 57k x 40
newDf = pd.concat([df1, df2], axis = 1)`//newDf is the required dataframe
I figured it out:
First: iterate over my pandas column and create a list of lists
for_np = []
for x in merged['other_data']:
row = x.split(",")
row2 = map(float, row)
for_np.append(row2)
Then create a np array:
n = np.array(for_np)
Then use scipy.sparse.hstack on X (my original tfidf sparse matrix and my new matrix. I'll probably end-up reweighting these 40-d vectors if they do not improve the classification results, but this approach worked!
import scipy.sparse
X = scipy.sparse.hstack([X, n])
You could have a look at the answer to this question:
use Featureunion in scikit-learn to combine two pandas columns for tfidf
Obviously, the anwers given should work, but as soon as you want your classifier to make predictions, you definitely want to work with pipelines and feature unions.