Convert multidimensional climate numpy array to Pandas dataframe - pandas

I want to convert a multidimensional climate data into the pandas data frame. The shape of my numpy array is temperature.shape -> (365,100,200) -> ["time", "longitude", "latitude"]. Then I would like to have the following columns in my pandas dataframe: columns=["time", "lon", "lat", "temp"].
I tried this code:
df = pd.DataFrame(temperature, columns=['time', 'lat', 'lon', 'temp'])
I got this error:
ValueError: Must pass 2-d input
How can I solve it? I could not find any hint in suggested topics. Thanks.

Pandas is expects a 2D array where the columns and rows correspond to the final data frame.
It looks like you're trying to unravel the (365,100,200) array in 365*100*200=7,300,000 individual records. This can be done by flattening the array if you have the values for each independent quantity along each access.
For example, unravelling a (3,4,5) shaped 3D array with X, Y and Z dimensions given by the lists/arrays x_index, y_index, z_index, rather than time, longitude, latitude and M replacing temperature:
import numpy as np
import pandas as pd
nx = 3
ny = 4
nz = 5
M = np.ndarray((nx,ny,nz))
for i in range(nx):
for j in range(ny):
for k in range(nz):
M[i,j,k] = (i+j)*k
# constructed nx by ny by nz matrix from function f(x,y,z) = (x+y)*z
x_index = list(range(nx))
y_index = list(range(ny))
z_index = list(range(nz))
# Get arrays/list giving the values of x/y/z
X, Y, Z = np.meshgrid(x_index,y_index,z_index)
# Make (3,4,5) arrays of each independent variable
pd.DataFrame({"M=(X+Y)*Z":M.flatten(), "X":X.flatten(), "Y":Y.flatten(), "Z":Z.flatten()})
# Flatten the data and independent variables to make 3*4*5=60 individual records

Related

How can i create a bubble chart using this data in seaborn?

i have all the data i need to plot in a single row e.g.:
mcc_name year_1 year_2 year_3 year_1_% year_2_% year_3_%
book shop 30000 1500.41 9006.77 NaN -0.4708 -0.60379
i want the x axis to be the values in columns: [year_1, year_2, year_3] and values in y axis to be the y - axis (pct change)... and the size of the bubble proportional to the values in [year_1, year_2, year_3] .
sns.scatterplot(data=data_row , x=['year_1', 'year_2', 'year_3'], y=['year_1_%', 'year_2_%', 'year_3_%'], size="pop", legend=False, sizes=(20, 2000))
# show the graph
plt.show()
but i get this error:
ValueError: Length of list vectors must match length of `data` when both are used, but `data` has length 1 and the vector passed to `y` has length 3.
how can i plot??
You need to have your data in long format:
import pandas as pd
import seaborn as sns
import numpy as np
df = pd.DataFrame(np.array([30000,1500.41,9006.77,np.NaN,-0.4708,-0.60379]).reshape(1,-1),
columns = ['year_1','year_2','year_3','year_1_%','year_2_%','year_3_%'],
index = ['mcc_name'])
Usually you can use wide_to_long if your columns are formatted properly, but in this case, maybe easily to melt separately and join:
values = df.filter(regex='year_[0-9]$', axis=1).melt(value_name="value",var_name="year")
perc = df.filter(regex='_%', axis=1).melt(value_name="perc",var_name="year")
perc.year = perc.year.str.replace("_%","")
sns.scatterplot(data=values.merge(perc,on="year"),x = "year", y = "perc", size = "value")

Convert pandas single column to Scipy Sparse Matrix

I have a pandas data frame like this:
a other-columns
0.3 0.2 0.0 0.0 0.0... ....
I want to convert column a into SciPy sparse CSR matrix. a is a probability distribution. I would like to convert without expanding a into multiple columns.
This is naive solution with expanding a into multiple columns:
df = df.join(df['a'].str.split(expand = True).add_prefix('a')).drop(['a'], axis = 1)
df_matrix = scipy.sparse.csr_matrix(df.values)
But, I don't want to expand into multiple columns, as it shoots up the memory. Is it possible to do this by keeping a in 1 column only?
EDIT (Minimum Reproducible Example):
import pandas as pd
from scipy.sparse import csr_matrix
d = {'a': ['0.05 0.0', '0.2 0.0']}
df = pd.DataFrame(data=d)
df = df.join(df['a'].str.split(expand = True).add_prefix('a')).drop(['a'], axis = 1)
df = df.astype(float)
df_matrix = scipy.sparse.csr_matrix(df.values)
df_matrix
Output:
<2x2 sparse matrix of type '<class 'numpy.float64'>'
with 2 stored elements in Compressed Sparse Row format>
I want to achieve above, but, without splitting into multiple columns. Also, in my real file, I have 36 length string (separated by space) columns and millions of rows. It is sure that all rows will contain 36 spaces.
Also, in my real file, I have 36 length string (separated by space) columns and millions of rows. It is sure that all rows will contain 36 spaces.
Convert large csv to sparse matrix for use in sklearn
I can not overstate how much you should not do the thing that follows this sentence.
import pandas as pd
import numpy as np
from scipy import sparse
df = pd.DataFrame({'a': ['0.05 0.0', '0.2 0.0'] * 100000})
chunksize = 10000
sparse_coo = []
for i in range(int(np.ceil(df.shape[0]/chunksize))):
chunk = df.iloc[i * chunksize:min(i * chunksize +chunksize, df.shape[0]), :]
sparse_coo.append(sparse.coo_matrix(chunk['a'].apply(lambda x: [float(y) for y in x.split()]).tolist()))
sparse_coo = sparse.vstack(sparse_coo)
You could get the dense array from the column without the expand:
In [179]: df = pd.DataFrame(data=d)
e.g.
In [180]: np.array(df['a'].str.split().tolist(),float)
Out[180]:
array([[0.05, 0. ],
[0.2 , 0. ]])
But I doubt if that saves much in memory (though I only have a crude understanding of DataFrame memory use.
You could convert each string to a sparse matrix:
In [190]: def foo(astr):
...: alist = astr.split()
...: arr = np.array(alist, float)
...: return sparse.coo_matrix(arr)
In [191]: alist = [foo(row) for row in df['a']]
In [192]: alist
Out[192]:
[<1x2 sparse matrix of type '<class 'numpy.float64'>'
with 1 stored elements in COOrdinate format>,
<1x2 sparse matrix of type '<class 'numpy.float64'>'
with 1 stored elements in COOrdinate format>]
In [193]: sparse.vstack(alist)
Out[193]:
<2x2 sparse matrix of type '<class 'numpy.float64'>'
with 2 stored elements in COOrdinate format>
I tried to make the coo directly from the alist, but that didn't trim out the zeros. There's just as much conversion, but if sufficiently sparse (5% or less) it could save quite a bit on memory (if not time).
sparse.vstack combines the data,rows,cols values from the component matrices to define a new coo matrix. It's most straight forward way of combining sparse matrices, if not the fastest.
Looks like I could use apply as well
In [205]: df['a'].apply(foo)
Out[205]:
0 (0, 0)\t0.05
1 (0, 0)\t0.2
Name: a, dtype: object
In [206]: df['a'].apply(foo).values
Out[206]:
array([<1x2 sparse matrix of type '<class 'numpy.float64'>'
with 1 stored elements in COOrdinate format>,
<1x2 sparse matrix of type '<class 'numpy.float64'>'
with 1 stored elements in COOrdinate format>], dtype=object)
In [207]: sparse.vstack(df['a'].apply(foo))
Out[207]:
<2x2 sparse matrix of type '<class 'numpy.float64'>'
with 2 stored elements in COOrdinate format>

How to convert numpy one dimensional array to Pandas Series or Dataframe

I have spent quiet some time on what seems to be very easy thing. All I want is to convert a numpy array to a Series and then combine Series to make a dataframe. I have two numpy arrays.
import numpy as np
rooms = 2*np.random.rand(100, 1) + 3
price = 265 + 6*rooms + abs(np.random.randn(100, 1))
I wanted to convert rooms and price to series and then combine the two series into a dataframe to make lmplot
So could any one tell me how to do that? Thanks.
you can use ravel() to convert the arrays to 1-d data:
pd.DataFrame({
'rooms': rooms.ravel(),
'price': price.ravel()
})
The problem with passing the arrays directly to pd.Series is the dimensionality: rooms and price are 2d-array of shape (100,1) while pd.Series requires a 1d-array. To reshape them you can use different methods, one of which is .squeeze(), namely:
import pandas as pd
import numpy as np
rooms = 2*np.random.rand(100, 1) + 3
price = 265 + 6*rooms + abs(np.random.randn(100, 1))
rooms_series = pd.Series(rooms.squeeze())
price_series = pd.Series(price.squeeze())
Now to go from series to dataframe you can do:
pd.DataFrame({'rooms': rooms_series,
'price': price_series})
Or directly from the numpy arrays:
pd.DataFrame({'rooms': rooms.squeeze(),
'price': price.squeeze()})

numpy.corrcoeff() MemoryError

Can't understand MemoryError I get using numpy.corrcoeff() to find correlation coefficient between 2 vectors smin & smax as following:
import numpy as np
from numpy import random as rn
r=0.01
sigma=0.2
T=1
K=1
N=252
h=T/N
M = 50000
Z = rn.randn(M,N)
S=np.ones((M,N+1))
smax=np.ones((M,1))
smin=np.ones((M,1))
for i in range(0,N):
S[:,i+1]=S[:,i]*(np.exp((r-(sigma**2)/2)*h+sigma*Z[:,i]*np.sqrt(h)))
for j in range(0,M):
smax[j,:]=np.exp(-r*T)*(np.max(S[j,:])>K)*(np.max(S[j,:])-K)
smin[j,:]=np.exp(-r*T)*(np.min(S[j,:])<K)*(K-np.min(S[j,:]))
c=np.corrcoef(smax,smin)
print(c)
if there is another way to find correlation coeff.,like using pandas it's also good.
The shape of your arrays here is what is the problem. The function documentation states that x is a "1-D or 2-D array containing multiple variables and observations. Each row of x represents a variable, and each column a single observation of all those variables." and that y is an additional set of variables and observations. So this is trying to allocate an array of size (10000, 10000), which is huge.
If you just want to calculate the pearson correlation coefficient between two one dimensional vectors, you can use a much simpler formula than what is implemented here. This documentation has the formula I am referring to.
https://hydroerr.readthedocs.io/en/stable/api/HydroErr.HydroErr.pearson_r.html#HydroErr.HydroErr.pearson_r
But to be able to still use the numpy version you need to pass in the observations and predictions in the same parameter x, and x and y need to be 1D arrays.
import numpy as np
simulated_array = np.random.rand(50000)
observed_array = np.random.rand(50000)
c = np.corrcoef([simulated_array, observed_array])[1, 0]
More explanation about this here.

How to split a cell which contains nested array in a pandas DataFrame

I have a pandas DataFrame, which contains 610 rows, and every row contains a nested list of coordinate pairs, it looks like that:
[1377778.4800000004, 6682395.377599999] is one coordinate pair.
I want to unnest every row, so instead of one row containing a list of coordinates I will have one row for every coordinate pair, i.e.:
I've tried s.apply(pd.Series).stack() from this question Split nested array values from Pandas Dataframe cell over multiple rows but unfortunately that didn't work.
Please any ideas? Many thanks in advance!
Here my new answer to your problem. I used "reduce" to flatten your nested array and then I used "itertools chain" to turn everything into a 1d list. After that I reshaped the list into a 2d array which allows you to convert it to the dataframe that you need. I tried to be as generic as possible. Please let me know if there are any problems.
#libraries
import operator
from functools import reduce
from itertools import chain
#flatten lists of lists using reduce. Then turn everything into a 1d list using
#itertools chain.
reduced_coordinates = list(chain.from_iterable(reduce(operator.concat,
geometry_list)))
#reshape the coordinates 1d list to a 2d and convert it to a dataframe
df = pd.DataFrame(np.reshape(reduced_coordinates, (-1, 2)))
df.columns = ['X', 'Y']
One thing you can do is use numpy. It allows you to perform a lot of list/ array operations in a fast and efficient way. This includes "unnesting" (reshaping) lists. Then you only have to convert to pandas dataframe.
For example,
import numpy as np
#your list
coordinate_list = [[[1377778.4800000004, 6682395.377599999],[6582395.377599999, 2577778.4800000004], [6582395.377599999, 2577778.4800000004]]]
#convert list to array
coordinate_array = numpy.array(coordinate_list)
#print shape of array
coordinate_array.shape
#reshape array into pairs of
reshaped_array = np.reshape(coordinate_array, (3, 2))
df = pd.DataFrame(reshaped_array)
df.columns = ['X', 'Y']
The output will look like this. Let me know if there is something I am missing.
import pandas as pd
import numpy as np
data = np.arange(500).reshape([250, 2])
cols = ['coord']
new_data = []
for item in data:
new_data.append([item])
df = pd.DataFrame(data=new_data, columns=cols)
print(df.head())
def expand(row):
row['x'] = row.coord[0]
row['y'] = row.coord[1]
return row
df = df.apply(expand, axis=1)
df.drop(columns='coord', inplace=True)
print(df.head())
RESULT
coord
0 [0, 1]
1 [2, 3]
2 [4, 5]
3 [6, 7]
4 [8, 9]
x y
0 0 1
1 2 3
2 4 5
3 6 7
4 8 9