Memory problems with turning a large Xarray to a DataFrame, Python - dataframe

I have an x,y,time xarray of about 2 million datapoints. I have tried turning it into a dask DataArray and then turning that into a dataframe, but there isn't enough memory. I use colab. Is there a different way to do this that might fit into colabs memory?
dask_array = da.array.from_array(xarray.values, chunks=
(xarray.shape[0], 100, 100))
#reshape and transpose to make rows of timeseries for each
point
dask_t =
dask_array.transpose(1,2,0).reshape(arr.shape[1]*arr.shape[2],
arr.shape[0])
#prepare column names for timeseries
feat_cols = ['time_'+str(i) for i in
range(dask_array.shape[0])]
#turn into a dataframe
pd.DataFrame(dask_t.compute(), columns=feat_cols)

Related

How to calculate and reshape more than 1 bn items of data into PySpark?

Our use case is to read data from BQ and caculate by using pandas and numpy.reshape to turn it into input for the model, sample code like:
import numpy as np
import pandas as pd
# Source Data
feature = spark.read.format('bigquery') \
.option('table', TABLE_NAME) \
.load()
test = feature.to_pandas_on_spark().sort_values(by = ['col1','col2'], ascending = True).drop(['col1','col3','col5'], axis = 1)
test = (test - test.mean())/(test.std())
row = int(len(test)/100)
row2 = 50
col3 = 100
feature_array = np.reshape(feature_nor.values, (row,row2,col3))
feature.to_pandas_on_spark() will collect all data into driver memory and for small amout of data it can work, but for more than 15 Billion data it can not handle this.
I try to convert to_pandas_on_spark() to spark dataframe so that it can compute in parallell:
sorted_df = feature.sort('sndr_id').sort('date_index').drop('sndr_id').drop('date_index').drop('cal_dt')
mean_df = sorted_df.select(*[f.mean(c).alias(c) for c in sorted_df.columns])
std_df = sorted_df.select(*[f.stddev(c).alias(c) for c in sorted_df.columns])
Since the function is different from the pandas api, so I cannot verify these code and for the last reshape operation(np.reshape(feature_nor.values, (row,row2,col3))) dataframe doesn't support this function, is there a good solution to replace it?
I want to know how to handle 1B data in a efficient way and without memory overflow, including how to use numpy's reshape and pandas's computation operations, any answers will be super helpful!
I would advise not to use pandas or numpy on a dataset of this size, there usually is some Spark function to solve your problem, even firing up a UDF or using pandas on spark comes with a significant performance loss.
What exactly are your reshape criteria?
Maybe pivot helps?

Fastest way to load multiple numpy arrays into spark rdd?

I'm new to Spark. In my application, I would like to create an RDD from many numpy arrays. Each numpy array is (10,000, 5,000). Currently, I'm trying the following:
rdd_list = []
for np_array in np_arrays:
pandas_df = pd.DataFrame(np_array)
spark_df = sqlContext.createDataFrame(pandas_df) ##SLOW STEP
rdd_list.append(spark_df.rdd)
big_rdd = sc.union(rdd_list)
All of the steps are fast, except converting the Pandas dataframe to Spark dataframe is very slow. If I use a subset of the numpy array, such (10,000, 500), it takes a couple minutes to convert it to a Spark dataframe. But if I use the full numpy array (10,000, 5,000), it just hangs.
Is there anything I can do to speed up my workflow? Or should I be doing this in a completely different way? (FYI, I'm kind of stuck with the initial numpy arrays.)
For my application I had used the class ArrayRDD from the sparkit-learn project for loading numpy arrays into spark RDDs. I had no complaints but your mileage may vary.

Huge sparse dataframe to scipy sparse matrix without dense transform

Have data with more then 1 million rows and 30 columns, one of the columns is user_id (more then 1500 different users).
I want one-hot-encode this column and to use data in ML algorithms (xgboost, FFM, scikit). But due to huge row numbers and unique user values matrix will be ~ 1 million X 1500, so need do this in sparse format (otherwise data kill all RAM).
For me convenient way to work with data through pandas DataFrame, which also now it support sparse format:
df = pd.get_dummies(df, columns=['user_id', 'type'], sparse=True)
Work pretty fast and have small size in RAM. But for working with scikit algos and xgboost it's necessary transform dataframe to sparse matrix.
Is there any way to do this rather than iterate through columns and hstack them in one scipy sparse matrix?
I tried df.as_matrix() and df.values, but all of first transform data to dense what arise MemoryError :(
P.S.
Same to get DMatrix for xgboost
UPDATE:
So i release next solution (will be thankful for optimisation suggestions):
def sparse_df_to_saprse_matrix (sparse_df):
index_list = sparse_df.index.values.tolist()
matrix_columns = []
sparse_matrix = None
for column in sparse_df.columns:
sps_series = sparse_df[column]
sps_series.index = pd.MultiIndex.from_product([index_list, [column]])
curr_sps_column, rows, cols = sps_series.to_coo()
if sparse_matrix != None:
sparse_matrix = sparse.hstack([sparse_matrix, curr_sps_column])
else:
sparse_matrix = curr_sps_column
matrix_columns.extend(cols)
return sparse_matrix, index_list, matrix_columns
And the following code allows to get sparse dataframe:
one_hot_df = pd.get_dummies(df, columns=['user_id', 'type'], sparse=True)
full_sparse_df = one_hot_df.to_sparse(fill_value=0)
I have created sparse matrix 1,1 million rows x 1150 columns. But during creating it's still uses significant amount of RAM (~10Gb on edge with my 12Gb).
Don't know why, because resulting sparse matrix uses only 300 Mb (after loading from HDD). Any ideas?
You should be able to use the experimental .to_coo() method in pandas [1] in the following way:
one_hot_df = pd.get_dummies(df, columns=['user_id', 'type'], sparse=True)
one_hot_df, idx_rows, idx_cols = one_hot_df.stack().to_sparse().to_coo()
This method, instead of taking a DataFrame (rows / columns) it takes a Series with rows and columns in a MultiIndex (this is why you need the .stack() method). This Series with the MultiIndex needs to be a SparseSeries, and even if your input is a SparseDataFrame, .stack() returns a regular Series. So, you need to use the .to_sparse() method before calling .to_coo().
The Series returned by .stack(), even if it's not a SparseSeries only contains the elements that are not null, so it shouldn't take more memory than the sparse version (at least with np.nan when the type is np.float).
http://pandas.pydata.org/pandas-docs/stable/sparse.html#interaction-with-scipy-sparse
Does my answer from a few months back help?
Pandas sparse dataFrame to sparse matrix, without generating a dense matrix in memory
It was accepted but I didn't get any further feedback.
I'm familiar with the scipy sparse formats and their inputs, but don't know much about pandas sparse.

IPython / pandas: Is there an canonical way to detect rapid changes in a timeseries?

Noob data analyst, analyzing some gas concentrations over a timeseries of a couple of thousand points (so small). I graphed it with Matplotlib, and there are some easy to see points where things change rapidly.
What is the canonical / easiest way to home in on those points?
import pandas as pd
from numpy import diff, concatenate
ff = pd.DataFrame( #acquire data here
columns=('Year','Recon'))
fd = diff(ff['Recon'], axis=-1)
ff['diff'] = concatenate([[0],fd],axis=0)
ff['rolling10'] = pd.rolling_mean(ff['diff'],10)
ff['rolling5'] = pd.rolling_mean(ff['diff'],5)
ff.plot('Year',['rolling5','rolling10'],subplots=False)
But note! my test data was evenly sampled. Looks like rolling_* don't apply to irregular time series yet, though there are some workarounds: Pandas: rolling mean by time interval

Pandas Timeseries plotting

I have a Pandas timeseries object with dates and corresponding values. But, when I try to plot it, the plot is a L-shape plot (the dates and values are automatically arranged in such a way that the highest value comes first...).
This is what did to generate the plot:
df = pd.read_csv('C:\data\test1.csv') # two-column dataframe
data_list = df['values'].tolist()
dates_list = df['date'].tolist()
df_ts = pd.Series(data_list, index=dates_list)
df_ts.plot()
I am not sure where I am making a mistake. I am reading in a csv file, converting to a timeseries obj and plotting it. Any suggestions is very much appreciated.
Thanks!
PD
don't bother creating the unnecessary intermediate data structures, just organize your DataFrame better.
df['date'] = pd.to_datetime(df.date) #make sure you're actually dealing with timestamps.
df.set_index('date', inplace=True)
df.sort(inplace=True)
df.plot()