Dataframe conversion from pandas to polars -- difference in the final dimensions - pandas

I'm trying to convert a Pandas Dataframe to a Polar one.
I simply used the function result_polars = pl.from_pandas(result). Conversion proceeds well, but when I check the shape of the two dataframe I get that the Polars one has half the size of the original Pandas Dataframe.
I believe that 4172903059 in length is almost the maximum dimension that the polars dataframe allows.
Does anyone have suggestions?
Here a screenshot of the shape of the two dataframes.
Here a Minimum working example
import polars as pl
import pandas as pd
import numpy as np
df = pd.DataFrame(np.zeros((4292903069,1), dtype=np.uint8))
df_polars = pl.from_pandas(df)
Using these dimensions the two dataframes have the same size. If instead I put the following:
import polars as pl
import pandas as pd
import numpy as np
df = pd.DataFrame(np.zeros((4392903069,1), dtype=np.uint8))
df_polars = pl.from_pandas(df)
The Polars dataframe has much smaller dimension (97935773).

The default polars wheel retrieved with pip install polars "only" allows for 2^32 e.g. ~4.2 billion rows.
Do you need more than that install pip install polars-u64-idx and uninstall the previous installation.

Related

FInding fft gives keyerror :'Aligned ' pandas

I have a time series data
I am trying to find the fft .But it gives keyerror :Aligned when trying to get the value
my data looks like below
this is the code:
import datetime
import numpy as np
import scipy as sp
import scipy.fftpack
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
temp_fft = sp.fftpack.fft(data3)
Looks like your data is a pandas series. fft works with numpy arrays rather than series.
Easy resolution is to convert your series into a numpy array either via
data3.values
or
np.array(data3)
You can then pass that array into fft function. So the end result is:
temp_fft = sp.fftpack.fft(data3.values)
This should work for you now.

Numpy broadcasting comparison report "'bool' object has no attribute 'sum'" error when dealing with large dataframe

I use numpy broadcasting to get the differences matrix from a pandas dataframe. I find when dealing with large dataframe, it reports "'bool' object has no attribute 'sum'" error. While dealing with small dataframe, it runs fine.
I post the two csv files in the following links:
large file
small file
import numpy as np
import pandas as pd
df_small = pd.read_csv(r'test_small.csv',index_col='Key')
df_small.fillna(0,inplace=True)
a_small = df_small.to_numpy()
matrix = pd.DataFrame((a_small != a_small[:, None]).sum(2), index=df_small.index, columns=df_small.index)
print(matirx)
when running this, I could get the difference matrix.
when switch to large file, It reports the following error. Does anybody know why this happens?
EDIT:The numpy version is 1.19.5
np.__version__
'1.19.5'

How can i sort my dataframe with the most recent value at the bottom?

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
df=pd.read_csv("BTC-USD.csv")
df=df.drop(["Date","Adj Close","Volume","Low","Close"],axis=1)
x=df["Open"]
y=df["High"]
Here is my dataframe
In my data frame , newest value is at the top. What i wanna do here is putting newest value to bottom and oldest value to top.
I understand that you just want to inverse line order without considering values in columns.
Thus, you need to use pandas.DataFrame.reindex function as follow:
reordered_df = df.reindex(df.index[::-1])
The documentation is here: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.reindex.html

What is the difference between doing a regression with a dataframe and ndarray?

I would like to know why would I need to convert my dataframe to ndarray when doing a regression, since I get the same result for intercept and coef when I do not convert it?
import matplotlib.pyplot as plt
import pandas as pd
import pylab as pl
import numpy as np
from sklearn import linear_model
%matplotlib inline
# import data and create dataframe
!wget -O FuelConsumption.csv https://s3-api.us-geo.objectstorage.softlayer.net/cf-courses-data/CognitiveClass/ML0101ENv3/labs/FuelConsumptionCo2.csv
df = pd.read_csv("FuelConsumption.csv")
cdf = df[['ENGINESIZE','CYLINDERS','FUELCONSUMPTION_COMB','CO2EMISSIONS']]
# Split train/ test data
msk = np.random.rand(len(df)) < 0.8
train = cdf[msk]
test = cdf[~msk]
# Modeling
regr = linear_model.LinearRegression()
train_x = np.asanyarray(train[['ENGINESIZE']])
train_y = np.asanyarray(train[['CO2EMISSIONS']])
**# if I use the dataframe, train[['ENGINESIZE']] for 'x', and train[['CO2EMISSIONS']] for 'y'
below, I get the same result**
regr.fit (train_x, train_y)
# The coefficients
print ('Coefficients: ', regr.coef_)
print ('Intercept: ',regr.intercept_)
Thank you very much!
So df is the loaded dataframe, cdf is another frame with selected columns, and train is selected rows.
train[['ENGINESIZE']] is a 1 column dataframe (I believe train['ENGINESIZE'] would be a pandas Series).
I believe the preferred syntax for getting an array from the dataframe is:
train[['ENGINESIZE']].values # or
train[['ENGINESIZE']].to_numpy()
though
np.asanyarray(train[['ENGINESIZE']])
is supposed to do the same thing.
Digging down through the regr.fit code I see that it calls sklearn.utils.check_X_y which in turn calls sklearn.tils.check_array. That takes care of converting the inputs to numpy arrays, with some awareness of pandas dataframe peculiarities (such as multiple dtypes).
So it appears that if fit accepts your dataframes, you don't need to convert them ahead of time. But if you can get a nice array from the dataframe, there's no harm in do that either. Either way the fit is done with arrays, derived from the dataframe.

Linear 1D interpolation using a interp1D function in Python on panda dataframe columns

I'm trying to use the "interp1d " function from scipy.interpolate to generate an interpolation from two columns in a python dataframe . I'm using python 2.7. I'm able to generate the interpolation without errors but the interpolation fails to show any reasonable output when the values are supplied within the boundary conditions. For eg column 'X-Co ordinate' in the 16 columns x 200 rows dataframe DF is between 0.5- 10.5 while the 'Y-Co ordinate' column is a number between range 1.5-99.4. I have generated the interpolation as follows:
from scipy.interpolate import interp1d
import pandas as pd
DF=pd.DataFrame() #This dummy dataframe will have the columns and rows as described above
InterpolatedFunction=interp1d(DF['X-Co ordinate'],DF['Y-Co ordinate'], bound_error=False)
InterpolatedValue_For_X_Equals_5=interp1d(5)
Give the pandas built-in method df.interpolate(method='linear') a try.