I created a DataFrame with 3 indexes. Later, I applied unstack method followd by the stack method. Then checked for equality of new and old data frames. Why are both of them different? Is unstacking not opposite procedure to stacking? Here is my code :
import numpy as np
import pandas as pd
data = pd.Series([7]*9, index = [[1,2,3,2,4,9,6,7,9], ['a','c','f','a', 'k','f','c','d','a'], [np.nan]*9])
data2 = data.unstack().stack()
print(data2.equals(data))
The output returns False, but don't know why!
Related
I'm trying to convert a Pandas Dataframe to a Polar one.
I simply used the function result_polars = pl.from_pandas(result). Conversion proceeds well, but when I check the shape of the two dataframe I get that the Polars one has half the size of the original Pandas Dataframe.
I believe that 4172903059 in length is almost the maximum dimension that the polars dataframe allows.
Does anyone have suggestions?
Here a screenshot of the shape of the two dataframes.
Here a Minimum working example
import polars as pl
import pandas as pd
import numpy as np
df = pd.DataFrame(np.zeros((4292903069,1), dtype=np.uint8))
df_polars = pl.from_pandas(df)
Using these dimensions the two dataframes have the same size. If instead I put the following:
import polars as pl
import pandas as pd
import numpy as np
df = pd.DataFrame(np.zeros((4392903069,1), dtype=np.uint8))
df_polars = pl.from_pandas(df)
The Polars dataframe has much smaller dimension (97935773).
The default polars wheel retrieved with pip install polars "only" allows for 2^32 e.g. ~4.2 billion rows.
Do you need more than that install pip install polars-u64-idx and uninstall the previous installation.
I'm trying to store an ndarray from a pandas data frame
to postgres. Putting the ndarrays in an column and using to_sql() stores
them very inefficiently. Is there a more efficient way(memory wise) of doing this ?
Note: Of course normalizing the ndarrays into rows in a table would be much better for searching and maybe reduce memory usage, but this is specifically about keeping the ndarray since the structure dimensions are not precisely known beforehand.
Using BytesIO in combination with numpy.save() seems to do the trick. Also, explicit types in to_sql ensure bytea is used. Something like:
import io
import numpy as np
import pandas as pd
from sqlalchemy import String, LargeBinary
df = pd.DataFrame([file_path],columns=["filename"])
f = io.BytesIO()
np.save(f, blob_data)
f.seek(0)
blob = f.read()
df['image'] = [blob]
And then save it like:
df.to_sql(con=engine, name=destination_table_name, schema=destination_schema_name, dtype={"filename": String, "image": LargeBinary})
To read it back do something like:
df2 = pull_dataframe_from_postgres_function()
f = io.BytesIO()
f.write(df2["image"][0])
f.seek(0)
data = np.load(f) # data as a ndarray
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
df=pd.read_csv("BTC-USD.csv")
df=df.drop(["Date","Adj Close","Volume","Low","Close"],axis=1)
x=df["Open"]
y=df["High"]
Here is my dataframe
In my data frame , newest value is at the top. What i wanna do here is putting newest value to bottom and oldest value to top.
I understand that you just want to inverse line order without considering values in columns.
Thus, you need to use pandas.DataFrame.reindex function as follow:
reordered_df = df.reindex(df.index[::-1])
The documentation is here: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.reindex.html
I would like to know why would I need to convert my dataframe to ndarray when doing a regression, since I get the same result for intercept and coef when I do not convert it?
import matplotlib.pyplot as plt
import pandas as pd
import pylab as pl
import numpy as np
from sklearn import linear_model
%matplotlib inline
# import data and create dataframe
!wget -O FuelConsumption.csv https://s3-api.us-geo.objectstorage.softlayer.net/cf-courses-data/CognitiveClass/ML0101ENv3/labs/FuelConsumptionCo2.csv
df = pd.read_csv("FuelConsumption.csv")
cdf = df[['ENGINESIZE','CYLINDERS','FUELCONSUMPTION_COMB','CO2EMISSIONS']]
# Split train/ test data
msk = np.random.rand(len(df)) < 0.8
train = cdf[msk]
test = cdf[~msk]
# Modeling
regr = linear_model.LinearRegression()
train_x = np.asanyarray(train[['ENGINESIZE']])
train_y = np.asanyarray(train[['CO2EMISSIONS']])
**# if I use the dataframe, train[['ENGINESIZE']] for 'x', and train[['CO2EMISSIONS']] for 'y'
below, I get the same result**
regr.fit (train_x, train_y)
# The coefficients
print ('Coefficients: ', regr.coef_)
print ('Intercept: ',regr.intercept_)
Thank you very much!
So df is the loaded dataframe, cdf is another frame with selected columns, and train is selected rows.
train[['ENGINESIZE']] is a 1 column dataframe (I believe train['ENGINESIZE'] would be a pandas Series).
I believe the preferred syntax for getting an array from the dataframe is:
train[['ENGINESIZE']].values # or
train[['ENGINESIZE']].to_numpy()
though
np.asanyarray(train[['ENGINESIZE']])
is supposed to do the same thing.
Digging down through the regr.fit code I see that it calls sklearn.utils.check_X_y which in turn calls sklearn.tils.check_array. That takes care of converting the inputs to numpy arrays, with some awareness of pandas dataframe peculiarities (such as multiple dtypes).
So it appears that if fit accepts your dataframes, you don't need to convert them ahead of time. But if you can get a nice array from the dataframe, there's no harm in do that either. Either way the fit is done with arrays, derived from the dataframe.
I have a dictionary like this:
d = {'Caps': 'cap_list', 'Term': 'unique_tokens', 'LocalFreq': 'local_freq_list','CorpusFreq': 'corpus_freq_list'}
I want to create a dask dataframe from it. How do I do it? Normally, in Pandas, is can be easily imported to a Pandas df by:
df = pd.DataFrame({'Caps': cap_list, 'Term': unique_tokens, 'LocalFreq': local_freq_list,
'CorpusFreq': corpus_freq_list})
Should I first load into a bag and then convert from bag to ddf?
If your data fits in memory then I encourage you to use Pandas instead of Dask Dataframe.
If for some reason you still want to use Dask dataframe then I would convert things to a Pandas dataframe and then use the dask.dataframe.from_pandas function.
import dask.dataframe as dd
import pandas as pd
df = pd.DataFrame(...)
ddf = dd.from_pandas(df, npartitions=20)
But there are many cases where this will be slower than just using Pandas well.