Pandas: Value_counts() function explanation - pandas

I wanted to ask if anyone knows if its possible to return a single value using the value_counts() in pandas or if there is a possible way to isolate a single value?
thanks!

In answer to 'a possible way to isolate a single value' question:
iat accesses a single value:
import pandas as pd
data = pd.Series([f"value_{x}" for x in range(20)])
data.iat[4]
Combining this with value_counts() method:
import pandas as pd
numbers = [f"value_{x}" for x in range(20)]
numbers[5] = numbers[6]
data = pd.Series(numbers)
data.value_counts().iat[0]

Related

How to calculate and reshape more than 1 bn items of data into PySpark?

Our use case is to read data from BQ and caculate by using pandas and numpy.reshape to turn it into input for the model, sample code like:
import numpy as np
import pandas as pd
# Source Data
feature = spark.read.format('bigquery') \
.option('table', TABLE_NAME) \
.load()
test = feature.to_pandas_on_spark().sort_values(by = ['col1','col2'], ascending = True).drop(['col1','col3','col5'], axis = 1)
test = (test - test.mean())/(test.std())
row = int(len(test)/100)
row2 = 50
col3 = 100
feature_array = np.reshape(feature_nor.values, (row,row2,col3))
feature.to_pandas_on_spark() will collect all data into driver memory and for small amout of data it can work, but for more than 15 Billion data it can not handle this.
I try to convert to_pandas_on_spark() to spark dataframe so that it can compute in parallell:
sorted_df = feature.sort('sndr_id').sort('date_index').drop('sndr_id').drop('date_index').drop('cal_dt')
mean_df = sorted_df.select(*[f.mean(c).alias(c) for c in sorted_df.columns])
std_df = sorted_df.select(*[f.stddev(c).alias(c) for c in sorted_df.columns])
Since the function is different from the pandas api, so I cannot verify these code and for the last reshape operation(np.reshape(feature_nor.values, (row,row2,col3))) dataframe doesn't support this function, is there a good solution to replace it?
I want to know how to handle 1B data in a efficient way and without memory overflow, including how to use numpy's reshape and pandas's computation operations, any answers will be super helpful!
I would advise not to use pandas or numpy on a dataset of this size, there usually is some Spark function to solve your problem, even firing up a UDF or using pandas on spark comes with a significant performance loss.
What exactly are your reshape criteria?
Maybe pivot helps?

Exponential moving average on pandas

I was having a bit of trouble making an exponential moving average for a pandas data frame. I managed to make a simple moving average but I'm not sure how I can make one that is exponential. I was wondering if there's a function in pandas or maybe another module that can help with this. Ideally the exponential moving average would be in another column in my data frame. This is my code below:
import pandas as pd
import datetime as dt
import yfinance as yf
#Get initial paramaters
start = dt.date(2020,1,1)
end = dt.date.today()
ticker = 'SPY'
#Get df data
df = yf.download(ticker,start,end,progress=False)
#Make simple moving average
df['SMA'] = df['Adj Close'].rolling(window=75,min_periods=1).mean()
Thanks
Use the ewm method:
df['SMA'] = df['Adj Close'].ewm(span=75, min_periods=1).mean()
NB. check carefully the parameters' documentation as there is no more window, you should use one of com, span, halflife or alpha instead

What is the Vaex command for pd.isnull().sum()?

Someone please give me a VAEX alternative for this code:
df_train = vaex.open('../input/ms-malware-hdf5/train.csv.hdf5')
total = df_train.isnull().sum().sort_values(ascending = False)
Vaex does not at this time support counting missing values on a dataframe level, only on an expression (column) level. So you will have to do a bit of work yourself.
Consider the following example:
import vaex
import vaex.ml
import pandas as pd
df = vaex.ml.datasets.load_titanic()
count_na = [] # to count the missing value per column
for col in df.column_names:
count_na.append(df[col].isna().sum().item())
s = pd.Series(data=count_na, index=df.column_names).sort_values(ascending=True)
If you think this is something you might need to use often, it might be worth it to create your own dataframe method following this example.

Can I extract or construct as a Pandas dataframe the table with coefficient values etc. provided by the summary() method in statsmodels?

I have run an OLS model in statsmodels and I would like to have the table in the summary as a Pandas dataframe.
This is what I mean:
I would like the table within the red frame to be constructed / extracted and become a Pandas DataFrame.
My code up to that point was straightforward:
from statsmodels.regression.linear_model import OLS
mod = OLS(endog = coded_design_poly_select.response.values, exog = coded_design_poly_select.iloc[:, :-1].values)
fitted_model = mod.fit()
fitted_model.summary()
What would you suggest?
The fitted_model is in fact a RegressionResults object that stores all the regression results and you can access them via the corresponding methods/attributes.
For what you asked for, I believe the following code would work
data = {'coef': fitted_model.params,
'std err': fitted_model.bse,
't': fitted_model.tvalues,
'P>|t|': fitted_model.pvalues,
'[0.025': fitted_model.conf_int()[0],
'0.975]': fitted_model.conf_int()[1]}
pd.DataFrame(data).round(3)

Stacking after unstacking a data frame is different in pandas

I created a DataFrame with 3 indexes. Later, I applied unstack method followd by the stack method. Then checked for equality of new and old data frames. Why are both of them different? Is unstacking not opposite procedure to stacking? Here is my code :
import numpy as np
import pandas as pd
data = pd.Series([7]*9, index = [[1,2,3,2,4,9,6,7,9], ['a','c','f','a', 'k','f','c','d','a'], [np.nan]*9])
data2 = data.unstack().stack()
print(data2.equals(data))
The output returns False, but don't know why!