Pandas: Groupby Fill disappear the column - pandas

I have dataframe. I am doing a groupby and doing a ffill. Post this I can't see the column over which I grouped. Why? What can I do to mitigate this? My code below:
df.groupby(["col1"], as_index=False).fillna(method="ffill")

Try this:
df.groupby('col1', as_index=False).apply(lambda x: x.fillna(method="ffill"))
Why applying apply method?
Group by is - split-apply-combine.
Group by groups can be divided into 4 parts.
Aggregation
Aggregation functions can be directly applied on groupby because we are applying these functions on groups.
df.groupby('col1', as_index=False).mean()
Transformation
Transformation allows us to perform some computation on the groups as a whole and then return the combined DataFrame. This is done using the transform() function.
df.groupby('col1', as_index=False).transform(lambda x: x.fillna(x.mean()))
Filteration
Filtration allows us to discard certain values based on computation and return only a subset of the group. We can do this using the filter() function in Pandas.
df.groupby('col1', as_index=False).filter(any_filter_function)
Apply
Pandas’ apply() function applies a function along an axis of the DataFrame. When using it with the GroupBy function, we can apply any function to the grouped result.
df.groupby('col1', as_index=False).apply(lambda x: x.fillna(method="ffill"))

Related

Find dates and difference between extreme observations

he function passed to apply must take a dataframe as its first argument and return a DataFrame, Series or scalar. apply will then take care of combining the results back together into a single dataframe or series. apply is therefore a highly flexible grouping method.
While apply is a very flexible method, its downside is that using it can be quite a bit slower than using more specific methods like agg or transform. Pandas offers a wide range of method that will be much faster than using apply for their specific purposes, so try to use them before reaching for apply.
easiest is an aggregation with groupby and then do a select
# make index a column
df = df.reset_index()
# get min of holdings for each ticker
lowest = df[['ticker','holdings']].groupby('ticker').min()
print(lowest)
# select lowest my performing a left join (solutions with original)
# this gives only the matching rows of df in return
lowest_dates = lowest.merge(df, on=['ticker','holdings'], how='left')
print(lowest_dates)
If you just want a series of Date you can use this function.
def getLowest(df):
df = df.reset_index()
lowest = df[['ticker','holdings']].groupby('ticker').min()
lowest_dates = lowest.merge(df, on=['ticker','holdings'], how='left')
return lowest_dates['Date']
From my point of view it would be better to return the entire dataframe, to know which ticker was lowest when. In this case you can :
return lowest_dates

Split and merge nested DataFrame in Python

I have a dataframe, which has two columns. One of the columns is also another dataframe. It looks like below:
I want to have a dataframe with 3 columns, containing "Date_Region", "transformed_weight" and "Barcode", which would replicate each "Date_Region" row times the length of its "Weight-Barcode" dataframe. The final dataframe should looks like below:
This will do:
pd.concat(
iter(final_df.apply(
lambda row: row['Weights-Barcode'].assign(
Date_Region=row['Date_Region'],
),
axis=1,
)),
ignore_index=True,
)[['Date_Region', 'transformed_weight', 'Barcode']]
From the inside out:
final_df.apply(..., axis=1) will call the lambda function on each row.
The lambda function uses assign() to return the nested DataFrame from that row with an addition of the Date_Region column with the value from the outside.
Calling iter(...) on the resulting series results in an iterable of the DataFrames already including the added column.
Finally, using pd.concat(...) on that iterable to concatenate them all together. I'm using ignore_index=True here to just reindex everything again (it doesn't seem to me your index is meaninful, and not ignoring them means you'd end up with duplicates.)
Finally, I'm reordering the columns, so the added Date_Region column becomes the leftmost one.

How do I stack 3-D arrays in a grouped pandas dataframe?

I have a pandas dataframe that consists of two columns: a column of string identifiers and a column of 3-D arrays. The arrays have been grouped by the ID. How can I stack all the arrays for each group so that there is a single stacked array for each ID? The code I have is as follows:
df1 = pd.DataFrame({'IDs': ids})
df2 = pd.DataFrame({'arrays':arrays})
df = pd.concat([df1, df2], axis=1)
grouped = df['arrays'].groupby(df['IDs'])
(I attempted np.dstack(grouped), but this was unsuccessful.)
I believe this is what you want:
df.groupby('IDs')['arrays'].apply(np.dstack).to_frame().reset_index()
It will apply the np.dstack(...) function to each group of arrays sharing an ID.
The apply() function returns a pd.Series (with IDs as index), so we then use to_frame() to create a DataFrame from it and reset_index() to put IDs back as a column.
(Note: The documentation for apply() talks about using agg() for efficiency, but unfortunately it doesn't seem to be possible to use agg() with a function that returns an ndarray, such as np.dstack. In that case, agg() wants to treat that array as multiple objects, as a series, rather than as a single object... My attempts with it resulted in an exception saying "function does not reduce".)

How to translate a pandas group by without aggregation to pyspark?

I am trying to convert the following pandas line into pyspark:
df = df.groupby('ID', as_index=False).head(1)
Now, I am familiar with the pyspark df.groupby("col1", "col2") method in pyspark, as well as the following to get whatever the first element is within a group:
df = df.withColumn("row_num", row_number().over(Window.partitionBy("ID").orderBy("SOME_DATE_COLUMN"))).where(col("row_num") < 2)
However, without an orderBy argument, this grouping and fetching of the first element in each group doesn't work (and I am literally trying to convert from pandas to spark, whatever the pandas line does):
An error occurred while calling o2547.withColumn.
: org.apache.spark.sql.AnalysisException: Window function row_number() >requires window to be ordered, please add ORDER BY clause. For example >SELECT row_number()(value_expr) OVER (PARTITION BY window_partition >ORDER BY window_ordering) from table
Looking at the pandas groupby documentation, I cannot grasp what groupby does without a following sort/agg function applied to the groups; i.e. what is the default order within a group from which the $.head(1)$ fetches the first element?
It depends on the order of your pandas dataframe before the groupby. From the pandas groupby documentation:
Sort group keys. Get better performance by turning this off. Note this does not influence the order of observations within each group. Groupby preserves the order of rows within each group.
Converting the pandas beheaviour exactly to pyspark is impossible as pyspark dataframes aren't ordered. But if your data source can provide a row number or something like that, it is possible.

Linear 1D interpolation on multiple datasets using loops

I'm interested in performing Linear interpolation using the scipy.interpolate library. The dataset looks somewhat like this:
DATAFRAME for interpolation between X, Y for different RUNs
I'd like to use this interpolated function to find the missing Y from this dataset:
DATAFRAME to use the interpolation function
The number of runs given here is just 3, but I'm running on a dataset that will run into 1000s of runs. Hence appreciate if you could advise how to use the iterative functions for the interpolation ?
from scipy.interpolate import interp1d
for RUNNumber in range(TotalRuns)
InterpolatedFunction[RUNNumber]=interp1d(X, Y)
As I understand it, you want a separate interpolation function defined for each run. Then you want to apply these functions to a second dataframe. I defined a dataframe df with columns ['X', 'Y', 'RUN'], and a second dataframe, new_df with columns ['X', 'Y_interpolation', 'RUN'].
interpolating_functions = dict()
for run_number in range(1, max_runs):
run_data = df[df['RUN']==run_number][['X', 'Y']]
interpolating_functions[run_number] = interp1d(run_data['X'], run_data['Y'])
Now that we have interpolating functions for each run, we can use them to fill in the 'Y_interpolation' column in a new dataframe. This can be done using the apply function, which takes a function and applies it to each row in a dataframe. So let's define an interpolate function that will take a row of this new df and use the X value and the run number to calculate an interpolated Y value.
def interpolate(row):
int_func = interpolating_functions[row['RUN']]
interp_y = int_func._call_linear([row['X'])[0] #the _call_linear method
#expects and returns an array
return interp_y[0]
Now we just use apply and our defined interpolate function.
new_df['Y_interpolation'] = new_df.apply(interpolate,axis=1)
I'm using pandas version 0.20.3, and this gives me a new_df that looks like this: