Difference in count of groupby and its indices - pandas

What is the difference between the lenght of groupby object and the length of indices method of groupby object? I expected to return the same numbers for both the statements.
len(Fees.groupby(['InstituteCode','Code','ProgramType','Status','AcademicYear']))
8000
Why do I get different numbers?
len(Fees.groupby(['InstituteCode','Code','ProgramType','Status','AcademicYear']).indices)
7433
Does it mean I have only 7433 distinct records for the given list of columns?

This was because the "Code" column was null for 568 records. Those were skipped in groupby. It became clear when I checked for null values using...
df.apply(lambda x: x.isnull().sum())

Related

Null value of mean in dataframe columns after imputation

I have a dataframe where I imputed nulls with the mean of group ID. The null count after applying imputation for all such columns is 0. When I do a describe, the mean of few columns which were imputed shows as NaN. The null count for such columns is 0 and the data type is 'float16' (changed it from 'float64' to save on in-memory computations). How could this possibly happen?
Thanks in advance!

Is there a way in pandas to compare two values within one column and sum how many times the second value is greater?

In my pandas dataframe, I have one column, score, thats rows are values such as [80,100], [90,100], etc. what I want to do is go through this column and if the second value in the list is greater than the first value, then to count that. so that I have a value that sums the number of times where in [a,b], b was greater. how would I do this?
print(len([x for x in df['score'] if x[1] > x[0]]))

Pandas groupby year filtering the dataframe by n largest values

I have a dataframe at hourly level with several columns. I want to extract the entire rows (containing all columns) of the 10 top values of a specific column for every year in my dataframe.
so far I ran the following code:
df = df.groupby([df.index.year])['totaldemand'].apply(lambda grp: grp.nlargest(10)))
The problem here is that I only get the top 10 values for each year of that specific column and I lose the other columns. How can I do this operation and having the corresponding values of the other columns that correspond to the top 10 values per year of my 'totaldemand' column?
We usually do head after sort_values
df = df.sort_values('totaldemand',ascending = False).groupby([df.index.year])['totaldemand'].head(10)
nlargest can be applied to each group, passing the column to look for
largest values.
So run:
df.groupby([df.index.year]).apply(lambda grp: grp.nlargest(3, 'totaldemand'))
Of course, in the final version replace 3 with your actual value.
Get the index of your query and use it as a mask on your original df:
idx = df.groupby([df.index.year])['totaldemand'].apply(lambda grp: grp.nlargest(10))).index.to_list()
df.iloc[idx,]
(or something to that extend, I can't test now without any test data)

How can I add continuous 'Ident' column to a dataframe in Pyspark, not as monotonically_increasing_id()?

I have a dataframe 'df', and I want to add an 'Ident' numeric column where the values are continuous. I tried with monotonically_increasing_id() but the values are not continuous. As it description says: "The generated ID is guaranteed to be monotonically increasing and unique, but not consecutive. "
So, my question is, how could I do that?
You could try something like this,
df = df.rdd.zipWithIndex().map(lambda x: [x[1]] + [y for y in x[0]]).toDF(['Ident']+df.columns)
This will give you first column as your identifier which will have consecutive values starting from 0 to N-1, where N is total number of records in df.

Fillna (forward fill) on a large dataframe efficiently with groupby?

What is the most efficient way to forward fill information in a large dataframe?
I combined about 6 million rows x 50 columns of dimensional data from daily files. I dropped the duplicates and now I have about 200,000 rows of unique data which would track any change that happens to one of the dimensions.
Unfortunately, some of the raw data is messed up and has null values. How do I efficiently fill in the null data with the previous values?
id start_date end_date is_current location dimensions...
xyz987 2016-03-11 2016-04-02 Expired CA lots_of_stuff
xyz987 2016-04-03 2016-04-21 Expired NaN lots_of_stuff
xyz987 2016-04-22 NaN Current CA lots_of_stuff
That's the basic shape of the data. The issue is that some dimensions are blank when they shouldn't be (this is an error in the raw data). An example is that for previous rows, the location is filled out for the row but it is blank in the next row. I know that the location has not changed but it is capturing it as a unique row because it is blank.
I assume that I need to do a groupby using the ID field. Is this the correct syntax? Do I need to list all of the columns in the dataframe?
cols = [list of all of the columns in the dataframe]
wfm.groupby(['id'])[cols].fillna(method='ffill', inplace=True)
There are about 75,000 unique IDs within the 200,000 row dataframe. I tried doing a
df.fillna(method='ffill', inplace=True)
but I need to do it based on the IDs and I want to make sure that I am being as efficient as possible (it took my computer a long time to read and consolidate all of these files into memory).
It is likely efficient to execute the fillna directly on the groupby object:
df = df.groupby(['id']).fillna(method='ffill')
Method referenced
here
in documentation.
How about forward filling each group?
df = df.groupby(['id'], as_index=False).apply(lambda group: group.ffill())
github/jreback: this is a dupe of #7895. .ffill is not implemented in cython on a groupby operation (though it certainly could be), and instead calls python space on each group.
here's an easy way to do this.
url:https://github.com/pandas-dev/pandas/issues/11296
according to jreback's answer, when you do a groupby ffill() is not optimized, but cumsum() is. try this:
df = df.sort_values('id')
df.ffill() * (1 - df.isnull().astype(int)).groupby('id').cumsum().applymap(lambda x: None if x == 0 else 1)