Find indices of a groupby in pandas - pandas

I have this fake dataset:
And I want to find how many times a combination of BirthDate and ZipCode occur, like so:
Now, my question is how can I find the positions in the dataset df where these occurences happen? For example, how can I find the position where 2000101 and 08002 are?
Thanks in advance.

Use GroupBy.agg if need aggregate for counts and for index values like list to new column Pos:
df1 = (df.reset_index()
.groupby(['BithDate','ZipCode'])
.agg(RowNumber=('BithDate','size'), Pos = ('index',list)))

Related

pandas: split pandas columns of unequal length list into multiple columns

I have a dataframe with one column of unequal list which I want to spilt into multiple columns (the item value will be the column names). An example is given below
I have done through iterrows, iterating thruough the rows and examine the list from each rows. It seem workable as my dataframe has few rows. However, I wonder if there is any clean methods
I have done through additional_df = pd.DataFrame(venue_df.location.values.tolist())
However the list break down into as below
thanks fro your help
Can you try this code: built assuming venue_df.location contains the list you have shown in the cells.
venue_df['school'] = venue_df.location.apply(lambda x: ('school' in x)+0)
venue_df['office'] = venue_df.location.apply(lambda x: ('office' in x)+0)
venue_df['home'] = venue_df.location.apply(lambda x: ('home' in x)+0)
venue_df['public_area'] = venue_df.location.apply(lambda x: ('public_area' in x)+0)
Hope this helps!
First lets explode your location column, so we can get your wanted end result.
s=df['Location'].explode()
Then lets use crosstab in that series so we can get your end result
import pandas as pd
pd.crosstab(s).unstack()
I didnt test it out cause i dont know you base_df

Groupby does return previous df without changing it

df=pd.read_csv('../input/tipping/tips.csv')
df_1 = df.groupby(['day','time'])
df_1.head()
Guys, what am I missing here ? As it returns to me previous dataframe without groupby
We can print it using the following :
df_1 = df.groupby(['day','time']).apply(print)
groupby doesn't work the way you are assuming by the sounds of it. Using head on the grouped dataframe takes the first 5 rows of the dataframe, even if it is across groups because that is how the groupby object is built. You can use #tlentali's approach to print out each group, but df_1 will not be assigned the grouped dataframe that way, instead, None (the number of groups times) as that is the output of print.
The way below gives a lot of control over how to show/display the groups and their keys
This might also help you understand more about how the grouped data frame structure in pandas works.
df_1 = df.groupby(['day','time'])
# for each (day,time) and grouped data
for key, group in df_1:
# show the (day,time)
print(key)
# display head of the grouped data
group.head()

Is there a pandas function for get variables names in a column?

I'm just thinking in a hypothetical dataframe (df) with around 50 columns and 30000 rows, and one hypothetical column like e.g: Toy = ['Ball','Doll','Horse',...,'Sheriff',etc].
Now I only have the name of the column (Toy) and I want to know what are the variables inside the column without duplicated values.
I'm thinking an output like the .describe() function
df['Toy'].describe()
but with more info, because now I'm getting only this output
count 30904
unique 7
top "Doll"
freq 16562
Name: Toy, dtype: object
In other words, how do I get the 7 values in this column. I was thinking in something like copy the column and delete duplicated values, but I'm pretty sure that there is a shorter way. Do you know the right code or if I should use another library?
Thank you so much!
You can use unique() function to list out all the unique values in your columns. In your case, to list out the unique values in the column name toys in the dataframe df the syntax would look like
df["toys"].unique()
You can also use .drop_duplicates(), which returns a pandas Series:
df['toys'].drop_duplicates()

Pandas groupby year filtering the dataframe by n largest values

I have a dataframe at hourly level with several columns. I want to extract the entire rows (containing all columns) of the 10 top values of a specific column for every year in my dataframe.
so far I ran the following code:
df = df.groupby([df.index.year])['totaldemand'].apply(lambda grp: grp.nlargest(10)))
The problem here is that I only get the top 10 values for each year of that specific column and I lose the other columns. How can I do this operation and having the corresponding values of the other columns that correspond to the top 10 values per year of my 'totaldemand' column?
We usually do head after sort_values
df = df.sort_values('totaldemand',ascending = False).groupby([df.index.year])['totaldemand'].head(10)
nlargest can be applied to each group, passing the column to look for
largest values.
So run:
df.groupby([df.index.year]).apply(lambda grp: grp.nlargest(3, 'totaldemand'))
Of course, in the final version replace 3 with your actual value.
Get the index of your query and use it as a mask on your original df:
idx = df.groupby([df.index.year])['totaldemand'].apply(lambda grp: grp.nlargest(10))).index.to_list()
df.iloc[idx,]
(or something to that extend, I can't test now without any test data)

Find unique values across each row

I Have a 2 dimensional numpy array say as follows:
[["cat","dog","dog","mouse","man"],
["rhino","rhino","bat","rhino","dino","dino"],
["zebra","alien","alien","alien","alien"]]
I want to perform numpy.unique along each row in order to count the number of occurrences of each label, unfortunately I don't think this is possible as numpy.unique would return vectors of different lengths:
[["cat","dog","mouse","man"]
["rhino","bat","dino"]
["zebra","alien"]]
(similar then for the counts)
so this won't work obviously.
Does anybody know of a way I can get around this problem?
Try this:
a = pd.DataFrame([["cat","dog","dog","mouse","man"],
["rhino","rhino","bat","rhino","dino","dino"],
["zebra","alien","alien","alien","alien"]])
a.apply(lambda x: pd.Series(x.unique()), axis=1)