Pandas dataframe - Unique counts based on different conditions - pandas

I have a dataframe like this
screenshot
I want to find out :
unique viewers: 3
unique viewers who reviewed movies: 2
I am able to do that using the following code:
movie['Viewer_ID'].nunique()
movie.loc[movie['watched']==1,:]['Viewer_ID'].nunique()
However, i was wondering if there is a better way to combined both in one, something like
movie.agg({'Viewer_id':'nunique'
,'watched': 'sum'
})
is there a way i can write a conditional count within the agg function?

You can use .groupby():
view_count = movie.groupby('Viewer_id').watched.sum()
Now view_count is a Series with viewer id as index and sum of watched as values. You can filter with:
filtered = view_count.loc[view.count > 0]

Related

Apache spark: asc not working as expected

I have following code:
df.orderBy(expr("COUNTRY_NAME").desc, expr("count").asc).show()
I expect count column to be arranged in ascending order for a given COUNTRY_NAME. But I see something like this:
Last value of 12 is not as per the expectation.
Why is it so?
If you output df.printSchema(), you'll see that your "count" column is of the string datatype, resulting in the undesired alphanumeric sort.
In pyspark, you can use the following to accomplish what you are looking for:
df = df.withColumn('count',df['count'].cast('int'))
df.orderBy(['COUNTRY_NAME'],ascending=False).orderBy(['count'],ascending=True).show()
You should create and apply your schema when the data is read in - if possible.

I have a dataframe that I am able to see in Jupiter Notebook using df.display(), when I write it to excel, the excel output has only 1 column

I need to group the data by the following columns and get the most recent date for answers
df_Q = df.groupby(['question', 'user_id', 'options', 'answer'])
df_Date = df_Q.agg(Recent_Date=('datetime', np.max))
I would presume that this is because your dataframe only has one column, as the rest have been converted to "multiindex row-axis labels" by the groupby + agg.
Try adding:
df_Date = df_Date.reset_index()
Before the export to Excel.
Beware that you might get more than one answer per person per question as you are including "answer" in your groupby. I would presume, depending on your goal, that grouping by "user_id" and "question" might suffice.

How to use AND (&) operator in Pandas between 2 filters?

I am using the StackOverflow annual survey data to do a sample data analysis project. The data can be found in the below link:
Annual Data Survey
I want to filter 2 columns using a single command. The two data type of 2 columns are as follows:
Country 88751 non-null object
ConvertedComp 55823 non-null float64
I want to select a list of countries and then see if their ConvertedComp is greater than 10000.
To make a list of required countries i am using the following filter:
countries = ['United States', 'India','United Kingdom','Germany']
filt = (df['Country'].isin(countries) )
I am using the following filter on the ConvertComp as follows:
filt1 = (df['ConvertedComp']>1000 )
I want to use both these conditions to make a single filter in a single cell. I am using the & operator as follows:
filter1 = (df['Country'].isin(countries) & df['ConvertedComp']>1000)
According to my understanding, when i apply the above filter to the dataframe, I should not see any countries except the ones mentioned in the above list. However when I apply the filer the dataframe is giving me 0 results.
Can anyone please explain how to correctly use the & operator while using filters?

How to use aggregate with condition in pandas?

I have a dataframe.
Following code works
stat = working_data.groupby(by=['url', 'bucket_id'],
as_index=False).agg({'delta': 'max','id': 'count'})
Now i need to count ids with different statuses. I have "DOWNLOADED", "NOT_DOWNLOADED" and "DOWNLOADING" for the status.
I would like to have df with columns bucket_id, max, downloaded (how many have "DOWNLOADED" status) , not_downloaded (how many have "NOT_DOWNLOADED" status) , downloading (how many have "DOWNLOADING" status). How to make it?
Input I have:
.
Output i have:
As you can see count isn't devided by status. But i want to know that there are x downloaded, y not_downloaded, z downloading for each bucket_id bucket_id (so they should be in separate columns, but info for one bucket_id should be in one row)
One way to use assign to create columns then aggregate this new column.
working_data.assign(downloaded=df['status'] == 'DOWNLOADED',
not_downloaded=df['status'] == 'NOT_DOWNLOADED',
downloading=df['status'] == 'DOWNLOADING')\
.groupby(by=['url', 'bucket_id'],
as_index=False).agg({'delta': 'max',
'id': 'count',
'downloaded': 'sum',
'not_donwloaded':'sum',
'downloading':'sum'})

How to get the first tuple inside a bag when Grouping

I do not understand how to deal with duplicates when generating my output, so I ended up getting several duplicates but I want one only.
I've tried using LIMIT but that only applies when selecting I suppose. I also used DISTINCT but wrong scenario I guess.
grouped = GROUP wantedTails BY tail_number;
smmd = FOREACH grouped GENERATE wantedTails.tail_number as Tails, SUM(wantedTails.distance) AS totaldistance;
So for my grouped, I got smg like (not the whole):
({(N983JB),(N983JB),(N983JB),(N983JB),(N983JB),(N983JB),(N983JB),(N983JB),(N983JB),(N983JB),(N983JB),(N983JB),(N983JB),(N983JB),(N983JB),(N983JB),(N983JB),(N983JB)},44550)
but I expect (N983JB,44550). How can I delete those duplicates generated during grouping? Thank you!
The way I see it, there are two ways to de-duplicate data in Pig.
Less flexible but a convenient way is to apply MAX to the columns which need to be de-duplicated after performing a GROUP BY. Apply SUM only if you want to add up values across duplicates:
dataWithDuplicates = LOAD '<path_to_data>';
grouped = GROUP dataWithDuplicates BY tail_number;
dedupedData= FOREACH grouped GENERATE
--Since you have grouped on tailNumber, it is already de-duped
group AS tailNumber,
MAX(dataWithDuplicates.distance) AS dedupedDistance,
SUM(dataWithDuplicates.distance) AS totalDistance;
If you want more flexibility while de-duping, you can take help of nested-FOREACH in Pig. This question captures the gist of its usage: how to delete the rows of data which is repeating in Pig. Other references for nested-FORACH: https://www.safaribooksonline.com/library/view/programming-pig/9781449317881/ch06.html