Pandas aggregate by unique occurrence per group - pandas

In pandas, I'd like to analyze groups if there is a single occurrence of a conditional value. I've included a sample dataframe with a first step attempt at identifying such groups below. So, let's say, in the data frame below, I want to filter the original data frame only for species of iris that ever had a sepal length greater than 6. In the last command, I'm counting the number of unique species groups that had a sepal length greater than 6 (so, at least I can count them).
But, what I really want is the original dataframe where I analyze rows only if the species had a sepal length greater than 6 (so, it would be a dataframe without the species "setosa" since they never have one).
The longer explanation is that I have a real dataset of users. Each user will have values in certain columns that may exceed a threshold value of interest. I haven't figured out how to analyze users who have these threshold values.
Perhaps a loop would be better. I might loop through each unique user name and look if any row with that user ever exceeds a certain value and gets some kind of new column (though I know loops are frowned upon in pandas so I'm posting here to see if there's some kind of well-known method of identifying groups by occurrence).
Thanks and let me know if I can make this question any more clear!
import pandas as pd
import seaborn as sns
import numpy as np
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)
iris = sns.load_dataset('iris')
iris['longsepal'] = iris['sepal_length'] > 7
iris['longpetal'] = iris['petal_length'] > 5
iris.groupby(['longsepal'])['species'].nunique()

Consider groupby().transform() to calculate inline max aggregates to be later filtered on its value by species. Technically, the > 7 only returns one species as veriscolor max reaches 7.0. Below shows the operator and functional form of inequality logic.
iris['longsepal'] = iris.groupby(['species'])['sepal_length'].transform('max')
iris['longpetal'] = iris.groupby(['species'])['petal_length'].transform('max')
# DATA FILTERS
longsepal_iris = iris.loc[iris['longsepal'] > 7] # GREATER THAN OPERATOR FORM: >
longsepal_iris = iris.loc[iris['longsepal'].gt(7)] # GREATER THAN FUNCTIONAL FORM: gt()
longpetal_iris = iris.loc[iris['longpetal'] > 5] # GREATER THAN OPERATOR FORM: >
longpetal_iris = iris.loc[iris['longpetal'].gt(5)] # GREATER THAN FUNCTIONAL FORM: gt()
# SPECIES
longsepal_iris['species'].unique()
# ['virginica']
longpetal_iris['species'].unique()
# ['versicolor' 'virginica']

Related

DataFrame append to DataFrame row by row and reset if condition is matched

I have a DataFrame which I want to slice into many DataFrames by adding rows by one until the sum of column Score of the DataFrame is greater than 50,000. Once that condition is met, then I want a new slice to begin.
Here is an example of what this might look like:
Sum Score cumulatively, floor divide it by 50,000, and shift it up one cell (since you want each group to be > 50,000 and not < 50,000).
import pandas as pd
import numpy as np
# Generating DataFrame with random data
df = pd.DataFrame(np.random.randint(1,60000,15))
# Creating new column that's a cumulative sum with each
# value floor divided by 50000
df['groups'] = df[0].cumsum() // 50000
# Values shifted up one and missing values filled with the maximum value
# so that values at the bottom are included in the last DataFrame slice
df.groups = df.groups.shift(-1, fill_value=df.groups.max())
Then as per this answer you can use pandas.DataFrame.groupby in a list comprehension to return a list of split DataFrames.
df_list = [df_slice for _, df_slice in df.groupby(['groups'])]

What is the most efficient method for calculating per-row historical values in a large pandas dataframe?

Say I have two pandas dataframes (df_a & df_b), where each row represents a toy and features about that toy. Some pretend features:
Was_Sold (Y/N)
Color
Size_Group
Shape
Date_Made
Say df_a is relatively small (10s of thousands of rows) and df_b is relatively large (>1 million rows).
Then for every row in df_a, I want to:
Find all the toys from df_b with the same type as the one from df_a (e.g. the same color group)
The df_b toys must also be made before the given df_a toy
Then find the ratio of those sold (So count sold / count all matched)
What is the most efficient means to make those per-row calculations above?
The best I've came up with so far is something like the below.
(Note code might have an error or two as I'm rough typing from a different use case)
cols = ['Color', 'Size_Group', 'Shape']
# Run this calculation for multiple features
for col in cols:
print(col + ' - Started')
# Empty list to build up the calculation in
ratio_list = []
# Start the iteration
for row in df_a.itertuples(index=False):
# Relevant values from df_a
relevant_val = getattr(row, col)
created_date = row.Date_Made
# df to keep the overall prior toy matches
prior_toys = df_b[(df_b.Date_Made < created_date) & (df_b[col] == relevant_val)]
prior_count = len(prior_toys)
# Now find the ones that were sold
prior_sold_count = len(prior_toys[prior_toys.Was_Sold == "Y"])
# Now make the calculation and append to the list
if prior_count == 0:
ratio = 0
else:
ratio = prior_sold_count / prior_count
ratio_list.append(ratio)
# Store the calculation in the original df_a
df_a[col + '_Prior_Sold_Ratio'] = ratio_list
print(col + ' - Finished')
Using .itertuples() is useful, but this is still pretty slow. Is there a more efficient method or something I'm missing?
EDIT
Added the below script which will emulated data for the above scenario:
import numpy as np
import pandas as pd
colors = ['red', 'green', 'yellow', 'blue']
sizes = ['small', 'medium', 'large']
shapes = ['round', 'square', 'triangle', 'rectangle']
sold = ['Y', 'N']
size_df_a = 200
size_df_b = 2000
date_start = pd.to_datetime('2015-01-01')
date_end = pd.to_datetime('2021-01-01')
def random_dates(start, end, n=10):
start_u = start.value//10**9
end_u = end.value//10**9
return pd.to_datetime(np.random.randint(start_u, end_u, n), unit='s')
df_a = pd.DataFrame(
{
'Color': np.random.choice(colors, size_df_a),
'Size_Group': np.random.choice(sizes, size_df_a),
'Shape': np.random.choice(shapes, size_df_a),
'Was_Sold': np.random.choice(sold, size_df_a),
'Date_Made': random_dates(date_start, date_end, n=size_df_a)
}
)
df_b = pd.DataFrame(
{
'Color': np.random.choice(colors, size_df_b),
'Size_Group': np.random.choice(sizes, size_df_b),
'Shape': np.random.choice(shapes, size_df_b),
'Was_Sold': np.random.choice(sold, size_df_b),
'Date_Made': random_dates(date_start, date_end, n=size_df_b)
}
)
First of all, I think your computation would be much more efficient using relational database and SQL query. Indeed, the filters can be done by indexing columns, performing a database join, some advance filtering and count the result. An optimized relational database can generate an efficient algorithm based on a simple SQL query (hash-based row grouping, binary search, fast intersection of sets, etc.). Pandas is sadly not very good to perform efficiently advanced requests like this. It is also very slow to iterate over pandas dataframe although I am not sure this can be alleviated in this case using only pandas. Hopefully you can use some Numpy and Python tricks and (partially) implement what fast relational database engines would do.
Additionally, pure-Python object types are slow, especially (unicode) strings. Thus, **converting column types to efficient ones in a first place can save a lot of time (and memory). For example, there is no need for the Was_Sold column to contains "Y"/"N" string objects: a boolean can just be used in that case. Thus let us convert that:
df_b.Was_Sold = df_b.Was_Sold == "Y"
Finally, the current algorithm has a bad complexity: O(Na * Nb) where Na is the number of rows in df_a and Nb is the number of rows in df_b. This is not easy to improve though due to the non-trivial conditions. A first solution is to group df_b by col column ahead-of-time so to avoid an expensive complete iteration of df_b (previously done with df_b[col] == relevant_val). Then, the date of the precomputed groups can be sorted so to perform a fast binary search later. Then you can use Numpy to count boolean values efficiently (using np.sum).
Note that doing prior_toys['Was_Sold'] is a bit faster than prior_toys.Was_Sold.
Here is the resulting code:
cols = ['Color', 'Size_Group', 'Shape']
# Run this calculation for multiple features
for col in cols:
print(col + ' - Started')
# Empty list to build up the calculation in
ratio_list = []
# Split df_b by col and sort each (indexed) group by date
colGroups = {grId: grDf.sort_values('Date_Made') for grId, grDf in df_b.groupby(col)}
# Start the iteration
for row in df_a.itertuples(index=False):
# Relevant values from df_a
relevant_val = getattr(row, col)
created_date = row.Date_Made
# df to keep the overall prior toy matches
curColGroup = colGroups[relevant_val]
prior_count = np.searchsorted(curColGroup['Date_Made'], created_date)
prior_toys = curColGroup[:prior_count]
# Now find the ones that were sold
prior_sold_count = prior_toys['Was_Sold'].values.sum()
# Now make the calculation and append to the list
if prior_count == 0:
ratio = 0
else:
ratio = prior_sold_count / prior_count
ratio_list.append(ratio)
# Store the calculation in the original df_a
df_a[col + '_Prior_Sold_Ratio'] = ratio_list
print(col + ' - Finished')
This is 5.5 times faster on my machine.
The iteration of the pandas dataframe is a major source of slowdown. Indeed, prior_toys['Was_Sold'] takes half the computation time because of the huge overhead of pandas internal function calls repeated Na times... Using Numba may help to reduce the cost of the slow iteration. Note that the complexity can be increased by splitting colGroups in subgroups ahead of time (O(Na log Nb)). This should help to completely remove the overhead of prior_sold_count. The resulting program should be about 10 time faster than the original one.

Python CountVectorizer for Pandas DataFrame

I have got a pandas dataframe which looks like the following:
df.head()
categorized.Hashtags
0 icietmaintenant supyoga standuppaddleportugal ...
1 instapaysage bretagne labellebretagne bretagne...
2 bretagne lescrepescestlavie quimper bzh labret...
3 bretagne mer paysdiroise magnifique phare plou...
4 bateaux baiededouarnenez voiliers vieuxgreemen..
Now instead of using pandas get_dummmies() command I would like to use CountVectorizer to create the same output. Because get_dummies takes too much time.
df_x = df["categorized.Hashtags"]
vect = CountVectorizer(min_df=0.,max_df=1.0)
X = vect.fit_transform(df_x)
count_vect_df = pd.DataFrame(X.todense(), columns = vect.get_feature_names())
When I now output the respective data frame "count_vect_df" then the data frame contains a lot of columns which are empty/ contains only zero values. How can I avoid this?
Cheers,
Andi
From scikit-learn CountVectorizer docs:
Convert a collection of text documents to a matrix of token counts
This implementation produces a sparse representation of the counts
using scipy.sparse.csr_matrix.
The CountVectorizer returns a sparse-matrix, which contains most of zero values, where non-zero values represent the number of times that specific term has appeared in the particular document.

Count frequency of multiple words

I used this code
unclassified_df['COUNT'] = unclassified_df.tweet.str.count('mulcair')
to count the number of times mulcair appeared in each row in my pandas dataframe. I am trying to repeat the same but for a set of words such as
Liberal = ['lpc','ptlib','justin','trudeau','realchange','liberal', 'liberals', "liberal2015",'lib2015','justin2015', 'trudeau2015', 'lpc2015']
I saw somewhere that I could use collection.Counter(data) and its .most_common(k) method for such, please can anyone help me out.
from collections import Counter
import pandas as pd
#check frequency for the following for each row, but no repetition for row
Liberal = ['lpc','justin','trudeau','realchange','liberal', 'liberals', "liberal2015", 'lib2015','justin2015', 'trudeau2015', 'lpc2015']
#sample data
data = {'tweet': ['lpc living dream camerama', "jsutingnasndsa dnsadnsadnsa dsalpcdnsa", "but", 'mulcair suggests thereslcp bad lpc blood']}
# the data frame with one coloumn tweet
df = pd.DataFrame(data,columns=['tweet'])
#no duplicates per row
print [(df.tweet.str.contains(word).sum(),word) for word in Liberal]
#captures all duplicates located in each row
print pd.Series({w: df.tweet.str.count(w).sum() for w in Liberal})
References:
Contains & match

How to find last occurrence of maximum value in a numpy.ndarray

I have a numpy.ndarray in which the maximum value will mostly occur more than once.
EDIT: This is subtly different from numpy.argmax: how to get the index corresponding to the *last* occurrence, in case of multiple occurrences of the maximum values because the author says
Or, even better, is it possible to get a list of indices of all the occurrences of the maximum value in the array?
whereas in my case getting such a list may prove very expensive
Is it possible to find the index of the last occurrence of the maximum value by using something like numpy.argmax? I want to find only the index of the last occurrence, not an array of all occurrences (since several hundreds may be there)
For example this will return the index of the first occurrence ie 2
import numpy as np
a=np.array([0,0,4,4,4,4,2,2,2,2])
print np.argmax(a)
However I want it to output 5.
numpy.argmax only returns the index of the first occurrence. You could apply argmax to a reversed view of the array:
import numpy as np
a = np.array([0,0,4,4,4,4,2,2,2,2])
b = a[::-1]
i = len(b) - np.argmax(b) - 1
i # 5
a[i:] # array([4, 2, 2, 2, 2])
Note numpy doesn't copy the array but instead creates a view of the original with a stride that accesses it in reverse order.
id(a) == id(b.base) # True
If your array is made up of integers and has less than 1e15 rows. You can also sort this out by adding a noise function that linearly increases the value of later occurrences.
>>>import numpy as np
>>>a=np.array([0,0,4,4,4,4,2,2,2,2])
>>>noise= np.array(range(len(a))) * 1e-15
>>>print(np.argmax(a + noise))
5