Does pandas have any kind of size limit on filters? - pandas

This is a weird question but I'm getting weird results. I have a dataframe containing data for college basketball games:
game_id season home_team away_team net_ortg net_drtg clock period home visitor ... total_seconds_elapsed win lead p_1 p_2 p_3 p_4 p_5 p_6 total_pts
627168 401173715 2020 Air Force UC Riverside 12.0 10.5 00:06:34 1 37 24 ... 806 1 13 1 0 0 0 0 0 61
320163 401174714 2020 Arkansas State Idaho 11.4 0.4 00:01:42 2 76 67 ... 2298 1 9 0 1 0 0 0 0 143
26942 401169867 2020 Vanderbilt Tulsa 1.5 10.9 00:07:50 1 24 18 ... 730 0 6 1 0 0 0 0 0 42
213142 401170184 2020 La Salle Wagner 2.3 -13.5 00:10:19 2 57 36 ... 1781 1 21 0 1 0 0 0 0 93
1631866 401255594 2021 Virginia Tech South Florida 8.4 -1.5 00:19:32 1 2 0 ... 28 1 2 1 0 0 0 0 0 2
1644302 401263600 2021 Nebraska South Dakota 1.2 -8.1 00:14:51 1 9 11 ... 309 1 -2 1 0 0 0 0 0 20
1181057 401170704 2020 Colorado Stanford 4.7 3.1 00:14:22 1 6 4 ... 338 1 2 1 0 0 0 0 0 10
1670578 401266749 2021 Texas Tech Troy 15.2 -17.9 00:07:54 2 67 33 ... 1926 1 34 0 1 0 0 0 0 100
27199 401170392 2020 Florida Gulf Coast Campbell -5.6 -2.0 00:17:46 1 2 0 ... 134 0 2 1 0 0 0 0 0 2
1588187 401262682 2021 UNLV Montana State 4.5 -0.8 00:02:54 1 23 39 ... 1026 0 -16 1 0 0 0 0 0 62
I am using test_train_split from sklearn to split the dataframe on game_id so I can do some ML tasks.
train_id, test_id = train_test_split(list(df.game_id), test_size=0.1)
train_mask = df['game_id'].isin(list(train_id))
test_mask = df['game_id'].isin(list(test_id))
print(df.shape)
print(len(train_id))
print(len(test_id))
>>(1326422, 22)
>>1193779
>>132643
Here's the weird thing (or at least the part I am not understanding):
>>train_mask.describe()
count 1326422
unique 1
top True
freq 1326422
Name: game_id, dtype: object
>>test_mask.describe()
count 1326422
unique 1
top True
freq 1326422
Name: game_id, dtype: object
Ok, but what if I do the exact same statement but limit the size of train_id:
train_mask = df['game_id'].isin(list(train_id[0:100]))
train_mask.describe()
count 1326422
unique 2
top False
freq 1302107
Name: game_id, dtype: object
And just to check again using array indexing on the full list:
train_mask = df['game_id'].isin(list(train_id[0:-1]))
train_mask.describe()
count 1326422
unique 1
top True
freq 1326422
Name: game_id, dtype: object
For the life of me I can't figure out what is going on unless there is some limitation on the size of the queries that pandas is able to run. Help!
Edit: It appears the exact size where this behavior happens is 54,665:
>>train_mask = df['game_id'].isin(list(train_id[0:54665]))
>>train_mask.describe()
count 1326422
unique 2
top True
freq 1326180
Name: game_id, dtype: object
>>train_mask = df['game_id'].isin(list(train_id[0:54666]))
>>train_mask.describe()
count 1326422
unique 1
top True
freq 1326422
Name: game_id, dtype: object
Truly bizarre!

pd.Series.isin returns a Boolean Series the same length of whatever you were checking. So you won't change the shape of anything until you slice your DataFrame: df_train = df[train_mask]
To clarify a few things, the output of describe displays the following:
import pandas as pd
s = pd.Series([True]*10 + [False]*6)
s.describe()
#count 16 # length of the Series
#unique 2 # Number of unique values in the Series
#top True # Most common value
#freq 10 # How many times does the most common value appear
#dtype: object
So checking for different IDs will never change the count. But unique, top and freq are all changing to reflect the fact that your mask itself changes.

If game_id is not a unique identifier, you may be ending up with the same game_ids in both your train and test set, which is likely resulting in the same records ending up in both the train and test set. Instead, created a train_test_split on unique game_ids.
train_id, test_id = train_test_split(df.game_id.unique(), test_size=0.1)

I believe that what's going on is that your mask is a set of True False values that are the length of the DataFrame. When you are limiting the size of train_id, you are just reducing the number of True values rather than decreasing the length of the mask. Try the following to confirm:
print(len(df['game_id'].isin(list(train_id[0:100]))))
print(len(df['game_id'].isin(list(train_id[0:-1]))))
And then to see how many true values you have (sum works here because True is evaluated as a 1 and False as a 0):
df['game_id'].isin(list(train_id[0:100])).sum()
df['game_id'].isin(list(train_id[0:-1])).sum()

I'm just adding on to ALollz's solution, to show you the dataframes (so accept his/her answer). As stated, this will return a series of True and False:
import pandas as pd
df = pd.DataFrame(
[['401173715', '2020', 'Air Force'],
['401174714' , '2020', 'Arkansas State'],
['401169867' , '2020', 'Vanderbilt'],
['401170184' , '2020', 'La Salle'],
['401255594' , '2021', 'Virginia Tech'],
['401263600' , '2021', 'Nebraska'],
['401170704' , '2020', 'Colorado'],
['401266749' , '2021', 'Texas Tech'],
['401170392' , '2020', 'Florida Gulf'],
['401262682' , '2021', 'UNLV']],
columns = ['game_id', 'season', 'home_team' ])
from sklearn.model_selection import train_test_split
train_id, test_id = train_test_split(list(df.game_id), test_size=0.1)
train_mask = df['game_id'].isin(list(train_id))
test_mask = df['game_id'].isin(list(test_id))
So the description is right as ALollz describes. Has 2 unique values (True, False), and the top value counts are either True or False, depending which mask you are looking at, and count are same, and frequency will change. now if you limit the rows and not include the last row (index 10), you're left with only 1 unique value in each data set.
Now what I am assuming what you want is to actually get those rows (where it's True). So you need to change the syntax to:
train_mask = df[df['game_id'].isin(list(train_id))]
test_mask = df[df['game_id'].isin(list(test_id))]
This will give you the 2 dataframes with the train_ids and the test ids:
So play with this code to see:
import pandas as pd
df = pd.DataFrame(
[['401173715', '2020', 'Air Force'],
['401174714' , '2020', 'Arkansas State'],
['401169867' , '2020', 'Vanderbilt'],
['401170184' , '2020', 'La Salle'],
['401255594' , '2021', 'Virginia Tech'],
['401263600' , '2021', 'Nebraska'],
['401170704' , '2020', 'Colorado'],
['401266749' , '2021', 'Texas Tech'],
['401170392' , '2020', 'Florida Gulf'],
['401262682' , '2021', 'UNLV']],
columns = ['game_id', 'season', 'home_team' ])
from sklearn.model_selection import train_test_split
train_id, test_id = train_test_split(list(df.game_id), test_size=0.1)
train_mask = df['game_id'].isin(list(train_id))
test_mask = df['game_id'].isin(list(test_id))
df_train = df[train_mask]
df_test = df[test_mask]

Related

pandas return multiple DataFrames from apply function

EDIT Based on comments, clarifying the examples further to depict more realistic use case
I want to call a function with df.apply. This function returns multiple DataFrames. I want to join each of these DataFrames into logical groups. I am unable to do that without using for loop (which defeats the purpose of calling with apply).
I have tried calling function for each row of dataframe and it is slower than apply. However, with apply combining the results slows down things again.
Any tips?
# input data frame
data = {'Name':['Ani','Bob','Cal','Dom'], 'Age': [15,12,13,14], 'Score': [93,98,95,99]}
df_in=pd.DataFrame(data)
print(df_in)
Output>
Name Age Score
0 Ani 15 93
1 Bob 12 98
2 Cal 13 95
3 Dom 14 99
Function to be applied>
def func1(name, age):
num_rows = np.random.randint(int(age/3))
age_mul_1 = np.random.randint(low=1, high=age, size = num_rows)
age_mul_2 = np.random.randint(low=1, high=age, size = num_rows)
data = {'Name': [name]*num_rows, 'Age_Mul_1': age_mul_1, 'Age_Mul_2': age_mul_2}
df_func1 = pd.DataFrame(data)
return df_func1
def func2(name, age, score, other_params):
num_rows = np.random.randint(int(score/10))
score_mul_1 = np.random.randint(low=age, high=score, size = num_rows)
data2 = {'Name': [name]*num_rows, 'score_Mul_1': score_mul_1}
df_func2 = pd.DataFrame(data2)
return df_func2
def ret_mul_df(row):
df_A = func1(row['Name'], row['Age'])
#print(df_A)
df_B = func2(row['Name'], row['Age'], row['Score'],1)
#print(df_B)
return df_A, df_B
What I want to do is essentially create is two dataframes df_A_combined and df_B_combined
However, How I am currently combining is as follows:
df_out = df_in.apply(lambda row: ret_mul_df(row), axis=1)
df_A_combined = pd.DataFrame()
df_B_combined = pd.DataFrame()
for ser in df_out:
df_A_combined = df_A_combined.append(ser[0], ignore_index=True)
df_B_combined = df_B_combined.append(ser[1], ignore_index=True)
print(df_A_combined)
Name Age_Mul_1 Age_Mul_2
0 Ani 7 8
1 Ani 1 4
2 Ani 1 8
3 Ani 12 6
4 Bob 9 8
5 Cal 8 7
6 Cal 8 1
7 Cal 4 8
print(df_B_combined)
Name score_Mul_1
0 Ani 28
1 Ani 29
2 Ani 50
3 Ani 35
4 Ani 84
5 Ani 24
6 Ani 51
7 Ani 28
8 Bob 32
9 Cal 26
10 Cal 70
11 Dom 56
12 Dom 53
How can I avoid the iteration?
The func1, func2 are calls to 3rd party libraries (which are very computation intensive) and several such calls are made. Also dataframes df_A_combined and df_B_combined are not combinable among themselves
Note: This is a much simplified example and splitting the function will lead to lot of redundancies.
If this isn't what you want, I'll update if you can post what the two dataframes should look like.
data = {'Name':['Ani','Bob','Cal','Dom'], 'Age': [15,12,13,14], 'Score': [93,98,95,99]}
df_in=pd.DataFrame(data)
print(df_in)
df_A = df_in[['Name','Age']]
df_A['Age_Multiplier'] = df_A['Age'] * 3
print(df_A)
...: print(df_A)
Name Age Age_Multiplier
0 Ani 15 45
1 Bob 12 36
2 Cal 13 39
3 Dom 14 42
df_B = df_in[['Name','Score']]
df_B['Score_Multiplier'] = df_B['Score'] * 2
print(df_B)
...: print(df_B)
Name Score Score_Multiplier
0 Ani 93 186
1 Bob 98 196
2 Cal 95 190
3 Dom 99 198

How do I extract information from nested duplicates in pandas?

I am trying to extract information from duplicates.
data = np.array([[100,1,0, 'GB'],[100,0,1, 'IT'],[101,1,0, 'CN'],[101,0,1, 'CN'],
[102,1,0, 'JP'],[102,0,1, 'CN'],[103,0,1, 'DE'],
[103,0,1, 'DE'],[103,1,0, 'VN'],[103,1,0, 'VN']])
df = pd.DataFrame(data, columns = ['wed_cert_id','spouse_1',
'spouse_2', 'nationality'])
I would like to categorise each wedding as either cross-national or not.
In my actual data set there can be more than 2 spouses to a marriage.
My aim is to obtain a data frame like this:
or like this:
I have tried to find a way to filter the data using .duplicated() and trying to deny .duplicated() with a not operator, but have not succeed in working it out:
df = df.loc[df.wed_cert_id.duplicated(keep=False) ~df.nationality.duplicated(keep=False), :]
df = df.loc[df.wed_cert_id.duplicated(keep=False) not df.nationality.duplicated(keep=False), :]
Dropping the duplicates drops too many observations. My data set allows for >2 spouses per wedding, creating the potential for duplication:
df.drop_duplicates(subset=['wed_cert_id','nationality'], keep=False, inplace=True)
How do I do it?
Many thanks from now
I believe you need:
df['cross_national'] = (df.groupby('wed_cert_id')['nationality']
.transform('nunique').gt(1).view('i1'))
print(df)
Or:
df['cross_national'] = (df.groupby('wed_cert_id')['nationality']
.transform('nunique').gt(1).view('i1')
.mul(df[['spouse_1','spouse_2']].prod(1)))
print(df)
wed_cert_id spouse_1 spouse_2 nationality cross_national
0 100 1 0 GB 1
1 100 0 1 IT 1
2 101 1 0 CN 0
3 101 0 1 CN 0
4 102 1 0 JP 1
5 102 0 1 CN 1
6 103 0 1 DE 1
7 103 0 1 DE 1
8 103 1 0 VN 1
9 103 1 0 VN 1

Data standardization of feat having lt/gt values among absolute values

One of the datasets I am dealing with has few features which have lt/gt values along with absolute values. Please refer to an example below -
>>> df = pd.DataFrame(['<10', '23', '34', '22', '>90', '42'], columns=['foo'])
>>> df
foo
0 <10
1 23
2 34
3 22
4 >90
5 42
note - foo is % value. ie 0 <= foo <= 100
How are such data transformed to run regression models on?
One thing you could do is, for values <10, impute the median value (5). Similarly, for those >90, impute 95.
Then add two extra boolean columns:
df = pd.DataFrame(['<10', '23', '34', '22', '>90', '42'], columns=['foo'])
dummies = pd.get_dummies(df, columns=['foo'])[['foo_<10', 'foo_>90']]
df = df.replace('<10', 5).replace('>90', 95)
df = pd.concat([df, dummies], axis=1)
df
This will give you
foo foo_<10 foo_>90
0 5 1 0
1 23 0 0
2 34 0 0
3 22 0 0
4 95 0 1
5 42 0 0

Iterate over every row and compare a column value of a dataframe

I have following dataframe. I want to iterate over every row and compare the score column, if the value is >= value present in cut_off list.
seq score status
7 TTGTTCTCTGTGTATTTCAGGCT 10.42 positive
56 CAGGTGAGA 9.22 positive
64 AATTCCTGTGGACTTTCAAGTAT 1.23 positive
116 AAGGTATAT 7.84 positive
145 AAGGTAATA 8.49 positive
172 TGGGTAGGT 6.86 positive
204 CAGGTAGAG 7.10 positive
214 GCGTTTCTTGAATCCAGCAGGGA 3.58 positive
269 GAGGTAATG 8.73 positive
274 CACCCATTCCTGTACCTTAGGTA 8.96 positive
325 GCCGTAAGG 5.46 positive
356 GAGGTGAGG 8.41 positive
cut_off = range(0, 11)
The code I tried so far is:
cutoff_list_pos = []
number_list_pos = []
cut_off = range(0, int(new_df['score'].max())+1)
for co in cut_off:
for df in df_elements:
val = (df['score']>=co).value_counts()
cutoff_list_pos.append(co)
number_list_pos.append(val)
The desired output is:
cutoff true false
0 0 12.0 0
1 1 12.0 0
and so on..
If the score is >= to the value in cut_off, it should assign the row as true else false.
You can use parameter keys in concat by values of cutoff_list_pos, then transpose and convert index to column by DataFrame.reset_index:
df = (pd.concat(number_list_pos, axis=1, keys=cutoff_list_pos, sort=False)
.T
.rename_axis('cutoff')
.reset_index())
Another pandas implementation:
res_df = pd.DataFrame(columns=['cutoff', 'true'])
for i in range(1,int(df['score'].max()+1)):
temp_df = pd.DataFrame(data={'cutoff': i, 'true': (df['score']>=i).sum()}, index=[i])
res_df = pd.concat([res_df, temp_df])
res_df
cutoff true
1 1 12
2 2 11
3 3 11
4 4 10
5 5 10
6 6 9
7 7 8
8 8 6
9 9 2
10 10 1

Apply function with arguments across Multiindex levels

I would like to apply a custom function to each level within a multiindex.
For example, I have the dataframe
df = pd.DataFrame(np.arange(16).reshape((4,4)),
columns=pd.MultiIndex.from_product([['OP','PK'],['PRICE','QTY']]))
of which I want to add a column for each level 0 column, called "Value" which is the result of the following function;
def my_func(df, scale):
return df['QTY']*df['PRICE']*scale
where the user supplies the "scale" value.
Even in setting up this example, I am not sure how to show the result I want. But I know I want the final dataframe's multiindex column to be
pd.DataFrame(columns=pd.MultiIndex.from_product([['OP','PK'],['PRICE','QTY','Value']]))
Even if that wasn't had enough, I want to apply one "scale" value for the "OP" level 0 column and a different "scale" value to the "PK" column.
Use:
def my_func(df, scale):
#select second level of columns
df1 = df.xs('QTY', axis=1, level=1).values *df.xs('PRICE', axis=1, level=1) * scale
#create MultiIndex in columns
df1.columns = pd.MultiIndex.from_product([df1.columns, ['val']])
#join to original
return pd.concat([df, df1], axis=1).sort_index(axis=1)
print (my_func(df, 10))
OP PK
PRICE QTY val PRICE QTY val
0 0 1 0 2 3 60
1 4 5 200 6 7 420
2 8 9 720 10 11 1100
3 12 13 1560 14 15 2100
EDIT:
For multiple by scaled values different for each level is possible use list of values:
print (my_func(df, [10, 20]))
OP PK
PRICE QTY val PRICE QTY val
0 0 1 0 2 3 120
1 4 5 200 6 7 840
2 8 9 720 10 11 2200
3 12 13 1560 14 15 4200
Use groupby + agg, and then concatenate the pieces together with pd.concat.
scale = 10
v = df.groupby(level=0, axis=1).agg(lambda x: x.values.prod(1) * scale)
v.columns = pd.MultiIndex.from_product([v.columns, ['value']])
pd.concat([df, v], axis=1).sort_index(axis=1, level=0)
OP PK
PRICE QTY value PRICE QTY value
0 0 1 0 2 3 60
1 4 5 200 6 7 420
2 8 9 720 10 11 1100
3 12 13 1560 14 15 2100