Data standardization of feat having lt/gt values among absolute values - pandas

One of the datasets I am dealing with has few features which have lt/gt values along with absolute values. Please refer to an example below -
>>> df = pd.DataFrame(['<10', '23', '34', '22', '>90', '42'], columns=['foo'])
>>> df
foo
0 <10
1 23
2 34
3 22
4 >90
5 42
note - foo is % value. ie 0 <= foo <= 100
How are such data transformed to run regression models on?

One thing you could do is, for values <10, impute the median value (5). Similarly, for those >90, impute 95.
Then add two extra boolean columns:
df = pd.DataFrame(['<10', '23', '34', '22', '>90', '42'], columns=['foo'])
dummies = pd.get_dummies(df, columns=['foo'])[['foo_<10', 'foo_>90']]
df = df.replace('<10', 5).replace('>90', 95)
df = pd.concat([df, dummies], axis=1)
df
This will give you
foo foo_<10 foo_>90
0 5 1 0
1 23 0 0
2 34 0 0
3 22 0 0
4 95 0 1
5 42 0 0

Related

Pandas: compare df and add missing rows

I have a list of dataframes which have 1 column in common ('label'). However, in some of the dataframes some rows are missing.
Example: df1 = pd.DataFrame([['sample1',2,3], ['sample4',7,8]], columns=['label', 'B', 'E'], index=[1,2]) df2 = pd.DataFrame([['sample1',20,30], ['sample2',70,80], ['sample3',700,800]], columns=['label', 'B', 'C'], index=[2,3,4])
I would like to add rows, so the length of the dfs are the same but preserving the right order. The desired output would be:
label B E
1 sample1 2 3
2 0 0 0
3 0 0 0
4 sample4 7 8
label B C
1 sample1 20 30
2 sample2 70 80
3 sample3 700 800
4 0 0 0
I was looking into pandas three-way joining multiple dataframes on columns
but I don't want to merge my dataframes. And pandas align() function : illustrative example
doesn't give the desired output either. I was also thinking about comparing the 'label' column with a list and loop through to add the missing rows. If somebody could point me into the right direction, that would be great.
You can get the common indices in the desired order, then reindex:
# here the order matters to get the preference
# for a sorted order use:
# unique = sorted(pd.concat([df1['label'], df2['label']]).unique())
unique = pd.concat([df2['label'], df1['label']]).unique()
out1 = (df1.set_axis(df1['label'])
.reindex(unique, fill_value=0)
.reset_index(drop=True)
)
out2 = (df2.set_axis(df2['label'])
.reindex(unique, fill_value=0)
.reset_index(drop=True)
)
outputs:
# out1
label B E
0 sample1 2 3
1 0 0 0
2 0 0 0
3 sample4 7 8
# out2
label B C
0 sample1 20 30
1 sample2 70 80
2 sample3 700 800
3 0 0 0

Is there a way i could work with this multiindex?

I have a dataframe like this one, https://i.stack.imgur.com/2Sr29.png. RBD is a code that identifies each school, LET_CUR corresponds to a class and MRUN corresponds to the amount of students in each class, what i need is the following:
I would like to know how many of the schools have at least one class with more than 45 students, so far I haven't figured out yet a code to do that.
Thanks.
From your DataFrame :
>>> import pandas as pd
>>> from io import StringIO
>>> df = pd.read_csv(StringIO("""
RBD,LET_CUR,MRUN
1,A,65
1,B,23
1,C,21
2,A,22
2,B,20
2,C,34
3,A,54
4,A,23
4,B,11
5,A,15
5,C,16
6,A,76"""))
>>> df = df.set_index(['RBD', 'LET_CUR'])
>>> df
MRUN
RBD LET_CUR
1 A 65
B 23
C 21
2 A 22
B 20
C 34
3 A 54
4 A 23
B 11
5 A 15
C 16
6 A 76
As we want to know the number of school with at leat one class having more than 45 students, we can first filter the DataFrame on the column MRUN and then use the nunique() method to count the number of unique school :
>>> df_filtered = df[df['MRUN'] > 45].reset_index()
>>> df_filtered['RBD'].nunique()
3
Try with the following (here i build a similar dataframe structure as yours):
df = pd.DataFrame({'RBD': [1, 1, 2, 3],
'COD_GRADO': ['1', '2', '1', '3'],
'LET_CUR':['A', 'C', 'B', 'A'],
'MRUN':[65, 34, 64, 25]},
columns=['RBD', 'COD_GRADO', 'LET_CUR', 'MRUN'])
print(df)
n_schools = df.loc[df['MRUN'] >= 45].shape[0]
print(f"Number of shools with 45+ students is {n_schools}")
And output, for my example would (table formatted for easier reading):
(pd indices)
RBD
COD_GRADO
LET_CUR
MRUN
0
1
1
A
65
1
1
2
C
34
2
2
1
B
64
3
3
3
A
25
> Number of shools with 45+ students is 2

Does pandas have any kind of size limit on filters?

This is a weird question but I'm getting weird results. I have a dataframe containing data for college basketball games:
game_id season home_team away_team net_ortg net_drtg clock period home visitor ... total_seconds_elapsed win lead p_1 p_2 p_3 p_4 p_5 p_6 total_pts
627168 401173715 2020 Air Force UC Riverside 12.0 10.5 00:06:34 1 37 24 ... 806 1 13 1 0 0 0 0 0 61
320163 401174714 2020 Arkansas State Idaho 11.4 0.4 00:01:42 2 76 67 ... 2298 1 9 0 1 0 0 0 0 143
26942 401169867 2020 Vanderbilt Tulsa 1.5 10.9 00:07:50 1 24 18 ... 730 0 6 1 0 0 0 0 0 42
213142 401170184 2020 La Salle Wagner 2.3 -13.5 00:10:19 2 57 36 ... 1781 1 21 0 1 0 0 0 0 93
1631866 401255594 2021 Virginia Tech South Florida 8.4 -1.5 00:19:32 1 2 0 ... 28 1 2 1 0 0 0 0 0 2
1644302 401263600 2021 Nebraska South Dakota 1.2 -8.1 00:14:51 1 9 11 ... 309 1 -2 1 0 0 0 0 0 20
1181057 401170704 2020 Colorado Stanford 4.7 3.1 00:14:22 1 6 4 ... 338 1 2 1 0 0 0 0 0 10
1670578 401266749 2021 Texas Tech Troy 15.2 -17.9 00:07:54 2 67 33 ... 1926 1 34 0 1 0 0 0 0 100
27199 401170392 2020 Florida Gulf Coast Campbell -5.6 -2.0 00:17:46 1 2 0 ... 134 0 2 1 0 0 0 0 0 2
1588187 401262682 2021 UNLV Montana State 4.5 -0.8 00:02:54 1 23 39 ... 1026 0 -16 1 0 0 0 0 0 62
I am using test_train_split from sklearn to split the dataframe on game_id so I can do some ML tasks.
train_id, test_id = train_test_split(list(df.game_id), test_size=0.1)
train_mask = df['game_id'].isin(list(train_id))
test_mask = df['game_id'].isin(list(test_id))
print(df.shape)
print(len(train_id))
print(len(test_id))
>>(1326422, 22)
>>1193779
>>132643
Here's the weird thing (or at least the part I am not understanding):
>>train_mask.describe()
count 1326422
unique 1
top True
freq 1326422
Name: game_id, dtype: object
>>test_mask.describe()
count 1326422
unique 1
top True
freq 1326422
Name: game_id, dtype: object
Ok, but what if I do the exact same statement but limit the size of train_id:
train_mask = df['game_id'].isin(list(train_id[0:100]))
train_mask.describe()
count 1326422
unique 2
top False
freq 1302107
Name: game_id, dtype: object
And just to check again using array indexing on the full list:
train_mask = df['game_id'].isin(list(train_id[0:-1]))
train_mask.describe()
count 1326422
unique 1
top True
freq 1326422
Name: game_id, dtype: object
For the life of me I can't figure out what is going on unless there is some limitation on the size of the queries that pandas is able to run. Help!
Edit: It appears the exact size where this behavior happens is 54,665:
>>train_mask = df['game_id'].isin(list(train_id[0:54665]))
>>train_mask.describe()
count 1326422
unique 2
top True
freq 1326180
Name: game_id, dtype: object
>>train_mask = df['game_id'].isin(list(train_id[0:54666]))
>>train_mask.describe()
count 1326422
unique 1
top True
freq 1326422
Name: game_id, dtype: object
Truly bizarre!
pd.Series.isin returns a Boolean Series the same length of whatever you were checking. So you won't change the shape of anything until you slice your DataFrame: df_train = df[train_mask]
To clarify a few things, the output of describe displays the following:
import pandas as pd
s = pd.Series([True]*10 + [False]*6)
s.describe()
#count 16 # length of the Series
#unique 2 # Number of unique values in the Series
#top True # Most common value
#freq 10 # How many times does the most common value appear
#dtype: object
So checking for different IDs will never change the count. But unique, top and freq are all changing to reflect the fact that your mask itself changes.
If game_id is not a unique identifier, you may be ending up with the same game_ids in both your train and test set, which is likely resulting in the same records ending up in both the train and test set. Instead, created a train_test_split on unique game_ids.
train_id, test_id = train_test_split(df.game_id.unique(), test_size=0.1)
I believe that what's going on is that your mask is a set of True False values that are the length of the DataFrame. When you are limiting the size of train_id, you are just reducing the number of True values rather than decreasing the length of the mask. Try the following to confirm:
print(len(df['game_id'].isin(list(train_id[0:100]))))
print(len(df['game_id'].isin(list(train_id[0:-1]))))
And then to see how many true values you have (sum works here because True is evaluated as a 1 and False as a 0):
df['game_id'].isin(list(train_id[0:100])).sum()
df['game_id'].isin(list(train_id[0:-1])).sum()
I'm just adding on to ALollz's solution, to show you the dataframes (so accept his/her answer). As stated, this will return a series of True and False:
import pandas as pd
df = pd.DataFrame(
[['401173715', '2020', 'Air Force'],
['401174714' , '2020', 'Arkansas State'],
['401169867' , '2020', 'Vanderbilt'],
['401170184' , '2020', 'La Salle'],
['401255594' , '2021', 'Virginia Tech'],
['401263600' , '2021', 'Nebraska'],
['401170704' , '2020', 'Colorado'],
['401266749' , '2021', 'Texas Tech'],
['401170392' , '2020', 'Florida Gulf'],
['401262682' , '2021', 'UNLV']],
columns = ['game_id', 'season', 'home_team' ])
from sklearn.model_selection import train_test_split
train_id, test_id = train_test_split(list(df.game_id), test_size=0.1)
train_mask = df['game_id'].isin(list(train_id))
test_mask = df['game_id'].isin(list(test_id))
So the description is right as ALollz describes. Has 2 unique values (True, False), and the top value counts are either True or False, depending which mask you are looking at, and count are same, and frequency will change. now if you limit the rows and not include the last row (index 10), you're left with only 1 unique value in each data set.
Now what I am assuming what you want is to actually get those rows (where it's True). So you need to change the syntax to:
train_mask = df[df['game_id'].isin(list(train_id))]
test_mask = df[df['game_id'].isin(list(test_id))]
This will give you the 2 dataframes with the train_ids and the test ids:
So play with this code to see:
import pandas as pd
df = pd.DataFrame(
[['401173715', '2020', 'Air Force'],
['401174714' , '2020', 'Arkansas State'],
['401169867' , '2020', 'Vanderbilt'],
['401170184' , '2020', 'La Salle'],
['401255594' , '2021', 'Virginia Tech'],
['401263600' , '2021', 'Nebraska'],
['401170704' , '2020', 'Colorado'],
['401266749' , '2021', 'Texas Tech'],
['401170392' , '2020', 'Florida Gulf'],
['401262682' , '2021', 'UNLV']],
columns = ['game_id', 'season', 'home_team' ])
from sklearn.model_selection import train_test_split
train_id, test_id = train_test_split(list(df.game_id), test_size=0.1)
train_mask = df['game_id'].isin(list(train_id))
test_mask = df['game_id'].isin(list(test_id))
df_train = df[train_mask]
df_test = df[test_mask]

Creating column from categories in a column in pandas

I have a dataframe in which I want to create columns based on the levels of data from that column. For example,
Cust_ID MCC Date TRANS_AMT Frequency
1 1750 Jan 6633 1
1 1799 Jan 5584 1
2 3001 Mar 405 2
2 3174 Oct 1219 1
I want to create columns based on the levels of data I have in column MCC and Date. For each Cust_ID, I want TRANS_AMT and Frequency they have done at each MCC and Date level combined.
Below is the required output:
Because ordering of columns in final DataFrame should be important, convert column date to ordered categorical, then create MultiIndex by DataFrame.set_index and columns TRANS_AMT and Frequency convert to ordered CategoricalIndex too.
Then reshape by DataFrame.unstack and sorting by second level of MultiIndex in columns by DataFrame.sort_index.
Last flatten values in list comprehension with f-strings and DataFrame.reset_index for column from index:
cats = ['Jan', 'Feb', 'Mar', 'Apr','May', 'Jun',
'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec']
df['Date'] = pd.Categorical(df['Date'], categories=cats, ordered=True)
df1 = df.set_index(['Cust_ID','MCC','Date'])
df1.columns = pd.CategoricalIndex(df1.columns,
categories=['TRANS_AMT','Frequency'],
ordered=True)
df1 = df1.unstack(level=[1,2], fill_value=0).sort_index(axis=1, level=1)
df1.columns = [f'{a}_{b}_{c}' for a, b, c in df1.columns]
df1 = df1.reset_index()
print (df1)
Cust_ID TRANS_AMT_1750_Jan Frequency_1750_Jan TRANS_AMT_1799_Jan \
0 1 6633 1 5584
1 2 0 0 0
Frequency_1799_Jan TRANS_AMT_3001_Mar Frequency_3001_Mar \
0 1 0 0
1 0 405 2
TRANS_AMT_3174_Oct Frequency_3174_Oct
0 0 0
1 1219 1
If not important ordering remove converting to categoricals:
df1 = (df.set_index(['Cust_ID','MCC','Date'])
.unstack(level=[1,2], fill_value=0)
.sort_index(axis=1, level=1))
df1.columns = [f'{a}_{b}_{c}' for a, b, c in df1.columns]
df1 = df1.reset_index()
print (df1)
Cust_ID Frequency_1750_Jan TRANS_AMT_1750_Jan Frequency_1799_Jan \
0 1 1 6633 1
1 2 0 0 0
TRANS_AMT_1799_Jan Frequency_3001_Mar TRANS_AMT_3001_Mar \
0 5584 0 0
1 0 2 405
Frequency_3174_Oct TRANS_AMT_3174_Oct
0 0 0
1 1 1219

Convert ordered levels to numeric in pandas

I was wondering is there any function in pandas that allows me to do this.
I have a column with levels [low, medium, high].
I would like to translate them to [1,2,3] to perform linear regression. However, what i am currently doing is df[df['interest_level'] == 'low'] = 1. is there a better way of doing this?
Thanks.
use pd.factorize() method:
df['interest_level'] = pd.factorize(df['interest_level'])[0]
you can also categorize your new numerical values (this might save a lot of memory):
Sample DataFrame:
In [34]: df = pd.DataFrame({'interest_level':np.random.choice(['medium','high','low'], 10)})
In [35]: df
Out[35]:
interest_level
0 high
1 low
2 medium
3 high
4 low
5 high
6 high
7 low
8 low
9 medium
Solution:
In [36]: df['interest_level'], cats = pd.factorize(df['interest_level'])
In [37]: df['interest_level'] = pd.Categorical(df['interest_level'], categories=np.arange(len(cats)))
In [38]: df
Out[38]:
interest_level
0 0
1 1
2 2
3 0
4 1
5 0
6 0
7 1
8 1
9 2
In [39]: cats # this can be used for the backtracing ...
Out[39]: Index(['high', 'low', 'medium'], dtype='object')
In [40]: df.memory_usage()
Out[40]:
Index 80
interest_level 34 # <---- NOTE: only 34 bytes used for 10 integers
dtype: int64
In [41]: df.dtypes
Out[41]:
interest_level category
dtype: object
You can use map:
d = {'low':1,'medium':2,'high':3}
df['interest_level'] = df['interest_level'].map(d)
Sample:
df = pd.DataFrame({'interest_level':['medium','high','low', 'low', 'medium']})
print (df)
interest_level
0 medium
1 high
2 low
3 low
4 medium
d = {'low':1,'medium':2,'high':3}
df['interest_level'] = df['interest_level'].map(d)
print (df)
interest_level
0 2
1 3
2 1
3 1
4 2
Another solution is cast to Categorical and then use cat.codes:
categories = ['low','medium','high']
df['interest_level'] = df['interest_level'].astype('category',
categories=categories,
ordered=True).cat.codes + 1
print (df)
interest_level
0 2
1 3
2 1
3 1
4 2