Use same category labeling criteria on two different dataframes - pandas

I have a dataFrame that contains a categorical feature which i have encoded in the following way:
df['categorical_feature'] = df['categorical_feature'].astype('category')
df['labels'] = df['categorical_feature'].cat.codes
If I apply the same code as above on another dataFrame with same category field the mapping is shuffled, but i need it to be consistent with the first dataFrame.
Is there a way to successfully apply the same mapping category:label to another dataFrame that has the same categorical values?

I think you are looking for pd.Series.map(), which maps values from category to label using a dictionary that has category: label mappings.
Create mapping dictionary: You can do this using a dictionary comprehension in combination with zip, but there also other ways of doing this:
col = 'categorical_features'
mapping_dict = {k: v for k, v in zip(df[col], df[col].cat.codes}
Now you can map that category: label mapping:
df['labels'] = df['categorical'].map(mapping_dict)

Related

Split Get from Python dictionary

I have a dictionary that I can use the get method to extract values from but I need to subset these values. For example
dict_of_measures = {k: v for k, v in measures.groupby('Measure')}
And I am using get
BCS=dict_of_measures.get('BCS')
I have several values and wanted to know if I could use a for loop to extract from the dictionary and subset into multiple dataframes per measure using the get method? Is this possible?
for measure name in dict_of_measures:
get measure name()
you can use dict comprehension-
result= []
keys_to_extract = ['key1','key2']
new_dict = {k: bigdict[k] for k in keys_to_extract}
result.append(new_dict) # add dictionary to list. This can then be converted into pandas dataframe

Normalizing Data using pandas

I have below dataframe 'tt' in which second column 'underlier' is a list of dictionary keys where two keys are underliersecurityid and fxspot
dataframe tt
column = underlier values showing dictionary pair
I want to create a dataframe as an output that takes out the keys from underlier and puts against each enterprise id. eg:
EnterpriseID, underliersecurityid, fxspot
I am able to normalize the underlier column itself, however I keep loosing enterprise id. plz suggest if there is some way to handle this
tt = bn.iloc[:,[4,-7]]
tt
ttu = pd.DataFrame(bn.iloc[:,-7].values.tolist()).dropna()
ttu
ttu2 = pd.DataFrame(ttu.iloc[:,0].values.tolist()).dropna()
ttu2
Synthesising data. explode() the list then use json_normalize() on output of to_dict() to expand dict into columns
tt = pd.DataFrame([{"enterpriseid":"abcd","underlyer":[{"underlyersecurityid":"SWAP10Y","fmspot":[]}]}])
pd.json_normalize(tt.explode("underlyer").to_dict(orient="records"))
output
enterpriseid underlyer.underlyersecurityid underlyer.fmspot
abcd SWAP10Y []

Alternatives to iloc for searching dataframes

I have a simple piece of code that iterates through a list of id's, and if an id is in a particular data frame column(in this case, the column is called uniqueid), it uses iloc to get the value from another column on the matching row and then adds it to as a value in a dictionary with the id as the key:
union_cols = ['uniqueid', 'FLD_ZONE', 'FLD_ZONE_1', 'ST_FIPS', 'CO_FIPS', 'CID']
union_df = gpd.GeoDataFrame.from_features(records(union_gdb, union_cols))
pop_df = pd.read_csv(pop_csv, low_memory=False) # Example dataframe
uniqueid_inin = ['', 'FL1234', 'F54323', ....] # Just an example
isin_dict = dict()
for id in uniqueid_inin:
if (id is not '') & (id in pop_df.uniqueid.values):
v = pop_df.loc[pop_df['uniqueid'] == id, 'Pop_By_Area'].iloc[0]
inin_dict.setdefault(id, v)
This works, but it is very slow. Is there a quicker way to do this?
To resolve this issue (and make the process more efficient) I had to think about the process in a different way that took advantage of Pandas and didn't rely on a generic Python solution. I first had to get a list of only the uniqueids from my union_df that were absolutely in pop_df. If they were not, applying the .isin() method would throw an indexing error.
#Get list of uniqueids in pop_df
pop_uniqueids = pop_df['uniqueid'].unique()
#Get only the union_df rows where the uniqueid matches pop_uniqueid
union_df = union_df.loc[(union_df['uniqueid'].isin(pop_uniqueids))]
union_df = union_df.reset_index()
union_df = union_df.drop(columns='index')
When the uniqueid_inin list is created from union_df (by just getting the uniqueid's from rows where my zone_status column is equal to 'in-in'), it will only contain a subset of items that are definitely in pop_df and empty values are no longer an issue. Then, I simply create a subset dataframe using the list and zip the desired column values together in a dictionary:
inin_subset =pop_df[ pop_df['uniqueid'].isin(uniqueid_inin)]
inin_pop_dict = dict(zip(inin_subset.uniqueid, inin_subset.Pop_By_Area))
I hope this technique helps.

I got stuck converting a especific dictionary to a pandas dataframe

I have a dictionary and I got stuck while trying to convert to a pandas dataframe.
It's a result of scoring an IBM ML model. The result comes in this format and I would like to transform this dictionary to a pandas dataframe in order to merge later with the original dataframe that was scored.
Dictionary:
{'predictions': [{'fields': ['prediction', 'probability'], 'values': [['Creditworthy', [0.5522992460276774, 0.4477007539723226]]]}]}
image of the code which generated this dictionary
I would like a pandas dataframe like this:
index predictions prediction probability
0 Creditworthy 0.552299 0.447701
Assume that the source dictionary is in value named dct.
Start from reading column names:
cols = dct['predictions'][0]['fields']
Then create DataFrame in a form which can be read from this dictionary:
df = pd.DataFrame(dct['predictions'][0]['values'],
columns=['predictions', 'val'])
For the time being, values are in val column, as a list:
predictions val
0 Creditworthy [0.5522992460276774, 0.4477007539723226]
Then break val column into separate columns, setting at the same time
proper column names (read before):
df[cols] = pd.DataFrame(df.val.values.tolist())
And the only thing to do is to drop val columns:
df.drop(columns=['val'], inplace=True)
The result is:
predictions prediction probability
0 Creditworthy 0.552299 0.447701
Just as it should be.

Is there a faster way through list comprehension to iterate through two dataframes?

I have two dataframes, one contains screen names/display names and another contains individuals, and I am trying to create a third dataframe that contains all the data from each dataframe in a new row for each time a last name appears in the screen name/display name. Functionally this will create a list of possible matching names. My current code, which works perfectly but very slowly, looks like this:
# Original Social Media Screen Names
# cols = 'userid','screen_name','real_name'
usernames = pd.read_csv('social_media_accounts.csv')
# List Of Individuals To Match To Accounts
# cols = 'first_name','last_name'
individuals = pd.read_csv('individuals_list.csv')
userid, screen_name, real_name, last_name, first_name = [],[],[],[],[]
for index1, row1 in individuals.iterrows():
for index2, row2 in usernames.iterrows():
if (row2['Screen_Name'].lower().find(row1['Last_Name'].lower()) != -1) | (row2['Real_Name'].lower().find(row1['Last_Name'].lower()) != -1):
userid.append(row2['UserID'])
screen_name.append(row2['Screen_Name'])
real_name.append(row2['Real_Name'])
last_name.append(row1['Last_Name'])
first_name.append(row1['First_Name'])
cols = ['UserID', 'Screen_Name', 'Real_Name', 'Last_Name', 'First_Name']
index = range(0, len(userid))
match_list = pd.DataFrame(index=index, columns=cols)
match_list = match_list.fillna('')
match_list['UserID'] = userid
match_list['Screen_Name'] = screen_name
match_list['Real_Name'] = real_name
match_list['Last_Name'] = last_name
match_list['First_Name'] = first_name
Because I need the whole row from each column, the list comprehension methods I have tried do not seem to work.
The thing you want is to iterate through a dataframe faster. Doing that with a list comprehension is, taking data out of a pandas dataframe, handling it using operations in python, then putting it back in a pandas dataframe. The fastest way (currently, with small data) would be to handle it using pandas iteration methods.
The next thing you want to do is work with 2 dataframes. There is a tool in pandas called join.
result = pd.merge(usernames, individuals, on=['Screen_Name', 'Last_Name'])
After the merge you can do your filtering.
Here is the documentation: http://pandas.pydata.org/pandas-docs/stable/merging.html