Normalizing Data using pandas

Normalizing Data using pandas - pandas

I have below dataframe 'tt' in which second column 'underlier' is a list of dictionary keys where two keys are underliersecurityid and fxspot
dataframe tt
column = underlier values showing dictionary pair
I want to create a dataframe as an output that takes out the keys from underlier and puts against each enterprise id. eg:
EnterpriseID, underliersecurityid, fxspot
I am able to normalize the underlier column itself, however I keep loosing enterprise id. plz suggest if there is some way to handle this
tt = bn.iloc[:,[4,-7]]
tt
ttu = pd.DataFrame(bn.iloc[:,-7].values.tolist()).dropna()
ttu
ttu2 = pd.DataFrame(ttu.iloc[:,0].values.tolist()).dropna()
ttu2

Synthesising data. explode() the list then use json_normalize() on output of to_dict() to expand dict into columns
tt = pd.DataFrame([{"enterpriseid":"abcd","underlyer":[{"underlyersecurityid":"SWAP10Y","fmspot":[]}]}])
pd.json_normalize(tt.explode("underlyer").to_dict(orient="records"))
output
enterpriseid underlyer.underlyersecurityid underlyer.fmspot
abcd SWAP10Y []

Related

How to create an array of dataframe by filtering a data frame with a Seq of lists in Scala

One Df1 with all data, one of the column with vl1,vl3,vl4,vl2,vl3,vl4..etc and another dataframe valueDf with a single column which have unique values vl1,vl2,vl3..etc
I want to split valueDf into an array of 4 dataframes like
Df(0)=(vl1,vl4,vl5)
Df(1)=(vl3,vl2,vl6)
Df(2)=(vl7,vl8,vl9)... etc
So that i can do leftsemi join on Df1 and write in 4 different locations
/folder_Df(0)/
/foldwer_Df(1)/.. Etc
I tried doing a randomSplit as below
val Dfs = valueDf.randomSplit(Array.range(1,5).map(_.toDouble), 1)
But now I have set of values which need to go into Df(0),Df(1),Df(2) based on this list (vl1,vl4,vl5),(vl3,vl2,vl6),(vl7,vl8,vl9).

Pandas series replace value ignoring case but only if exact match

As Title says, I'm looking for a perfect solution to replace exact string in a series ignoring case.
ls = {'CAT':'abc','DOG' : 'def','POT':'ety'}
d = pd.DataFrame({'Data': ['cat','dog','pot','Truncate','HotDog','ShuPot'],'Result':['abc','def','ety','Truncate','HotDog','ShuPot']})
d
In the above code, ref hold the key-value pair where key is the existing value in a dataframe column and value is value to replace with.
Issue with this case is, service that pass the dictionary always holds dictionary key in upper case where dataframe might have value in lowercase.
expected output is stored in 'Result Column.
I tried including re.ignore = True which changes the last 2 values.
following code but that is not working as expected. it also converting values to upper case from previous iteration.
for k,v in ls.items():
print (k,v)
d['Data'] = d['Data'].astype(str).str.upper().replace({k:v})
print (d)
I'd appreciate any help.

Create a mapping series from the given dictionary, then transform the index of the mapping series to lower case, then using Series.map map the values in Data column to the values in mappings, then use Series.fillna to fill the missing values in the mapped series:
mappings = pd.Series(ls)
mappings.index = mappings.index.str.lower()
d['Result'] = d['Data'].str.lower().map(mappings).fillna(d['Data'])
# print(d)
Data Result
0 cat abc
1 dog def
2 pot ety
3 Truncate Truncate
4 HotDog HotDog
5 ShuPot ShuPot

How to index a column with two values pandas

I have two dataframes:
Dataframe #1
Reads the values--Will only be interested in NodeID AND GSE
sta = pd.read_csv(filename)
Dataframe #2
Reads the file, use pivot and get the following result
sim = pd.read_csv(headout,index_col=0)
sim['Layer'] = sim.groupby('date').cumcount() + 1
sim['Layer'] = 'L' + sim['Layer'].astype(str)
sim = sim.pivot(index = None , columns = 'Layer').T
This gives me the index column to be with two values. (The header is blank for the first one, and Layers for the second) i.e 1,L1.
What I need help on is:
I can not find a way to rename that first blank in the index to 'NodeID'.
I want to name it that so that I can do the lookup function and use NodeID in both dataframes so that I can bring in the 'GSE' values from the first dataframe to the second.
I have been googling way to rename that first column in the second dataframe and I can not seem to find an solution. Any ideas help at this point. I think my pivot function might be wrong...
This is a picture of dataframe #2 before pivot. The number 1-4 are the Node ID.
when I export it to csv to see what the dataframe looks like I get this..

Try
df.rename(columns={"Index": "your preferred name"})
if it is your index then do -
df = df.reset_index()
df.rename(columns={"index": "your preferred name"})

How do you split All columns in a large pandas data frame?

I have a very large data frame that I want to split ALL of the columns except first two based on a comma delimiter. So I need to logically reference column names in a loop or some other way to split all the columns in one swoop.
In my testing of the split method:
I have been able to explicitly refer to ( i.e. HARD CODE) a single column name (rs145629793) as one of the required parameters and the result was 2 new columns as I wanted.
See python code below
HARDCODED COLUMN NAME --
df[['rs1','rs2']] = df.rs145629793.str.split(",", expand = True)
The problem:
It is not feasible to refer to the actual column names and repeat code.
I then replaced the actual column name rs145629793 with columns[2] in the split method parameter list.
It results in an ERROR
'str has ni str attribute'

You can index columns by position rather than name using iloc. For example, to get the third column:
df.iloc[:, 2]
Thus you can easily loop over the columns you need.

I know what you are asking, but it's still helpful to provide some input data and expected output data. I have included random input data in my code below, so you can just copy and paste this to run, and try to apply it to your dataframe:
import pandas as pd
your_dataframe=pd.DataFrame({'a':['1,2,3', '9,8,7'],
'b':['4,5,6', '6,5,4'],
'c':['7,8,9', '3,2,1']})
import copy
def split_cols(df):
dict_of_df = {}
cols=df.columns.to_list()
for col in cols:
key_name = 'df'+str(col)
dict_of_df[key_name] = copy.deepcopy(df)
var=df[col].str.split(',', expand=True).add_prefix(col)
df=pd.merge(df, var, how='left', left_index=True, right_index=True).drop(col, axis=1)
return df
split_cols(your_dataframe)
Essentially, in this solution you create a list of the columns that you want to loop through. Then you loop through that list and create new dataframes for each column where you run the split() function. Then you merge everything back together on the index. I also:
included a prefix of the column name, so the column names did not have duplicate names and could be more easily identifiable
dropped the old column that we did the split on.
Just import copy and use the split_cols() function that I have created and pass the name of your dataframe.

Is there a faster way through list comprehension to iterate through two dataframes?

I have two dataframes, one contains screen names/display names and another contains individuals, and I am trying to create a third dataframe that contains all the data from each dataframe in a new row for each time a last name appears in the screen name/display name. Functionally this will create a list of possible matching names. My current code, which works perfectly but very slowly, looks like this:
# Original Social Media Screen Names
# cols = 'userid','screen_name','real_name'
usernames = pd.read_csv('social_media_accounts.csv')
# List Of Individuals To Match To Accounts
# cols = 'first_name','last_name'
individuals = pd.read_csv('individuals_list.csv')
userid, screen_name, real_name, last_name, first_name = [],[],[],[],[]
for index1, row1 in individuals.iterrows():
for index2, row2 in usernames.iterrows():
if (row2['Screen_Name'].lower().find(row1['Last_Name'].lower()) != -1) | (row2['Real_Name'].lower().find(row1['Last_Name'].lower()) != -1):
userid.append(row2['UserID'])
screen_name.append(row2['Screen_Name'])
real_name.append(row2['Real_Name'])
last_name.append(row1['Last_Name'])
first_name.append(row1['First_Name'])
cols = ['UserID', 'Screen_Name', 'Real_Name', 'Last_Name', 'First_Name']
index = range(0, len(userid))
match_list = pd.DataFrame(index=index, columns=cols)
match_list = match_list.fillna('')
match_list['UserID'] = userid
match_list['Screen_Name'] = screen_name
match_list['Real_Name'] = real_name
match_list['Last_Name'] = last_name
match_list['First_Name'] = first_name
Because I need the whole row from each column, the list comprehension methods I have tried do not seem to work.

The thing you want is to iterate through a dataframe faster. Doing that with a list comprehension is, taking data out of a pandas dataframe, handling it using operations in python, then putting it back in a pandas dataframe. The fastest way (currently, with small data) would be to handle it using pandas iteration methods.
The next thing you want to do is work with 2 dataframes. There is a tool in pandas called join.
result = pd.merge(usernames, individuals, on=['Screen_Name', 'Last_Name'])
After the merge you can do your filtering.
Here is the documentation: http://pandas.pydata.org/pandas-docs/stable/merging.html

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Normalizing Data using pandas - pandas

Related

How to create an array of dataframe by filtering a data frame with a Seq of lists in Scala

Pandas series replace value ignoring case but only if exact match

How to index a column with two values pandas

How do you split All columns in a large pandas data frame?

Is there a faster way through list comprehension to iterate through two dataframes?

Categories

Resources