How to concatenate values from multiple rows using Pandas? - pandas

In the screenshot, 'Ctrl' column contains a key value. I have two duplicate rows for OTC-07 which I need to consolidate. I would like to concat the rest of column values for OTC-07. i.e, OTC-07 should have Type A,B and Assertion a,b,c,d after consolidation.. Can anyone help me on this? :o

First define a dataframe of given structure:
import pandas as pd
import numpy as np
df = pd.DataFrame({
'Ctrl': ['OTC-05', 'OTC-06', 'OTC-07', 'OTC-07', 'OTC-08'],
'Type': ['A', 'A', 'A', 'B', np.NaN],
'Assertion': ['a,b,c', 'c,b', 'a,c', 'b,c,d', 'a,b,c']
})
df
Output:
Then replace NaN values with empty strings:
df = df.replace(np.NaN, '', regex=True)
Then group by column 'Ctrl' and aggregate columns 'Type' and 'Assertion'. Please not that assertion aggregation is a bit tricky as you need not a simple concatenation, but sorted list of unique letters:
df.groupby(['Ctrl']).agg({
'Type': ','.join,
'Assertion': lambda x: ','.join(list(sorted(set(','.join(x).split(',')))))
})
Output:

Related

How do I fillna using data from left column as the reference

Id like to ask for help in fixing the missing values in pandas dataframe (python)
here is the dataset
In this dataset I found a missing value in ['Item_Weight'] column.
I don't want to drop the missing values because I found out by sorting them. the missing value is "miss type" by someone who encoded it.
here is the sorted dataset
Now I created a lookup dataset so I can merge them to fill na missing values.
How can I merge them or join them only to fill the missing values (Nan) using the lookup table I made? Or is there any other way without using a lookup table?
Looking at this you will probably want to use something along the lines of map instead of join/merge this is an example of how to use map with your data.
import pandas as pd
import numpy as np
df = pd.DataFrame({
'Column1' : ['A', 'B', 'C'],
'Column2' : [1, np.nan, 3]
})
df
df_map = pd.DataFrame({
'Column1' : ['A', 'B', 'C'],
'Column2' : [1, 2, 3]
})
df_map
#Looks to find where the column you specify is null, then using your map df will map the value from column1 to column2
df['Column2'] = np.where(df['Column2'].isna(), df['Column1'].map(df_map.set_index('Column1')['Column2']), df['Column2'])
I had to create my own dataframes since you used screenshots. In the future, the use of screenshots is not considered best to help developers with assistance.
This will probably work:
df = df.sort_values(['Item_Identifier', 'Item_Weight']).ffill()
But I can't test it since you didn't give us anything to work with.

find a value from df1 in df2 and replace other values of the matching rows

I have the following code with 2 dataframes (df1 & df2)
import pandas as pd
data = {'Name': ['Name1', 'Name2', 'Name3', 'Name4', 'Name5'],
'Number': ['456', 'A977', '132a', '6783r', '868354']}
replace = {'NewName': ['NewName1', 'NewName3', 'NewName4', 'NewName5', 'NewName2'],
'ID': ['I753', '25552', '6783r', '868354', 'A977']}
df1 = pd.DataFrame(data, columns = ['Name', 'Number'])
df2 = pd.DataFrame(replace, columns = ['NewName', 'ID'])
Now I would like to compare every item in the 'Number' column of df1 with the 'ID' column of df2. If there is a match, I would like to replace the 'Name' of df1 with the 'NewName' of df2, otherwise it should keep the 'Name' of df1.
First I tried the following code, but unfortunately it mixed the name and the number in the different rows.
df1.loc[df1['Number'].isin(df2['ID']), ['Name']] = df2.loc[df2['ID'].isin(df1['Number']),['NewName']].values
The next code that I tried worked a bit better, but it replaced the 'Name' of df1 with the 'Number' of df1 if there was no matching.
df1['Name'] = df1['Number'].replace(df2.set_index('ID')['NewName'])
How can I stop this behavior in my last code or are there better ways in general to achieve what I would like to do?
You can use map instead of replace to substitute each value in the column Number in df1 with corresponding value from the NewName column in df2 then fill the NaN values (values which can't be substituted) in mapped column with the original values from the Name column in df1:
df1['Name'] = df1['Number'].map(df2.set_index('ID')['NewName']).fillna(df1['Name'])
>>> df1
Name Number
0 Name1 456
1 NewName2 A977
2 Name3 132a
3 NewName4 6783r
4 NewName5 868354

Get names of dummy variables created by get_dummies

I have a dataframe with a very large number of columns of different types. I want to encode the categorical variables in my dataframe using get_dummies(). The question is: is there a way to get the column headers of the encoded categorical columns created by get_dummies()?
The hard way to do this would be to extract a list of all categorical variables in the dataframe, then append the different text labels associated to each categorical variable to the corresponding column headers. I wonder if there is an easier way to achieve the same end.
I think the way that should work with all the different uses of get_dummies would be:
#example data
import pandas as pd
df = pd.DataFrame({'P': ['p', 'q', 'p'], 'Q': ['q', 'p', 'r'],
'R': [2, 3, 4]})
dummies = pd.get_dummies(df)
#get column names that were not in the original dataframe
new_cols = dummies.columns[~dummies.columns.isin(df.columns)]
new_cols gives:
Index(['P_p', 'P_q', 'Q_p', 'Q_q', 'Q_r'], dtype='object')
I think the first column is the only column preserved when using get_dummies, so you could also just take the column names after the first column:
dummies.columns[1:]
which on this test data gives the same result:
Index(['P_p', 'P_q', 'Q_p', 'Q_q', 'Q_r'], dtype='object')

Drop columns from Pandas dataframe if they are not in specific list

I have a pandas dataframe and it has some columns. I want to drop columns if they are not presented at a list.
pandas dataframe columns:
list(pandas_df.columns.values)
Result:
['id', 'name' ,'region', 'city']
And my expected column names:
final_table_columns = ['id', 'name', 'year']
After x operations result should be:
list(pandas_df.columns.values)
['id', 'name']
Use Index.intersection to find the intersection of an index and a list of (column) labels:
pandas_df = pandas_df[pandas_df.columns.intersection(final_table_columns)]
You could use a list comprehension creating all column-names to drop()
final_table_columns = ['id', 'name', 'year']
df = df.drop(columns=[col for col in df if col not in final_table_columns])
To do it in-place:
df.drop(columns=[col for col in df if col not in final_table_columns], inplace=True)
To do it in-place, consider Index.difference. This was not documented in any prior answer.
df.drop(columns=df.columns.difference(final_table_columns), inplace=True)
To create a new dataframe, Index.intersection also works.
df_final = df.drop(columns=df.columns.difference(final_table_columns)
df_final = df[df.columns.intersection(final_table_columns)] # credited to unutbu

Pandas - possible to aggregate two columns using two different aggregations?

I'm loading a csv file, which has the following columns:
date, textA, textB, numberA, numberB
I want to group by the columns: date, textA and textB - but want to apply "sum" to numberA, but "min" to numberB.
data = pd.read_table("file.csv", sep=",", thousands=',')
grouped = data.groupby(["date", "textA", "textB"], as_index=False)
...but I cannot see how to then apply two different aggregate functions, to two different columns?
I.e. sum(numberA), min(numberB)
The agg method can accept a dict, in which case the keys indicate the column to which the function is applied:
grouped.agg({'numberA':'sum', 'numberB':'min'})
For example,
import numpy as np
import pandas as pd
df = pd.DataFrame({'A': ['foo', 'bar', 'foo', 'bar',
'foo', 'bar', 'foo', 'foo'],
'B': ['one', 'one', 'two', 'three',
'two', 'two', 'one', 'three'],
'number A': np.arange(8),
'number B': np.arange(8) * 2})
grouped = df.groupby('A')
print(grouped.agg({
'number A': 'sum',
'number B': 'min'}))
yields
number B number A
A
bar 2 9
foo 0 19
This also shows that Pandas can handle spaces in column names. I'm not sure what the origin of the problem was, but literal spaces should not have posed a problem. If you wish to investigate this further,
print(df.columns)
without reassigning the column names, will show show us the repr of the names. Maybe there was a hard-to-see character in the column name that looked like a space (or some other character) but was actually a u'\xa0' (NO-BREAK SPACE), for example.