split pandas column to many columns - pandas

I have dataframe like below:
ColumnA ColumnB ColumnC
0 usr usr1,usr2 X1
1 xyz xyz1,xyz2,xyz3 X2
2 abc abc1,abc2,abc3 X3
What I want to do is:
split column B by ","
Problem is some cells of column B has 3 variable (xyz1,xyz2,xyz3), some of them 6 etc. It is not stable.
Expected output:
ColumnA ColumnB usercol1 usercol2 usercol3 ColumnC
0 usr usr1,usr2 usr1 usr2 - X1
1 xyz xyz1,xyz2,xyz3 xyz1 xyz2 xyz3 X2
2 abc abc1,abc2,abc3 abc1 abc2 abc3 X3

Create a new dataframe that uses expand=True with str.split()
Then concat the first two columns, the new expanded dataframe and the third original dataframecolumn. This is dynamic to uneven list lengths.
df1 = df['ColumnB'].str.split(',',expand=True).add_prefix('usercol')
df1 = pd.concat([df[['ColumnA', 'ColumnB']],df1, df[['ColumnC']]], axis=1).replace(np.nan, '-')
df1
Out[1]:
ColumnA ColumnB usercol0 usercol1 usercol2 ColumnC
0 usr usr1,usr2 usr1 usr2 - X1
1 xyz xyz1,xyz2,xyz3 xyz1 xyz2 xyz3 X2
2 abc abc1,abc2,abc3 abc1 abc2 abc3 X3
Technically, this could be done with one line as well:
df = pd.concat([df[['ColumnA', 'ColumnB']],
df['ColumnB'].str.split(',',expand=True).add_prefix('usercol'),
df[['ColumnC']]], axis=1).replace(np.nan, '-')
df
Out[1]:
ColumnA ColumnB usercol0 usercol1 usercol2 ColumnC
0 usr usr1,usr2 usr1 usr2 - X1
1 xyz xyz1,xyz2,xyz3 xyz1 xyz2 xyz3 X2
2 abc abc1,abc2,abc3 abc1 abc2 abc3 X3

Related

making list of categorical columns with unique values greater than a specific number pandas

I have a DF with categorical, numeric and date columns. I want to make a list of all categorical columns that have unique values more than 2. So my df is something like this
date_time1 date_time2 cat_col1 cat_col_2 num_col1 num_col2 cat_col3
2020-10-08 2021-11-08 ABC xyz 20 40 PQR
19:09:21.884 15:18:26.864
2020-10-08 2021-11-08 BCD xyz 30 50 ABC
19:09:21.884 15:18:26.864
2020-10-08 2021-11-08 ABC yza 40 30 MNO
19:09:21.884 15:18:26.864
2020-10-08 2021-11-08 CDE xyz 10 80 CDE
19:09:21.884 15:18:26.864
2020-10-08 2021-11-08 BCD xyz 20 70 MNO
19:09:21.884 15:18:26.864
I want to now get a list of only categorical column names which have unique value counts more than 2. So in this case it should be
mylist =['cat_col1', 'cat_col3']
Can someone please help me with this?
If you want to select the columns just by the name:
[col for col in df.columns if col.startswith('cat_') and df[col].nunique() > 2]
Result:
['cat_col1', 'cat_col3']
If you want to select by type:
[col for col in df.select_dtypes(include='category').columns if df[col].nunique() > 2]

check sorting by year and quarter pandas dataframe

I have a df that looks like below
date col1 col2
0 2000 Q1 123 456
1 2000 Q2 234 567
2 2000 Q3 345 678
3 2000 Q4 456 789
4 2001 Q1 567 890
The df has over 200 rows. I need to -
check if the data is sorted by date
if not, then sort it by date
Can someone please help me with this?
Many thanks
Use DataFrame.sort_values with key parameter and converting values to datetimes:
df = df.sort_values('date', key=lambda x: pd.to_datetime(x.str.replace('\s+', '')))
print (df)
date col1 col2
0 2000 Q1 123 456
1 2000 Q2 234 567
2 2000 Q3 345 678
3 2000 Q4 456 789
4 2001 Q1 567 890
EDIT: You can use Series.is_monotonic for test if values are monotonic_increasing:
if not df['date'].is_monotonic:
df = df.sort_values('date', key=lambda x: pd.to_datetime(x.str.replace('\s+', '')))
You can convert your date column as pd.Index (or define it as the index of your dataframe):
if not pd.Index(df['date']).is_monotonic_increasing:
df = df.sort_values('date')

Assign value to the new column based on the other columns value pandas

This question seems repetition and answered before but it is a bit tricky.
Let us say I have the following data frame.
Id Col_1
1 aaa
1 ccc
2 bbb
3 aa
Based on the value column Id and Col_1 I want create new column and assign new value by checking the existence of aa in Col_1. And this value should be applied based on the Id means if they have same Id.
The expected result:
Id Col_1 New_Column
1 aaa aa
1 ccc aa
2 bbb
3 aa aa
I tried it with this:
df['New_Column'] = ((df['Id']==1) | df['Col_1'].str.contains('aa')).map({True:'aa', False:''})
and the result is
Id Col_1 New_Column
1 aaa aa
1 ccc
2 bbb
3 aa aa
But as I mentioned it above, I want to assign aa on the new column with the same Id as well.
Can anyone help on this?
Use GroupBy.transform with GroupBy.any for get mask for all groups with at least one aaa:
mask = df['Col_1'].str.contains('aa').groupby(df['Id']).transform('any')
Alternative with Series.isin and filtering Id values by aa:
mask = df['Id'].isin(df.loc[df['Col_1'].str.contains('aa'), 'Id'])
df['New_Column'] = np.where(mask, 'aa','')
print (df)
Id Col_1 New_Column
0 1 aaa aa
1 1 ccc aa
2 2 bbb
3 3 aa aa
EDIT:
mask1 = df['Id'].isin(df.loc[df['Col_1'].str.contains('aa'), 'Id'])
mask2 = df['Id'].isin(df.loc[df['Col_1'].str.contains('bb'), 'Id'])
df['New_Column'] = np.select([mask1, mask2], ['aa','bb'],'')
print (df)
Id Col_1 New_Column
0 1 aaa aa
1 1 ccc aa
2 2 bbb bb
3 3 aa aa

How to groupby in pandas where i have column values starting with similar letters

Suppose i have a column with values(not column name) L1 xyy, L2 yyy, L3 abc, now i want to group L1, L2 and L3 as L(or any other name also would do).
Similarly i have other values like A1 xxx, A2 xxx, to be grouped form A and so on for other alphabets.
How do i achieve this in pandas?
I have L1, A1 and so on all in same column, and not different columns.
Use indexing by str[0] for return first letter of column and then aggregate some function, e.g. sum:
df = pd.DataFrame({'col':['L1 xyy','L2 yyy','L3 abc','A1 xxx','A2 xxx'],
'val':[2,3,5,1,2]})
print (df)
col val
0 L1 xyy 2
1 L2 yyy 3
2 L3 abc 5
3 A1 xxx 1
4 A2 xxx 2
df1 = df.groupby(df['col'].str[0])['val'].sum().reset_index(name='new')
print (df1)
col new
0 A 3
1 L 10
If need new column by first value:
df['new'] = df['col'].str[0]
print (df)
col val new
0 L1 xyy 2 L
1 L2 yyy 3 L
2 L3 abc 5 L
3 A1 xxx 1 A
4 A2 xxx 2 A

Merge two DataFrames on multiple columns

hope you can help me.
I have two pretty big Datasets.
DF1 Example:
|id| A_Workflow_Type_ID | B_Workflow_Type_ID | ...
1 123 456
2 789 222 ...
3 333 NULL ...
DF2 Example:
Workflow| Operation | Profile | Type | Name | ...
123 1 2 Low_Cost xyz ...
456 2 5 High_Cost z ...
I need to merge the two datasets without creating many NaNs and multiple columns. So i merge on the informations A_Workflow_Type_ID and B_Workflow_Type_ID from DF1 on Workflow from DF2.
I tried it with several join operations in pandas and the merge option it failure.
My last try:
all_Data = pd.merge(left=DF1,right=DF2, how='inner', left_on =['A_Workflow_Type_ID ','B_Workflow_Type_ID '], right_on=['Workflow'])
But that returns an error that they have to be equal lenght on both sides.
Thanks for the help!
You need reshape first by melt and then merge:
#generate all column without strings Workflow
cols = DF1.columns[~DF1.columns.str.contains('Workflow')]
print (cols)
Index(['id'], dtype='object')
df = DF1.melt(cols, value_name='Workflow', var_name='type')
print (df)
id type Workflow
0 1 A_Workflow_Type_ID 123.0
1 2 A_Workflow_Type_ID 789.0
2 3 A_Workflow_Type_ID 333.0
3 1 B_Workflow_Type_ID 456.0
4 2 B_Workflow_Type_ID 222.0
5 3 B_Workflow_Type_ID NaN
all_Data = pd.merge(left=df,right=DF2, on ='Workflow')
print (all_Data)
id type Workflow Operation Profile Type Name
0 1 A_Workflow_Type_ID 123 1 2 Low_Cost xyz
1 1 B_Workflow_Type_ID 456 2 5 High_Cost z