Pyspark Dataframe - How to create new column with only first 2 words - pandas

dataframe --> df having a column for Full Name (First, middle & last). The column name is full_name and words are seperated by a space (delimiter)
I'd like to create a new column having only 1st and middle name.
I have tried the following
df = df.withColumn('new_name', split(df['full_name'], ' '))
But this returns all the words in a list.
I also tried
df = df.withColumn('new_name', split(df['full_name'], ' ')).getItem(1)
But this returns only the 2nd name in the list (middle name)
Please advise how to proceed with this.

Try this
import pyspark.sql.functions as F
split_col = F.split(df['FullName'], ' ')
df = df.withColumn('FirstMiddle', F.concat_ws(' ',split_col.getItem(0),split_col.getItem(1)))
df.show()

Took my some time thinking but I came up with this
df1 = df.withColumn('first_name', f.split(df['full_name'], ' ').getItem(0))\
.withColumn('middle_name', f.split(df['full_name'], ' ').getItem(1))\
.withColumn('New_Name', f.concat(f.col('first_name'), f.lit(' '), f.col('middle_name')))\
.drop('first_name')\
.drop('middle_name')
It is a working code and the output is as expected but I am not sure how efficient this is considered. If someone has any better ideas please reply

Related

pandas cant replace commas with dots

Help me plz.
I have this dataset:
https://drive.google.com/file/d/1i9QwMZ63qYVlxxde1kB9PufeST4xByVQ/view
i cant replace commas (',') with dots ('.')
When i load this dataset with:
df = pd.read_csv('/content/drive/MyDrive/data.csv', sep=',', decimal=',')
it still contains commas, for example in the value ''0,20'
when i try this code:
df = df.replace(',', '.')
it runs without errors, but the commas still remain, although other values ​​​​in the dataset can be changed this way...
You can do it like this:
df = df.replace(',', '.', regex=True)
But keep in mind that you need to convert the columns to integer type (the ones that have the issues) because as for now they are as of type object.
You can check for those cases with the below command:
df.dtypes

How to replace element in pandas DataFrame column [duplicate]

I have a column in my dataframe like this:
range
"(2,30)"
"(50,290)"
"(400,1000)"
...
and I want to replace the , comma with - dash. I'm currently using this method but nothing is changed.
org_info_exc['range'].replace(',', '-', inplace=True)
Can anybody help?
Use the vectorised str method replace:
df['range'] = df['range'].str.replace(',','-')
df
range
0 (2-30)
1 (50-290)
EDIT: so if we look at what you tried and why it didn't work:
df['range'].replace(',','-',inplace=True)
from the docs we see this description:
str or regex: str: string exactly matching to_replace will be replaced
with value
So because the str values do not match, no replacement occurs, compare with the following:
df = pd.DataFrame({'range':['(2,30)',',']})
df['range'].replace(',','-', inplace=True)
df['range']
0 (2,30)
1 -
Name: range, dtype: object
here we get an exact match on the second row and the replacement occurs.
For anyone else arriving here from Google search on how to do a string replacement on all columns (for example, if one has multiple columns like the OP's 'range' column):
Pandas has a built in replace method available on a dataframe object.
df.replace(',', '-', regex=True)
Source: Docs
If you only need to replace characters in one specific column, somehow regex=True and in place=True all failed, I think this way will work:
data["column_name"] = data["column_name"].apply(lambda x: x.replace("characters_need_to_replace", "new_characters"))
lambda is more like a function that works like a for loop in this scenario.
x here represents every one of the entries in the current column.
The only thing you need to do is to change the "column_name", "characters_need_to_replace" and "new_characters".
Replace all commas with underscore in the column names
data.columns= data.columns.str.replace(' ','_',regex=True)
In addition, for those looking to replace more than one character in a column, you can do it using regular expressions:
import re
chars_to_remove = ['.', '-', '(', ')', '']
regular_expression = '[' + re.escape (''. join (chars_to_remove)) + ']'
df['string_col'].str.replace(regular_expression, '', regex=True)
Almost similar to the answer by Nancy K, this works for me:
data["column_name"] = data["column_name"].apply(lambda x: x.str.replace("characters_need_to_replace", "new_characters"))
If you want to remove two or more elements from a string, example the characters '$' and ',' :
Column_Name
===========
$100,000
$1,100,000
... then use:
data.Column_Name.str.replace("[$,]", "", regex=True)
=> [ 100000, 1100000 ]

pandas fillna with fuzzy search on col names

i have a dataframe with many col names having _paid as part of the name (eg. A_paid, B_paid. etc). I need to fill miss values in any col that has _paid as part of the name. (note: i am not allowed to replace missing value in other cols with no _paid as part of the name).
I tried to use .fillna(), but not sure how to make it do fuzzy search on col names.
If you want to select any column that has _paid in it:
paid_cols = df.filter(like="_paid").columns
or
paid_cols = df.columns[df.columns.str.contains("_paid", regex=False)]
andthen
df[paid_cols] = df[paid_cols].fillna(...)
If you need _paid to be at the end only, then with $ anchor in a regex:
paid_cols = df.filter(regex="_paid$").columns
or
paid_cols = df.columns[df.columns.str.contains("_paid$")]
then the same fillna above.

Different behaviour between two ways of dropping duplicate values in a dataframe

I tested two ways of dropping duplicated rows in a dataframe but they didn't obtain the same result and I don't understand why.
First code:
file_df1 = open('df1.csv', 'r')
df1_list = []
for line in fila_df1:
new_line = line.rsplit(',')
df1_firstcolumn = new_line[0]
if df1_firstcolumn not in df1_list:
df1_list.append(df1_firstcolumn)
#else:
#print('firstcolumn: ' + df1_firstcolumn + ' is duplicated')
file_df1.close()
The second-way using pandas:
import pandas as pd
df1 = pd.read_csv('df1.csv', header=None, names=['firstcolumn','second','third','forth'])
df1.drop_duplicates(inplace=True)
I obtained more unique values using pandas.
The first way you post will "drop duplicates" when duplicates occur based on data in the first column only.
The pandas drop_duplicates function by default is checking that values in all four columns have been duplicated. The version below will remove duplicates based on the first column only.
df1.drop_duplicates(subset=['firstcolumn'], inplace=True)

How to index a column with two values pandas

I have two dataframes:
Dataframe #1
Reads the values--Will only be interested in NodeID AND GSE
sta = pd.read_csv(filename)
Dataframe #2
Reads the file, use pivot and get the following result
sim = pd.read_csv(headout,index_col=0)
sim['Layer'] = sim.groupby('date').cumcount() + 1
sim['Layer'] = 'L' + sim['Layer'].astype(str)
sim = sim.pivot(index = None , columns = 'Layer').T
This gives me the index column to be with two values. (The header is blank for the first one, and Layers for the second) i.e 1,L1.
What I need help on is:
I can not find a way to rename that first blank in the index to 'NodeID'.
I want to name it that so that I can do the lookup function and use NodeID in both dataframes so that I can bring in the 'GSE' values from the first dataframe to the second.
I have been googling way to rename that first column in the second dataframe and I can not seem to find an solution. Any ideas help at this point. I think my pivot function might be wrong...
This is a picture of dataframe #2 before pivot. The number 1-4 are the Node ID.
when I export it to csv to see what the dataframe looks like I get this..
Try
df.rename(columns={"Index": "your preferred name"})
if it is your index then do -
df = df.reset_index()
df.rename(columns={"index": "your preferred name"})