How to split the names into different column - apache-spark-sql

How to split the full name into different columns in pyspark.
input CSV:
Name,Marks
Sam Kumar Timberlake,83
Theo Kumar Biber,82
Tom Kumar Perry,86
Xavier Kumar Cruse,87
Output Csv should be :
FirstName,MiddleName,LastName,Marks
Sam,Kumar,Timberlake,83
Theo,Kumar,Biber,82
Tom,Kumar,Perry,86
Xavier,Kumar,Cruse,87

I am sure there is a better way, but the longer way is re-instate. Meaning, do the work. I created double of the names and just manually cleaned the data into first middle names and last names. I don't think there is any machine language that can tell you the person has two first names and one middle name unless the person used a dash for two first names and two last names (born and married into last names) and use common sense for last names and be ready for mistakes. Gotta do it manually, unless, again.. you are certain because you called them up and know for sure.
The mathematical way would be separating last name from the rest. It is like calling someone by their first name John when they go by their middle name Gary. Mistakes are inevitable as long the person you address understands it is legally them. Not sure if it all makes sense.

This should work in your specific case:
import pyspark.sql.functions as F
df = df.withColumn(
"arr", F.split(F.col("Name"), " ")
)
df = (
df
.withColumn('FirstName', F.arr.getItem(0))
.withColumn('MiddleName', F.arr.getItem(1))
.withColumn('LastName', F.arr.getItem(2))
)
If you want to include the case when someone has several middle names:
df = (
df
.withColumn('FirstName', df.arr.getItem(0))
.withColumn('LastName', df.arr[F.size(df.arr)-1])
)
df = df.withColumn(
'MiddleName',
F.trim(F.expr("substring(Name, length(FirstName)+1, length(Name)-length(LastName)-length(FirstName))"))
)

Related

I am stuck in writing in OR & AND condition in python code. Actually there is a one little tricky condition

Basically there are two columns: Customer_column and Company_column. Customer_column is taking input from customer also there is a rule to insert data in (OR,AND) condition.
I want check one column value into the another column. The tricky part is like there is an OR , AND condition values available in column which I want to check. The pipe (|) symbol stands for OR condition, and the comma (,) stands for AND condition.
For example: (1) if my one column (Customer_column) has (Oil|Leak) that means I want to check 'oil' or 'leak'. One of them should be available in respective column (Company_column).(2) If it contains ((Oil)|(Leak,engine)) that means I want to check whether 'oil' word should be available or 'leak' + 'engine' word should be available in the respective column.(3)If it contains like((Oil|Leak),engine) that means I want to check here whether a combination of 'oil'+ 'engine' word should be available or 'leak' + 'engine' word should be available in the respective column.(4) )If it contains like ((Oil|Leak|Water),engine,Machine) that means I want to check here whether a combination of 'oil'+'engine'+'machine' word should be available or 'leak' +'engine'+'machine' word or 'water' +'engine'+'machine' word should be available in the respective column.
Below is my data frame:
import pandas as pd
data = {'Customer_column': ['(Oil|Leak)', '((Oil)|(Leak,engine))', '(Oil|Leak),engine)', '((Oil|Leak|Water),engine,Machine)', '(Leak,water,There)|(Mark,water,There)'],
'Company_column': ['(leak is present in radiator)', '(engine is leaking)', '(water leak from radiator)', '(water & oil is available in engine machine)', '(there is a water leak mark at engine)']
}
df = pd.DataFrame(data)
print (df)
Below is my expected output:
data = {'Customer_column': ['(Oil|Leak)', '((Oil)|(Leak,engine))', '(Oil|Leak),engine)', '((Oil|Leak|Water),engine,Machine)', '(Leak,water,There)|(Mark,water,There)'],
'Company_column': ['(leak is present in radiator)', '(engine is leaking)', '(water leak from radiator)', '(water & oil is available in engine machine)', '(there is a water leak mark at engine)'],
'Result': ['Leak', 'Leak,engine', 'None', 'oil engine machine,water engine machine', 'Leak water There,Mark water There'],
}
df = pd.DataFrame(data)
print (df)
I tried regex and contain method to solve this. For the OR condition I got my result but I am getting wrong output where the AND condition is written.
import re
df['match'] = [m.group() if (m:=re.search(fr'\b{re.escape(b)}\b', a, flags=re.I)) else None
for a,b in zip(df['Customer_column'], df['Company_column'])]
The second code I tried:
def matches(cust, comp):
words_comp = set(comp[1:-1].casefold().split())
return '+'.join([x for x in cust[1:-1].split('|')
if set(x.casefold().split(','))
.issubset(words_comp)
])
df['match'] = [matches(cust, comp) for cust, comp in
zip(df['Customer_column'], df['Company_column'])]
df
The above one is some how giving me correct result with some limitation of brackets but gives wrong output in condition 3rd and 4th.

How to read csv files correctly using pandas?

I'm having a csv file like below. I need to check whether the number of columns are greater than the max length of rows. Ex,
name,age,profession
"a","24","teacher","cake"
"b",31,"Doctor",""
"c",27,"Engineer","tea"
If i try to read it using
print(pd.read_csv('test.csv'))
it will print as below.
name age profession
a 24 teacher cake
b 31 Doctor NaN
c 27 Engineer tea
But it's wrong. It happened due to the less number of columns. So i need to identify this scenario as a wrong csv format. what is the best way to test this other than reading this as string and testing the length of each row.
And important thing is, the columns can be different. There are no any mandatory columns to present.
You can try put header=None into .read_csv. Then pandas will throw ParserError if number of columns won't match length of rows. For example:
try:
df = pd.read_csv("your_file.csv", header=None)
except pd.errors.ParserError:
print("File Invalid")

Replacing substrings based on lists

I am trying to replace substrings in a data frame by the lists "name" and "lemma". As long as I enter the lists manually, the code delivers the result in the dataframe m.
name=['Charge','charge','Prepaid']
lemma=['Hallo','hallo','Hi']
m=sdf.replace(regex= name, value =lemma)
As soon as I am reading in both lists from an excel file, my code is not replacing the substrings anymore. I need to use an excel file, since the lists are in one table that is very large.
sdf= pd.read_excel('training_data.xlsx')
synonyms= pd.read_excel('synonyms.xlsx')
lemma=synonyms['lemma'].tolist()
name=synonyms['name'].tolist()
m=sdf.replace(regex= name, value =lemma)
Thanks for your help!
df.replace()
Replace values given in to_replace with value.
Values of the DataFrame are replaced with other values dynamically. This differs from updating with .loc or .iloc, which require you to specify a location to update with some value.
in short, this method won't make change on the series level, only on values.
This may achieve what you want:
sdf.regex = synonyms.name
sdf.value = synonyms.lemma
If you are just trying to replace 'Charge' with 'Hallo' and 'charge' with 'hallo' and 'Prepaid' with 'Hi' then you can use repalce() and pass the list of words to finds as the first argument and the list of words to replace with as the second keyword argument value.
Try this:
df=df.replace(name, value=lemma)
Example:
name=['Charge','charge','Prepaid']
lemma=['Hallo','hallo','Hi']
df = pd.DataFrame([['Bob', 'Charge', 'E333', 'B442'],
['Karen', 'V434', 'Prepaid', 'B442'],
['Jill', 'V434', 'E333', 'charge'],
['Hank', 'Charge', 'E333', 'B442']],
columns=['Name', 'ID_First', 'ID_Second', 'ID_Third'])
df=df.replace(name, value=lemma)
print(df)
Output:
Name ID_First ID_Second ID_Third
0 Bob Hallo E333 B442
1 Karen V434 Hi B442
2 Jill V434 E333 hallo
3 Hank Hallo E333 B442

How to select columns based on value they contain pandas

I am working in pandas with a certain dataset that describes the population of a certain country per year. The dataset is construed in a weird way wherein the years aren't the columns themselves but rather the years are a value within the first row of the set. The dataset describes every year from 1960 up til now but I only need 1970, 1980, 1990 etc. For this purpose I've created a list with all those years and tried to make a new dataset which is equivalent to the old one but only has the columns that contain a value from said list so I don't have all this extra info I'm not using. Online I can only find instructions for removing rows or selecting by column name, since both these criteria don't apply in this situation I thought i should ask here.
The dataset is a csv file which I've downloaded off some world population site. here a link to a screenshot of the data
As you can see the years are given in scientific notation for some years, which is also how I've added them to my list.
pop = pd.read_csv('./maps/API_SP.POP.TOTL_DS2_en_csv_v2_10576638.csv',
header=None, engine='python', skiprows=4)
display(pop)
years = ['1.970000e+03','1.980000e+03','1.990000e+03','2.000000e+03','2.010000e+03','2.015000e+03', 'Country Name']
pop[pop.columns[pop.isin(years).any()]]
This is one of the things I've tried so far which I thought made the most sense, but I am still very new to pandas so any help would be greatly appreciated.
Using the data at https://data.worldbank.org/indicator/sp.pop.totl, copied into pastebin (first time using the service, so apologies if it doesn't work for some reason):
# actual code using CSV file saved to desktop
#df = pd.read_csv(<path to CSV>, skiprows=4)
# pastebin for reproducibility
df = pd.read_csv(r'https://pastebin.com/raw/LmdGySCf',sep='\t')
# manually select years and other columns of interest
colsX = ['Country Name', 'Country Code', 'Indicator Name', 'Indicator Code',
'1990', '1995', '2000']
dfX = df[colsX]
# select every fifth year
colsY = df.filter(regex='19|20', axis=1).columns[[int(col) % 5 == 0 for col in df.filter(regex='19|20', axis=1).columns]]
dfY = df[colsY]
As a general comment:
The dataset is construed in a weird way wherein the years aren't the columns themselves but rather the years are a value within the first row of the set.
This is not correct. Viewing the CSV file, it is quite clear that row 5 (Country Name, Country Code, Indicator Name, Indicator Code, 1960, 1961, ...) are indeed column names. You have read the data into pandas in such a way that those values are not column years, but your first step, before trying to subset your data, should be to ensure you have read in the data properly -- which, in this case, would give you column headers named for each year.

Organizing data (pandas dataframe)

I have a data in the following form:
product/productId B000EVS4TY
1 product/title Arrowhead Mills Cookie Mix, Chocolate Chip, 1...
2 product/price unknown
3 review/userId A2SRVDDDOQ8QJL
4 review/profileName MJ23447
5 review/helpfulness 2/4
6 review/score 4.0
7 review/time 1206576000
8 review/summary Delicious cookie mix
9 review/text I thought it was funny that I bought this pro...
10 product/productId B0000DF3IX
11 product/title Paprika Hungarian Sweet
12 product/price unknown
13 review/userId A244MHL2UN2EYL
14 review/profileName P. J. Whiting "book cook"
15 review/helpfulness 0/0
16 review/score 5.0
17 review/time 1127088000
I want to convert it to a dataframe such that the entries in the 1st column
product/productId
product/title
product/price
review/userId
review/profileName
review/helpfulness
review/score
review/time
review/summary
review/text
are the column headers with the values arranged corresponding to each header in the table.
I still had a tiny doubt about your file, but since both my suggestions are quite similar, I will try to address both the scenarios you might have.
In case your file doesn't actually have the line numbers inside of it, this should do it:
filepath = "./untitled.txt" # you need to change this to your file path
column_separator="\s{3,}" # we'll use a regex, I explain some caveats of this below...
# engine='python' surpresses a warning by pandas
# header=None is that so all lines are considered 'data'
df = pd.read_csv(filepath, sep=column_separator, engine="python", header=None)
df = df.set_index(0) # this takes column '0' and uses it as the dataframe index
df = df.T # this makes the data look like you were asking (goes from multiple rows+1column to multiple columns+1 row)
df = df.reset_index(drop=True) # this is just so the first row starts at index '0' instead of '1'
# you could just do the last 3 lines with:
# df = df.set_index(0).T.reset_index(drop=True)
If you do have line numbers, then we just need to do some little adjustments
filepath = "./untitled1.txt"
column_separator="\s{3,}"
df = pd.read_csv(filepath, sep=column_separator, engine="python", header=None, index_col=0)
df.set_index(1).T.reset_index(drop=True) #I did all the 3 steps in 1 line, for brevity
In this last case, I would advise you change it in order to have line numbers in all of them (in the example you provided, the numbering starts at the second line, this might be an option about how you handle headers when exporting the data in whatever tool you might be using
Regarding the regex, the caveat is that "\s{3,}" looks for any block of 3 consecutive whitespaces or more to determine the column separator. The problem here is that we'll depend a bit on the data to find the columns. For instance, if in any of the values just so happens to appear 3 consecutive spaces, pandas will raise an exception, since the line will have one more column than the others. One solution to this could be increasing it to any other 'appropriate' number, but then we still depend on the data (for instance, with more than 3, in your example, "review/text" would have enough spaces for the two columns to be identified)
edit after realising what you meant by "stacked"
Whatever "line-number scenario" you have, you'll need to make sure you always have the same number of columns for all registers and reshape the continuous dataframe with something similar to this:
number_of_columns = 10 # you'll need to make sure all "registers" do have the same number of columns otherwise this will break
new_shape = (-1,number_of_columns) # this tuple will mean "whatever number of lines", by 10 columns
final_df = pd.DataFrame(data = df.values.reshape(new_shape)
,columns=df.columns.tolist()[:-10])
Again, take notice of making sure that all lines have the same number of columns (for instance, a file with just the data you provided, assuming 10 columns, wouldn't work). Also, this solution assumes all columns will have the same name.