Replacing substrings based on lists - pandas

I am trying to replace substrings in a data frame by the lists "name" and "lemma". As long as I enter the lists manually, the code delivers the result in the dataframe m.
name=['Charge','charge','Prepaid']
lemma=['Hallo','hallo','Hi']
m=sdf.replace(regex= name, value =lemma)
As soon as I am reading in both lists from an excel file, my code is not replacing the substrings anymore. I need to use an excel file, since the lists are in one table that is very large.
sdf= pd.read_excel('training_data.xlsx')
synonyms= pd.read_excel('synonyms.xlsx')
lemma=synonyms['lemma'].tolist()
name=synonyms['name'].tolist()
m=sdf.replace(regex= name, value =lemma)
Thanks for your help!

df.replace()
Replace values given in to_replace with value.
Values of the DataFrame are replaced with other values dynamically. This differs from updating with .loc or .iloc, which require you to specify a location to update with some value.
in short, this method won't make change on the series level, only on values.
This may achieve what you want:
sdf.regex = synonyms.name
sdf.value = synonyms.lemma

If you are just trying to replace 'Charge' with 'Hallo' and 'charge' with 'hallo' and 'Prepaid' with 'Hi' then you can use repalce() and pass the list of words to finds as the first argument and the list of words to replace with as the second keyword argument value.
Try this:
df=df.replace(name, value=lemma)
Example:
name=['Charge','charge','Prepaid']
lemma=['Hallo','hallo','Hi']
df = pd.DataFrame([['Bob', 'Charge', 'E333', 'B442'],
['Karen', 'V434', 'Prepaid', 'B442'],
['Jill', 'V434', 'E333', 'charge'],
['Hank', 'Charge', 'E333', 'B442']],
columns=['Name', 'ID_First', 'ID_Second', 'ID_Third'])
df=df.replace(name, value=lemma)
print(df)
Output:
Name ID_First ID_Second ID_Third
0 Bob Hallo E333 B442
1 Karen V434 Hi B442
2 Jill V434 E333 hallo
3 Hank Hallo E333 B442

Related

Pandas Replace_ column values

Hello,
I am analyzing the next dataset with this information .
The column ['program_number'] is an object but I want to change it to a integer colum.
I have tried to replace some values but it doesn´t work.
as you can see, some values like 6 is duplicate. like '6 ' and 6.
How can I resolve it? Many thanks
UPDATE
Didn't see 1X and 3X at first.
If you need those numbers and just want to remove the X then:
df["Program"] = df["Program"].str.strip(" X").astype(int)
If there is data in the column which aren't numbers or which shouldn't be converted, you can use pd.to_numeric with errors='corece'. If there are cells which can't be converted, you'll get NaN. Be aware that this will result in floating numbers.
df["Program"] = pd.to_numeric(df["Program"], errors="coerce")
old
You want to use str.strip() here, rather than replace.
Try this:
df1['program_number'] = df1['program_number'].str.strip().astype(int)

pandas contains regex

I would like to match all cells that beginns with 978 number. But following code matches 397854 or nan too.
an_transaction_product["kniha"] = np.where(an_transaction_product["zbozi_ean"].str.contains('^978', regex=True) , 1, 0)
What do I do wrong please?
This doesn't work because .str.contains will check if the regex occurs anywhere in the string.
If you insist on using regex, .str.match does what you want.
But for this simple case .str.startswith("978") is clearer.
Apart from regex, you can use .loc to find cells that start with '978'. The code below will assign 1 to such cells in column 'A', just as an example:
df.loc[df['A'].astype(str).str[:3]=='978', 'A'] = 1
note: astype(str) converts the number to string and then str[:3] gets the first 3 characters, and then compares it to '978'.

How to split the names into different column

How to split the full name into different columns in pyspark.
input CSV:
Name,Marks
Sam Kumar Timberlake,83
Theo Kumar Biber,82
Tom Kumar Perry,86
Xavier Kumar Cruse,87
Output Csv should be :
FirstName,MiddleName,LastName,Marks
Sam,Kumar,Timberlake,83
Theo,Kumar,Biber,82
Tom,Kumar,Perry,86
Xavier,Kumar,Cruse,87
I am sure there is a better way, but the longer way is re-instate. Meaning, do the work. I created double of the names and just manually cleaned the data into first middle names and last names. I don't think there is any machine language that can tell you the person has two first names and one middle name unless the person used a dash for two first names and two last names (born and married into last names) and use common sense for last names and be ready for mistakes. Gotta do it manually, unless, again.. you are certain because you called them up and know for sure.
The mathematical way would be separating last name from the rest. It is like calling someone by their first name John when they go by their middle name Gary. Mistakes are inevitable as long the person you address understands it is legally them. Not sure if it all makes sense.
This should work in your specific case:
import pyspark.sql.functions as F
df = df.withColumn(
"arr", F.split(F.col("Name"), " ")
)
df = (
df
.withColumn('FirstName', F.arr.getItem(0))
.withColumn('MiddleName', F.arr.getItem(1))
.withColumn('LastName', F.arr.getItem(2))
)
If you want to include the case when someone has several middle names:
df = (
df
.withColumn('FirstName', df.arr.getItem(0))
.withColumn('LastName', df.arr[F.size(df.arr)-1])
)
df = df.withColumn(
'MiddleName',
F.trim(F.expr("substring(Name, length(FirstName)+1, length(Name)-length(LastName)-length(FirstName))"))
)

Need explanation on how pandas.drop is working here

I have a data frame, lets say xyz. I have written code to find out the % of null values each column possess in the dataframe. my code below:
round(100*(xyz.isnull().sum()/len(xyz.index)), 2)
let say i got following results:
abc 26.63
def 36.58
ghi 78.46
I want to drop column ghi because it has more than 70% of null values.
I achieved it using the following code:
xyz = xyz.drop(xyz.loc[:,round(100*(xyz.isnull().sum()/len(xyz.index)), 2)>70].columns, 1)
but , i did not understand how does this code works, can anyone please explain it?
the code is doing the following:
xyz.drop( [...], 1)
removes the specified elements for a given axis, either by row or by column. In this particular case, df.drop( ..., 1) means you're dropping by axis 1, i.e, column
xyz.loc[:, ... ].columns
will return a list with the column names resulting from your slicing condition
round(100*(xyz.isnull().sum()/len(xyz.index)), 2)>70
this instruction is counting the number of nulls, adding them up and normalizing by the number of rows, effectively computing the percentage of nan in each column. Then, the amount is rounded to have only 2 decimal positions and finally you return True is the number of nan is more than 70%. Hence, you get a mapping between columns and a True/False array.
Putting everything together: you're first producing a Boolean array that marks which columns have more than 70% nan, then, using .loc you use Boolean indexing to look only at the columns you want to drop ( nan % > 70%), then using .columns you recover the name of such columns, which then are used by the .drop instruction.
Hopefully this clear things up!
If you code is hard to understand , you can just check dropna with thresh, since pandas already cover this case.
df=df.dropna(axis=1,thresh=round(len(df)*0.3))

iteration in spark sql dataframe , getting 1st row value in first iteration and second row value in next iteration and so on

Below is the query that will give the data and distance where distance is <=10km
var s=spark.sql("select date,distance from table_new where distance <=10km")
s.show()
this will give the output like
12/05/2018 | 5
13/05/2018 | 8
14/05/2018 | 18
15/05/2018 | 15
16/05/2018 | 23
---------- | --
i want to use first row of the dataframe s , store the date value in a variable v , in first iteration.
In next iteration it should pick the second row , and corresponding data value to be replaced the old variable b .
like wise so on .
I think you should look at Spark "Window Functions". You may find here what you need.
The "bad" way to do this would be to collect the dataframe using df.collect() which would return a list of Rows which you can manually iterate over each using a loop.This is bad cause it brings all the data in your driver.
The better way would be to use foreach() :
df.foreach(lambda x: <<your code here>>)
foreach() takes a lambda function as argument which iterates over each row of the dataframe without bringing all the data in the driver.But you cant use a simple local variable v inside a lambda fuction when there is overwriting involved.you can use spark accumulators for such a case.
eg: if i want to sum all the values in 2nd column
counter = sc.longAccumulator("counter")
df.foreach(lambda row: counter.add(row.get(1)))