Change Column Values in a Dataframe column using Pandas - pandas

The data type of the column is object. but, i still map it to string using astype(str). even used temp['Injury Severity'].str.strip() to remove spaces from column values.
enter image description here
I want to replace all "Fatal(0)",Fatal(1)"... with only "Fatal". so i used.temp['Injury Severity'] = temp['Injury Severity'].replace('Fatal(0)','Fatal',inplace = True).
But did not work. i also tried temp.loc[temp['Injury Severity'] == 'Fatal(0)','Injury Severity'] = temp['Injury Severity'].replace('Fatal(0)','Fatal',inplace = True)
In addition is tried str.replace but did not work out.lastly also used regex = True but no changes was observed.It still remains the same.

I think it is solved. It seems that the values were having leading and trailing spaces in the name of values.Thanks alot for the help everyone !!

Try This,
temp['Injury Severity'].replace('Fatal(0)','Fatal',inplace = True)
No need to assign it again.

Related

replacing a column value in a dataframe using map and replace, the difference, using pandas

I can replace a couple of values in column, 'qualify', with true or false as follows and works just fine:
df['qualify'] = df['qualify'].map({'yes':True, 'np':False})
but if I use it to change a name in another column, it will change the name but will make all other values in that column except the one it change to NaN.
df['name'] = df['name'].map({'dick':'Harry'})
Of course using replace will do the job right. But I need to understand why map() does not work correctly in the second instance?
df['name']=df['name'].replace('dick','Harry')

Crosstabs includes missing values

When I make a cross tab (using SPSS version 22), my missing values are included (see image below). This is something I do not want. If anyone could tell me how I could exclude the missing values that would be great :)
Edit: It looks like your variable como_af is a string. String variables do not have missing values, they just have blanks. You might want to consider recoding it into a numeric variable for easier analysis:
if como_af = "Yes" como_af_num = 1.
if como_af = "No" como_af_num = 2.
if como_af = "" como_af_num = $sysmis.
or alternatively:
recode como_af ("Yes"=1) ("No"=2) (""=sysmis) into como_af_num.
now if you cross nihss_mild by como_af_num, the blanks (now sysmis) will be excluded.
As Martin mentioned, you need to set your user-missing values.
I'll just mention that for String variables (such as in your case), system-missing values (blanks) are not considered missing by default. If your variable were Numeric, blanks would automatically be considered missing.
To set empty values in a String variable to missing you can use:
MISSING VALUES comor_AF (" ") .
Edit: Martin's updated solution does the trick too.

pyspark.sql data.frame understanding functions

I am taking a mooc.
It has one assignment where a column needs to be converted to the lower case. sentence=lower(column) does the trick. But initially I thought that the syntax should be sentence=column.lower(). I looked at the documentation and I couldnt figure out the problem with my syntax. Would it be possible to explain how I could have figured out that I have a wrong syntax by searching online documentation and function definition?
I am specially confused as This link shows that string.lower() does the trick in case of the regular string python objects
from pyspark.sql.functions import regexp_replace, trim, col, lower
def removePunctuation(column):
"""Removes punctuation, changes to lower case, and strips leading and trailing spaces.
Note:
Only spaces, letters, and numbers should be retained. Other characters should should be
eliminated (e.g. it's becomes its). Leading and trailing spaces should be removed after
punctuation is removed.
Args:
column (Column): A Column containing a sentence.
Returns:
Column: A Column named 'sentence' with clean-up operations applied.
"""
sentence=lower(column)
return sentence
sentenceDF = sqlContext.createDataFrame([('Hi, you!',),
(' No under_score!',),
(' * Remove punctuation then spaces * ',)], ['sentence'])
sentenceDF.show(truncate=False)
(sentenceDF
.select(removePunctuation(col('sentence')))
.show(truncate=False))
You are correct. When you are working with a string, if you want to convert it to lowercase, you should use str.lower().
And if you check the String page in the Python Documentation, you will see it has a lower method that should work as you expect:
a_string = "StringToConvert"
a_string.lower() # "stringtoconvert"
However. in the Spark example you provided, in your function removePunctuation you are NOT working with a singlestring, you are working with a Column. And a Column is a different object than a string, that is way you should use a method that works with a Column.
Specifically, you are working with this pyspark sql method. The next time you are in doubt on which method you need to implement, double check the datatype of your objects. Also, if you check the list of imports, you will see it is calling the lower method from pyspark.sql.functions
This is how i managed to do it:
lowered = lower(column)
np_lowered = regexp_replace(lowered, '[^\w\s]', '')
trimmed_np_lowered = trim(np_lowered)
return trimmed_np_lowered
return trim(lower(regexp_replace(column, "\p{Punct}", ""))).alias('sentence')

Storing a Value of a Set analysis expression in a Variable

I am struggling with storing a set analysis expression's value in a variable.
I want to store below expression's value in a variable so that i can use that further for some calculations.
Min({< Data_Period = {'Weekly'},Formatted_Date = {'>$(=$(vSelectedWeek))'}>} Date,2)
The above expression works fine if i use it in a text box on a sheet tab. However, it is not working if i try to store its value in a variable and use that variable.
Set vW1 = Min({< Data_Period = {'Weekly'},Formatted_Date = {'>$(=$(vSelectedWeek))'}>} Date,2);
Here vSelectedWeek is being calculated as follows:
Set vSelectedWeek = Date(Weekstart(Only(BaseData_Date)),'dd/MM/YYYY');
Please advise if i am doing anything wrong or is there any other way around to achieve the same?
Thanks in advance.
If your var is truly working with that expression then try creating an input box object, define your var there and add the expression in the right column.
That should work.
If you find my answer to be pretty simple or not the way you want it, checking this link might help: https://community.qlik.com/thread/198307

How should I perform data masking with pentaho PDI (spoon)?

I would perform data masking for more than 10 tables and each tables has more than 100 columns.
I'd tried to mask data using pentaho PDI tool, but I couldn't find out how should I write mask data with it.
How should I perform data masking with Pentaho?
I think one of the way is to use tool named "replace in String" but I couldn't change any string even if I tried to use it.
my question is,
Is it correct way to use "replace in String" in order to do data
masking.
if it is correct, how should I fill the value in the respective field?
I want to replace some value with *, let's say, the value is "this is sample value" it should be "txxx xx xxxxx xxxxe" some thing like this.
please help.
It's not about kettle, it's about regexp.
I can confirm that "String Replace" has strange unpredictable behavior, in case of using regex inside this step. There is no explanation of "Replace String" step in official docs as well, not much actually.
Anyway u can use RegexEvaluation step to capture needed part and replace inside original string.
But there is workaround which makes it easier
JavaScript-Step with str.replace
This can be done by using a javascript-step, like:
//variable
var str = data_to_mask;
//first letter
var first = str.match(/^[A-Za-z0-9]/);
//last letter
var last = str.match(/[A-Za-z0-9]$/);
//replace all with "x"
str = str.replace(/[A-Za-z0-9]/gi, "x");
//get the first and the last letter back
str = str.replace(/^[A-Za-z0-9]/, first);
str = str.replace(/[A-Za-z0-9]$/, last);
(Simar's answer works as well I think and maybe it's a bit more elegant :)