Style.format column in dataframe to absolute values - pandas

I have the following problem. I have a column in a dataframe (let's call it df['Price']) and I need to format in to two decimal places, but for negative values I need the minus sign gone, since I already have a coloring formatting which colors me in red the negative values.
df.style.format({'Price': '{:,.2f}'})
This is the generic formatting which works fine, but how do I change this to solve my problem? I basically need to send the absolute values of the column to formatting instead the actual values.

You can pass a callable to .format as well – see https://pandas.pydata.org/pandas-docs/stable/user_guide/style.html#Formatting-Values.
This should do the trick:
df.style.format({'Price': lambda value: f'{abs(value):,.2f}'})

From the pandas documentation you can pass a callable as formatter. Therefore you can just take the absolute value.
df.style.format({'Price': lambda x: f"{abs(x):,.2f}"})

Related

Formatting column in pandas to decimal places using table.style based on value

I am trying to format a column in a dataframe using style.
So far I successfully used the styling for a fixed number of decimals:
mytable.style.format('{:,.2f}', pd.IndexSlice[:, ['Price']])
but I need to expand this to formatting based on value as this:
if value is >=1000, then format to zero decimal places
if value is between 1000 and 1, then format to two decimal places
if value is < 1, then format to five decimal places
Does anyone have a solution for this?
Thank you!
Building upon #Code_beginner's answer – the callable should return formatted string as output:
def my_format(val):
if val >= 1000:
return f"{val:,.0f}"
if val >= 1:
return f"{val:,.2f}"
return f"{val:,.5f}"
mytable.style.format({'Price': my_format})
What you are looking for is called "conditional formatting". In which you set the conditions like you described and return the right format. There are examples in the documentation, they used a lambda function there. But you can also create a normal function which might look something like this:
def customfunc(val):
if val>=1000:
format='{:,.0f}'
if val<1000 and val>=1:
format='{:,.2f}'
if val<1:
format='{:,.5f}'
return format
df.style.format({0:customfunc})
This should style your first column like described in your problem. If the columns has a name you have to adjust it accordingly. If you have trouble see the documentation linked abve there are more examples.
Just to have it visually clear, this is how it looks now:
That's my line of code:
df.style.format({'Price': customfunc})

replacing a column value in a dataframe using map and replace, the difference, using pandas

I can replace a couple of values in column, 'qualify', with true or false as follows and works just fine:
df['qualify'] = df['qualify'].map({'yes':True, 'np':False})
but if I use it to change a name in another column, it will change the name but will make all other values in that column except the one it change to NaN.
df['name'] = df['name'].map({'dick':'Harry'})
Of course using replace will do the job right. But I need to understand why map() does not work correctly in the second instance?
df['name']=df['name'].replace('dick','Harry')

What's the difference between .col and ['col'] in pandas

I've been using pandas for a little while now and I've realised that I use
df.col
df['col']
interchangeably. Are they actually the same or am I missing something?
Following on from the link in the comments.
df.col
Simply refers to an attribute of the dataframe, similar to say
df.shape
Now if 'col' is a column name in the dataframe then accessing this attribute returns the column as series. This sometimes will be sufficient but
df['col']
will always work, and can also be used to add a new column to a dataframe.
I think this is kind of obvious.....
You cannot use df.col if the column name 'col' has a space in it. But df['col'] always works.
e.g,
df['my col'] works but df.my col will not work.
I'll note there's a difference in how some methods consume data. For example, in the LifeTimes library if I use dataframe.col with some methods, the method will consider the column to be an ndarray and throw an exception that the data must be 1-dimensional.
If however I use dataframe['col'] then the method will consume the data as expected.

pyspark.sql data.frame understanding functions

I am taking a mooc.
It has one assignment where a column needs to be converted to the lower case. sentence=lower(column) does the trick. But initially I thought that the syntax should be sentence=column.lower(). I looked at the documentation and I couldnt figure out the problem with my syntax. Would it be possible to explain how I could have figured out that I have a wrong syntax by searching online documentation and function definition?
I am specially confused as This link shows that string.lower() does the trick in case of the regular string python objects
from pyspark.sql.functions import regexp_replace, trim, col, lower
def removePunctuation(column):
"""Removes punctuation, changes to lower case, and strips leading and trailing spaces.
Note:
Only spaces, letters, and numbers should be retained. Other characters should should be
eliminated (e.g. it's becomes its). Leading and trailing spaces should be removed after
punctuation is removed.
Args:
column (Column): A Column containing a sentence.
Returns:
Column: A Column named 'sentence' with clean-up operations applied.
"""
sentence=lower(column)
return sentence
sentenceDF = sqlContext.createDataFrame([('Hi, you!',),
(' No under_score!',),
(' * Remove punctuation then spaces * ',)], ['sentence'])
sentenceDF.show(truncate=False)
(sentenceDF
.select(removePunctuation(col('sentence')))
.show(truncate=False))
You are correct. When you are working with a string, if you want to convert it to lowercase, you should use str.lower().
And if you check the String page in the Python Documentation, you will see it has a lower method that should work as you expect:
a_string = "StringToConvert"
a_string.lower() # "stringtoconvert"
However. in the Spark example you provided, in your function removePunctuation you are NOT working with a singlestring, you are working with a Column. And a Column is a different object than a string, that is way you should use a method that works with a Column.
Specifically, you are working with this pyspark sql method. The next time you are in doubt on which method you need to implement, double check the datatype of your objects. Also, if you check the list of imports, you will see it is calling the lower method from pyspark.sql.functions
This is how i managed to do it:
lowered = lower(column)
np_lowered = regexp_replace(lowered, '[^\w\s]', '')
trimmed_np_lowered = trim(np_lowered)
return trimmed_np_lowered
return trim(lower(regexp_replace(column, "\p{Punct}", ""))).alias('sentence')

Pandas interpolate: slinear vs index

In Pandas interpolate what is the difference between using method of slinear vs index?
method='slinear' will only interpolate "inside" values. It will not interpolate None values that do not have a valid value on both sides.
Using index values will default to limit_direction='forward' which will interpolate inside values and anything after a non NaN value.
Notice in the other posted example, if you set v to None where either t is 100 or where t is 11 (both ends of the sorted index), the result will not be interpolated when using method='slinear'. This is VERY useful for interpolating values inside a series, when you don't want to extending the series.
Thanks to #TomAugspurger on GitHub for explaining this.