replacing trailing space in Pandas column - pandas

If I have a string with space at the end, I can use rstrip() to remove the trailing space.
word='- Keane '
"".join(word.rstrip())
it returns '- Keane' which is what I want.
However, it dsnt do the same when passing it through a pandas column using apply method. here is what I have (two first row of the column WL2019['Location']):
Location
'-Keane '
'- PBC-CALTEX '
I want:
'- Keane'
'- PBC-CALTEX'
The code I use:
WL2019['NewLoc']=WL2019['Location'].apply(lambda x: "".join(str(x).rstrip()))
But it dsnt do anything. It basically outputs the same as column Location. Does anyone know why and how can get this fixed?
Thanks
EDIT: okay, I failed to explain clearly what I have been doing. This is the problem:
I had a string column that had to extract part of entries between two dahesh. like this:
'v102- Keane - ARC'
'v103- PBC-CALTEX -BARS'
I used the code below to extarct the middle part. Once you do that, the output is in List type in each entry. We cant use strip() for lists. I had to go through mumbo jumbo below to fix it. I found the solution but not efficient yet. I might post the better solution later.
def location(a):
pat=r'[\s]+[\w\W]+[\s]+'
pattern=re.compile(pat, re.IGNORECASE)
return re.findall(pattern,a)
WL2019['NewLoc']=WL2019['Account'].apply(location)
WL2019['NewLoc']=WL2019['NewLoc'].apply(lambda x: str(x).strip('[]'))
WL2019['NewLoc']=WL2019['NewLoc'].apply(lambda x: str(x).strip("''"))
WL2019['NewLoc']=WL2019['NewLoc'].apply(lambda x: str(x).strip('""'))
WL2019['NewLoc'] = WL2019['NewLoc'].replace('- ','', regex=True).replace(' -', '', regex=True)

Use pandas.Series.str.rstrip-
WL2019['NewLoc']=WL2019['Location'].str.rstrip()

Related

Formatting column in pandas to decimal places using table.style based on value

I am trying to format a column in a dataframe using style.
So far I successfully used the styling for a fixed number of decimals:
mytable.style.format('{:,.2f}', pd.IndexSlice[:, ['Price']])
but I need to expand this to formatting based on value as this:
if value is >=1000, then format to zero decimal places
if value is between 1000 and 1, then format to two decimal places
if value is < 1, then format to five decimal places
Does anyone have a solution for this?
Thank you!
Building upon #Code_beginner's answer – the callable should return formatted string as output:
def my_format(val):
if val >= 1000:
return f"{val:,.0f}"
if val >= 1:
return f"{val:,.2f}"
return f"{val:,.5f}"
mytable.style.format({'Price': my_format})
What you are looking for is called "conditional formatting". In which you set the conditions like you described and return the right format. There are examples in the documentation, they used a lambda function there. But you can also create a normal function which might look something like this:
def customfunc(val):
if val>=1000:
format='{:,.0f}'
if val<1000 and val>=1:
format='{:,.2f}'
if val<1:
format='{:,.5f}'
return format
df.style.format({0:customfunc})
This should style your first column like described in your problem. If the columns has a name you have to adjust it accordingly. If you have trouble see the documentation linked abve there are more examples.
Just to have it visually clear, this is how it looks now:
That's my line of code:
df.style.format({'Price': customfunc})

How to convert object dtype with value '$1.2M' to float dtype in pandas [duplicate]

Tring to remove the commas and dollars signs from the columns. But when I do, the table prints them out and still has them in there. Is there a different way to remove the commans and dollars signs using a pandas function. I was unuable to find anything in the API Docs or maybe i was looking in the wrong place
import pandas as pd
import pandas_datareader.data as web
players = pd.read_html('http://www.usatoday.com/sports/mlb/salaries/2013/player/p/')
df1 = pd.DataFrame(players[0])
df1.drop(df1.columns[[0,3,4, 5, 6]], axis=1, inplace=True)
df1.columns = ['Player', 'Team', 'Avg_Annual']
df1['Avg_Annual'] = df1['Avg_Annual'].replace(',', '')
print (df1.head(10))
You have to access the str attribute per http://pandas.pydata.org/pandas-docs/stable/text.html
df1['Avg_Annual'] = df1['Avg_Annual'].str.replace(',', '')
df1['Avg_Annual'] = df1['Avg_Annual'].str.replace('$', '')
df1['Avg_Annual'] = df1['Avg_Annual'].astype(int)
alternately;
df1['Avg_Annual'] = df1['Avg_Annual'].str.replace(',', '').str.replace('$', '').astype(int)
if you want to prioritize time spent typing over readability.
Shamelessly stolen from this answer... but, that answer is only about changing one character and doesn't complete the coolness: since it takes a dictionary, you can replace any number of characters at once, as well as in any number of columns.
# if you want to operate on multiple columns, put them in a list like so:
cols = ['col1', 'col2', ..., 'colN']
# pass them to df.replace(), specifying each char and it's replacement:
df[cols] = df[cols].replace({'\$': '', ',': ''}, regex=True)
#shivsn caught that you need to use regex=True; you already knew about replace (but also didn't show trying to use it on multiple columns or both the dollar sign and comma simultaneously).
This answer is simply spelling out the details I found from others in one place for those like me (e.g. noobs to python an pandas). Hope it's helpful.
#bernie's answer is spot on for your problem. Here's my take on the general problem of loading numerical data in pandas.
Often the source of the data is reports generated for direct consumption. Hence the presence of extra formatting like %, thousand's separator, currency symbols etc. All of these are useful for reading but causes problems for the default parser. My solution is to typecast the column to string, replace these symbols one by one then cast it back to appropriate numerical formats. Having a boilerplate function which retains only [0-9.] is tempting but causes problems where the thousand's separator and decimal gets swapped, also in case of scientific notation. Here's my code which I wrap into a function and apply as needed.
df[col] = df[col].astype(str) # cast to string
# all the string surgery goes in here
df[col] = df[col].replace('$', '')
df[col] = df[col].replace(',', '') # assuming ',' is the thousand's separator in your locale
df[col] = df[col].replace('%', '')
df[col] = df[col].astype(float) # cast back to appropriate type
I used this logic
df.col = df.col.apply(lambda x:x.replace('$','').replace(',',''))
When I got to this problem, this was how I got out of it.
df['Salary'] = df['Salary'].str.replace("$",'').astype(float)

Need to extract specific text from a column on excel using either Alteryx or Pandas

I have a column that contains a specific set of text that I need to be retained and the rest removed or moved to another column. Unfortunately, I am not able to use normal text-to-column due to the variation of the text arrangement.
For example, I need the word Issue and the id associated with it to be separated. I am struggling to figure out a way to do this with the variation of the arrangement of the text I need.
If someone can help me find a solution using Alteryx would be much appreciated, if not Pandas would also work.
Thanks all.
Use str.extract with Pattern to extract specific text from the data frame [Pandas]
df['After']=df['Before'].str.extract(pat='(ISSUE \d+|issue \d+)',expand=False)
For an Alteryx-only solution, the easiest way would be an Alteryx Formula using REGEX_Replace:
REGEX_Replace([Before],".*(issue \d+).*","?1",1)
If you don't like RegEx, basic string manipulations can do it also: basically it's a Substring...
Substring([Before], *starting index*, *length*)
The starting index is easy: it's just FindString([Before],"ISSUE")
The length isn't too hard either: it's the index (using FindString again) of the first comma in the substring that starts with "ISSUE": SubString([Before],FindString([Before],"ISSUE"))
Combining all that and spreading it out a bit:
Substring(
[Before],
FindString([Before],"ISSUE"),
FindString(
SubString(
[Before],
FindString([Before],"ISSUE")
),","
)
)

pyspark.sql data.frame understanding functions

I am taking a mooc.
It has one assignment where a column needs to be converted to the lower case. sentence=lower(column) does the trick. But initially I thought that the syntax should be sentence=column.lower(). I looked at the documentation and I couldnt figure out the problem with my syntax. Would it be possible to explain how I could have figured out that I have a wrong syntax by searching online documentation and function definition?
I am specially confused as This link shows that string.lower() does the trick in case of the regular string python objects
from pyspark.sql.functions import regexp_replace, trim, col, lower
def removePunctuation(column):
"""Removes punctuation, changes to lower case, and strips leading and trailing spaces.
Note:
Only spaces, letters, and numbers should be retained. Other characters should should be
eliminated (e.g. it's becomes its). Leading and trailing spaces should be removed after
punctuation is removed.
Args:
column (Column): A Column containing a sentence.
Returns:
Column: A Column named 'sentence' with clean-up operations applied.
"""
sentence=lower(column)
return sentence
sentenceDF = sqlContext.createDataFrame([('Hi, you!',),
(' No under_score!',),
(' * Remove punctuation then spaces * ',)], ['sentence'])
sentenceDF.show(truncate=False)
(sentenceDF
.select(removePunctuation(col('sentence')))
.show(truncate=False))
You are correct. When you are working with a string, if you want to convert it to lowercase, you should use str.lower().
And if you check the String page in the Python Documentation, you will see it has a lower method that should work as you expect:
a_string = "StringToConvert"
a_string.lower() # "stringtoconvert"
However. in the Spark example you provided, in your function removePunctuation you are NOT working with a singlestring, you are working with a Column. And a Column is a different object than a string, that is way you should use a method that works with a Column.
Specifically, you are working with this pyspark sql method. The next time you are in doubt on which method you need to implement, double check the datatype of your objects. Also, if you check the list of imports, you will see it is calling the lower method from pyspark.sql.functions
This is how i managed to do it:
lowered = lower(column)
np_lowered = regexp_replace(lowered, '[^\w\s]', '')
trimmed_np_lowered = trim(np_lowered)
return trimmed_np_lowered
return trim(lower(regexp_replace(column, "\p{Punct}", ""))).alias('sentence')

OpenRefine remove duplicates from list with jython

I have a column with values that are duplicated e.g.
VMS5796,VMS5650,VMS5650,CSL,VMA5216,CSL,VMA5113
I'm applying a transform using jython that removes the duplicates (On error is set to keep original), here's the code:
return list(set(value.split(",")))
Which works in the preview, but isn't getting applied to the column. What am I doing wrong?
The Map function is very powerful and an underused function in Python / Jython. It probably is unclear what this code does internally, but it is extremely fast in processing millions of bits of values from a list or array in your columns cells' values that need to be 'mapped' as a string type and then applying a join with a separator char such as a comma ', '
deduped_list = list(set(value.split(",")))
return ', '.join(map(str, deduped_list))
There are probably other, even slightly faster variations than this, but this should get you going in the right direction.
Interestingly, you can also get the 'printable representation' repr(object) which is acceptable to an EVAL like OpenRefine's and can be useful for seeing the representation of your values as well..., which I just found out about, researching this answer in more depth for you.
deduped_list = list(set(value.split(",")))
return ', '.join(map(repr, deduped_list))
Preview implicitly formats things for display. Your expression returns an array (which can't be stored in a cell), so if you'd like to get it string form, tack a .join(',') on the end.