pyspark.sql data.frame understanding functions - apache-spark-sql

I am taking a mooc.
It has one assignment where a column needs to be converted to the lower case. sentence=lower(column) does the trick. But initially I thought that the syntax should be sentence=column.lower(). I looked at the documentation and I couldnt figure out the problem with my syntax. Would it be possible to explain how I could have figured out that I have a wrong syntax by searching online documentation and function definition?
I am specially confused as This link shows that string.lower() does the trick in case of the regular string python objects
from pyspark.sql.functions import regexp_replace, trim, col, lower
def removePunctuation(column):
"""Removes punctuation, changes to lower case, and strips leading and trailing spaces.
Note:
Only spaces, letters, and numbers should be retained. Other characters should should be
eliminated (e.g. it's becomes its). Leading and trailing spaces should be removed after
punctuation is removed.
Args:
column (Column): A Column containing a sentence.
Returns:
Column: A Column named 'sentence' with clean-up operations applied.
"""
sentence=lower(column)
return sentence
sentenceDF = sqlContext.createDataFrame([('Hi, you!',),
(' No under_score!',),
(' * Remove punctuation then spaces * ',)], ['sentence'])
sentenceDF.show(truncate=False)
(sentenceDF
.select(removePunctuation(col('sentence')))
.show(truncate=False))

You are correct. When you are working with a string, if you want to convert it to lowercase, you should use str.lower().
And if you check the String page in the Python Documentation, you will see it has a lower method that should work as you expect:
a_string = "StringToConvert"
a_string.lower() # "stringtoconvert"
However. in the Spark example you provided, in your function removePunctuation you are NOT working with a singlestring, you are working with a Column. And a Column is a different object than a string, that is way you should use a method that works with a Column.
Specifically, you are working with this pyspark sql method. The next time you are in doubt on which method you need to implement, double check the datatype of your objects. Also, if you check the list of imports, you will see it is calling the lower method from pyspark.sql.functions

This is how i managed to do it:
lowered = lower(column)
np_lowered = regexp_replace(lowered, '[^\w\s]', '')
trimmed_np_lowered = trim(np_lowered)
return trimmed_np_lowered

return trim(lower(regexp_replace(column, "\p{Punct}", ""))).alias('sentence')

Related

replacing trailing space in Pandas column

If I have a string with space at the end, I can use rstrip() to remove the trailing space.
word='- Keane '
"".join(word.rstrip())
it returns '- Keane' which is what I want.
However, it dsnt do the same when passing it through a pandas column using apply method. here is what I have (two first row of the column WL2019['Location']):
Location
'-Keane '
'- PBC-CALTEX '
I want:
'- Keane'
'- PBC-CALTEX'
The code I use:
WL2019['NewLoc']=WL2019['Location'].apply(lambda x: "".join(str(x).rstrip()))
But it dsnt do anything. It basically outputs the same as column Location. Does anyone know why and how can get this fixed?
Thanks
EDIT: okay, I failed to explain clearly what I have been doing. This is the problem:
I had a string column that had to extract part of entries between two dahesh. like this:
'v102- Keane - ARC'
'v103- PBC-CALTEX -BARS'
I used the code below to extarct the middle part. Once you do that, the output is in List type in each entry. We cant use strip() for lists. I had to go through mumbo jumbo below to fix it. I found the solution but not efficient yet. I might post the better solution later.
def location(a):
pat=r'[\s]+[\w\W]+[\s]+'
pattern=re.compile(pat, re.IGNORECASE)
return re.findall(pattern,a)
WL2019['NewLoc']=WL2019['Account'].apply(location)
WL2019['NewLoc']=WL2019['NewLoc'].apply(lambda x: str(x).strip('[]'))
WL2019['NewLoc']=WL2019['NewLoc'].apply(lambda x: str(x).strip("''"))
WL2019['NewLoc']=WL2019['NewLoc'].apply(lambda x: str(x).strip('""'))
WL2019['NewLoc'] = WL2019['NewLoc'].replace('- ','', regex=True).replace(' -', '', regex=True)
Use pandas.Series.str.rstrip-
WL2019['NewLoc']=WL2019['Location'].str.rstrip()

How to remove several symbols from a string in BiqQuery

I have strings which contain numbers like that:
a20cdac0_19221bdc12022bab3fe05a43df4a7dbe
I need to get only symbols after underscore symbol:
19221bdc12022bab3fe05a43df4a7dbe
Unfortunately, the amount of those symbols is always different, so I can't use just RIGHT function.
I know that probably REGEXP might help, but I can't understand how to use that exactly. Will be very grateful for the help.
Below is for BigQuery Standard SQL (using regexp)
regexp_extract(value, r'_(.*)') regexp_approach
if to apply to sample value from your question
regexp_extract('a20cdac0_19221bdc12022bab3fe05a43df4a7dbe', r'_(.*)') regexp_approach
result is
Yet, another regexp option is to use regexp_replace as in below example
regexp_replace(value, r'^.*?_', '')
Note: using split in this case is also an option unless you have more than one _ in which case you will get part between first and second _
split(value, '_')[safe_offset(1)]
Also, as you can see you need to use safe to prevent error in cases when _ is absent
You can use the split function like this
select split('a20cdac0_19221bdc12022bab3fe05a43df4a7dbe','_')[ORDINAL(2)];
https://cloud.google.com/bigquery/docs/reference/standard-sql/string_functions#split

OpenRefine remove duplicates from list with jython

I have a column with values that are duplicated e.g.
VMS5796,VMS5650,VMS5650,CSL,VMA5216,CSL,VMA5113
I'm applying a transform using jython that removes the duplicates (On error is set to keep original), here's the code:
return list(set(value.split(",")))
Which works in the preview, but isn't getting applied to the column. What am I doing wrong?
The Map function is very powerful and an underused function in Python / Jython. It probably is unclear what this code does internally, but it is extremely fast in processing millions of bits of values from a list or array in your columns cells' values that need to be 'mapped' as a string type and then applying a join with a separator char such as a comma ', '
deduped_list = list(set(value.split(",")))
return ', '.join(map(str, deduped_list))
There are probably other, even slightly faster variations than this, but this should get you going in the right direction.
Interestingly, you can also get the 'printable representation' repr(object) which is acceptable to an EVAL like OpenRefine's and can be useful for seeing the representation of your values as well..., which I just found out about, researching this answer in more depth for you.
deduped_list = list(set(value.split(",")))
return ', '.join(map(repr, deduped_list))
Preview implicitly formats things for display. Your expression returns an array (which can't be stored in a cell), so if you'd like to get it string form, tack a .join(',') on the end.

SQL to return results for the following regex

I have the following regular expression:
WHERE A.srvc_call_id = '40750564' AND REGEXP_LIKE (A.SRVC_CALL_DN, '[^TEST]')
The row that contains 40750564 has "TEST CALL" in the column SRVC_CALL_DN and REGEXP_LIKE doesn't seem to be filtering it out. Whenever I run the query it returns the row when it shouldn't.
Is my regex pattern wrong? Or does SQL not accept [^whatever]?
The carat anchors the expression to the start of a string. By enclosing the letters T, E, S & T in square brackets you're searching, as barsju suggests for any of these characters, not for the string TEST.
You say that SRVC_CALL_DN contains the string 'TEST CALL', but you don't say where in the string. You also say that you're looking for where this string doesn't match. This implies that you want to use not regexp_like(...
Putting all this together I think you need:
AND NOT REGEXP_LIKE (A.SRVC_CALL_DN, '^TEST[[:space:]]CALL')
This excludes every match from your query where the string starts with 'TEST CALL'. However, if this string may be in any position in the column you need to remove the carat - ^.
This also assumes that the string is always in upper case. If it's in mixed case or lower, then you need to change it again. Something like the following:
AND NOT REGEXP_LIKE (upper(A.SRVC_CALL_DN), '^TEST[[:space:]]CALL')
By upper-casing SRV_CALL_DN you ensure that you're always going to match but ensure that your query may not use an index on this column. I wouldn't worry about this particular point as regular expressions queries can be fairly poor at using indexes anyway and it appears as though SRVC_CALL_ID is indexed.
Also if it may not include 'CALL' you will have to remove this. It is best when using regular expressions to make your match pattern as explicit as possible; so include 'CALL' if you can.
Try with '^TEST' or '^TEST.*'
Your regexp means any string not starting with any of the characters: T,E,S,T.
But your case is so simple, starts with TEST. Why not use a simple like:
LIKE 'TEST%'

Getting around the lack of a Left Trim(string, char[]) function in JET / Access

I need to remove leading zeros from a string field in an Access database that is destroyed and recreated every time it is used within a C# program. Most string libraries (even SQL ones) include a Trim function to remove leading or following whitespace. Unfortunately, Access does not seem to have a LTrim(string s, char[] trimChars) or something similar. To get around this, I concocted this monstrosity:
Replace(LTrim(Replace(ADDRNO,'0', ' ')),' ', '0')
But this resulted in an undefined function reference for Replace, even though it is obviously an Access function.
What I am looking for is a way to trim these zeros, either by getting the JET engine to let me use the Replace function or by some other method entirely.
EDIT: Fixed syntax of Replace function. Problem still persists.
I suggest
Val(ADDRNO)
It will return the number portion without the leading zeros.
I think it's just the order of your parameters that is wrong:
debug.? Replace("My string", "i", "o") -> "My Strong"
You can use Trim and Replace.
I'm not sure what context you are running this but this seems to show the parameter order is different and uses double quotes instead of single quotes(I haven't used Access in awhile so maybe it doesn't matter), also try square brackets on column name:
http://www.techonthenet.com/access/functions/string/replace.php
Replace(LTrim(Replace([ADDRNO], "0", " "))," ", "0")
If that gives the same error just try the replace function by itself to narrow down the problem:
Replace ("alphabet", "a", "e")
If this works then you know the Replace function works, and there is some other issue.
Edit: If it doesn't work at all, then Replace is likely a VBA function available only in the Access application, and is not part of Jet. You could try some combination of Left/Right function and chop the string up, this can get quite ugly. I personally would just iterate over the record set and use C# code to modify the values. Hopefully you don't have such a large number of records that this would be a problem.