remove multiple occurred chars from a string except one char in pyspark - dataframe

If a dataframe is looks like this df = df.withColumn('NUM_COL', lit('Hey$$$ Hey$ T$$')) and when I need to make this string as
Hey$ Hey$ T$. I couldn't find any proper solutions for this. For an instance in IBM datastage there is a way to do this by using Trim(mylink.mystring,".") for remove redundant chars. What would be the best solution for this in PySpark?

regexp_replace should do the job:
df = df.withColumn('replaced', F.regexp_replace('value', r'(.)\1+', '$1'))
The idea is here to use the (first) group from the pattern as replacement.

Related

How do I print the first occurence of a string after a special character in Hive using reg_extract or split?

I am having a deep dilemma in hive. My data set in Hive looks like this:
##214628##564#7576#7876
#12771#242###256823
###3264###7236473####3
In each instance, I want to print only the first string after the #. So the output should be something like this:
214628
12771
3264
I tried using the reg_extract function, but alas I am getting only NULL values. Since hive doesn't support reg_substr, the following synatax doesn't work:
to_number(trim(regexp_substr(col_name,'[^#]+',1,1)))
Any suggestions are wecome!
You can use regexp_replace and then substr combination.
First remove all multiple occurrences of # from the string using regexp_replace().
regexp_replace(col,'#+','#') -- for data '#####123##' this will produce '#123#'
Then remove first # using substr. And then use instr to fetch everything starting from first till #.
substr(substr(str,2),1, instr(substr(str,2),'#')-1) this will produce '123'
You can see whole sql below.
select substr(substr(str,2),1, instr(substr(str,2),'#')-1) as result
from (
SELECT regexp_replace('#####123##','#+','#') as str) a
I assumed you always have # in the beginning. if you just add if left(str,1)='#'... and handle according to the data.

replace() method not working over single quote

I am trying to do:
df_flat = df_flat.replace("'", '"', regex=True)
To change single quotes to double quotes in the whole pandas df.
I have the whole dataframe with single quotes because I applied df.json_normalize and this changed all double quotes from source to single quotes, so I want to recover the double quotes.
I tried this different options:
df_flat = df_flat.apply(lambda s:s.replace("'",'"', regex=True))
df_flat=df_flat.replace({'\'': '"'}, regex=True)
And any of them is working. Any idea of what's happening?
I have pandas==1.3.2.
And the content of the columns are like:
{'A':'1', 'NB':'29382', 'SS': '686'}
Edit:
I need it because I then save that pandas df in a parquet file and copy to AWS Redshift. When try to do json_extract_path it doesn't work as it's not a valid json due to the single quotes. I can do replace in Redshift for each field, but i prefer to store in the correct format.
You may need to treat it as string:
df_flat = df_flat.astype(str).replace("'",'"', regex=True)

Finding strings between dashes using REGEXP_EXTRACT in Bigquery

In Bigquery, I am trying to find a way to extract particular segments of a string based on how many dashes come before it. The number of total dashes in the string will always be the same. For example, I could be looking for the string after the second dash and before the third dash in the following string:
abc-defgh-hij-kl-mnop
Currently, I am using the following regex to extract, which counts the dashes from the back:
([^-]+)(?:-[^-]+){2}$
The problem is that if there is nothing in between the dashes, the regex doesn't work. For example, something like this returns null:
abc-defgh-hij--mnop
Is there a way to use regex to extract a string after a certain number of dashes and cut it off before the subsequent dash?
Thank you!
Below is for BigQuery Standrd SQL
The simplest way in your case is to use SPLIT and OFFSET as in below example
SELECT SPLIT(str, '-')[OFFSET(3)]
above will return empty string for abc-defgh-hij--mnop
to prevent error in case of calling non-existing element - better to use SAFE_OFFSET
SELECT SPLIT(str, '-')[SAFE_OFFSET(3)]

How to remove illegal characters so a dataframe can write to Excel

I am trying to write a dataframe to an Excel spreadsheet using ExcelWriter, but it keeps returning an error:
openpyxl.utils.exceptions.IllegalCharacterError
I'm guessing there's some character in the dataframe that ExcelWriter doesn't like. It seems odd, because the dataframe is formed from three Excel spreadsheets, so I can't see how there could be a character that Excel doesn't like!
Is there any way to iterate through a dataframe and replace characters that ExcelWriter doesn't like? I don't even mind if it simply deletes them.
What's the best way or removing or replacing illegal characters from a dataframe?
Based on Haipeng Su's answer, I added a function that does this:
dataframe = dataframe.applymap(lambda x: x.encode('unicode_escape').
decode('utf-8') if isinstance(x, str) else x)
Basically, it escapes the unicode characters if they exist. It worked and I can now write to Excel spreadsheets again!
The same problem happened to me. I solved it as follows:
install python package xlsxwriter:
pip install xlsxwriter
replace the default engine 'openpyxl' with 'xlsxwriter':
dataframe.to_excel("file.xlsx", engine='xlsxwriter')
try a different excel writer engine solved my problem.
writer = pd.ExcelWriter('file.xlsx', engine='xlsxwriter')
If you don't want to install another Excel writer engine (e.g. xlsxwriter), you may try to remove these illegal characters by looking for the pattern which causes the IllegalCharacterError error to be raised.
Open cell.py which is found at /path/to/your/python/site-packages/openpyxl/cell/, look for check_string function, you'll see it is using a defined regular expression pattern ILLEGAL_CHARACTERS_RE to find those illegal characters. Trying to locate its definition you'll see this line:
ILLEGAL_CHARACTERS_RE = re.compile(r'[\000-\010]|[\013-\014]|[\016-\037]')
This line is what you need to remove those characters. Copy this line to your program and execute the below code before your dataframe is written to Excel:
dataframe = dataframe.applymap(lambda x: ILLEGAL_CHARACTERS_RE.sub(r'', x) if isinstance(x, str) else x)
The above line will remove those characters in every cell.
But the origin of these characters may be a problem. As you say, the dataframe comes from three Excel spreadsheets. If the source Excel spreadsheets contains those characters, you will still face this problem. So if you can control the generation process of source spreadsheets, try to remove these characters there to begin with.
I was also struggling with some weird characters in a data frame when writing the data frame to html or csv. For example, for characters with accent, I can't write to html file, so I need to convert the characters into characters without the accent.
My method may not be the best, but it helps me to convert unicode string into ascii compatible.
# install unidecode first
from unidecode import unidecode
def FormatString(s):
if isinstance(s, unicode):
try:
s.encode('ascii')
return s
except:
return unidecode(s)
else:
return s
df2 = df1.applymap(FormatString)
In your situation, if you just want to get rid of the illegal characters by changing return unidecode(s) to return 'StringYouWantToReplace'.
Hope this can give me some ideas to deal with your problems.
You can use built-in strip() method for python strings.
for each cell:
text = str(illegal_text).strip()
for entire data frame:
dataframe = dataframe.applymap(lambda t: str(t).strip())
If you're still struggling to clean up the characters, this worked well for me:
import xlwings as xw
import pandas as pd
df = pd.read_pickle('C:\\Users\\User1\\picked_DataFrame_notWriting.df')
topath = 'C:\\Users\\User1\\tryAgain.xlsx'
wb = xw.Book(topath)
ws = wb.sheets['Data']
ws.range('A1').options(index=False).value = df
wb.save()
wb.close()

How to escape delimiter found in value - pig script?

In pig script, I would like to find a way to escape the delimiter character in my data so that it doesn't get interpreted as extra columns. For example, if I'm using colon as a delimiter, and I have a column with value "foo:bar" I want that string interpreted as a single column without having the loader pick up the comma in the middle.
You can try http://pig.apache.org/docs/r0.12.0/func.html#regex-extract-all
A = LOAD 'somefile' AS (s:chararray);
B = FOREACH A GENERATE FLATTEN(REGEX_EXTRACT_ALL(s, '(.*) : (.*)'));
The regex might have to be adapted.
It seems Pig takes the Input as the string its not so intelligent to identify how what is data or what is not.
The pig Storage works on the Strong Tokenizer. So if u want to do something like
a = LOAD '/abc/def/file.txt' USING PigStorage(':');
It doesn't seems to be solving your problem. But if we can write our own PigStorage() Method possibly we could come across some solution.
I will try posting the Code to resolve this.
you can use STRSPLIT(string, regex, limit); for the column split based on the delimiter.