Replacing multiple values in pandas column at once - pandas

This is an incredibly basic question but after reading through Stack and the documentation for str.replace, I'm not finding the answer. Trying to drop all of the punctuation in a column. What would be the proper syntax for this?
This works but it's absurd:
igog['text']=igog['text'].str.replace(",","").str.replace("/","").str.replace(".","").str.replace("\"","").str.replace("?","").str.replace(";","")
This doesn't:
igog['text'] = igog['text'].str.replace({","," ","/","",".","","\"","","?","",";",""}).
Because I keep getting "replace() missing 1 required positional argument: 'repl'".
Thanks in advance!

You can make a simple loop like this:
t=",/.\?;"
for i in t:
igog[text]=igog[text].replace(i,"")
or you can use regex:
igog['text'].str.replace("[,/.\?;]", "")
or you can use re.sub():
import re
igog['text'] = re.sub('[,/.\?;]', "", igog['text'])
or you can define a translation table :
igog['text'].translate({ord(ch):' ' for ch in ',/.\?;'})

Related

Write pyspark dataframe with header as parquet

So if I do df = sql_context.read.csv("test_data_2019-01-01.csv", header=False) and then df.write.parquet("test_data_2019-01-01.parquet") everything works, but if I set header=True in read.csv and then try to write I get the following error:
An error occurred while calling o522.parquet.
: org.apache.spark.sql.AnalysisException: Attribute name " M6_Debt_Review_Ind" contains invalid character(s) among " ,;{}()\n\t=". Please use alias to rename it.
I need those headers, otherwise the column names appear as follows:
[Row(_c0='foo', _c1='bar', _c2='bla', _c3='bla2', _c4='blabla', _c5='bla3', _c6=' bla4'),
Row(_c0='1161057', _c1='57793622', _c2='6066807', _c3='2017-01-31', _c4='2017-01-31', _c5='1', _c6='0'),
Row(_c0='1177047', _c1='58973984', _c2='4938603', _c3='2017-02-28', _c4='2017-02-28', _c5='0', _c6='0')]
instead of
[Row(foo='1161057', bar='57793622', bla='6066807', bla2='2017-01-31', blabla='2017-01-31', bla3='1', M6_Debt_Review_Ind='0'),
Row(foo='1177047', bar='58973984', bla='4938603', bla2='2017-02-28', blabla='2017-02-28', bla3='0', bla4='0')]
Thanks in advance for any suggestions.
Nevermind, stupid mistake. There is a space in the column name.

how to find column whose name contains a specific string

I have following pandas dataframe with following columns
code nozzle_no nozzle_var nozzle_1 nozzle_2 nozzle_3 nozzle_4
I want to get columns names nozzle_1,nozzle_2,nozzle_3,nozzle_4 from above dataframe
I am doing following in pandas
colnames= sir_df_subset.columns[sir_df_subset.columns.str.contains(pat = 'nozzle_')]
But, it also includes following nozzle_no and nozzle_var, which I do not want. How to do it in pandas?
You can use df.filter regex param here:
df.filter(regex='nozzle_\d+')
The .str.contains has a regex flag, that is True by default, so you can enter a regex:
colnames= sir_df_subset.columns[sir_df_subset.columns.str.contains(pat = 'nozzle_\d+$')]
but the answer of #anky_91 with df.filter is MUCH better.

Spark Dataframe sql in java - How to escape single quote

I'm using spark-core, spark-sql, Spark-hive 2.10(1.6.1), scala-reflect 2.11.2. I'm trying to filter a dataframe created through hive context...
df = hiveCtx.createDataFrame(someRDDRow,
someDF.schema());
One of the column that I'm trying to filter has multiple single quotes in it. My filter query will be something similar to
df = df.filter("not (someOtherColumn= 'someOtherValue' and comment= 'That's Dany's Reply'"));
In my java class where this filter occurs, I tried to replace the String variable for e.g commentValueToFilterOut, which contains the value "That's Dany's Reply" with
commentValueToFilterOut= commentValueToFilterOut.replaceAll("'","\\\\'");
But when apply the filter to the dataframe I'm getting the below error...
java.lang.RuntimeException: [1.103] failure: ``)'' expected but identifier
s found
not (someOtherColumn= 'someOtherValue' and comment= 'That\'s Dany\'s Reply'' )
^
scala.sys.package$.error(package.scala:27)
org.apache.spark.sql.catalyst.SqlParser$.parseExpression(SqlParser.scala:49)
org.apache.spark.sql.DataFrame.filter(DataFrame.scala:768)
Please advise...
We implemented a workaround to overcome this issue.
Workaround:
Create a new column in the dataframe and copy the values from the actual column (which contains special characters in it, that may cause issues (like singe quote)), to the new column without any special characters.
df = df.withColumn("comment_new", functions.regexp_replace(df.col("comment"),"'",""));
Trim out the special characters from the condition and apply the filter.
commentToFilter = "That's Dany's Reply'"
commentToFilter = commentToFilter.replaceAll("'","");
df = df.filter("(someOtherColumn= 'someOtherValue' and comment_new= '"+commentToFilter+"')");
Now, the filter has been applied, you can drop the new column that you created for the sole purpose of filtering and restore it to the original dataframe.
df = df.drop("comment_new");
If you dont wnat to create a new column in the dataframe, you can also replace the special character with some "never-happen" string literal in the same column, for e.g
df = df.withColumn("comment", functions.regexp_replace(df.col("comment"),"'","^^^^"));
and do the same with the string literal that you want to apply against
comment_new commentToFilter = "That's Dany's Reply'"
commentToFilter = commentToFilter.replaceAll("'","^^^^");
df = df.filter("(someOtherColumn= 'someOtherValue' and comment_new= '"+commentToFilter+"')");
Once filtering is done restore the actual value by reverse-applying the string litteral
df = df.withColumn("comment", functions.regexp_replace(df.col("comment"),"^^^^", "'"));
Though It's not answer the actual issue, but someone having the same issue, can try this out as a workaround.
The actual solution could be, use sqlContext (instead of hiveContext) and / or Dataset (instead of dataframe) and / or upgrade to spark hive 2.12.
experts to debate & answer
PS: Thanks to KP, my lead

SQL: Use a predefined list in the where clause

Here is an example of what I am trying to do:
def famlist = selection.getUnique('Family_code')
... Where “””...
and testedWaferPass.family_code in $famlist
“””...
famlist is a list of objects
‘selection’ will change every run, so the list is always changing.
I want to return only columns from my SQL search where the row is found in the list that I have created.
I realize it is supposed to look like: in ('foo','bar')
But no matter what I do, my list will not get like that. So I have to turn my list into a string?
('\${famlist.join("', '")}')
Ive tried the above, idk. Wasn’t working for me. Just thought I would throw that in there. Would love some suggestions. Thanks.
I am willing to bet there is a Groovier way to implement this than shown below - but this works. Here's the important part of my sample script. nameList original contains the string names. Need to quote each entry in the list, then string the [ and ] from the toString result. I tried passing as prepared statement but for that you need to dynamically create the string for the ? for each element in the list. This quick-hack doesn't use a prepared statement.
def nameList = ['Reports', 'Customer', 'Associates']
def nameListString = nameList.collect{"'${it}'"}.toString().substring(1)
nameListString = nameListString.substring(0, nameListString.length()-1)
String stmt = "select * from action_group_i18n where name in ( $nameListString)"
db.eachRow( stmt ) { row ->
println "$row.action_group_id, $row.language, $row.name"
}
Hope this helps!

Why am I getting the Missing Operator error?

I keep getting the same error: Missing Operator. I have looked over this code three times and cannot find it. Can someone with a keen eye please help?
WHERE ([Letter Status].[Letter_Status] = “Agreed” AND [Research].[Site] = 9)
OR ([Telephone Status].[Details]= “Agreed” AND [Research].[Site] = 9)
I'm not sure about that “”
Maybe is from this post, but change them to "" :
WHERE ([Letter Status].[Letter_Status] = "Agreed" AND [Research].[Site] = 9)
OR ([Telephone Status].[Details]= "Agreed" AND [Research].[Site] = 9)
Usually it is a misspelled table or field name, two spaces instead of just one, a space instead of "_", any missing letter or something like that.
Create a new query in MS Access and put the whole query in, then run it. The Access GUI most probably tells you more detailed what exactly is missing here.