Write pyspark dataframe with header as parquet - apache-spark-sql

So if I do df = sql_context.read.csv("test_data_2019-01-01.csv", header=False) and then df.write.parquet("test_data_2019-01-01.parquet") everything works, but if I set header=True in read.csv and then try to write I get the following error:
An error occurred while calling o522.parquet.
: org.apache.spark.sql.AnalysisException: Attribute name " M6_Debt_Review_Ind" contains invalid character(s) among " ,;{}()\n\t=". Please use alias to rename it.
I need those headers, otherwise the column names appear as follows:
[Row(_c0='foo', _c1='bar', _c2='bla', _c3='bla2', _c4='blabla', _c5='bla3', _c6=' bla4'),
Row(_c0='1161057', _c1='57793622', _c2='6066807', _c3='2017-01-31', _c4='2017-01-31', _c5='1', _c6='0'),
Row(_c0='1177047', _c1='58973984', _c2='4938603', _c3='2017-02-28', _c4='2017-02-28', _c5='0', _c6='0')]
instead of
[Row(foo='1161057', bar='57793622', bla='6066807', bla2='2017-01-31', blabla='2017-01-31', bla3='1', M6_Debt_Review_Ind='0'),
Row(foo='1177047', bar='58973984', bla='4938603', bla2='2017-02-28', blabla='2017-02-28', bla3='0', bla4='0')]
Thanks in advance for any suggestions.

Nevermind, stupid mistake. There is a space in the column name.

Related

Failed exporting df.to_csv using a variable name in the path

I am using a function MyFunction(DataName) that creates a pd.DataFrame(). After certain modifications to data, I am able to export such dataframe into csv with this code:
df.to_csv (r'\\kant\kjemi-u1\izarc\pc\Desktop\out.csv', index = True, header=True)
Creating an 'out.csv' file which is overwritten everytime the code is run. However when I try to give a specific name (for instance the name of the data used to fill in the dataframe, for multiple exports like this:
df.to_csv (fr'\\kant\kjemi-u1\izarc\pc\Desktop\{DataName}.csv', index = True, header=True)
I have this error:
---------------------------------------------------------------------------
FileNotFoundError Traceback (most recent call last)
in
----> 1 MyFunction(DataName)
I am new in the programming world so any ideas of how I can overcome this problem are very welcomed. Thank you very much!
If I understand you right (and given that the fr in your code should be simply r), you want to have your to_csv statement dynamic and what is supposed to go within the brackets to change. So, assume you dataframe if df. Then, do this:
DataName = "df"
NewFinger.to_csv (r'\\kant\kjemi-u1\izarc\pc\Desktop\{}.csv'.format(DataName), index = True, header=True)
thanks for your help. In the beggining I was confused with 'NewFinger' I thought it was some sort of module I needed to install. I could not find information in google. However I solved the issue based on your suggestion actually with the following code:
DataName = "whichever name"
df.to_csv (r'\\kant\kjemi-u1\izarc\pc\Desktop\{}.csv'.format(DataName), index = True, header=True)

Replacing multiple values in pandas column at once

This is an incredibly basic question but after reading through Stack and the documentation for str.replace, I'm not finding the answer. Trying to drop all of the punctuation in a column. What would be the proper syntax for this?
This works but it's absurd:
igog['text']=igog['text'].str.replace(",","").str.replace("/","").str.replace(".","").str.replace("\"","").str.replace("?","").str.replace(";","")
This doesn't:
igog['text'] = igog['text'].str.replace({","," ","/","",".","","\"","","?","",";",""}).
Because I keep getting "replace() missing 1 required positional argument: 'repl'".
Thanks in advance!
You can make a simple loop like this:
t=",/.\?;"
for i in t:
igog[text]=igog[text].replace(i,"")
or you can use regex:
igog['text'].str.replace("[,/.\?;]", "")
or you can use re.sub():
import re
igog['text'] = re.sub('[,/.\?;]', "", igog['text'])
or you can define a translation table :
igog['text'].translate({ord(ch):' ' for ch in ',/.\?;'})

Spark Dataframe sql in java - How to escape single quote

I'm using spark-core, spark-sql, Spark-hive 2.10(1.6.1), scala-reflect 2.11.2. I'm trying to filter a dataframe created through hive context...
df = hiveCtx.createDataFrame(someRDDRow,
someDF.schema());
One of the column that I'm trying to filter has multiple single quotes in it. My filter query will be something similar to
df = df.filter("not (someOtherColumn= 'someOtherValue' and comment= 'That's Dany's Reply'"));
In my java class where this filter occurs, I tried to replace the String variable for e.g commentValueToFilterOut, which contains the value "That's Dany's Reply" with
commentValueToFilterOut= commentValueToFilterOut.replaceAll("'","\\\\'");
But when apply the filter to the dataframe I'm getting the below error...
java.lang.RuntimeException: [1.103] failure: ``)'' expected but identifier
s found
not (someOtherColumn= 'someOtherValue' and comment= 'That\'s Dany\'s Reply'' )
^
scala.sys.package$.error(package.scala:27)
org.apache.spark.sql.catalyst.SqlParser$.parseExpression(SqlParser.scala:49)
org.apache.spark.sql.DataFrame.filter(DataFrame.scala:768)
Please advise...
We implemented a workaround to overcome this issue.
Workaround:
Create a new column in the dataframe and copy the values from the actual column (which contains special characters in it, that may cause issues (like singe quote)), to the new column without any special characters.
df = df.withColumn("comment_new", functions.regexp_replace(df.col("comment"),"'",""));
Trim out the special characters from the condition and apply the filter.
commentToFilter = "That's Dany's Reply'"
commentToFilter = commentToFilter.replaceAll("'","");
df = df.filter("(someOtherColumn= 'someOtherValue' and comment_new= '"+commentToFilter+"')");
Now, the filter has been applied, you can drop the new column that you created for the sole purpose of filtering and restore it to the original dataframe.
df = df.drop("comment_new");
If you dont wnat to create a new column in the dataframe, you can also replace the special character with some "never-happen" string literal in the same column, for e.g
df = df.withColumn("comment", functions.regexp_replace(df.col("comment"),"'","^^^^"));
and do the same with the string literal that you want to apply against
comment_new commentToFilter = "That's Dany's Reply'"
commentToFilter = commentToFilter.replaceAll("'","^^^^");
df = df.filter("(someOtherColumn= 'someOtherValue' and comment_new= '"+commentToFilter+"')");
Once filtering is done restore the actual value by reverse-applying the string litteral
df = df.withColumn("comment", functions.regexp_replace(df.col("comment"),"^^^^", "'"));
Though It's not answer the actual issue, but someone having the same issue, can try this out as a workaround.
The actual solution could be, use sqlContext (instead of hiveContext) and / or Dataset (instead of dataframe) and / or upgrade to spark hive 2.12.
experts to debate & answer
PS: Thanks to KP, my lead

Attribute Error when getting unique column values

I am new to Python and Pandas. I am trying to write a function to get a unique list of my column values. My function looks like below, where "placename" is the attribute that I want to get unique values. 'placename' should be passed as a string argument,corresponding to the header of the csv file.
def get_city_list(state, type, placename):
city_dir = os.path.join(baseDir, type + ".csv")
city_df = pd.read_csv(city_dir, quotechar = '"', skipinitialspace = True, sep = ",")
state_df = city_df.loc[city_df["state"] == state]
city_list = state_df.placename.unique()
return city_list
However, when I call this function, it throws a attribute error saying 'DataFrame' object has no attribute "placename". It seems that "placename" should not be a string, and when I substitute it as
city_list = state_df.cityname.unique(), it works, where cityname (without " ") is the actual header of the column in the original csv file. Since I want to make my function versatile,I want to find a way to deal with this case so that I dont have to manually change the content of "placename" every time.
Your help is greatly appreciated!
Thanks
The dot operator . is reserved to access attributes of an object. pandas is nice enough to make column names accessible via an attribute. But you can't do something like df."myplace"
Change your code to:
state_df[placename].unique()
This way, you are passing placename on to the getitem method.

populate the content of two files into a final file using pentaho(different fields)

I have two files.
file A has, 3 columns
Sno,name,age,key,checkvalue
file B has 3 columns
Sno,title,age
I want to merge these two into final file C which has
Sno,name,age,key,checkvalue
I tried renaming "title" to "name" and then I used "Add constants" to add the other two field.
but, when i try to merge these, I get the below error
"
The name of field number 3 is not the same as in the first row received: you're mixing rows with different layout. Field [age String] does not have the same name as field [age String].
"
How to solve this issue.
After getting the input from file B. You use a select values and remove title column. Then you use a Add constants step and add new columns name,key,checkvalue and make Set empty string? to Y. Finally do the join accordingly. So it won't fail since both the files have same number of columns. Hope this helps.
Actually, there was an issue with the fields ... field mismatch. I used "Rename" option and it got fixed.