How to type cast an Array as empty String array - apache-spark-sql

I can type cast NULL as a string.
How can I type cast an empty array as an empty array of strings?
I need to solve it inside the SQL query.
The following snippet throws a ValueError: Some of types cannot be determined after inferring
df = spark.sql("select Array()").collect()
display(df)

I only found a somewhat roundabout way of doing this only with SQL:
select from_json("[]", "array<string>")

I think just keeping quotes should have empty string array
df = spark.sql("select array('')").collect()
display(df)

Related

Dealing with greater than and less than values in numeric data when reading csv in pandas

My csv file contains numeric data where some values have greater than or less than symbols e.g. ">244". I want my data type to be a float. When reading the file into pandas:
df = pd.read_csv('file.csv')
I get a warning:
Columns (2) have mixed types. Specify dtype option on import or set low_memory=False.
I have checked this question: Pandas read_csv: low_memory and dtype options and tried specifying the date type of the relevant column with:
df = pd.read_csv('file.csv',dtype={'column':'float'})
However, this gives an error:
ValueError: could not convert string to float: '>244'
I have also tried
df = pd.read_csv('file.csv',dtype={'column':'float'}, error_bad_lines=False)
However this does not solve my problem, and I get the same error above.
My problem appears to be that my data has a mixture of string and floats. Can I ignore any rows containing strings in particular columns when reading in the data?
You can use:
df = pd.read_csv('file.csv', dtype={'column':'str'})
Then:
df['column'] = pd.to_numeric(df['column'], errors='coerce')
I found a workaround which was read in my data
df = pd.read_csv('file.csv')
Then remove any values with '<' or '>'
df = df.loc[df['column'].str[:1] != '<']
df = df.loc[df['column'].str[:1] != '>']
Then convert to numeric with pd.to_numeric
df['column'] = pd.to_numeric(df['column'])

Casting from timestamp[us, tz=Etc/UTC] to timestamp[ns] would result in out of bounds timestamp

I have a feature which let's me query a databricks delta table from a client app. This is the code I use for that purpose:
df = spark.sql('SELECT * FROM EmployeeTerritories LIMIT 100')
dataframe = df.toPandas()
dataframe_json = dataframe.to_json(orient='records', force_ascii=False)
However, the second line throws me the error
Casting from timestamp[us, tz=Etc/UTC] to timestamp[ns] would result in out of bounds timestamp
I know what this error says, my date-type field is out of bounds and I tried searching for the solution but none of them were eligible for my scenario.
The solutions I found were about a specific dataframe column but in my case I have a global problem because I have tons of delta tables and I don't know the specific date-typed column so I can do type manipulation in order to avoid this.
Is it possible to find all Timestamp type columns and cast them to string? Does this seem like a good solution? Do you have any other ideas on how can I achieve what I'm trying to do?
Is it possible to find all Timestamp type columns and cast them to
string?
Yes, that's the way to go. You can loop through df.dtype and handle columns having type = "timestamp" by casting them into strings before calling df.toPandas():
import pyspark.sql.functions as F
df = df.select(*[
F.col(c).cast("string").alias(c) if t == "timestamp" else F.col(c)
for c, t in df.dtypes
])
dataframe = df.toPandas()
You can define this as a function that take df as parameter and use it with all your tables:
def stringify_timestamps(df: DataFrame) -> DataFrame:
return df.select(*[
F.col(c).cast("string").alias(c) if t == "timestamp" else F.col(c).alias(c)
for c, t in df.dtypes
])
If you want to preserve the timestamp type, you can consider nullifying the timestamp values which are greater than pd.Timestamp.max as shown in this post instead of converting into strings.

How to filter a column in Spark dataframe using a Array of strings?

I have to filter a column in spark dataframe using a Array[String]
I have an parameter file like below,
variable1=100,200
I read the parameter file and split each row by "=" and load in to a Map[String,String]
In order to get the value, I pass the key "varaible1" and split the value by ","
val value1:Array[String] = parameterValues("varaible1").split(",")
now I need to use this value1 while filtering a dataframe.
val finalDf = testDf.filter("column1 in ($value1) and column2 in ($value1)")
I'm getting the below error,
org.apache.spark.sql.catalyst.parser.ParseException:
mismatched input '(' expecting <EOF>(line 1, pos 12)
== SQL ==
column1 in ([Ljava.lang.String;#760b9444) and column2 in ([Ljava.lang.String;#5397d41e)
------------^^^
Any suggestions?
You can filter a column using an array as you've done. To correct your SQL expression, you need to do two things.
First, you forgot to put the 's' string interpolator at the start of your string representing your SQL expression, as below:
s"column1 in ($value1) and column2 in ($value1)"
And then, you need to convert your Array[String] to a well formatted String that will be understood as an SQL array. To do so, you can use mkString method on your value1 array:
value1.mkString("'", "','","'")
On your array Array("100", "200"), this method would return the string "'100','200'"
If we wrap up everything, we get the following expression:
val finalDf = testDf.filter(s"column1 in (${value1.mkString("'", "','","'")}) and column2 in (${value1.mkString("'", "','","'")})")
To filter a column by an array, you can use the isin column method:
import org.apache.spark.sql.functions.col
val finalDf = testDf.filter(col("column1").isin(value1: _*) && col("column2").isin(value1: _*))

Pandas boolean indexing w/ Column boolean array

Dataset: The name of dataframe I am working on is 'f500'. Here is the
first five rows in the dataframe
Goal: Select data with only numeric value
What I've tried:
1) I tried to use boolean array to filter out the non-numeric values and there was no error.
numeric_only_bool = (f500.dtypes != object)
boolean array
2) However, when I tried to do indexing with that boolean array, an error occurs.
numeric_only = f500[:, numeric_only_bool]
error message
I saw index-wise(row-wise) boolean indexing examples but could not find column-wise boolean indexing.
Can anyone help how to fix this code?
Thank you in advance.
Use DataFrame.loc:
numeric_only = f500.loc[:, numeric_only_bool]
Another soluion with DataFrame.select_dtypes:
#only numeric
numeric_only = f500.select_dtypes(np.number)
#exclude object columns
numeric_only = f500.select_dtypes(exclude=object)

Convert df column to a tuple

I am having trouble converting a df column into a tuple that I can iterate through. I started with a simple code that works like this:
set= 'pare-10040137', 'pare-10034330', 'pare-00022936', 'pare-10025987', 'pare-10036617'
for i in set:
ref_data=req_data[req_data['REQ_NUM']==i]
This works fine, but now I want my set to come from a df. The df looks like this:
open_reqs
Out[233]:
REQ_NUM
4825 pare-00023728
4826 pare-00023773
.... ..............
I want all of those REQ_NUM values thrown into a tuple, so I tried to do open_reqs.apply(tuple, axis=1) and tuple(zip(open_reqs.columns,open_reqs.T.values.tolist())) but it's not able to iterate through either of these.
My old set looks like this, so this is the format I need to match to iterate through like I was before. I'm not sure if the Unicode is also an issue (when I print above I get (u'pare-10052173',)
In[236]: set
Out[236]:
('pare-10040137',
'pare-10034330',
'pare-00022936',
'pare-10025987',
'pare-10036617')
So basically I need the magic code to get a nice simple set like that from the REQ_NUM column of my open_reqs table. Thank you!
The following statement makes a list out of the specified column and then converts it to an array of tuple
open_req_list = tuple(list(open_reqs['REQ_NUM']))
You can use the tolist() function to convert to a list and the tuple() the whole list
req_num = tuple(open_reqs['REQ_NUM'].tolist())
#type(req_num)
req_num
df = pd.DataFrame(data)
columns_tuple = tuple(df.columns)
df.columns has the datatype of object. To convert it into tuples, use this code and you will GET TUPLE OF ALL COLUMN NAMES