How to cast from double to int in from_json Spark SQL (NULL output) - sql

I have a table with a JSON string
When running this Spark SQL query:
select from_json('[{"column_1":"hola", "some_number":1.0}]', 'array<struct<column_1:string,some_number:int>>')
I get a NULL, since the data types for some_number are not matching (int vs double)...
If I run this it works:
select from_json('[{"column_1":"hola", "some_number":1.0}]', 'array<struct<column_1:string,some_number:double>>')
Is there a way to CAST this on-the-fly?

You can do from_json first using array<struct<column_1:string,some_number:double>> then cast as
array<struct<column_1:string,some_number:int>>
Example:
spark.sql("""select cast(from_json('[{"column_1":"hola", "some_number":1.0}]', 'array<struct<column_1:string,some_number:double>>') as array<struct<column_1:string,some_number:int>>)""").show()
//+-------------------------------------------------------+
//|jsontostructs([{"column_1":"hola", "some_number":1.0}])|
//+-------------------------------------------------------+
//| [[hola, 1]]|
//+-------------------------------------------------------+
//printSchema
spark.sql("""select cast(from_json('[{"column_1":"hola", "some_number":1.0}]', 'array<struct<column_1:string,some_number:double>>') as array<struct<column_1:string,some_number:int>>)""").printSchema()
//root
// |-- jsontostructs([{"column_1":"hola", "some_number":1.0}]): array (nullable = true)
// | |-- element: struct (containsNull = true)
// | | |-- column_1: string (nullable = true)
// | | |-- some_number: integer (nullable = true)

Related

How to provide value from the same row to scala spark substring function?

I've got following Dataframe with fnamelname column that I want to transform
+---+---------------+---+----------+--------+
| id| fnamelname|age| job|ageLimit|
+---+---------------+---+----------+--------+
| 1| xxxxx xxxxx| 28| teacher| 18|
| 2| xxxx xxxxxxxx| 30|programmer| 0|
| 3| xxxxx xxxxx| 28| teacher| 18|
| 8|xxxxxxx xxxxxxx| 12|programmer| 0|
| 9| xxxxx xxxxxxxx| 45|programmer| 0|
+---+---------------+---+----------+--------+
only showing top 5 rows
root
|-- id: string (nullable = true)
|-- fnamelname: string (nullable = true)
|-- age: integer (nullable = false)
|-- job: string (nullable = true)
|-- ageLimit: integer (nullable = false)
I want to use ageLimit as a len value within substring function, but somehow .cast("Int") function doesn't apply to a value of that row.
val readyDF: Dataset[Row] = peopleWithJobsAndAgeLimitsDF.withColumn("fnamelname",
substring(col("fnamelname"), 0, col("ageLimit").cast("Int")))
All I'm getting is:
found : org.apache.spark.sql.Column
required: Int
col("fnamelname"),0, col("ageLimit").cast("Int")))
How to provide a value of another column as a variable to function within .withColumn()?
The substring function takes an Int argument for the substring length. col("ageLimit").cast("Int") is not Int but another Column object holding the integer values of whatever was in the ageLimit column.
Instead, use the substr method of Column. It has an overload that takes two Columns for the position and the substring length. To pass a literal 0 for the position column, use lit(0):
val readyDF = peopleWithJobsAndAgeLimitsDF.withColumn("fnamelname",
col("fnamelname").substr(lit(0), col("ageLimit")))
You can't do this directly using substring (or any other function with similar signature). You must use expr, so solution would be something like:
peopleWithJobsAndAgeLimitsDF
.withColumn(
"fnamelname",
expr("substring(fnamelname, 0, ageLimit)")
)

How to drop duplicate columns based on another schema in spark scala?

Imagine I have two different dataframes with similar schemas:
df0.printSchema
root
|-- single: integer (nullable = false)
|-- double: integer (nullable = false)
and:
df1.printSchema
root
|-- newColumn: integer (nullable = false)
|-- single: integer (nullable = false)
|-- double: double (nullable = false)
Now I merge these two schemas like below and create a new dataframe with this merged schema:
val consolidatedSchema = df0.schema.++:(df1.schema).toSet
val uniqueConsolidatedSchemas = StructType(consolidatedSchema.toSeq)
val emptyDF = spark.createDataFrame(spark.sparkContext.emptyRDD[Row], uniqueConsolidatedSchemas)
emptyDF.printSchema
root
|-- newColumn: integer (nullable = false)
|-- single: integer (nullable = false)
|-- double: integer (nullable = false)
|-- double: double (nullable = false)
But as you can see, I have two fields with name of double, but with different data types.
How can I keep the one that its data type is matched with the one in df0 schema and drop the other one?
I want the final schema to be like this:
finalDF.printSchema
root
|-- newColumn: integer (nullable = false)
|-- single: integer (nullable = false)
|-- double: integer (nullable = false)
I really appreciate if you suggest any other method to merge these two schemas and reach my goal.
Thank you in advance.
You can filter the second schema to exclude the fields that are already present in the first one before you concatenate the lists:
val uniqueConsolidatedSchemas = StructType(
df0.schema ++ df1.schema.filter(c1 =>
!df0.schema.exists(c0 => c0.name == c1.name)
)
)

Dealing with too many terminal nodes in grammar

I'm trying to write a parser for protobuf3 using the grammars from https://github.com/antlr/grammars-v4/blob/master/protobuf3/Protobuf3.g4.
and I'm trying to deal with the _type declaration in my grammar:
field
: ( REPEATED )? type_ fieldName EQ fieldNumber ( LB fieldOptions RB )? SEMI
;
type_
: DOUBLE
| FLOAT
| INT32
| INT64
| UINT32
| UINT64
| SINT32
| SINT64
| FIXED32
| FIXED64
| SFIXED32
| SFIXED64
| BOOL
| STRING
| BYTES
| messageDefinition
| enumType
;
Inside enterField I have this snippet:
#Override
public void enterField(Protobuf3Parser.FieldContext ctx) {
MessageDefinition messageDefinition = this.messageStack.peek();
Field field = new Field();
field.setName(ctx.fieldName().ident().getText());
field.setPosition(ctx.fieldNumber().getAltNumber());
messageDefinition.addField(field);
super.enterField(ctx);
}
However I'm not sure on how I can deal with the type_ context here. It has too many terminal nodes (for basic types) and it could have a messageType or an enumType.
For my use case all I care about is if it is a basic type (and in that case get the type name) or if it is a complex type (such as another message or enum) get the definition name.
Is there a way to do this without having to check each possible outcome of ctx.field_() ?
Thank you
If both, messageDefinition and enumType return single lexer token, you can make the entire access very easy by using a label:
type_
: value = DOUBLE
| value = FLOAT
| value = INT32
| value = INT64
| value = UINT32
| value = UINT64
| value = SINT32
| value = SINT64
| value = FIXED32
| value = FIXED64
| value = SFIXED32
| value = SFIXED64
| value = BOOL
| value = STRING
| value = BYTES
| value = messageDefinition
| value = enumType
;
With that you only need to use the field value:
#Override
public void enterField(Protobuf3Parser.FieldContext ctx) {
...
const type = ctx.type_().value.getText();
...
super.enterField(ctx);
}

Haskell: Substitute a character with a variable name

I want to execute something like my effort here:
hashTag :: Char
hashTag = "#"
So I can therefore reference later:
such as adding a # to a list or assigning later
Point::Int -> Int -> Point -> String
Point a b x
| firstPoint(x) == a && secondPoint(x) == b = "#"
| otherwise = "."
There is a problem with hashTag that you defined it as a Char, but you write a string literal. You either should use a character literal, or change the type of the hashTag to String.
If we change the type to String, we can use:
hashTag :: String
hashTag = "#"
in that case we can use the variable we defined with:
Point :: Int -> Int -> Point -> String
Point a b x
| firstPoint x == a && secondPoint x == b = hashTag
| otherwise = "."
If you define hashTag as a Char:
hashTag :: Char
hashTag = '#'
then you need to wrap it in a list to generate a String with a single character: hashTag:
Point :: Int -> Int -> Point -> String
Point a b x
| firstPoint x == a && secondPoint x == b = [hashTag]
| otherwise = "."

How add setter to to discriminated unions in F#

I want add setter property to discriminated unions, how I should to do it?
f.e.:
type Factor =
| Value of Object
| Range of String
let mutable myProperty = 123
member this.MyProperty
with get() = myProperty
and set(value) = myProperty <- value
Here's how I might approach it:
type Value = { value: obj; mutable MyProperty: int }
type Range = { range: string; mutable MyProperty: int }
type Factor =
| Value of Value
| Range of Range
member this.MyProperty
with get() =
match this with
| Value { MyProperty=myProperty }
| Range { MyProperty=myProperty } -> myProperty
and set(myProperty) =
match this with
| Value x -> x.MyProperty <- myProperty
| Range x -> x.MyProperty <- myProperty
and use it like so:
let v = Value {value="hi":>obj ; MyProperty=0 }
v.MyProperty <- 2
match v with
| Value { value=value } as record ->
printfn "Value of value=%A with MyProperty=%i" value record.MyProperty
| _ ->
printfn "etc."
I've used this technique in a similar scenario to yours with happy results in FsEye's watch model: http://code.google.com/p/fseye/source/browse/tags/2.0.0-beta1/FsEye/WatchModel.fs.
Why not use a class and an active pattern:
type _Factor =
| Value_ of obj
| Range_ of string
type Factor(arg:_Factor) =
let mutable myProperty = 123
member this._DU = arg
member this.MyProperty
with get() = myProperty
and set(value) = myProperty <- value
let (|Value|Range|) (arg:Factor) =
match arg._DU with
|Value_(t) -> Value(t)
|Range_(t) -> Range(t)
This will obviously be significantly slower, but it allows you to do what you want
I'm not too familiar with F# yet, but I suppose you can't do this, it doesn't make any sense. Discriminated Unions as it can be seen from their name are unions. They represent some kind of a choice. And you're trying to incorporate some state into it. What're you trying to achieve? What's the use case?
Perhaps everything you need is to add additional "parameter" to your DU, i.e. if you have
type DU =
| A of int
| B of string
and you want to add setter of int type, then you can extend DU in such a way:
type DU =
| A of int * int
| B of string * int
member x.Set i =
match x with
| A(a1, a2) -> A(a1, i)
| B(b1, b2) -> B(b1, i)