How to provide value from the same row to scala spark substring function? - dataframe

I've got following Dataframe with fnamelname column that I want to transform
+---+---------------+---+----------+--------+
| id| fnamelname|age| job|ageLimit|
+---+---------------+---+----------+--------+
| 1| xxxxx xxxxx| 28| teacher| 18|
| 2| xxxx xxxxxxxx| 30|programmer| 0|
| 3| xxxxx xxxxx| 28| teacher| 18|
| 8|xxxxxxx xxxxxxx| 12|programmer| 0|
| 9| xxxxx xxxxxxxx| 45|programmer| 0|
+---+---------------+---+----------+--------+
only showing top 5 rows
root
|-- id: string (nullable = true)
|-- fnamelname: string (nullable = true)
|-- age: integer (nullable = false)
|-- job: string (nullable = true)
|-- ageLimit: integer (nullable = false)
I want to use ageLimit as a len value within substring function, but somehow .cast("Int") function doesn't apply to a value of that row.
val readyDF: Dataset[Row] = peopleWithJobsAndAgeLimitsDF.withColumn("fnamelname",
substring(col("fnamelname"), 0, col("ageLimit").cast("Int")))
All I'm getting is:
found : org.apache.spark.sql.Column
required: Int
col("fnamelname"),0, col("ageLimit").cast("Int")))
How to provide a value of another column as a variable to function within .withColumn()?

The substring function takes an Int argument for the substring length. col("ageLimit").cast("Int") is not Int but another Column object holding the integer values of whatever was in the ageLimit column.
Instead, use the substr method of Column. It has an overload that takes two Columns for the position and the substring length. To pass a literal 0 for the position column, use lit(0):
val readyDF = peopleWithJobsAndAgeLimitsDF.withColumn("fnamelname",
col("fnamelname").substr(lit(0), col("ageLimit")))

You can't do this directly using substring (or any other function with similar signature). You must use expr, so solution would be something like:
peopleWithJobsAndAgeLimitsDF
.withColumn(
"fnamelname",
expr("substring(fnamelname, 0, ageLimit)")
)

Related

How to drop duplicate columns based on another schema in spark scala?

Imagine I have two different dataframes with similar schemas:
df0.printSchema
root
|-- single: integer (nullable = false)
|-- double: integer (nullable = false)
and:
df1.printSchema
root
|-- newColumn: integer (nullable = false)
|-- single: integer (nullable = false)
|-- double: double (nullable = false)
Now I merge these two schemas like below and create a new dataframe with this merged schema:
val consolidatedSchema = df0.schema.++:(df1.schema).toSet
val uniqueConsolidatedSchemas = StructType(consolidatedSchema.toSeq)
val emptyDF = spark.createDataFrame(spark.sparkContext.emptyRDD[Row], uniqueConsolidatedSchemas)
emptyDF.printSchema
root
|-- newColumn: integer (nullable = false)
|-- single: integer (nullable = false)
|-- double: integer (nullable = false)
|-- double: double (nullable = false)
But as you can see, I have two fields with name of double, but with different data types.
How can I keep the one that its data type is matched with the one in df0 schema and drop the other one?
I want the final schema to be like this:
finalDF.printSchema
root
|-- newColumn: integer (nullable = false)
|-- single: integer (nullable = false)
|-- double: integer (nullable = false)
I really appreciate if you suggest any other method to merge these two schemas and reach my goal.
Thank you in advance.
You can filter the second schema to exclude the fields that are already present in the first one before you concatenate the lists:
val uniqueConsolidatedSchemas = StructType(
df0.schema ++ df1.schema.filter(c1 =>
!df0.schema.exists(c0 => c0.name == c1.name)
)
)

Dealing with too many terminal nodes in grammar

I'm trying to write a parser for protobuf3 using the grammars from https://github.com/antlr/grammars-v4/blob/master/protobuf3/Protobuf3.g4.
and I'm trying to deal with the _type declaration in my grammar:
field
: ( REPEATED )? type_ fieldName EQ fieldNumber ( LB fieldOptions RB )? SEMI
;
type_
: DOUBLE
| FLOAT
| INT32
| INT64
| UINT32
| UINT64
| SINT32
| SINT64
| FIXED32
| FIXED64
| SFIXED32
| SFIXED64
| BOOL
| STRING
| BYTES
| messageDefinition
| enumType
;
Inside enterField I have this snippet:
#Override
public void enterField(Protobuf3Parser.FieldContext ctx) {
MessageDefinition messageDefinition = this.messageStack.peek();
Field field = new Field();
field.setName(ctx.fieldName().ident().getText());
field.setPosition(ctx.fieldNumber().getAltNumber());
messageDefinition.addField(field);
super.enterField(ctx);
}
However I'm not sure on how I can deal with the type_ context here. It has too many terminal nodes (for basic types) and it could have a messageType or an enumType.
For my use case all I care about is if it is a basic type (and in that case get the type name) or if it is a complex type (such as another message or enum) get the definition name.
Is there a way to do this without having to check each possible outcome of ctx.field_() ?
Thank you
If both, messageDefinition and enumType return single lexer token, you can make the entire access very easy by using a label:
type_
: value = DOUBLE
| value = FLOAT
| value = INT32
| value = INT64
| value = UINT32
| value = UINT64
| value = SINT32
| value = SINT64
| value = FIXED32
| value = FIXED64
| value = SFIXED32
| value = SFIXED64
| value = BOOL
| value = STRING
| value = BYTES
| value = messageDefinition
| value = enumType
;
With that you only need to use the field value:
#Override
public void enterField(Protobuf3Parser.FieldContext ctx) {
...
const type = ctx.type_().value.getText();
...
super.enterField(ctx);
}

How to cast from double to int in from_json Spark SQL (NULL output)

I have a table with a JSON string
When running this Spark SQL query:
select from_json('[{"column_1":"hola", "some_number":1.0}]', 'array<struct<column_1:string,some_number:int>>')
I get a NULL, since the data types for some_number are not matching (int vs double)...
If I run this it works:
select from_json('[{"column_1":"hola", "some_number":1.0}]', 'array<struct<column_1:string,some_number:double>>')
Is there a way to CAST this on-the-fly?
You can do from_json first using array<struct<column_1:string,some_number:double>> then cast as
array<struct<column_1:string,some_number:int>>
Example:
spark.sql("""select cast(from_json('[{"column_1":"hola", "some_number":1.0}]', 'array<struct<column_1:string,some_number:double>>') as array<struct<column_1:string,some_number:int>>)""").show()
//+-------------------------------------------------------+
//|jsontostructs([{"column_1":"hola", "some_number":1.0}])|
//+-------------------------------------------------------+
//| [[hola, 1]]|
//+-------------------------------------------------------+
//printSchema
spark.sql("""select cast(from_json('[{"column_1":"hola", "some_number":1.0}]', 'array<struct<column_1:string,some_number:double>>') as array<struct<column_1:string,some_number:int>>)""").printSchema()
//root
// |-- jsontostructs([{"column_1":"hola", "some_number":1.0}]): array (nullable = true)
// | |-- element: struct (containsNull = true)
// | | |-- column_1: string (nullable = true)
// | | |-- some_number: integer (nullable = true)

How to convert String to Long in Kotlin?

So, due to lack of methods like Long.valueOf(String s) I am stuck.
How to convert String to Long in Kotlin?
1. string.toLong()
Parses the string as a [Long] number and returns the result.
#throws NumberFormatException if the string is not a valid
representation of a number.
2. string.toLongOrNull()
Parses the string as a [Long] number and returns the result or null
if the string is not a valid representation of a number.
3. string.toLong(10)
Parses the string as a [Long] number and returns the result.
#throws NumberFormatException if the string is not a valid
representation of a number.
#throws IllegalArgumentException when
[radix] is not a valid radix for string to number conversion.
public inline fun String.toLong(radix: Int): Long = java.lang.Long.parseLong(this, checkRadix(radix))
4. string.toLongOrNull(10)
Parses the string as a [Long] number and returns the result or null
if the string is not a valid representation of a number.
#throws IllegalArgumentException when [radix] is not a valid radix for string
to number conversion.
public fun String.toLongOrNull(radix: Int): Long? {...}
5. java.lang.Long.valueOf(string)
public static Long valueOf(String s) throws NumberFormatException
String has a corresponding extension method:
"10".toLong()
Extension methods are available for Strings to parse them into other primitive types. Examples below:
"true".toBoolean()
"10.0".toFloat()
"10.0".toDouble()
"10".toByte()
"10".toShort()
"10".toInt()
"10".toLong()
Note: Answers mentioning jet.String are outdated. Here is current Kotlin (1.0):
Any String in Kotlin already has an extension function you can call toLong(). Nothing special is needed, just use it.
All extension functions for String are documented. You can find others for standard lib in the api reference
Actually, 90% of the time you also need to check the 'long' is valid, so you need:
"10".toLongOrNull()
There is an 'orNull' equivalent for each 'toLong' of the basic types, and these allow for managing invalid cases with keeping with the Kotlin? idiom.
It's interesting. Code like this:
val num = java.lang.Long.valueOf("2");
println(num);
println(num is kotlin.Long);
makes this output:
2
true
I guess, Kotlin makes conversion from java.lang.Long and long primitive to kotlin.Long automatically in this case. So, it's solution, but I would be happy to see tool without java.lang package usage.
In Kotlin, to convert a String to Long (that represents a 64-bit signed integer) is simple.
You can use any of the following examples:
val number1: Long = "789".toLong()
println(number1) // 789
val number2: Long? = "404".toLongOrNull()
println("number = $number2") // number = 404
val number3: Long? = "Error404".toLongOrNull()
println("number = $number3") // number = null
val number4: Long? = "111".toLongOrNull(2) // binary
println("numberWithRadix(2) = $number4") // numberWithRadix(2) = 7
With toLongOrNull() method, you can use let { } scope function after ?. safe call operator.
Such a logic is good for executing a code block only with non-null values.
fun convertToLong(that: String) {
that.toLongOrNull()?.let {
println("Long value is $it")
}
}
convertToLong("123") // Long value is 123
One good old Java possibility what's not mentioned in the answers is java.lang.Long.decode(String).
Decimal Strings:
Kotlin's String.toLong() is equivalent to Java's Long.parseLong(String):
Parses the string argument as a signed decimal long. ... The
resulting long value is returned, exactly as if the argument and the
radix 10 were given as arguments to the parseLong(java.lang.String, int) method.
Non-decimal Strings:
Kotlin's String.toLong(radix: Int) is equivalent to Java's eLong.parseLong(String, int):
Parses the string argument as a signed long in the radix specified by
the second argument. The characters in the string must all be digits of the specified radix ...
And here comes java.lang.Long.decode(String) into the picture:
Decodes a String into a Long. Accepts decimal, hexadecimal, and octal
numbers given by the following grammar: DecodableString:
(Sign) DecimalNumeral | (Sign) 0x HexDigits | (Sign) 0X HexDigits | (Sign) # HexDigits | (Sign) 0 OctalDigits
Sign: - | +
That means that decode can parse Strings like "0x412", where other methods will result in a NumberFormatException.
val kotlin_toLong010 = "010".toLong() // 10 as parsed as decimal
val kotlin_toLong10 = "10".toLong() // 10 as parsed as decimal
val java_parseLong010 = java.lang.Long.parseLong("010") // 10 as parsed as decimal
val java_parseLong10 = java.lang.Long.parseLong("10") // 10 as parsed as decimal
val kotlin_toLong010Radix = "010".toLong(8) // 8 as "octal" parsing is forced
val kotlin_toLong10Radix = "10".toLong(8) // 8 as "octal" parsing is forced
val java_parseLong010Radix = java.lang.Long.parseLong("010", 8) // 8 as "octal" parsing is forced
val java_parseLong10Radix = java.lang.Long.parseLong("10", 8) // 8 as "octal" parsing is forced
val java_decode010 = java.lang.Long.decode("010") // 8 as 0 means "octal"
val java_decode10 = java.lang.Long.decode("10") // 10 as parsed as decimal
If you don't want to handle NumberFormatException while parsing
var someLongValue=string.toLongOrNull() ?: 0
Actually, there are several ways:
Given:
var numberString : String = "numberString"
// number is the Long value of numberString (if any)
var defaultValue : Long = defaultValue
Then we have:
+—————————————————————————————————————————————+——————————+———————————————————————+
| numberString is a valid number ? | true | false |
+—————————————————————————————————————————————+——————————+———————————————————————+
| numberString.toLong() | number | NumberFormatException |
+—————————————————————————————————————————————+——————————+———————————————————————+
| numberString.toLongOrNull() | number | null |
+—————————————————————————————————————————————+——————————+———————————————————————+
| numberString.toLongOrNull() ?: defaultValue | number | defaultValue |
+—————————————————————————————————————————————+——————————+———————————————————————+
string.toLong()
where string is your variable.

OCaml: circularity between variant type and module definition

I'm switching from Haskell to OCaml but I'm having some problems. For instance, I need a type definition for regular expressions. I do so with:
type re = EmptySet
| EmptyWord
| Symb of char
| Star of re
| Conc of re list
| Or of (RegExpSet.t * bool) ;;
The elements inside the Or are in a set (RegExpSet), so I define it next (and also a map function):
module RegExpOrder : Set.OrderedType =
struct
let compare = Pervasives.compare
type t = re
end
module RegExpSet = Set.Make( RegExpOrder )
module RegExpMap = Map.Make( RegExpOrder )
However, when I do "ocaml [name of file]" I get:
Error: Unbound module RegExpSet
in the line of "Or" in the definition of "re".
If I swap these definitions, that is, if I write the modules definitions before the re type definitions I obviously get:
Error: Unbound type constructor re
in the line of "type t = re".
How can I solve this?
Thanks!
You can try to use recursive modules. For instance, the following compiles:
module rec M :
sig type re = EmptySet
| EmptyWord
| Symb of char
| Star of re
| Conc of re list
| Or of (RegExpSet.t * bool)
end =
struct
type re = EmptySet
| EmptyWord
| Symb of char
| Star of re
| Conc of re list
| Or of (RegExpSet.t * bool) ;;
end
and RegExpOrder : Set.OrderedType =
struct
let compare = Pervasives.compare
type t = M.re
end
and RegExpSet : (Set.S with type elt = M.re) = Set.Make( RegExpOrder )