How to drop duplicate columns based on another schema in spark scala?

How to drop duplicate columns based on another schema in spark scala? - dataframe

Imagine I have two different dataframes with similar schemas:
df0.printSchema
root
|-- single: integer (nullable = false)
|-- double: integer (nullable = false)
and:
df1.printSchema
root
|-- newColumn: integer (nullable = false)
|-- single: integer (nullable = false)
|-- double: double (nullable = false)
Now I merge these two schemas like below and create a new dataframe with this merged schema:
val consolidatedSchema = df0.schema.++:(df1.schema).toSet
val uniqueConsolidatedSchemas = StructType(consolidatedSchema.toSeq)
val emptyDF = spark.createDataFrame(spark.sparkContext.emptyRDD[Row], uniqueConsolidatedSchemas)
emptyDF.printSchema
root
|-- newColumn: integer (nullable = false)
|-- single: integer (nullable = false)
|-- double: integer (nullable = false)
|-- double: double (nullable = false)
But as you can see, I have two fields with name of double, but with different data types.
How can I keep the one that its data type is matched with the one in df0 schema and drop the other one?
I want the final schema to be like this:
finalDF.printSchema
root
|-- newColumn: integer (nullable = false)
|-- single: integer (nullable = false)
|-- double: integer (nullable = false)
I really appreciate if you suggest any other method to merge these two schemas and reach my goal.
Thank you in advance.

You can filter the second schema to exclude the fields that are already present in the first one before you concatenate the lists:
val uniqueConsolidatedSchemas = StructType(
df0.schema ++ df1.schema.filter(c1 =>
!df0.schema.exists(c0 => c0.name == c1.name)
)
)

Related

How to provide value from the same row to scala spark substring function?

I've got following Dataframe with fnamelname column that I want to transform
+---+---------------+---+----------+--------+
| id| fnamelname|age| job|ageLimit|
+---+---------------+---+----------+--------+
| 1| xxxxx xxxxx| 28| teacher| 18|
| 2| xxxx xxxxxxxx| 30|programmer| 0|
| 3| xxxxx xxxxx| 28| teacher| 18|
| 8|xxxxxxx xxxxxxx| 12|programmer| 0|
| 9| xxxxx xxxxxxxx| 45|programmer| 0|
+---+---------------+---+----------+--------+
only showing top 5 rows
root
|-- id: string (nullable = true)
|-- fnamelname: string (nullable = true)
|-- age: integer (nullable = false)
|-- job: string (nullable = true)
|-- ageLimit: integer (nullable = false)
I want to use ageLimit as a len value within substring function, but somehow .cast("Int") function doesn't apply to a value of that row.
val readyDF: Dataset[Row] = peopleWithJobsAndAgeLimitsDF.withColumn("fnamelname",
substring(col("fnamelname"), 0, col("ageLimit").cast("Int")))
All I'm getting is:
found : org.apache.spark.sql.Column
required: Int
col("fnamelname"),0, col("ageLimit").cast("Int")))
How to provide a value of another column as a variable to function within .withColumn()?

The substring function takes an Int argument for the substring length. col("ageLimit").cast("Int") is not Int but another Column object holding the integer values of whatever was in the ageLimit column.
Instead, use the substr method of Column. It has an overload that takes two Columns for the position and the substring length. To pass a literal 0 for the position column, use lit(0):
val readyDF = peopleWithJobsAndAgeLimitsDF.withColumn("fnamelname",
col("fnamelname").substr(lit(0), col("ageLimit")))

You can't do this directly using substring (or any other function with similar signature). You must use expr, so solution would be something like:
peopleWithJobsAndAgeLimitsDF
.withColumn(
"fnamelname",
expr("substring(fnamelname, 0, ageLimit)")
)

Haskell: Substitute a character with a variable name

I want to execute something like my effort here:
hashTag :: Char
hashTag = "#"
So I can therefore reference later:
such as adding a # to a list or assigning later
Point::Int -> Int -> Point -> String
Point a b x
| firstPoint(x) == a && secondPoint(x) == b = "#"
| otherwise = "."

There is a problem with hashTag that you defined it as a Char, but you write a string literal. You either should use a character literal, or change the type of the hashTag to String.
If we change the type to String, we can use:
hashTag :: String
hashTag = "#"
in that case we can use the variable we defined with:
Point :: Int -> Int -> Point -> String
Point a b x
| firstPoint x == a && secondPoint x == b = hashTag
| otherwise = "."
If you define hashTag as a Char:
hashTag :: Char
hashTag = '#'
then you need to wrap it in a list to generate a String with a single character: hashTag:
Point :: Int -> Int -> Point -> String
Point a b x
| firstPoint x == a && secondPoint x == b = [hashTag]
| otherwise = "."

How to cast from double to int in from_json Spark SQL (NULL output)

I have a table with a JSON string
When running this Spark SQL query:
select from_json('[{"column_1":"hola", "some_number":1.0}]', 'array<struct<column_1:string,some_number:int>>')
I get a NULL, since the data types for some_number are not matching (int vs double)...
If I run this it works:
select from_json('[{"column_1":"hola", "some_number":1.0}]', 'array<struct<column_1:string,some_number:double>>')
Is there a way to CAST this on-the-fly?

You can do from_json first using array<struct<column_1:string,some_number:double>> then cast as
array<struct<column_1:string,some_number:int>>
Example:
spark.sql("""select cast(from_json('[{"column_1":"hola", "some_number":1.0}]', 'array<struct<column_1:string,some_number:double>>') as array<struct<column_1:string,some_number:int>>)""").show()
//+-------------------------------------------------------+
//|jsontostructs([{"column_1":"hola", "some_number":1.0}])|
//+-------------------------------------------------------+
//| [[hola, 1]]|
//+-------------------------------------------------------+
//printSchema
spark.sql("""select cast(from_json('[{"column_1":"hola", "some_number":1.0}]', 'array<struct<column_1:string,some_number:double>>') as array<struct<column_1:string,some_number:int>>)""").printSchema()
//root
// |-- jsontostructs([{"column_1":"hola", "some_number":1.0}]): array (nullable = true)
// | |-- element: struct (containsNull = true)
// | | |-- column_1: string (nullable = true)
// | | |-- some_number: integer (nullable = true)

Split main.rs into files that refer each other

I have the following structure:
|-- Cargo.toml
|-- src
| |-- file1.rs
| |-- file2.rs
| `-- main.rs
src/file1.rs
pub fn function1() {}
src/file2.rs
// ERROR (1): error[E0583]: file not found for module `file1`
// mod file1;
use crate::file1;
pub fn function2() {
file1::function1();
}
src/main.rs
// ERROR (2): no `file1` in the root
// use crate::file1;
mod file1;
mod file2;
fn main() {
file1::function1();
file2::function2();
}
Basically, I have a different way to import function1, depending on the fact that I am in the crate root or in an arbitrary Rust file (please see ERROR (1) and ERROR (2)).
I am a bit lost on how Rust manages arbirary files: they behave differently from the root crate, where a simple mod keyword does the trick.
So, the mentioned answer for which this is a duplicate only partially only answers how to refer a file from the crate root, not why referring the same file from another one should be different (use crate::<filename>).

Checking if string is empty in Kotlin

In Java, we've always been reminded to use myString.isEmpty() to check whether a String is empty. In Kotlin however, I find that you can use either myString == "" or myString.isEmpty() or even myString.isBlank().
Are there any guidelines/recommendations on this? Or is it simply "anything that rocks your boat"?
Thanks in advance for feeding my curiosity. :D

Don't use myString == "", in java this would be myString.equals("") which also isn't recommended.
isBlank is not the same as isEmpty and it really depends on your use-case.
isBlank checks that a char sequence has a 0 length or that all indices are white space. isEmpty only checks that the char sequence length is 0.
/**
* Returns `true` if this string is empty or consists solely of whitespace characters.
*/
public fun CharSequence.isBlank(): Boolean = length == 0 || indices.all { this[it].isWhitespace() }
/**
* Returns `true` if this char sequence is empty (contains no characters).
*/
#kotlin.internal.InlineOnly
public inline fun CharSequence.isEmpty(): Boolean = length == 0

For String? (nullable String) datatype, I use .isNullOrBlank()
For String, I use .isBlank()
Why? Because most of the time, I do not want to allow Strings with whitespace (and .isBlank() checks whitespace as well as empty String). If you don't care about whitespace, use .isNullorEmpty() and .isEmpty() for String? and String, respectively.

Use isEmpty when you want to test that a String is exactly equal to the empty string "".
Use isBlank when you want to test that a String is empty or only consists of whitespace ("", " ").
Avoid using == "".

There are two methods available in Kotlin.
isNullOrBlank()
isNullOrEmpty()
And the difference is:
data = " " // this is a text with blank space
println(data.isNullOrBlank()?.toString()) //true
println(data.isNullOrEmpty()?.toString()) //false

You can use isNullOrBlank() to check is a string is null or empty. This method considers spaces only strings to be empty.
Here is a usage example:
val s: String? = null
println(s.isNullOrBlank())
val s1: String? = ""
println(s1.isNullOrBlank())
val s2: String? = " "
println(s2.isNullOrBlank())
val s3: String? = " a "
println(s3.isNullOrBlank())
The output of this snippet is:
true
true
true
false

As someone mentioned in the comments, you can use ifBlank, like so:
fun getSomeValue(): String {
// ...
val foo = someCall()
return foo.ifBlank { "some-default" }
}
Documentation: https://kotlinlang.org/api/latest/jvm/stdlib/kotlin.text/if-blank.html

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

How to drop duplicate columns based on another schema in spark scala? - dataframe

You can filter the second schema to exclude the fields that are already present in the first one before you concatenate the lists: val uniqueConsolidatedSchemas = StructType( df0.schema ++ df1.schema.filter(c1 => !df0.schema.exists(c0 => c0.name == c1.name) ) )

Related

How to provide value from the same row to scala spark substring function?

Haskell: Substitute a character with a variable name

How to cast from double to int in from_json Spark SQL (NULL output)

Split main.rs into files that refer each other

Checking if string is empty in Kotlin

Categories

Resources