Related
I have a DataFrame which has many String columns where they should be float64 instead. I would like to transform all the column at once and transform the dataframe into a float array. How can do this? Importantly, there are some float columns too.
df = DataFrame(a=["1", "2", "3"], b=["1.1", "2.2", "3.3"], c=[0.1, 0.2, 0.3])
# Verbose option
df.a = parse.(Float64, df.a)
df.b = parse.(Float64, df.b)
matrix = Matrix{Float64}(df)
# Is is possible to do this at once especially when there are float columns too?
# Here parse.(Float64, df.c) would throw an error
One way of doing this is by looping over the String columns:
for c ∈ names(df, String)
df[!, c]= parse.(Float64, df[!, c])
end
Note that you don't need Matrix{Float64} if you've already turned everything into Floats, just Matrix(df) will do.
I've had the same question and landed on this page and found that the above code does not work for me, a slight change that made it work for me is:
for c ∈ names(df, Any)
df[!, c]= Float64.(df[!, c])
end
Note that for the names(df, Any) argument Any can be specified as String or any other data type.
As stated in the answer https://stackoverflow.com/a/57115894/3286489, the asReversed() will generate a reversed list, that value will be changed if the original list has changed its element within.
val list = mutableListOf(0, 1, 2, 3, 4, 5)
val asReversed = list.asReversed()
val reversed = list.reversed()
println("Original list: $list")
println("asReversed: $asReversed")
println("reversed: $reversed")
list[0] = 10
println("Original list: $list")
println("asReversed: $asReversed")
println("reversed: $reversed")
Outputs
Original list: [0, 1, 2, 3, 4, 5]
asReversed: [5, 4, 3, 2, 1, 0]
reversed: [5, 4, 3, 2, 1, 0]
Original list: [10, 1, 2, 3, 4, 5]
asReversed: [5, 4, 3, 2, 1, 10]
reversed: [5, 4, 3, 2, 1, 0]
To me, that means only if the original list is a MutableList then it can change it's value within.
However, if the original List is an immutable List, it's value cannot be changed, this essentially make asReversed() and reversed() has not distinct difference on it, right?
i.e.
val list = listOf(0, 1, 2, 3, 4, 5)
val asReversed = list.asReversed() // This is the same as below?
val reversed = list.reversed() // This is the same as above?
Did I miss any scenario they are still different?
Updated
I even change what it contains as mutableList
val list = listOf(
mutableListOf(1),
mutableListOf(2),
mutableListOf(3),
mutableListOf(4),
mutableListOf(5),
mutableListOf(6))
val asReversed = list.asReversed()
val reversed = list.reversed()
println("Original list: $list")
println("asReversed: $asReversed")
println("reversed: $reversed")
list[0][0] = 10
println("Original list: $list")
println("asReversed: $asReversed")
println("reversed: $reversed")
Output
Original list: [[1], [2], [3], [4], [5], [6]]
asReversed: [[6], [5], [4], [3], [2], [1]]
reversed: [[6], [5], [4], [3], [2], [1]]
Original list: [[5], [2], [3], [4], [5], [6]]
asReversed: [[6], [5], [4], [3], [2], [10]]
reversed: [[6], [5], [4], [3], [2], [10]]
This will change for both asReversed and reversed result
The most important difference between the two is, that asReversed is always just a view on the original list. That means, if you alter the original list, the result of asReversed will always contain all the updated information. In the end it's just a view.
reversed however always gives you a new disconnected list of the original list. Note however that when you deal with object references within the original list (like lists in a list or other kind of objects that are referenceable) then you will see all the adaptations of the original object of that list also adapted in the reversed ones, regardless of whether you used reversed or asReversed.
Regarding your update it is unfortunate you used such a list that contains references. With your original list however the difference becomes much clearer:
val asReversed = list.asReversed()
val reversed = list.reversed()
fun printAll(msg : String) {
println(msg)
println("Original list: $list")
println("asReversed: $asReversed")
println("reversed: $reversed")
}
printAll("Initial ----")
list as MutableList
list[2] = 1000
printAll("mutating original list ----")
reversed as MutableList
reversed[4] = 3000
printAll("mutating reversed list ----")
As asReversed returns you a ReversedListReadOnly-type, you can't easily cast it to a MutableList, but if it would be possible, the changes in asReversed would be reflected in the original list, but not in the reversed. The output of the above is the following:
Initial ----
Original list: [0, 1, 2, 3, 4, 5]
asReversed: [5, 4, 3, 2, 1, 0]
reversed: [5, 4, 3, 2, 1, 0]
mutating original list ----
Original list: [0, 1, 1000, 3, 4, 5]
asReversed: [5, 4, 3, 1000, 1, 0]
reversed: [5, 4, 3, 2, 1, 0]
mutating reversed list ----
Original list: [0, 1, 1000, 3, 4, 5]
asReversed: [5, 4, 3, 1000, 1, 0]
reversed: [5, 4, 3, 2, 3000, 0]
As you can see: the changes are nicely reflected in the original and the asReversed-list, but changes in the reversed are not reflected and also the reversed one will not contain the original list adaptations.
So, yes, you probably missed that scenario... the lists (reversed and asReversed) aren't equal, not even, if you have a readonly view as input, as noone can guarantee you that the list isn't altered... neither the original nor the asReversed one.
To add to other answers, this is an example of a general method-naming convention in Kotlin (and, to an extent in Java):
toNoun() methods convert an object into a new form that's independent of the original.
asNoun() methods return a view onto the existing object; changes in that will be reflected in the view and (if appropriate) vice versa.
verb() methods mutate the object directly (and don't return anything).
verbed() methods return a mutated copy of the object, leaving the original unchanged.
This question provides examples of two of those cases.
These conventions are very lightweight, read well, and are used fairly consistently. After you've seen a few, you know almost instinctively how a method will behave!
(It's not universal, though; for example, map() is a verb, but doesn't mutate its receiver. However, methods like map() and filter() long pre-date Kotlin, and so it's arguably better to stick to well-known existing names like that.)
Their difference actually amazed me!
Okay so by browsing the std-lib this is what I've found.
The reversed function actually creates a copy of whatever the Iterable is as a MutableList and reverses the list "really" and then returns it.
public fun <T> Iterable<T>.reversed(): List<T> {
if (this is Collection && size <= 1) return toList()
val list = toMutableList()
list.reverse()
return list
}
While when you call the asReversed(), it doesn't create a new list "by copying the elements".
It just creates an implementation of an Abstract list with delegation to the real list, and overriding the getter.
public fun <T> List<T>.asReversed(): List<T> = ReversedListReadOnly(this)
private open class ReversedListReadOnly<out T>(private val delegate: List<T>) : AbstractList<T>() {
override val size: Int get() = delegate.size
override fun get(index: Int): T = delegate[reverseElementIndex(index)]
}
So there's no overhead, since the list is immutable there is no need to touch any other part, neither need to copy the list i.e. no need for creating a new one and allocate its memory. It is simplified and just uses the original list.
Found this example here https://kotlinlang.org/api/latest/jvm/stdlib/kotlin.collections/as-reversed.html
val original = mutableListOf('a', 'b', 'c', 'd', 'e')
val originalReadOnly = original as List<Char>
val reversedA = originalReadOnly.asReversed()
val reversedB = originalReadOnly.reversed()
println(original) // [a, b, c, d, e]
println(reversedA) // [e, d, c, b, a]
println(reversedB)
// changing the original list affects its reversed view
original.add('f')
println(original) // [a, b, c, d, e, f]
println(reversedA) // [f, e, d, c, b, a]
println(reversedB) // [e, d, c, b, a]
original[original.lastIndex] = 'z'
println(original) // [a, b, c, d, e, z]
println(reversedA) // [z, e, d, c, b, a]
println(reversedB) // [e, d, c, b, a]
It does show that a List can still be changed if it was originally from a MutableList, where we change the MutableList. Hence that makes asReversed() is still different from reversed() in this case, since there's a way to change the List by changing the original MutableList
Given following code:
import java.sql.Date
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.functions._
object SortQuestion extends App{
val spark = SparkSession.builder().appName("local").master("local[*]").getOrCreate()
import spark.implicits._
case class ABC(a: Int, b: Int, c: Int)
val first = Seq(
ABC(1, 2, 3),
ABC(1, 3, 4),
ABC(2, 4, 5),
ABC(2, 5, 6)
).toDF("a", "b", "c")
val second = Seq(
(1, 2, (Date.valueOf("2018-01-02"), 30)),
(1, 3, (Date.valueOf("2018-01-01"), 20)),
(2, 4, (Date.valueOf("2018-01-02"), 50)),
(2, 5, (Date.valueOf("2018-01-01"), 60))
).toDF("a", "b", "c")
first.join(second.withColumnRenamed("c", "c2"), Seq("a", "b")).groupBy("a").agg(sort_array(collect_list("c2")))
.show(false)
}
Spark produces following result:
+---+----------------------------------+
|a |sort_array(collect_list(c2), true)|
+---+----------------------------------+
|1 |[[2018-01-01,20], [2018-01-02,30]]|
|2 |[[2018-01-01,60], [2018-01-02,50]]|
+---+----------------------------------+
This implies that Spark is sorting an array by date (since it is the first field), but I want to instruct Spark to sort by specific field from that nested struct.
I know I can reshape array to (value, date) but it seems inconvenient, I want a general solution (imagine I have a big nested struct, 5 layers deep, and I want to sort that structure by particular column).
Is there a way to do that? Am I missing something?
According to the Hive Wiki:
sort_array(Array<T>) : Sorts the input array in ascending order according to the natural ordering of the array elements and returns it (as of version 0.9.0).
This means that the array will be sorted lexicographically which holds true even with complex data types.
Alternatively, you can create a UDF to sort it (and witness performance degradation) based on the second element:
val sortUdf = udf { (xs: Seq[Row]) => xs.sortBy(_.getAs[Int](1) )
.map{ case Row(x:java.sql.Date, y: Int) => (x,y) }}
first.join(second.withColumnRenamed("c", "c2"), Seq("a", "b"))
.groupBy("a")
.agg(sortUdf(collect_list("c2")))
.show(false)
//+---+----------------------------------+
//|a |UDF(collect_list(c2, 0, 0)) |
//+---+----------------------------------+
//|1 |[[2018-01-01,20], [2018-01-02,30]]|
//|2 |[[2018-01-02,50], [2018-01-01,60]]|
//+---+----------------------------------+
For Spark 3+, you can pass a custom comparator function to array_sort:
The comparator will take two arguments representing two elements of
the array. It returns -1, 0, or 1 as the first element is less than,
equal to, or greater than the second element. If the comparator
function returns other values (including null), the function will fail
and raise an error.
val df = first
.join(second.withColumnRenamed("c", "c2"), Seq("a", "b"))
.groupBy("a")
.agg(collect_list("c2").alias("list"))
val df2 = df.withColumn(
"list",
expr(
"array_sort(list, (left, right) -> case when left._2 < right._2 then -1 when left._2 > right._2 then 1 else 0 end)"
)
)
df2.show(false)
//+---+------------------------------------+
//|a |list |
//+---+------------------------------------+
//|1 |[[2018-01-01, 20], [2018-01-02, 30]]|
//|2 |[[2018-01-02, 50], [2018-01-01, 60]]|
//+---+------------------------------------+
Where _2 is the name of the struct field you wan to use for sorting
If you have complex object it is much better to use statically typed Dataset.
case class Result(a: Int, b: Int, c: Int, c2: (java.sql.Date, Int))
val joined = first.join(second.withColumnRenamed("c", "c2"), Seq("a", "b"))
joined.as[Result]
.groupByKey(_.a)
.mapGroups((key, xs) => (key, xs.map(_.c2).toSeq.sortBy(_._2)))
.show(false)
// +---+----------------------------------+
// |_1 |_2 |
// +---+----------------------------------+
// |1 |[[2018-01-01,20], [2018-01-02,30]]|
// |2 |[[2018-01-02,50], [2018-01-01,60]]|
// +---+----------------------------------+
In simple cases it is also possible to udf, but leads to inefficient and fragile code in general and quickly goes out of control, when complexity of objects grows.
So, yet another problem using grouped DataFrames that I am getting so confused over...
I have defined an aggregation dictionary as:
aggregations_level_1 = {
'A': {
'mean': 'mean',
},
'B': {
'mean': 'mean',
},
}
And now I have two grouped DataFrames that I have aggregated using the above, then joined:
grouped_top =
df1.groupby(['group_lvl']).agg(aggregations_level_1)
grouped_bottom =
df2.groupby(['group_lvl']).agg(aggregations_level_1)
Joining these:
df3 = grouped_top.join(grouped_bottom, how='left', lsuffix='_top_10',
rsuffix='_low_10')
A_top_10 A_low_10 B_top_10 B_low_10
mean mean mean mean
group_lvl
a 3.711413 14.515901 3.711413 14.515901
b 4.024877 14.442106 3.694689 14.209040
c 3.694689 14.209040 4.024877 14.442106
Now, if I call index and columns I have:
print df3.index
>> Index([u'a', u'b', u'c'], dtype='object', name=u'group_lvl')
print df3.columns
>> MultiIndex(levels=[[u'A_top_10', u'A_low_10', u'B_top_10', u'B_low_10'], [u'mean']],
labels=[[0, 1, 2, 3], [0, 0, 0, 0]])
So, it looks as though I have a regular DataFrame-object with index a,b,c but each column is a MultiIndex-object. Is this a correct interpretation?
How do I slice and call this? Say I would like to have only A_top_10, A_low_10 for all a,b,c?
Only A_top_10, B_top_10 for a and c?
I am pretty confused so any overall help would be great!
Need slicers, but first sort columns by sort_index else error:
UnsortedIndexError: 'MultiIndex Slicing requires the index to be fully lexsorted tuple len (1), lexsort depth (0)'
df = df.sort_index(axis=1)
idx = pd.IndexSlice
df1 = df.loc[:, idx[['A_low_10', 'A_top_10'], :]]
print (df1)
A_low_10 A_top_10
mean mean
group_lvl
a 14.515901 3.711413
b 14.442106 4.024877
c 14.209040 3.694689
And:
idx = pd.IndexSlice
df2 = df.loc[['a','c'], idx[['A_top_10', 'B_top_10'], :]]
print (df2)
A_top_10 B_top_10
mean mean
group_lvl
a 3.711413 3.711413
c 3.694689 4.024877
EDIT:
So, it looks as though I have a regular DataFrame-object with index a,b,c but each column is a MultiIndex-object. Is this a correct interpretation?
I think very close, better is say I have MultiIndex in columns.
Is there a simple way to divide list into parts (maybe some lambda) in Kotlin?
For example:
[1, 2, 3, 4, 5, 6] => [[1, 2], [3, 4], [5, 6]]
Since Kotlin 1.2 you can use Iterable<T>.chunked(size: Int): List<List<T>> function from stdlib (https://kotlinlang.org/api/latest/jvm/stdlib/kotlin.collections/chunked.html).
Given the list: val list = listOf(1, 2, 3, 4, 5, 6) you can use groupBy:
list.groupBy { (it + 1) / 2 }.map { it.value }
Or if your values are not numbers you can first assign an index to them:
list.withIndex()
.groupBy { it.index / 2 }
.map { it.value.map { it.value } }
Or if you'd like to save some allocations you can go a bit more manual way with foldIndexed:
list.foldIndexed(ArrayList<ArrayList<Int>>(list.size / 2)) { index, acc, item ->
if (index % 2 == 0) {
acc.add(ArrayList(2))
}
acc.last().add(item)
acc
}
The better answer is actually the one authored by VasyaFromRussia.
If you use groupBy, you will have to add and index and then post-process extracting the value from an IndexedValue object.
If you use chunked, you simply need to write:
val list = listOf(10, 2, 3, 4, 5, 6)
val chunked = list.chunked(2)
println(chunked)
This prints out:
[[10, 2], [3, 4], [5, 6]]
Nice way of dividing list is by the use of function partition. Unlike groupBy it doesn't divide list by keys but rather by predicate which gives out Pair<List, List> as a result.
Here's an example:
val (favorited, rest) = posts.partition { post ->
post.isFavorited()
}
favoritedList.addAll(favorited)
postsList.addAll(rest)
The API says there is a GroupBy function, which should do what you want.
https://kotlinlang.org/api/latest/jvm/stdlib/kotlin.collections/group-by.html
Or use sublist and break it up yourself
https://kotlinlang.org/api/latest/jvm/stdlib/kotlin.collections/-list/sub-list.html
If you want to divide a list into N parts.
(and not divide a list into parts of size N)
You can still use the chunked answer:
https://stackoverflow.com/a/48400664/413127
Only, first you need to find your chunk size.
val parts = 2
val list = listOf(10, 2, 3, 4, 5, 6)
val remainder = list.size % 2 // 1 or 0
val chunkSize = (list.size / parts) + remainder
val chunked = list.chunked(chunkSize)
println(chunked)
This prints out
[[10, 2, 3], [4, 5, 6]]
or when
val parts = 3
This prints out
[[10, 2], [3, 4], [5, 6]]
Interesting answer in Python here: Splitting a list into N parts of approximately equal length