I am writing a model to calculate the maximum production capacity for a machine in a year based on 15-min data. As the maximum capacity is not the sum of the required capacity for all 15-min over the year, I want to write a piece of code that determines the maximum value in the list and then adds this maximum value and the three next consecutive values after this maximum value to a new variable. An simplified example would be:
fifteen_min_capacity = [10, 12, 3, 4, 8, 12, 10, 9, 2, 10, 4, 3, 15, 8, 9, 3, 4, 10]
The piece of code I want to write would be able to determine the maximum capacity in this list (15) and then add this capacity plus the three consecutive ones (8,9,3) to a new variables:
hourly_capacity = 35
Does anyone now the code that would give this output?
I have tried using the max(), the sum() and a combination of both. However, I do not get a working code. Any help would be much appreciated!
In dask or even pandas how would you go about grouping an dask data frame that has a 3 columns of time / level / spread into a set of fixed ranges by time.
Time is only used to move one direction. Like a loop counting up. So the end result would be start time and end time with high of level, low of level, first value of level and last value of level over the fixed range? Example
12:00:00, 10, 1
12:00:01, 11, 1
12:00:02, 12, 1
12:00:03, 11, 1
12:00:04, 9, 1
12:00:05, 6, 1
12:00:06, 10, 1
12:00:07, 14, 1
12:00:08, 11, 1
12:00:09, 7, 1
12:00:10, 13, 1
12:00:11, 8, 1
For a fixed level range of (7). So level from start to end can not be more than 7 total distance from start to end for each bin of level. Just because first bin is only 8 difference in time and second is only 2 different in time, this dose not madder one the high to low madders that the total distance from high to low dose not go passed 7 the fixed bin size. The first bin could have been 5 not 8 for first bin and 200 for next bin not 2 in the example below. So the First few rows in dask would look something like this.
First Time, Last Time, High Level, Low Level, First Level, Last Level, Spread
12:00:00, 12:00:07, 13, 6, 10, 13, 1
12:00:07, 12:00:09, 14, 7, 13, 7, 1
12:00:09, X, 13, 7, X, X, X
How could this be aggregated in dask with a fix window of level moving forward in time binning each time level moves above X or equal too high/low with in X or below X?
I have made a selection form a huge amount of ID's, using the following query:
select ID from [tabelname] where id > 0 and id < 31
This gives me 30 ID's to work with.
What I would like to do now, is to use 3 threads, with the first one using ID 1, 4, 7, 10 etc, the second ID 2, 5, 8, 11 etc and the third one ID 3, 6, 9, 12 etc.
Up until now, I have only been able to have all threads use ID 1 through 30 parallel to each other. Would it be at all possible to do this?
Thanks in advance!
JMeter has a build-in operation that you can use in combination with a pre-processor to find the current thread number:
https://jmeter.apache.org/api/org/apache/jmeter/threads/JMeterContext.html#getThreadNum()
If you now use ctx.getThreadGroup().getNumThreads() to find the number of threads you're using, you can basically divide your testset into subsets and let each thread compute on its own subset (e.g. thread1 computes on 0..10, thread2 on 11..20, thread3 on 21..30, etc..)
I have a data frame which looks somewhat like this:
endPoint power time
device1 -4 0
device2 3 0
device3 -2 0
device4 0 0
device5 5 0
device6 -5 0
device1 4 1
device2 -3 1
device3 5 1
device4 -2 1
device5 1 1
device6 4 1
....
device1 6 x
device2 -5 x
device3 4 x
device4 3 x
device5 -1 x
device6 1 x
I want to change it into something like this:
span powerAboveThreshold time
device1-device3 true 0
device2-device6 true 0
...
devicex-devicey false w
I want to aggregate rows into two new columns and group this by time and span. The value of powerAboveThreshold depends on whether or not the power for either device in the span is above 0 - so if devicex or devicey is below 0 then it will be false.
As a side-note, there is one span of devices which contains 4 devices - whereas the rest contain just 2 devices. I need to design with this in mind.
device1-device3-device6-device2
I am using the Apache Spark DataFrame API/Spark SQL to accomplish this.
edit:
Could I convert the DataFrame to an RDD and compute it that way?
edit2:
Follow-up questions to Daniel L:
Seems like a great answer from what I have understood so far. I have a few questions:
Will the RDD have the expected structure if I convert it from a DataFrame?
What is going on in this part of the program? .aggregateByKey(Result())((result, sample) => aggregateSample(result, sample), addResults). I see that it runs aggregateSample() with each key-value pair (result, sample), but how does the addResults call work? Will it be called on each item relating to a key to add each successive Result generated by aggregateSample to the previous ones? I don't fully understand.
What is .map(_._2) doing?
In what situation will result.span be empty in the aggregateSample function?
In what situation will res1.span be empty in the addResults function?
Sorry for all of the questions, but I'm new to functional programming, Scala, and Spark so I have a lot to wrap my head around!
I'm not sure you can do the text concatenation as you want on DataFrames (maybe you can), but on a normal RDD you can do this:
val rdd = sc.makeRDD(Seq(
("device1", -4, 0),
("device2", 3, 0),
("device3", -2, 0),
("device4", 0, 0),
("device5", 5, 0),
("device6", -5, 0),
("device1", 4, 1),
("device2", -3, 1),
("device3", 5, 1),
("device4", 1, 1),
("device5", 1, 1),
("device6", 4, 1)))
val spanMap = Map(
"device1" -> 1,
"device2" -> 1,
"device3" -> 1,
"device4" -> 2,
"device5" -> 2,
"device6" -> 1
)
case class Result(var span: String = "", var aboveThreshold: Boolean = true, var time: Int = -1)
def aggregateSample(result: Result, sample: (String, Int, Int)) = {
result.time = sample._3
result.aboveThreshold = result.aboveThreshold && (sample._2 > 0)
if(result.span.isEmpty)
result.span += sample._1
else
result.span += "-" + sample._1
result
}
def addResults(res1: Result, res2: Result) = {
res1.aboveThreshold = res1.aboveThreshold && res2.aboveThreshold
if(res1.span.isEmpty)
res1.span += res2.span
else
res1.span += "-" + res2.span
res1
}
val results = rdd
.map(x => (x._3, spanMap.getOrElse(x._1, 0)) -> x) // Create a key to agregate with, by time and span
.aggregateByKey(Result())((result, sample) => aggregateSample(result, sample), addResults)
.map(_._2)
results.collect().foreach(println(_))
And it prints this, which is what I understood you needed:
Result(device4-device5,false,0)
Result(device4-device5,true,1)
Result(device1-device2-device3-device6,false,0)
Result(device1-device2-device3-device6,false,1)
Here I use a map that tells me which devices go together (for your pairings and 4-device exception), you might want to replace it with some other function, hardcode it as a static function to avoid serialization or use a broadcast variable.
=================== Edit ==========================
Seems like a great answer from what I have understood so far.
Feel free to upvote/accept it, helps me an others looking for things to answer :-)
Will the RDD have the expected structure if I convert it from a DataFrame?
Yes, the main difference is that a DataFrame includes a schema, so it can better optimize the underling calls, should be trivial to use this schema directly or map to the tuples I used as an example, I did it mostly for convenience. Hernan just posted another answer that shows some of this (and also copied the initial test data I used for convenience), so I won't repeat that piece, but as he mentions, your device-span grouping and presentation is tricky and thus I preferred a more imperative way on an RDD.
What is going on in this part of the program? .aggregateByKey(Result())((result, sample) => aggregateSample(result, sample), addResults). I see that it runs aggregateSample() with each key-value pair (result, sample), but how does the addResults call work? Will it be called on each item relating to a key to add each successive Result generated by aggregateSample to the previous ones? I don't fully understand.
aggregateByKey is a very optimal function. To avoid shuffling all data from one node to another to later merge, it first does local aggregation of samples to a single result per key, locally (the first function). They it shuffles these results around and adds the up (the second function).
What is .map(_._2) doing?
Simply discarding the key from the key/value RDD after aggregation, you no longer care about it, so I just keep the result.
In what situation will result.span be empty in the aggregateSample function?
In what situation will res1.span be empty in the addResults function?
When you do aggregation, you need to provide a "zero" value. For instance, if you where aggregating numbers, Spark would do (0 + firstValue) + secondValue... etc. The if clause prevent the adding of a spurious '-' before the first device name, since we put it between devices. No different than dealing for instance with one extra comma on a list of items, etc. Check the documentation and samples for aggregateByKey, it will help you a lot.
This is the implementation for dataframes (without the concated names):
val data = Seq(
("device1", -4, 0),
("device2", 3, 0),
("device3", -2, 0),
("device4", 0, 0),
("device5", 5, 0),
("device6", -5, 0),
("device1", 4, 1),
("device2", -3, 1),
("device3", 5, 1),
("device4", 1, 1),
("device5", 1, 1),
("device6", 4, 1)).toDF("endPoint", "power", "time")
val mapping = Seq(
"device1" -> 1,
"device2" -> 1,
"device3" -> 1,
"device4" -> 2,
"device5" -> 2,
"device6" -> 1).toDF("endPoint", "span")
data.as("A").
join(mapping.as("B"), $"B.endpoint" === $"A.endpoint", "inner").
groupBy($"B.span", $"A.time").
agg(min($"A.power" > 0).as("powerAboveThreshold")).
show()
Concated names are quite a bit harder, this requires you either to write your own UDAF (supported in the next version of Spark), or to use a combination of Hive functions.
Lets say we are dealing with the keys 1-15. To get the worst case performance of a regular BST, you would insert the keys in ascending or descending order as follows:
1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15
Then the BST would essentially become a linked list.
For best case of a BST you would insert the keys in the following order, they are arranged in such a way that the next key inserted is half of the total range to be inserted, so the first is 15/2 = 8, then 8/2 = 4 etc...
8, 4, 12, 2, 6, 10, 14, 1, 3, 5, 7, 9, 11, 13, 15
Then the BST would be a well balanced tree with optimal height 3.
The best case for a red black tree can also be constructed with the best case from a BST. But how do we construct the worst case for a red black tree? Is it the same as the worst case for a BST? Is there a specific pattern that will yield the worst case?
You are looking for a skinny tree, right? This can be produced by inserting [1 ... , 2^(n+1)-2] in reverse order.
You won't be able to. A Red-Black Tree keeps itself "bushy", so it would rotate to fix the imbalance. The length of your above worst case for a Red-Black Tree is limited to two elements, but that's still not a "bad" case; it's what's expected, as lg(2) = 1, and you have 1 layer past the root with two elements. As soon as you add the third element, you get this:
B B
\ / \
R => R R
\
R