Lets say I have 3 different tables. Foo, Bar, and Baz. Each tables has the same structure; a timestamp and a data value. We can also assume that each table is synchronized at the top row.
Foo Bar Baz
________________ ________________ _________________
|Time |Value| |Time |Value| |Time |Value |
|1:00 |0 | |1:00 |10 | |1:00 |100 |
|1:15 |1 | |1:10 |11 | |1:20 |101 |
|1:30 |2 | |1:40 |12 | |1:50 |102 |
|1:45 |3 | |1:50 |13 | |1:55 |103 |
Is there a simple way to to assemble these records into a single view where the value of each column is assumed to be the last known value for each populates the unprovided times?
________________________________________
|Time |Foo.Value|Bar.Value|Baz.Value|
|1:00 | 1| 10| 100|
|1:10 | 1| 11| 100|
|1:15 | 2| 11| 100|
|1:20 | 2| 11| 101|
|1:30 | 3| 11| 101|
|1:40 | 3| 12| 101|
|1:45 | 4| 12| 101|
|1:50 | 4| 13| 102|
|1:55 | 4| 13| 103|
Edit:
What if I wanted to select a time range, but wished to have the last known value of each column brought forward? Is there a simple way to do so without producing the entire table then filtering it down?
e.g. if I wanted records from 1:17 to 1:48, I would want the following...
________________________________________
|Time |Foo.Value|Bar.Value|Baz.Value|
|1:20 | 2| 11| 101|
|1:30 | 3| 11| 101|
|1:40 | 3| 12| 101|
|1:45 | 4| 12| 101|
SQL Server 2008 doesn't support lag(), much less lag() with ignore nulls. So, I think the easiest way may be with correlated subqueries. Get all the times from the three tables and then populate the values:
select fbb.time,
(select top 1 value from foo t where t.time <= fbb.time order by t.time desc
) as foo,
(select top 1 value from bar t where t.time <= fbb.time order by t.time desc
) as bar,
(select top 1 value from baz t where t.time <= fbb.time order by t.time desc
) as baz
from (select time from foo union
select time from bar union
select time from baz
) fbb;
EDIT:
An alternative approach uses aggregation:
select time, max(foo) as foo, max(bar) as bar, max(baz) as baz
from (select time, value as foo, NULL as bar, NULL as baz from foo union all
select time, NULL, value, NULL from bar union all
select time, NULL, NULL baz from baz
) fbb
group by time
order by time;
This probably has better performance than the first method.
Here is an another alternative solution as you are using SQL SERVER 2008:
SELECT *
FROM (
SELECT t, [time], value
FROM ( SELECT 'Foo' as t, *
FROM #Foo
UNION
SELECT 'Bar' as t, *
FROM #Bar
UNION
SELECT 'Baz' as t, *
FROM #Baz
) un
WHERE [time] BETWEEN '1:17' AND '1:48'
) AS fbb
PIVOT (MAX(value) FOR fbb.[t] IN (Foo, Bar, Baz)) pvt
Related
I have two tables. Table t1 defines the metadata. ie, what are the attribute values an ideal transaction should contain. It also defines the order of importance of attributes by the order of records in the array. The first record is most important and it has weightage of 1. 2nd one has 0.9, 3rd - 0.8, 4th - 0.7 and so on.... Anything above 10 is of least important. I need to find the quality of data filled in the transaction table t2. Find the percentage of attributes filled and what is the quality rank of them.
t1
------------------------------------
| a_id | attribute_values |
------------------------------------
| 12345 | ["a1", "a2", "a3", "a5"] |
| 6789 | ["b1", "b4", "b7"] |
------------------------------------
t2
------------------------------------
| b_id | a_id | attribute_values|
------------------------------------
| B123 | 12345 | ["a2", "a5"] |
| B456 | 6789 | ["b1, "b7"] |
-------------------------------------
I am looking for way to calculate the quality rank for my t2 records as below
------------------------------------------
| b_id | percent_complete | quality_rank |
------------------------------------------
| B123 | 50 | 0.4. |
| B456 | 66.66 | 0.6. |
------------------------------------------
B123 - (2 out of 4) 50% complete. quality rank - (0.9+0.7)/4 = 0.4
B456 - (2 out of 3) 66.66% complete. quality rank - (1+0.8)/3 = 0.6
Solved it by exploding both the tables. Calculated the weight and rank for the first table and then joined with the other table. Not able to do it in single sql though.
scala> val t1 = Seq((12345, List("a1", "a2", "a3", "a5")), (6789, List("b1", "b5", "b7"))).toDF("a_id", "attribute_values")
scala> val t2 = Seq(("B123", 12345, List("a2", "a5")), ("B456", 6789, List("b1", "b7"))).toDF("b_id","a_id", "attribute_values")
scala> val t1_1 = t1.select($"a_id", posexplode($"attribute_values"))
scala> t1_1.show
+-----+---+---+
| a_id|pos|col|
+-----+---+---+
|12345| 0| a1|
|12345| 1| a2|
|12345| 2| a3|
|12345| 3| a5|
| 6789| 0| b1|
| 6789| 1| b5|
| 6789| 2| b7|
+-----+---+---+
scala> t1_1.createOrReplaceTempView("tab_t1_1")
scala> spark.sql("select *, 1 - (pos * 0.1) as calc_weight, count(col) over (partition by a_id) as rec_count from tab_t1_1").show
+-----+---+---+-----------+---------+
| a_id|pos|col|calc_weight|rec_count|
+-----+---+---+-----------+---------+
| 6789| 0| b1| 1.0| 3|
| 6789| 1| b5| 0.9| 3|
| 6789| 2| b7| 0.8| 3|
|12345| 0| a1| 1.0| 4|
|12345| 1| a2| 0.9| 4|
|12345| 2| a3| 0.8| 4|
|12345| 3| a5| 0.7| 4|
+-----+---+---+-----------+---------+
scala> val t1_2 = spark.sql("select *, 1 - (pos * 0.1) as calc_weight, count(col) over (partition by a_id) as rec_count from tab_t1_1")
scala> t1_2.createOrReplaceTempView("tab_t1_2")
scala> val t2_1 = t2.select($"b_id", $"a_id", explode($"attribute_values"))
scala> t2_1.show
+----+-----+---+
|b_id| a_id|col|
+----+-----+---+
|B123|12345| a2|
|B123|12345| a5|
|B456| 6789| b1|
|B456| 6789| b7|
+----+-----+---+
scala> t2_1.createOrReplaceTempView("tab_t2_1")
scala> spark.sql("Select b_id, t1.a_id, round(count(t2.col)*100/max(t1.rec_count),2) as percent_complete, round(sum(t1.calc_weight)/ max(t1.rec_count),2) as quality_rank from tab_t1_2 t1, tab_t2_1 t2 where t1.a_id = t2.a_id and t1.col = t2.col group by b_id, t1.a_id").show
+----+-----+----------------+------------+
|b_id| a_id|percent_complete|quality_rank|
+----+-----+----------------+------------+
|B123|12345| 50.0| 0.40|
|B456| 6789| 66.67| 0.60|
+----+-----+----------------+------------+
I have a production table in hive which gets incremental(changed records/new records) data from external source on daily basis. For values in row are possibly spread across different dates, for example, this is how records in table looks on first day
+---+----+----+
| id|col1|col2|
+---+----+----+
| 1| a1| b1|
| 2| a2| |
| 3| | b3|
+---+----+----+
on second day, we get following -
+---+----+----+
| id|col1|col2|
+---+----+----+
| 4| a4| |
| 2| | b2 |
| 3| a3| |
+---+----+----+
which has new record as well as changed records
The result I want to achieve is, merge of rows based on Primary key (id in this case) and produce and output which is -
+---+----+----+
| id|col1|col2|
+---+----+----+
| 1| a1| b1|
| 2| a2| b2 |
| 3| a3| b3|
| 4| a4| b4|
+---+----+----+
Number of columns are pretty huge , typically in range of 100-150. Aim is to provide latest full view of all the data received so far.How can I do this within hive itself.
(ps:it doesnt have to be sorted)
This can archived using COALESCE and full outer join.
SELECT COALESCE(a.id ,b.id) as id ,
COALESCE(a.col1 ,b.col1) as col1,
COALESCE(a.col2 ,b.col2) as col2
FROM tbl1 a
FULL OUTER JOIN table2 b
on a.id =b.id
I have some data like this:
a,timestamp,list,rid,sbid,avgvalue
1,1011,1001,4,4,1.20
2,1000,819,2,3,2.40
1,1011,107,1,3,5.40
1,1021,819,1,1,2.10
In the data above I want to find which stamp has the highest tag value (avg. value) based on the tag. Like this.
For time stamp 1011 and a 1:
1,1011,1001,4,4,1.20
1,1011,107,1,3,5.40
The output would be:
1,1011,107,1,3,5.40 //because for timestamp 1011 and tag 1 the higest avg value is 5.40
So I need to pick this column.
I tried this statement, but still it does not work properly:
val highvaluetable = df.registerTempTable("high_value")
val highvalue = sqlContext.sql("select a,timestamp,list,rid,sbid,avgvalue from high_value") highvalue.select($"a",$"timestamp",$"list",$"rid",$"sbid",$"avgvalue".cast(IntegerType).as("higher_value")).groupBy("a","timestamp").max("higher_value")
highvalue.collect.foreach(println)
Any help will be appreciated.
After I applied some of your suggestions, I am still getting duplicates in my data.
+---+----------+----+----+----+----+
|a| timestamp| list|rid|sbid|avgvalue|
+---+----------+----+----+----+----+
| 4|1496745915| 718| 4| 3|0.30|
| 4|1496745918| 362| 4| 3|0.60|
| 4|1496745913| 362| 4| 3|0.60|
| 2|1496745918| 362| 4| 3|0.10|
| 3|1496745912| 718| 4| 3|0.05|
| 2|1496745918| 718| 4| 3|0.30|
| 4|1496745911|1901| 4| 3|0.60|
| 4|1496745912| 718| 4| 3|0.60|
| 2|1496745915| 362| 4| 3|0.30|
| 2|1496745915|1901| 4| 3|0.30|
| 2|1496745910|1901| 4| 3|0.30|
| 3|1496745915| 362| 4| 3|0.10|
| 4|1496745918|3878| 4| 3|0.10|
| 4|1496745915|1901| 4| 3|0.60|
| 4|1496745912| 362| 4| 3|0.60|
| 4|1496745914|1901| 4| 3|0.60|
| 4|1496745912|3878| 4| 3|0.10|
| 4|1496745912| 718| 4| 3|0.30|
| 3|1496745915|3878| 4| 3|0.05|
| 4|1496745914| 362| 4| 3|0.60|
+---+----------+----+----+----+----+
4|1496745918| 362| 4| 3|0.60|
4|1496745918|3878| 4| 3|0.10|
Same time stamp with same tag. This is considered as duplicate.
This is my code:
rdd.createTempView("v1")
val rdd2=sqlContext.sql("select max(avgvalue) as max from v1 group by (a,timestamp)")
rdd2.createTempView("v2")
val rdd3=sqlContext.sql("select a,timestamp,list,rid,sbid,avgvalue from v1 join v2 on v2.max=v1.avgvalue").show()
You can use dataframe api to find the max as below:
df.groupBy("timestamp").agg(max("avgvalue"))
this will give you output as
+---------+-------------+
|timestamp|max(avgvalue)|
+---------+-------------+
|1021 |2.1 |
|1000 |2.4 |
|1011 |5.4 |
+---------+-------------+
which doesn't include the other fields you require . so you can use first as
df.groupBy("timestamp").agg(max("avgvalue") as "avgvalue", first("a") as "a", first("list") as "list", first("rid") as "rid", first("sbid") as "sbid")
you should have output as
+---------+--------+---+----+---+----+
|timestamp|avgvalue|a |list|rid|sbid|
+---------+--------+---+----+---+----+
|1021 |2.1 |1 |819 |1 |1 |
|1000 |2.4 |2 |819 |2 |3 |
|1011 |5.4 |1 |1001|4 |4 |
+---------+--------+---+----+---+----+
The above solution would not still give you correct row-wise output so what you can do is use window function and select the correct row as
import org.apache.spark.sql.functions._
val windowSpec = Window.partitionBy("timestamp").orderBy("a")
df.withColumn("newavg", max("avgvalue") over windowSpec)
.filter(col("newavg") === col("avgvalue"))
.drop("newavg").show(false)
This will give row-wise correct data as
+---+---------+----+---+----+--------+
|a |timestamp|list|rid|sbid|avgvalue|
+---+---------+----+---+----+--------+
|1 |1021 |819 |1 |1 |2.1 |
|2 |1000 |819 |2 |3 |2.4 |
|1 |1011 |107 |1 |3 |5.4 |
+---+---------+----+---+----+--------+
You can use groupBy and find the max value for that perticular group as
//If you have the dataframe as df than
df.groupBy("a", "timestamp").agg(max($"avgvalue").alias("maxAvgValue"))
Hope this helps
I saw the above answers. Below is the one which you can try as well
val sqlContext=new SQLContext(sc)
case class Tags(a:Int,timestamp:Int,list:Int,rid:Int,sbid:Int,avgvalue:Double)
val rdd=sc.textFile("file:/home/hdfs/stackOverFlow").map(x=>x.split(",")).map(x=>Tags(x(0).toInt,x(1).toInt,x(2).toInt,x(3).toInt,x(4).toInt,x(5).toDouble)).toDF
rdd.createTempView("v1")
val rdd2=sqlContext.sql("select max(avgvalue) as max from v1 group by (a,timestamp)")
rdd2.createTempView("v2")
val rdd3=sqlContext.sql("select a,timestamp,list,rid,sbid,avgvalue from v1 join v2 on v2.max=v1.avgvalue").show()
OutPut
+---+---------+----+---+----+--------+
| a|timestamp|list|rid|sbid|avgvalue|
+---+---------+----+---+----+--------+
| 2| 1000| 819| 2| 3| 2.4|
| 1| 1011| 107| 1| 3| 5.4|
| 1| 1021| 819| 1| 1| 2.1|
+---+---------+----+---+----+--------+
All the other solutions provided here did not give me the correct answer so this is what it worked for me with row_number():
import org.apache.spark.sql.functions._
import org.apache.spark.sql.expressions.Window
val windowSpec = Window.partitionBy("timestamp").orderBy(desc("avgvalue"))
df.select("a", "timestamp", "list", "rid", "sbid", "avgvalue")
.withColumn("largest_avgvalue", row_number().over( windowSpec ))
.filter($"largest_avgvalue" === 1)
.drop("largest_avgvalue")
The other solutions had the following problems in my tests:
The solution with .agg( max(x).as(x), first(y).as(y), ... ) doesn't work because first() function "will return the first value it sees" according to documentation, which means it is non-deterministic,
The solution with .withColumn("x", max("y") over windowSpec.orderBy("m") ) doesn't work because the result of the max will be same as in the value that is selecting for the row. I believe the problem there is the orderBy()".
Hence, the following also gives the correct answer, with max():
val windowSpec = Window.partitionBy("timestamp").orderBy(desc("avgvalue"))
df.select("a", "timestamp", "list", "rid", "sbid", "avgvalue")
.withColumn("largest_avgvalue", max("avgvalue").over( windowSpec ))
.filter($"largest_avgvalue" === $"avgvalue")
.drop("largest_avgvalue")
Another time, another problem. I have the following table:
|assemb.|Repl_1|Repl_2|Repl_3|Repl_4|Repl_5|Amount_1|Amount_2|Amount_3|Amount_4|Amount_5|
|---------------------------------------------------------------------------------------|
|4711001|111000|222000|333000|444000|555000| 1| 1| 1| 1| 1|
|---------------------------------------------------------------------------------------|
|4711002|222000|333000|444000|555000|666000| 1| 1| 1| 1| 1|
|---------------------------------------------------------------------------------------|
And here what I need:
|Article|Amount|
|--------------|
| 111000| 1|
|--------------|
| 222000| 2|
|--------------|
| 333000| 2|
|--------------|
| 444000| 2|
|--------------|
| 555000| 2|
|--------------|
| 666000| 1|
|---------------
Repl_1 to Repl_10 are replacement-articles of the assembly. I can have n assemblies with to 10 rep-articles. At the end I need to overview all articles with there amounts of all assemblies.
THX.
Best greetz
Vegeta
This is probably the quickest way of achieving it using UNION ALL. However, I'd recommend normalising your table
SELECT Article, SUM(Amount) FROM (
SELECT Repl_1 AS Article, SUM(Amount_1) AS Amount FROM #Test GROUP BY Repl_1
UNION ALL
SELECT Repl_2 AS Article, SUM(Amount_2) AS Amount FROM #Test GROUP BY Repl_2
UNION ALL
SELECT Repl_3 AS Article, SUM(Amount_3) AS Amount FROM #Test GROUP BY Repl_3
UNION ALL
SELECT Repl_4 AS Article, SUM(Amount_4) AS Amount FROM #Test GROUP BY Repl_4
UNION ALL
SELECT Repl_5 AS Article, SUM(Amount_5) AS Amount FROM #Test GROUP BY Repl_5
) tbl GROUP BY Article
I'm using SQLite browser, I'm trying to find a query that can find the max of each grouped by a value from another column from:
Table is called main
| |Place |Value|
| 1| London| 101|
| 2| London| 20|
| 3| London| 101|
| 4| London| 20|
| 5| London| 20|
| 6| London| 20|
| 7| London| 20|
| 8| London| 20|
| 9| France| 30|
| 10| France| 30|
| 11| France| 30|
| 12| France| 30|
The result I'm looking for is the finding the most frequent value grouping by place:
| |Place |Most Frequent Value|
| 1| London| 20|
| 2| France| 30|
Or even better
| |Place |Most Frequent Value|Largest Percentage|2nd Largest Percentage|
| 1| London| 20| 0.75| 0.25|
| 2| France| 30| 1| 0.75|
You can group by place, then value, and order by frequency eg.
select place,value,count(value) as freq from cars group by place,value order by place, freq;
This will not give exactly the answer you want, but near to it like
London | 101 | 2
France | 30 | 4
London | 20 | 6
Now select place and value from this intermediate table and group by place, so that only one row per place is displayed.
select place,value from
(select place,value,count(value) as freq from cars group by place,value order by place, freq)
group by place;
This will produce the result like following:
France | 30
London | 20
This works for sqlite. But for some other programs, it might not work as expected and return the place and value with least frequency. In those, you can put order by place, freq desc instead to solve your problem.
The first part would be something like this.
http://sqlfiddle.com/#!7/ac182/8
with tbl1 as
(select a.place,a.value,count(a.value) as val_count
from table1 a
group by a.place,a.value
)
select t1.place,
t1.value as most_frequent_value
from tbl1 t1
inner join
(select place,max(val_count) as val_count from tbl1
group by place) t2
on t1.place=t2.place
and t1.val_count=t2.val_count
Here we are deriving tbl1 which will give us the count of each place and value combination. Now we will join this data with another derived table t2 which will find the max count and we will join this data to get the required result.
I am not sure how do you want the percentage in second output, but if you understood this query, you can use some logic on top of it do derive the required output. Play around with the sqlfiddle. All the best.
RANK
SQLite now supports RANK, so we can use the exact same syntax that works on PostgreSQL, similar to https://stackoverflow.com/a/12448971/895245
SELECT "city", "value", "cnt"
FROM (
SELECT
"city",
"value",
COUNT(*) AS "cnt",
RANK() OVER (
PARTITION BY "city"
ORDER BY COUNT(*) DESC
) AS "rnk"
FROM "Sales"
GROUP BY "city", "value"
) AS "sub"
WHERE "rnk" = 1
ORDER BY
"city" ASC,
"value" ASC
This would return all in case of tie. To return just one you could use ROW_NUMBER instead of RANK.
Tested on SQLite 3.34.0 and PostgreSQL 14.3. GitHub upstream.