Transform row wise postgres data to grouped column wise data - sql

I have a stock market data which looks likes like
instrument_symbol|close |timestamp |
-----------------|------|-------------------|
IOC |134.15|2019-08-05 00:00:00|
YESBANK | 83.75|2019-08-05 00:00:00|
IOC |135.25|2019-08-02 00:00:00|
YESBANK | 88.3|2019-08-02 00:00:00|
IOC |136.95|2019-08-01 00:00:00|
YESBANK | 88.4|2019-08-01 00:00:00|
IOC | 139.3|2019-07-31 00:00:00|
YESBANK | 91.2|2019-07-31 00:00:00|
YESBANK | 86.05|2019-07-30 00:00:00|
IOC | 133.5|2019-07-30 00:00:00|
IOC |138.25|2019-07-29 00:00:00|
I want to transform it to
timestamp, IOC, YESBANK
2019-08-05 00:00:00 134.15 83.75
2019-08-02 00:00:00 135.25 88.3
......
.....
...
format.
Is there some Postgres query to do this? Or do we have to do this programmatically?

You can use conditional aggregation. In Postgres, I like the filter syntax:
select "timestamp",
max(close) filter (where instrument_symbol = 'IOC') as ioc,
max(close) filter (where instrument_symbol = 'YESBANK') as yesbank
from t
group by "timestamp"
order by 1 desc;

Use conditional aggregation.
select "timestamp" :: date, max( case
when instrument_symbol = 'IOC'
then close end ) as ioc,
max( case
when instrument_symbol = 'YESBANK'
then close end ) as yesbank FROM t
group by "timestamp" :: date
order by 1 desc
DEMO

Related

How can i get all daily dates between two given dates in DataFrame

I have this DataFrame of Spark:
+-------------------+-------------------+
| first | last |
+-------------------+-------------------+
|2022-11-03 00:00:00|2022-11-06 00:00:00|
+-------------------+-------------------+
I need to get the dates from the first, not including, to the last, including, jumping from day to day.
My desired output would be:
+-------------------+-------------------+-----------------------------------------------------------------+
| first | last | array_dates
+-------------------+-------------------+-----------------------------------------------------------------+
|2022-11-03 00:00:00|2022-11-06 00:00:00| [2022-11-04 00:00:00, 2022-11-05 00:00:00, 2022-11-06 00:00:00] |
+-------------------+-------------------+-----------------------------------------------------------------+
Since Spark, 2.4, you can use sequence built-in spark function as follows:
import org.apache.spark.sql.functions.{col, date_add, expr, sequence}
val result = df.withColumn(
"array_dates",
sequence(
date_add(col("first"), 1),
col("last"),
expr("INTERVAL 1 DAY")
)
)
with following input df:
+-------------------+-------------------+
|first |last |
+-------------------+-------------------+
|2022-11-03 00:00:00|2022-11-06 00:00:00|
+-------------------+-------------------+
You get the following output result:
+-------------------+-------------------+---------------------------------------------------------------+
|first |last |array_dates |
+-------------------+-------------------+---------------------------------------------------------------+
|2022-11-03 00:00:00|2022-11-06 00:00:00|[2022-11-04 00:00:00, 2022-11-05 00:00:00, 2022-11-06 00:00:00]|
+-------------------+-------------------+---------------------------------------------------------------+

how to create a date range mapping in spark?

I have a spark dataframe with following structure.
+-------+-------------------+
|country| date_published|
+-------+-------------------+
| UK|2020-04-15 00:00:00|
| UK|2020-04-14 00:00:00|
| UK|2020-04-09 00:00:00|
| UK|2020-04-08 00:00:00|
| UK|2020-04-07 00:00:00|
| UK|2020-04-06 00:00:00|
| UK|2020-04-03 00:00:00|
| UK|2020-04-02 00:00:00|
| UK|2020-04-01 00:00:00|
| UK|2020-03-31 00:00:00|
| UK|2020-03-30 00:00:00|
| UK|2020-03-27 00:00:00|
| UK|2020-03-26 00:00:00|
| UK|2020-03-25 00:00:00|
| UK|2020-03-24 00:00:00|
| UK|2020-03-23 00:00:00|
| UK|2020-03-20 00:00:00|
| UK|2020-03-19 00:00:00|
| UK|2020-03-18 00:00:00|
| UK|2020-03-17 00:00:00|
+-------+-------------------+
I want to create a date mapping based on this data. Conditions,
All dates till 2020-01-01 should be mapped as "YTD".
All dates till 2019-04-15 should be mapped as "LAST_1_YEAR".
All dates from 2019-01-01 till 2019-04-15(last year date as of day) should be mapped as "YTD_LAST_YEAR"
All dates before 2019-04-15 should be mapped as "YEAR_AGO_1_YEAR"
We can create two columns like ytd_map(condition 1, 3), last_year_map(condition 2,4)
There could be other countries in the list, and the above conditions should work for them
Approach I tried, is to create a dataframe with max_date_published for each country, but I not sure how to filter the dataframe for each country separately.
df_data = df_data_cleaned.select("date_published","country").distinct().orderBy(F.desc("date_published"))
df_max_dt = df_data.groupBy("country").agg(F.max(F.col("date_published")))
df_max_dt.collect()
I tried this and working as of now.
spark.sql("select country,\
date_published,\
(case when date_published >= max_date_published_last_year then 'LAST_1_YEAR'\
when date_published <= max_date_published_last_year and date_published >= add_months(max_date_published_last_year, -12) then 'YEAR_AGO_1_YEAR' else '' end) as MAT_MAPPING,\
(case when date_published >= date_published_start_of_year then 'YTD'\
when date_published <= max_date_published_last_year and date_published >= date_published_start_of_last_year\
then 'YTD_LAST_YEAR'\
else '' end) as YTD_MAPPING from\
(select t.country, t.date_published, t.date_published_ya, t.max_date_published_current_year,\
cast(add_months(t.max_date_published_current_year, -12) as timestamp) as max_date_published_last_year,\
date_trunc('year', max_date_published_current_year) AS date_published_start_of_year,\
date_trunc('year', cast(add_months(t.max_date_published_current_year, -12) as timestamp)) AS date_published_start_of_last_year\
from\
(select country,\
date_published,cast(add_months(date_published, -12) as timestamp) as date_published_ya,\
max(date_published)over(partition by country order by date_published desc) max_date_published_current_year from df_mintel_time) t) t2")

Concat values from a column and make another column

I am working with Spark SQL, and doing some SQL operations on a Hive Table.
My table is like this:
```
ID COST CODE
1 100 AB1
5 200 BC3
1 400 FD3
6 600 HJ2
1 900 432
3 800 DS2
2 500 JT4
```
I want to create another table out of this, which would have the total cost and top 5 CODES in a chain in another column like this.
```
ID TOTAL_COST CODE CODE_CHAIN
1 1400 432 432, FD3, AB1
```
Total Cost is easy but, how to concat the values from the CODE column and form another column.
I have tried collect_set function but, the values cannot be limited and also are not properly sorted, probably due to distributed processing.
Any SQL logic is possible?
EDIT:
I need the data sorted, so I get top 5 values.
Use slice, sort_array, and collect_list
import org.apache.spark.sql.functions._
df
.groupBy("id")
.agg(
sum("cost") as "total_cost",
slice(sort_array(collect_list(struct($"cost", $"code")), false), 1, 5)("code") as "codes")
In Spark 2.3 you'll have to replace slice with manual indexing of the sorted array
val sorted = sort_array(collect_list(struct($"cost", $"code")), false)("code")
val codes = array((0 until 5).map(i => sorted.getItem(i)): _*) as "codes"
Use window function and with() table to filter on the first row_number. Check this out:
scala> val df = Seq((1,100,"AB1"),(5,200,"BC3"),(1,400,"FD3"),(6,600,"HJ2"),(1,900,"432"),(3,800,"DS2"),(2,500,"JT4")).toDF("ID","COST","CODE")
df: org.apache.spark.sql.DataFrame = [ID: int, COST: int ... 1 more field]
scala> df.show()
+---+----+----+
| ID|COST|CODE|
+---+----+----+
| 1| 100| AB1|
| 5| 200| BC3|
| 1| 400| FD3|
| 6| 600| HJ2|
| 1| 900| 432|
| 3| 800| DS2|
| 2| 500| JT4|
+---+----+----+
scala> df.createOrReplaceTempView("course")
scala> spark.sql(""" with tab1(select id,cost,code,collect_list(code) over(partition by id order by cost desc rows between current row and 5 following ) cc, row_number() over(partition by id order by cost desc) rc,sum(cost) over(partition by id order by cost desc rows between current row and 5 following) total from course) select id, total, cc from tab1 where rc=1 """).show(false)
+---+-----+---------------+
|id |total|cc |
+---+-----+---------------+
|1 |1400 |[432, FD3, AB1]|
|6 |600 |[HJ2] |
|3 |800 |[DS2] |
|5 |200 |[BC3] |
|2 |500 |[JT4] |
+---+-----+---------------+
scala>

SQL query for finding the most frequent value of a grouped by value

I'm using SQLite browser, I'm trying to find a query that can find the max of each grouped by a value from another column from:
Table is called main
| |Place |Value|
| 1| London| 101|
| 2| London| 20|
| 3| London| 101|
| 4| London| 20|
| 5| London| 20|
| 6| London| 20|
| 7| London| 20|
| 8| London| 20|
| 9| France| 30|
| 10| France| 30|
| 11| France| 30|
| 12| France| 30|
The result I'm looking for is the finding the most frequent value grouping by place:
| |Place |Most Frequent Value|
| 1| London| 20|
| 2| France| 30|
Or even better
| |Place |Most Frequent Value|Largest Percentage|2nd Largest Percentage|
| 1| London| 20| 0.75| 0.25|
| 2| France| 30| 1| 0.75|
You can group by place, then value, and order by frequency eg.
select place,value,count(value) as freq from cars group by place,value order by place, freq;
This will not give exactly the answer you want, but near to it like
London | 101 | 2
France | 30 | 4
London | 20 | 6
Now select place and value from this intermediate table and group by place, so that only one row per place is displayed.
select place,value from
(select place,value,count(value) as freq from cars group by place,value order by place, freq)
group by place;
This will produce the result like following:
France | 30
London | 20
This works for sqlite. But for some other programs, it might not work as expected and return the place and value with least frequency. In those, you can put order by place, freq desc instead to solve your problem.
The first part would be something like this.
http://sqlfiddle.com/#!7/ac182/8
with tbl1 as
(select a.place,a.value,count(a.value) as val_count
from table1 a
group by a.place,a.value
)
select t1.place,
t1.value as most_frequent_value
from tbl1 t1
inner join
(select place,max(val_count) as val_count from tbl1
group by place) t2
on t1.place=t2.place
and t1.val_count=t2.val_count
Here we are deriving tbl1 which will give us the count of each place and value combination. Now we will join this data with another derived table t2 which will find the max count and we will join this data to get the required result.
I am not sure how do you want the percentage in second output, but if you understood this query, you can use some logic on top of it do derive the required output. Play around with the sqlfiddle. All the best.
RANK
SQLite now supports RANK, so we can use the exact same syntax that works on PostgreSQL, similar to https://stackoverflow.com/a/12448971/895245
SELECT "city", "value", "cnt"
FROM (
SELECT
"city",
"value",
COUNT(*) AS "cnt",
RANK() OVER (
PARTITION BY "city"
ORDER BY COUNT(*) DESC
) AS "rnk"
FROM "Sales"
GROUP BY "city", "value"
) AS "sub"
WHERE "rnk" = 1
ORDER BY
"city" ASC,
"value" ASC
This would return all in case of tie. To return just one you could use ROW_NUMBER instead of RANK.
Tested on SQLite 3.34.0 and PostgreSQL 14.3. GitHub upstream.

(SQL) Assembling multiple tables of deltas into a single populated view?

Lets say I have 3 different tables. Foo, Bar, and Baz. Each tables has the same structure; a timestamp and a data value. We can also assume that each table is synchronized at the top row.
Foo Bar Baz
________________ ________________ _________________
|Time |Value| |Time |Value| |Time |Value |
|1:00 |0 | |1:00 |10 | |1:00 |100 |
|1:15 |1 | |1:10 |11 | |1:20 |101 |
|1:30 |2 | |1:40 |12 | |1:50 |102 |
|1:45 |3 | |1:50 |13 | |1:55 |103 |
Is there a simple way to to assemble these records into a single view where the value of each column is assumed to be the last known value for each populates the unprovided times?
________________________________________
|Time |Foo.Value|Bar.Value|Baz.Value|
|1:00 | 1| 10| 100|
|1:10 | 1| 11| 100|
|1:15 | 2| 11| 100|
|1:20 | 2| 11| 101|
|1:30 | 3| 11| 101|
|1:40 | 3| 12| 101|
|1:45 | 4| 12| 101|
|1:50 | 4| 13| 102|
|1:55 | 4| 13| 103|
Edit:
What if I wanted to select a time range, but wished to have the last known value of each column brought forward? Is there a simple way to do so without producing the entire table then filtering it down?
e.g. if I wanted records from 1:17 to 1:48, I would want the following...
________________________________________
|Time |Foo.Value|Bar.Value|Baz.Value|
|1:20 | 2| 11| 101|
|1:30 | 3| 11| 101|
|1:40 | 3| 12| 101|
|1:45 | 4| 12| 101|
SQL Server 2008 doesn't support lag(), much less lag() with ignore nulls. So, I think the easiest way may be with correlated subqueries. Get all the times from the three tables and then populate the values:
select fbb.time,
(select top 1 value from foo t where t.time <= fbb.time order by t.time desc
) as foo,
(select top 1 value from bar t where t.time <= fbb.time order by t.time desc
) as bar,
(select top 1 value from baz t where t.time <= fbb.time order by t.time desc
) as baz
from (select time from foo union
select time from bar union
select time from baz
) fbb;
EDIT:
An alternative approach uses aggregation:
select time, max(foo) as foo, max(bar) as bar, max(baz) as baz
from (select time, value as foo, NULL as bar, NULL as baz from foo union all
select time, NULL, value, NULL from bar union all
select time, NULL, NULL baz from baz
) fbb
group by time
order by time;
This probably has better performance than the first method.
Here is an another alternative solution as you are using SQL SERVER 2008:
SELECT *
FROM (
SELECT t, [time], value
FROM ( SELECT 'Foo' as t, *
FROM #Foo
UNION
SELECT 'Bar' as t, *
FROM #Bar
UNION
SELECT 'Baz' as t, *
FROM #Baz
) un
WHERE [time] BETWEEN '1:17' AND '1:48'
) AS fbb
PIVOT (MAX(value) FOR fbb.[t] IN (Foo, Bar, Baz)) pvt