How to generate float sequence in Presto? - sql

I want to generate float range which can be unnested into a column in PrestoDb. I am following documentation https://prestodb.io/docs/current/functions/array.html and trying out 'sequence' but looks like float ranges cannot be generated in sequence. I want to generate a table like below with the value interval reduced by 0.3
| date | value |
| 2020-01-31 | 47.6 |
| 2020-02-28 | 47.3 |
| 2020-03-31 | 47.0 |
I was trying to generate a sequence and then unnest it into column values. I am able to generate date column using the sequence in prestodb but not the value column
Any suggestions please

You can use sequence with bigint and convert to double after unnesting:
presto> SELECT x / 10e0 FROM UNNEST(sequence(476, 470, -3)) t(x);
_col0
-------
47.6
47.3
47.0
(verified on Presto 336)

Related

How to round decimals smaller than .5 to the following number in SQL?

I'm having this situation where a I have a large database with +1000 products.
Some of them have prices like 12.3, 20.7, 55.1 for example.
| Name | Price |
| -------- | -------------- |
| Product 1| 12.3 |
| Product 2| 20.7 |
| Product 3| 55.1 |
(and so on)...
What I've tried is update prices set price = ROUND (price, 0.1).
The output for this will be:
| Name | Price |
| -------- | -------------- | (after updated)
| Product 1| 12.3 | 12.0
| Product 2| 20.7 | 21.0
| Product 3| 55.1 | 55.0
the prices with decimals < .5 will remain the same, and I'm out of ideas.
I'll appreciate any help.
Note I need to update all rows,Ii'm trying to learn about CEILING() but only shows how to use it with SELECT, any idea on how to perform an UPDATE CEILING or something?
It's not entirely clear what you're asking, but I can tell you the function call as shown makes no sense.
The second argument to the ROUND() function is the number of decimal places, not the size of the value you wish to round to. Additionally, the function only accepts integral types for that argument. Therefore, if you pass the value 0.1 to the function what will happen is the value is first cast to an integer, and the result of casting 0.1 to an integer is 0.
We see then, that calling ROUND(price, 0.1) is the same as calling ROUND(price, 0).
If you want to round to the nearest 0.1, that's one decimal place and the correct value for the ROUND() function is 1.
ROUND(price, 1)
Compare results here:
https://dbfiddle.uk/?rdbms=sqlserver_2019&fiddle=7878c275f0f9ea86f07770e107bc1274
Note the trailing 0's remain, because the fundamental type of the value is unchanged. If you also want to remove the trailing 0`s, then you're really moving into the realm of strings, and for that you should wait and the client code, application, or reporting tool handle the conversion.

Explode single DataFrame row into multiple ones across Year-Month column?

I'm trying to explode a dataframe row made of a "yearMonth" column into multiple rows where each one is a day of that month
this is the example. I want to go from this:
Key
YearMonth
xxx
202101
to this
Key
YearMonthDay
xxx
20210101
xxx
20210102
xxx
...
xxx
20210131
Here is an example using some of the built-in spark functions that can be imported using:
import org.apache.spark.sql.functions.{add_months, col, datediff, explode, concat, substring, lit, date_add, sequence, date_format}
The main idea of this solution can be described with the following steps:
convert the values into a date format as the first date of the month.
calculate the number of days in the given month.
create an array of days to add that will be exploded to a separate row each and will be added to the start date of the month
drop unnecessary columns
List("202010", "202011")
.toDF("month_short")
.withColumn("month", functions.concat(substring($"month_short", 0, 4), lit("-"), substring($"month_short", 5, 2), lit("-"),lit("01")))
.withColumn("days_to_add", explode(sequence(lit(0), datediff(add_months($"month", 1), $"month") - 1)))
.withColumn("day_in_month", date_format(date_add($"month", $"days_to_add"), "yyyyMMdd"))
.drop("days_to_add","month")
.show(false)
output example:
+-----------+------------+
|month_short|day_in_month|
+-----------+------------+
|202010 |20201001 |
|202010 |20201002 |
|202010 |20201003 |
|202010 |20201004 |
|202010 |20201005 |
|202010 |20201006 |
|202010 |20201007 |
|202010 |20201008 |
|202010 |20201009 |
|202010 |20201010 |
|202010 |20201011 |
|202010 |20201012 |
Please use the below data functions to get the desired output:
df1=input_df.withColumn("YearMonthDay",
F.explode(F.sequence(F.to_date(F.col("YearMonth").cast("string"), "yyyyMM"),
F.last_day(F.to_date(F.col("YearMonth").cast("string"), "yyyyMM")))))
.drop("YearMonth")

Apache-pig Number Extraction After a specific String

I have a file with 10,1900 lines with Delimiter as 5 ('|') [obviously 6 columns now] , and I have statement in sixth column like "Dropped 12 (0.01%)" !! I am longing to extract the number after Dropped within brackets;
Actual -- Dropped 12 (0.01%)
Expected -- 0.01
I need a solution using Apache pig.
You are looking for the REGEX_EXTRACT function.
Let's say you have a table A that looks like:
+--------------------+
| col1 |
+--------------------+
| Dropped 12 (0.01%) |
| Dropped 24 (0.02%) |
+--------------------+
You can extract the number in parenthesis with the following:
B = FOREACH A GENERATE REGEX_EXTRACT(col6, '.*\\((.*)%\\)', 1);
+---------+
| percent |
+---------+
| 0.01 |
| 0.02 |
+---------+
I'm specifying a regex capture group for whatever characters are between ( and %). Notice that I'm using \\ as the escape character so that I match the opening and closing parenthesis.

Group and split records in postgres into several new column series

I have data of the form
-----------------------------|
6031566779420 | 25 | 163698 |
6031566779420 | 50 | 98862 |
6031566779420 | 75 | 70326 |
6031566779420 | 95 | 51156 |
6031566779420 | 100 | 43788 |
6036994077620 | 25 | 41002 |
6036994077620 | 50 | 21666 |
6036994077620 | 75 | 14604 |
6036994077620 | 95 | 11184 |
6036994077620 | 100 | 10506 |
------------------------------
and would like to create a dynamic number of new columns by treating each series of (25, 50, 75, 95, 100) and corresponding values as a new series. What I'm looking for as target output is,
--------------------------
| 25 | 163698 | 41002 |
| 50 | 98862 | 21666 |
| 75 | 70326 | 14604 |
| 95 | 51156 | 11184 |
| 100 | 43788 | 10506 |
--------------------------
I'm not sure what the name of the sql / postgres operation I want is called nor how to achieve it. In this case the data has 2 new columns but I'm trying to formulate a solution that has has many new columns as are groups of data in the output of the original query.
[Edit]
Thanks for the references to array_agg, that looks like it would be helpful! I should've mentioned this earlier but I'm using Redshift which reports this version of Postgres:
PostgreSQL 8.0.2 on i686-pc-linux-gnu, compiled by GCC gcc (GCC) 3.4.2 20041017 (Red Hat 3.4.2-6.fc3), Redshift 1.0.1007
and it does not seem to support this function yet.
ERROR: function array_agg(numeric) does not exist
HINT: No function matches the given name and argument types. You may need to add explicit type casts.
Query failed
PostgreSQL said: function array_agg(numeric) does not exist
Hint: No function matches the given name and argument types. You may need to add explicit type casts.
Is crosstab the type of transformation I should be looking at? Or something else? Thanks again.
I've used array_agg() here
select idx,array_agg(val)
from t
group by idx
This will produce result like below:
idx array_agg
--- --------------
25 {163698,41002}
50 {98862,21666}
75 {70326,14604}
95 {11184,51156}
100 {43788,10506}
As you can see the second column is an array of two values(column idx) that corresponding to column idx
The following select queries will give you result with two separate column
Method : 1
SELECT idx
,col [1] col1 --First value in the array
,col [2] col2 --Second vlaue in the array
FROM (
SELECT idx
,array_agg(val) col
FROM t
GROUP BY idx
) s
Method : 2
SELECT idx
,(array_agg(val)) [1] col1 --First value in the array
,(array_agg(val)) [2] col2 --Second vlaue in the array
FROM t
GROUP BY idx
Result:
idx col1 col2
--- ------ -----
25 163698 41002
50 98862 21666
75 70326 14604
95 11184 51156
100 43788 10506
You can use array_agg function. Asuming, your columns are named A,B,C:
SELECT B, array_agg(C)
FROM table_name
GROUP BY B
Will get you output in array form. This is as close as you can get to variable columns in a simple query. If you really need variable columns, consider defining a PL/pgSQL procedure to convert array into columns.

SQL group by one column, sort by another and transponse a third

I have the following table, which is actually the minimal example of the result of multiple joined tables. I now would like to group by 'person_ID' and get all the 'value' entries in one row, sorted after the feature_ID.
person_ID | feature_ID | value
123 | 1 | 1.1
123 | 2 | 1.2
123 | 3 | 1.3
123 | 4 | 1.2
124 | 1 | 1.0
124 | 2 | 1.1
...
The result should be:
123 | 1.1 | 1.2 | 1.3 | 1.2
124 | 1.0 | 1.1 | ...
There should exist an elegant SQL query solution, which I can neither come up with, nor find it.
For fast reconstruction that would be the example data:
create table example(person_ID integer, feature_ID integer, value float);
insert into example(person_ID, feature_ID, value) values
(123,1,1.1),
(123,2,1.2),
(123,3,1.3),
(123,4,1.2),
(124,1,1.0),
(124,2,1.1),
(124,3,1.2),
(124,4,1.4);
Edit: Every person has 6374 entries in the real life application.
I am using a PostgreSQL 8.3.23 database, but I think that should probably be solvable with standard SQL.
Data bases aren't much at transposing. There is a nebulous column growth issue at hand, I mean how does the data base deal with a variable number of columns? It's not a spread sheet.
This transposing of sorts is normally done in the report writer, not in SQL.
... or in a program, like in php.
Dynamic cross tab in sql only by procedure, see:
https://www.simple-talk.com/sql/t-sql-programming/creating-cross-tab-queries-and-pivot-tables-in-sql/