Calculate average from JSON column - sql

I have a table with a column of JSON data that I want to extract information from. Specifically I just want to get the average value.
Example of what I have:
id speed_data
391982 [{"speed":1.3,"speed":1.3,"speed":1.4,"speed":1.5...
391983 [{"speed":0.9,"speed":0.8,"speed":0.8,"speed":1.0...
Example of what I want:
id speed_data
391982 1.375
391982 0.875
Any suggestions on how to get this query to work?
select t.*, avg(x.speed)
from tbl t,
json_array_elements(a->'speed') x
order by random()
limit 1

Your json array is messed up, like #posz commented. Would have to be:
CREATE TABLE tbl (id int, speed_data json);
INSERT INTO tbl VALUES
(391982, '{"speed":[1.3,1.3,1.4,1.5]}')
, (391983, '{"speed":[0.9,0.8,0.8,1.0]}');
You query is twisted in multiple ways, too. Would work like this in pg 9.3:
SELECT t.id, avg(x::text::numeric) AS avg_speed
FROM tbl t
, json_array_elements(speed_data->'speed') x
GROUP BY t.id;
SQL Fiddle.
In the upcoming pg 9.4 we can simplify with the new json_array_elements_text() ( also less error-prone in the cast):
SELECT t.id, avg(x::numeric) AS avg_speed
FROM tbl t
, json_array_elements_text(speed_data->'speed') x
GROUP BY t.id;
More Details:
How to turn json array into postgres array?
Aside: It would be much more efficient to store this as plain array (numeric[], not json) or in a normalized schema to begin with.

Related

Loop through table and update a specific column

I have the following table:
Id
Category
1
some thing
2
value
This table contains a lot of rows and what I'm trying to do is to update all the Category values to change every first letter to caps. For example, some thing should be Some Thing.
At the moment this is what I have:
UPDATE MyTable
SET Category = (SELECT UPPER(LEFT(Category,1))+LOWER(SUBSTRING(Category,2,LEN(Category))) FROM MyTable WHERE Id = 1)
WHERE Id = 1;
But there are two problems, the first one is trying to change the Category Value to upper, because only works ok for 1 len words (hello=> Hello, hello world => Hello world) and the second one is that I'll need to run this query X times following the Where Id = X logic. So my question is how can I update X rows? I was thinking in a cursor but I don't have too much experience with it.
Here is a fiddle to play with.
You can split the words apart, apply the capitalization, then munge the words back together. No, you shouldn't be worrying about subqueries and Id because you should always approach updating a set of rows as a set-based operation and not one row at a time.
;WITH cte AS
(
SELECT Id, NewCat = STRING_AGG(CONCAT(
UPPER(LEFT(value,1)),
SUBSTRING(value,2,57)), ' ')
WITHIN GROUP (ORDER BY CHARINDEX(value, Category))
FROM
(
SELECT t.Id, t.Category, s.value
FROM dbo.MyTable AS t
CROSS APPLY STRING_SPLIT(Category, ' ') AS s
) AS x GROUP BY Id
)
UPDATE t
SET t.Category = cte.NewCat
FROM dbo.MyTable AS t
INNER JOIN cte ON t.Id = cte.Id;
This assumes your category doesn't have non-consecutive duplicates within it; for example, bora frickin bora would get messed up (meanwhile bora bora fickin would be fine). It also assumes a case insensitive collation (which could be catered to if necessary).
In Azure SQL Database you can use the new enable_ordinal argument to STRING_SPLIT() but, for now, you'll have to rely on hacks like CHARINDEX().
Updated db<>fiddle (thank you for the head start!)

Selecting most recent timestamp row and getting values from a column with a Variant DataType

I hope the title makes some sense, I'm open to suggestions if I should make it more readable.
I have a temp table in Snowflake called BI_Table_Temp. It has 2 columns Load_DateTime with a datatype Timestamp_LTZ(9) and JSON_DATA which is a Variant datatype that's has nested records from a JSON file. I want to query this table which I then plan to ingest to another table but I want to make sure I always get the most recent Load_DateTime row.
I've tried this, which works but it shows me the Load_DateTime column and I don't want that I just want to get the values from the JSON_DATA row that has the max Load_DateTime timestamp:
SELECT
MAX(Load_DateTime),
transactions.value:id::string as id
transactions.value:value2::string as account_value
transactions.value:value3::string as new_account_value
FROM BI_Table_Temp,
LATERAL FLATTEN (JSON_DATA:transactions) as transactions
GROUP BY transactions.value
A simple option:
WITH data AS (
SELECT Load_DateTime
, transactions.value:id::string as id
, transactions.value:value2::string as account_value
, transactions.value:value3::string as new_account_value
FROM BI_Table_Temp,
LATERAL FLATTEN (JSON_DATA:transactions) as transactions
), max_load AS (
SELECT MAX(Load_DateTime) Load_DateTime, id
FROM data,
GROUP BY id
)
SELECT transactions.value:id::string as id
, transactions.value:value2::string as account_value
, transactions.value:value3::string as new_account_value
FROM data
JOIN max_load
USING (id, Load_DateTime)
Since transactions.value is a variant, I'm guessing that for GROUP BY transactions.value you really mean GROUP BY transactions.value:id.

Abbreviate a list in PostgreSQL

How can I abbreviate a list so that
WHERE id IN ('8893171511',
'8891227609',
'8884577292',
'886790275X',
.
.
.)
becomes
WHERE id IN (name of a group/list)
The list really would have to appear somewhere. From the point of view of your code being maintainable and reusable, you could represent the list in a CTE:
WITH id_list AS (
SELECT '8893171511' AS id UNION ALL
SELECT '8891227609' UNION ALL
SELECT '8884577292' UNION ALL
SELECT '886790275X'
)
SELECT *
FROM yourTable
WHERE id IN (SELECT id FROM cte);
If you have a persistent need to do this, then maybe the CTE should become a bona fide table somewhere in your database.
Edit: Using the Horse's suggestion, we can tidy up the CTE to the following:
WITH id_list (id) AS (
VALUES
('8893171511'),
('8891227609'),
('8884577292'),
('886790275X')
)
If the list is large, I would create a temporary table and store the list there.
That way you can ANALYZE the temporary table and get accurate estimates.
The temp table and CTE answers suggested will do.
Just wanted to bring another approach, that will work if you use PGAdmin for querying (not sure about workbench) and represent your data in a "stringy" way.
set setting.my_ids = '8893171511,8891227609';
select current_setting('setting.my_ids');
drop table if exists t;
create table t ( x text);
insert into t select 'some value';
insert into t select '8891227609';
select *
from t
where x = any( string_to_array(current_setting('setting.my_ids'), ',')::text[]);

Hive - getting the column names count of a table

How can I get the hive column count names using HQL? I know we can use the describe.tablename to get the names of columns. How do we get the count?
create table mytable(i int,str string,dt date, ai array<int>,strct struct<k:int,j:int>);
select count(*)
from (select transform ('')
using 'hive -e "desc mytable"'
as col_name,data_type,comment
) t
;
5
Some additional playing around:
create table mytable (id int,first_name string,last_name string);
insert into mytable values (1,'Dudu',null);
select size(array(*)) from mytable limit 1;
This is not bulletproof since not all combinations of columns types can be combined into an array.
It also requires that the table will contain at least 1 row.
Here is a more complex but also stronger solution (types versa), but also requires that the table will contain at least 1 row
select size(str_to_map(val)) from (select transform (struct(*)) using 'sed -r "s/.(.*)./\1/' as val from mytable) t;

How to get get Records based on multiple columns from a table

Consider the following table.
From the above table I want to select the Middle BFS_SCORE per LN_LOAN_ID and BR_ID. There are some LN_LOAN_ID with single score.
As an example for the above table the output I need is as below.
Please let me know how this can be achieved.
To handle cases where there are two scores for unique pair of LN_LOAD_ID, BR_ID you need a median, as there is no middle value for BFS_SCORE.
Postgres solution:
Create a median aggregate function following Postgres wiki:
CREATE OR REPLACE FUNCTION _final_median(NUMERIC[])
RETURNS NUMERIC AS
$$
SELECT AVG(val)
FROM (
SELECT val
FROM unnest($1) val
ORDER BY 1
LIMIT 2 - MOD(array_upper($1, 1), 2)
OFFSET CEIL(array_upper($1, 1) / 2.0) - 1
) sub;
$$
LANGUAGE 'sql' IMMUTABLE;
CREATE AGGREGATE median(NUMERIC) (
SFUNC=array_append,
STYPE=NUMERIC[],
FINALFUNC=_final_median,
INITCOND='{}'
);
Then your query would look as simple as this:
select
ln_load_id,
median(bfs_score) as bfs_score
br_id
from yourtable
But the tricky part comes with score_order. If there are two pairs and you actually really need a median, not the middle value - then there will be no row for your calculated score, so it will be null. Other than that, join back to your table to retrieve it for the "middle" column:
select
t1.ln_load_id, t1.bfs_score, t1.br_id, t2.score_order
from (
select
ln_load_id,
median(bfs_score) as bfs_score
br_id
from yourtable
) t1
left join yourtable t2 on
t1.ln_load_id = t2.ln_load_id
and t1.br_id = t2.br_id
and t1.bfs_score = t2.bfs_score