Records extract from HIVE array - hive

I have a HIVE table in which data is loaded for tripid as tripid, gps_location_1, gps_location_2 locations, it could be 10 gps locations for one trip and for other trip there could be 500 gps locations, when I query data my results are as follows
select tripid, size(gps) as count from trip;
tripid gps_location_1 |gps_location_2 | count |
1451f2b3d |44.1145 | 44.1148| 9 |
select tripid, gps.gps_location_1, gps.gps_location_1 as count from trip;
+---------+-----------------------------------------------------------+----------------------------------------------------------+
tripid | gps_location_1 | gps_location_2 | count |
+---------+-------------------------------------------------------------+----------------------------------------------------------+
1451f2b3d |[44.1145,44.1146,44.1147,44.1148,44.1148,44.1129,44.1127,44.1121] | [44.1148,44.1146,44.1146,44.1141,44.1138,44.1129,44.1127]| 9 |
+---------+------------------------------------------------------------------+----------------------------------------------------------+
I can see first value from trip array table.
select tripid, gps[0].gps_location_1, gps[0].gps_location_1 from trip;
tripid gps_location_1 gps_location_2
1451f2b3d 44.1145 44.1148
2nd row from trip array table
select tripid, gps[1].gps_location_1, gps[1].gps_location_1 from trip;
tripid gps_location_1 gps_location_2
1451f2b3d 44.1146 44.1146
Last row from trip array table
select tripid, gps[size(gps)].gps_location_1, gps[size(gps)].gps_location_1 from trip;
1451f2b3d 44.1121 44.1127
I need to store each row in my new target_trip table like this, loop through all rows for one tripid in trip table and inserted into target_table showing below.
How can I achieve it?
tripid gps_location_1 gps_location_2
1451f2b3d 44.1145 44.1148
1451f2b3d 44.1146 44.1146
1451f2b3d 44.1147 44.1146
1451f2b3d 44.1148 44.1141
1451f2b3d 44.1129 44.1138
1451f2b3d 44.1127 44.1129
1451f2b3d 44.1121 44.1127

Use lateral view explode:
elect tripid, coordinates.gps_location_1, coordinates.gps_location_1
from trip
lateral view outer explode(gps) s as coordinates
explode() UDTF generates rows for each array element. lateral view applies UDTF to each row of base table and then joins resulting rows to the input rows to form a virtual table having the specified table alias.
See also this answer for more info about lateral view.

Related

SQL: Checking value counts of a column

I'd like to check if a column in a table has values with a small number of value counts.
Consider the following table as an example:
RowID |Product
1 | A
2 | A
3 | B
...
200.000 | C
the following table is aggregated of the table above:
Product |Count
A |204
B |682
C |553
D |1402
E |30855
F |357
G |1
H |542
What I'd like to know of the column Product of my table is, whether or not a Product has a count that is less than 5%. And if so, the SQL statement should return: 'Some values of this field have a small number of value counts'
In other words: IF [MinValueCount]/[Count] <= .05 then 'Some values of this field have a small number of value counts' else 'null'
With the example above, I should get: 'Some values of this field have a small number of value counts'
as product G is less than 5% of the total count of products.
how should the SQL statement look like?
With kind regards,
Lazzanova
Use two levels of aggregation. You can get the total using window functions:
select max( 'Some values of this field have a small number of value counts')
from (select product, count(*) as cnt,
sum(count(*)) over () as total_cnt
from t
) t
where cnt < 0.05 * total_cnt;
The use of max() in the outer query is just to return one row. You could also use fetch or a similar clause (whatever your database supports):
select 'Some values of this field have a small number of value counts'
from (select product, count(*) as cnt,
sum(count(*)) over () as total_cnt
from t
) t
where cnt < 0.05 * total_cnt
fetch first 1 row only;

PostgreSQL Group by array of daterange

I have a massive table with records that all have a date and a price:
id | date | price | etc...
And then I have a list of random date ranges, never with the same length:
ARRAY [
daterange('2020-11-02','2020-11-05'),
daterange('2020-11-15','2020-11-20')
]
How would I most efficiently go about summing and grouping the records by their existence in one of the ranges, like so:
range | sum
------------------------------------------
[2020-11-02,2020-11-05) | 125.55
[2020-11-15,2020-11-20) | 566.12
You can unnest the array, left join the table on dates that are contained in ranges, and finally aggregate:
select x.rg, sum(t.price) sum_price
from unnest($1) x(rg)
left join mytable t on x.rg #> t.date
group by x.rg
$1 represents the array of dateranges that you want to pass to the query.

How to aggregate json fields when using GROUP BY clause in postgres?

I have the following table structure in my Postgres DB (v12.0)
id | pieces | item_id | material_detail
---|--------|---------|-----------------
1 | 10 | 2 | [{"material_id":1,"pieces":10},{"material_id":2,"pieces":20},{"material_id":3,"pieces":30}]
2 | 20 | 2 | [{"material_id":1,"pieces":40}
3 | 30 | 3 | [{"material_id":1,"pieces":20},{"material_id":3,"pieces":30}
I am using GROUP BY query for this records, like below
SELECT SUM(PIECES) FROM detail_table GROUP BY item_id HAVING item_id =2
Using which I will get the total pieces as 30. But how could I get the count of total pieces from material_detail group by material_id.
I want result something like this
pieces | material_detail
-------| ------------------
30 | [{"material_id":1,"pieces":50},{"material_id":2,"pieces":20},{"material_id":3,"pieces":30}]
As I am from MySQL background, I don't know how to achieve this with JSON fields in Postgres.
Note: material_detail column is of JSONB type.
You are aggregating on two different levels. I can't think of a solution that wouldn't need two separate aggregation steps. Additionally to aggregate the material information all arrays of the item_id have to be unnested first, before the actual pieces value can be aggregated for each material_id. Then this has to be aggregated back into a JSON array.
with pieces as (
-- the basic aggregation for the "detail pieces"
select dt.item_id, sum(dt.pieces) as pieces
from detail_table dt
where dt.item_id = 2
group by dt.item_id
), details as (
-- normalize the material information and aggregate the pieces per material_id
select dt.item_id, (m.detail -> 'material_id')::int as material_id, sum((m.detail -> 'pieces')::int) as pieces
from detail_table dt
cross join lateral jsonb_array_elements(dt.material_detail) as m(detail)
where dt.item_id in (select item_id from pieces) --<< don't aggregate too much
group by dt.item_id, material_id
), material as (
-- now de-normalize the material aggregation back into a single JSON array
-- for each item_id
select item_id, jsonb_agg(to_jsonb(d) - 'item_id') as material_detail
from details d
group by item_id
)
-- join both results together
select p.item_id, p.pieces, m.material_detail
from pieces p
join material m on m.item_id = p.item_id
;
Online example

SQL Merge two queries and insert a new column for calculus

I have 2 tables, Transactions (attributes of Interest: disponent_id, transaction_id) and Attachments (attributes of Interest: disponent_id, filename).
The main goal is the following:
I want to group the transactions per each Disponent of the table "Transact" (transactions per disponent)
The same with the table "Attach" (attachments per disponent)
After, I want to merge both and insert a new column, which shows the ratio of attachments per transaction (Attachments/Transactions)
..
(1)
Disponent | Transactions
213456 | 35
...
(2)
Disponent | Attachments
213456 | 70
(3)
Disponent | Transactions | Attachments | Ratio
213456 | 35 | 70 | 2
...
I've tried
SELECT Transact.disponent_id, COUNT(Transact.transaction_id) AS Transactionnumber
FROM Transact
GROUP BY Transact.disponent_id
UNION ALL
SELECT Attach.disponent_id, COUNT(Attach.filename) AS Filenumber
FROM Attach
GROUP BY Attach.disponent_id
But the result is only:
disponent_id | transactionnumber
234576 | 65
...
How can I insert the calculation and the attachment column?
I used your queries within with clause, then used new select statement with inner join.
check it out:
With wth0 as
(
SELECT
Transact.disponent_id,
COUNT(Transact.transaction_id) AS Transactionnumber
FROM Transact
GROUP BY Transact.disponent_id
),
wth1 as
(
SELECT Attach.disponent_id, COUNT(Attach.filename) AS Filenumber
FROM Attach
GROUP BY Attach.disponent_id
)
SELECT
wth0.disponent_id,
wth0.Transactionnumber,
wth1.Filenumber,
wth1.Filenumber / wth0.Transactionnumber as Ratio
from wth0
inner join wth1
on wth0.disponent_id = wth1.disponent_id;

Complex rank in SQL using Postgres

I'm in over my head with the SQL needed for a complex rank function. This is an app for a racing sport where I need to rank each Entry for a Timesheet based on the entry's :total_time.
The relevant models:
class Timesheet
has_many :entries
end
class Entry
belongs_to :timesheet
belongs_to :athlete
end
class Run
belongs_to :entry
end
An Entry's :total time isn't stored in the database. It's a calculated column of runs.sum(:finish). I use Postgres (9.3) rank() function to get Entries for a given Timesheet and rank them by this calculated column.
def ranked_entries
Entry.find_by_sql([
"SELECT *, rank() OVER (ORDER BY total_time asc)
FROM(
SELECT Entries.id, Entries.timesheet_id, Entries.athlete_id,
SUM(Runs.finish) AS total_time
FROM Entries
INNER JOIN Runs ON (Entries.id = Runs.entry_id)
GROUP BY Entries.id) AS FinalRanks
WHERE timesheet_id = ?", self.id])
end
So far so good. This returns my entry objects with a rank attribute which I can display on timesheet#show.
Now the tricky part. On a Timesheet, not every Entry will have the same number of runs. There is a cutoff (usually Top-20 but not always). This renders the rank() from Postgres inaccurate because some Entries will have a lower :total_time than the race winner because they didn't make the cutoff for the second heat.
My Question: Is it possible to do something like a rank() within a rank() to produce a table that looks like the one below? Or is there another preferred way? Thanks!
Note: I store times as integers, but I formatted them as the more familiar MM:SS in the simplified table below for clarity
| rank | entry_id | total_time |
|------|-----------|------------|
| 1 | 6 | 1:59.05 |
| 2 | 3 | 1:59.35 |
| 3 | 17 | 1:59.52 |
|......|...........|............|
| 20 | 13 | 56.56 | <- didn't make the top-20 cutoff, only has one run.
Let's create a table. (Get in the habit of including CREATE TABLE and INSERT statements in all your SQL questions.)
create table runs (
entry_id integer not null,
run_num integer not null
check (run_num between 1 and 3),
run_time interval not null
);
insert into runs values
(1, 1, '00:59.33'),
(2, 1, '00:59.93'),
(3, 1, '01:03.27'),
(1, 2, '00:59.88'),
(2, 2, '00:59.27');
This SQL statement will give you the totals in the order you want, but without ranking them.
with num_runs as (
select entry_id, count(*) as num_runs
from runs
group by entry_id
)
select r.entry_id, n.num_runs, sum(r.run_time) as total_time
from runs r
inner join num_runs n on n.entry_id = r.entry_id
group by r.entry_id, n.num_runs
order by num_runs desc, total_time asc
entry_id num_runs total_time
--
2 2 00:01:59.2
1 2 00:01:59.21
3 1 00:01:03.27
This statement adds a column for rank.
with num_runs as (
select entry_id, count(*) as num_runs
from runs
group by entry_id
)
select
rank() over (order by num_runs desc, sum(r.run_time) asc),
r.entry_id, n.num_runs, sum(r.run_time) as total_time
from runs r
inner join num_runs n on n.entry_id = r.entry_id
group by r.entry_id, n.num_runs
order by rank asc
rank entry_id num_runs total_time
--
1 2 2 00:01:59.2
2 1 2 00:01:59.21
3 3 1 00:01:03.27