How to properly do conditional aggregation in Hive - hive

Assume I have this data:
player_id stats
100 [{"position":"offense","wins":35},{"position":"defense","wins":17}]
200 [{"position":"offense","wins":85},{"position":"defense","wins":52}]
300 [{"position":"offense","wins":12},{"position":"defense","wins":98}]
And I want to display it as such:
player_id offense_wins defense_wins
100 35 17
200 85 52
300 12 98
The original data above is currently thrown into an ORC table using:
SELECT p.player_id
, s.position
, s.wins
FROM player_stats p
LATERAL VIEW EXPLODE(p.stats) sTable as s
Which gets me:
player_id position wins
100 offense 35
100 defense 17
200 offense 85
200 defense 52
300 offense 12
300 defense 98
Now from this point in MySQL I can just group_by the player_id then case the position, pulling in the associated wins value when it = 'offense' or 'defense' into their own columns, then wrap each case in a COALESCE() to prevent the nulls from coming through. Super fast.
In Hive, instead of COALESCE I have to use MIN or MAX, but the result will be the same regardless.
Here would be the primary way this data is queried:
SELECT player_id
, max(case when position = 'offense' then wins end) as offense_wins
, max(case when position = 'defense' then wins end) as defense_wins
FROM orctable
WHERE player_id = 100
GROUP BY player_id
Which would result in :
player_id offense_wins defense_wins
100 35 17
Now, in my real world situation the original dataset has six instances of that 'stats' array, each containing a map of 3-5 pairs. Because of this, the ORC table has player_id listed some 700 times from the repeated lateral views.
The entire table is 300k rows, and the player_id in the real world example is duplicated on this table just over 700 times.
Question 1 - is this the only and/or proper way to transform the data into the desired end result?
Question 2 - should this query be taking between 5 and 10 seconds to complete? The same dataset on a small MySQL server would do this in milliseconds.

Related

How to select last element for each ID

I would like select some elements from the last id
Here an example that I have :
id money
1 200
1 150
1 500
3 50
4 40
4 300
5 110
Here what I would like :
1 500
3 50
4 300
5 110
So like you can see, I took last id and the money who corresponds.
I tried to do a group by id order by id descending with limit 1. But limit 1 is not available in proc sql from sas and it doesn't work.
Thanks in advance
Unlike SAS datasets, SQL tables represent unordered sets. In your case, it looks like you want the maximum value in the second column, in which case you can use aggregation:
proc sql;
select id, max(money)
from t
group by id;
If you actually mean the last row per id based on the ordering in the SAS dataset, I would suggest using a data step instead.

Seperating a Oracle Query with 1.8 million rows into 40,000 row blocks

I have a project where I am taking Documents from one system and importing them into another.
The first system has the documents and associated keywords stored. I have a query that will return the results which will then be used as the index file to import them into the new system. There are about 1.8 million documents involved so this means 1.8 million rows (One per document).
I need to divide the returned results into blocks of 40,000 to make importing them in batches of 40,000 at a time, rather than one long import.
I have the query to return the results I need. Just need to know how to take that and break it up for easier import. My apologies if I have included to little information. This is my first time here asking for help.
Use the built-in function ORA_HASH to divide the rows into 45 buckets of roughly the same number of rows. For example:
select * from some_table where ora_hash(id, 44) = 0;
select * from some_table where ora_hash(id, 44) = 1;
...
select * from some_table where ora_hash(id, 44) = 44;
The function is deterministic and will always return the same result for the same input. The resulting number starts with 0 - which is normal for a hash, but unusual for Oracle, so the query may look off-by-one at first. The hash works better with more distinct values, so pass in the primary key or another unique value if possible. Don't use a low-cardinality column, like a status column, or the buckets will be lopsided.
This process is in some ways inefficient, since you're re-reading the same table 45 times. But since you're dealing with documents, I assume the table scanning won't be the bottleneck here.
A prefered way to bucketing the ID is to use the NTILE analytic function.
I'll demonstrate this on a simplified example with a table with 18 rows that should be divided in four chunks.
select listagg(id,',') within group (order by id) from tab;
1,2,3,7,8,9,10,15,16,17,18,19,20,21,23,24,25,26
Note, that the IDs are not consecutive, so no arithmetic can be used - the NTILE gets the parameter of the requested number of buckets (4) and calculates the chunk_id
select id,
ntile(4) over (order by ID) as chunk_id
from tab
order by id;
ID CHUNK_ID
---------- ----------
1 1
2 1
3 1
7 1
8 1
9 2
10 2
15 2
16 2
17 2
18 3
19 3
20 3
21 3
23 4
24 4
25 4
26 4
18 rows selected.
All but the last bucket are of the same size, the last one can be smaller.
If you want to calculate the ranges - use simple aggregation
with chunk as (
select id,
ntile(4) over (order by ID) as chunk_id
from tab)
select chunk_id, min(id) ID_from, max(id) id_to
from chunk
group by chunk_id
order by 1;
CHUNK_ID ID_FROM ID_TO
---------- ---------- ----------
1 1 8
2 9 17
3 18 21
4 23 26

BigQuery full table to partition

I have a 340 GB of data in one table (270 days worth of data). Now planning move this data to partition table.
That means I will have 270 partitions. What is the best way to move this data to partition table.
I dont want to run 270 queries which is very costly operation. So looking for optimized solution.
I have multiple tables like this. I need to migrate all these tables to partition tables.
Thanks,
I see three options
Direct Extraction out of original table:
Actions (how many queries to run) = Days [to extract] = 270
Full Scans (how much data scanned measured in full scans of original table) = Days = 270
Cost, $ = $5 x Table Size, TB xFull Scans = $5 x 0.34 x 270 = $459.00
Hierarchical(recursive) Extraction: (described in Mosha’s answer)
Actions = 2^log2(Days) – 2 = 510
Full Scans = 2*log2(Days) = 18
Cost, $ = $5 x Table Size, TB xFull Scans = $5 x 0.34 x 18 = $30.60
Clustered Extraction: (I will describe it in a sec)
Actions = Days + 1 = 271
Full Scans = [always]2 = 2
Cost, $ = $5 x Table Size, TB xFull Scans = $5 x 0.34 x 2 = $3.40
Summary
Method Actions Total Full Scans Total Cost
Direct Extraction 270 270 $459.00
Hierarchical(recursive) Extraction 510 18 $30.60
Clustered Extraction 271 2 $3.40
Definitely, for most practical purposes Mosha’s solution is way to go (I use it in most such cases)
It is relatively simple and straightforward
Even though you need to run query 510 times – the query is "relatively" simple and orchestration logic is simple to implement with whatever client you usually use
And cost save is quite visible!
From $460 down to $31!
Almost 15 times down!
In case if you -
a) want to lower cost even further for yet another 9 times (so it will be total x135 times lower)
b) and like having fun and more challenges
- take a look at third option
“Clustered Extraction” Explanation
Idea / Goal:
Step 1
We want to transform original table into another [single] table with 270 columns – one column for one day
Each column will hold one serialized row for respective day from original table
Total number of rows in this new table will be equal to number of rows for most "heavy" day
This will require just one query (see example below) with one full scan
Step 2
After this new table is ready – we will be extracting day-by-day querying ONLY respective column and write into final daily table (schema of daily table are the very same as original table’s schema and all those tables could be pre-created)
This will require 270 queries to be run with scans approximately equivalent (this really depends on how complex your schema, so can vary) to one full size of original table
While querying column – we will need to de-serialize row’s value and parse it back to original scheme
Very simplified example: (using BigQuery Standard SQL here)
The purpose of this example is just to give direction if you will find idea interesting for you
Serialization / de-serialization is extremely simplified to keep focus on idea and less on particular implementation which can be different from case to case (mostly depends on schema)
So, assume original table (theTable) looks somehow like below
SELECT 1 AS id, "101" AS x, 1 AS ts UNION ALL
SELECT 2 AS id, "102" AS x, 1 AS ts UNION ALL
SELECT 3 AS id, "103" AS x, 1 AS ts UNION ALL
SELECT 4 AS id, "104" AS x, 1 AS ts UNION ALL
SELECT 5 AS id, "105" AS x, 1 AS ts UNION ALL
SELECT 6 AS id, "106" AS x, 2 AS ts UNION ALL
SELECT 7 AS id, "107" AS x, 2 AS ts UNION ALL
SELECT 8 AS id, "108" AS x, 2 AS ts UNION ALL
SELECT 9 AS id, "109" AS x, 2 AS ts UNION ALL
SELECT 10 AS id, "110" AS x, 3 AS ts UNION ALL
SELECT 11 AS id, "111" AS x, 3 AS ts UNION ALL
SELECT 12 AS id, "112" AS x, 3 AS ts UNION ALL
SELECT 13 AS id, "113" AS x, 3 AS ts UNION ALL
SELECT 14 AS id, "114" AS x, 3 AS ts UNION ALL
SELECT 15 AS id, "115" AS x, 3 AS ts UNION ALL
SELECT 16 AS id, "116" AS x, 3 AS ts UNION ALL
SELECT 17 AS id, "117" AS x, 3 AS ts UNION ALL
SELECT 18 AS id, "118" AS x, 3 AS ts UNION ALL
SELECT 19 AS id, "119" AS x, 4 AS ts UNION ALL
SELECT 20 AS id, "120" AS x, 4 AS ts
Step 1 – transform table and write result into tempTable
SELECT
num,
MAX(IF(ts=1, ser, NULL)) AS ts_1,
MAX(IF(ts=2, ser, NULL)) AS ts_2,
MAX(IF(ts=3, ser, NULL)) AS ts_3,
MAX(IF(ts=4, ser, NULL)) AS ts_4
FROM (
SELECT
ts,
CONCAT(CAST(id AS STRING), "|", x, "|", CAST(ts AS STRING)) AS ser,
ROW_NUMBER() OVER(PARTITION BY ts ORDER BY id) num
FROM theTable
)
GROUP BY num
tempTable will look like below:
num ts_1 ts_2 ts_3 ts_4
1 1|101|1 6|106|2 10|110|3 19|119|4
2 2|102|1 7|107|2 11|111|3 20|120|4
3 3|103|1 8|108|2 12|112|3 null
4 4|104|1 9|109|2 13|113|3 null
5 5|105|1 null 14|114|3 null
6 null null 15|115|3 null
7 null null 16|116|3 null
8 null null 17|117|3 null
9 null null 18|118|3 null
Here, I am using simple concatenation for serialization
Step 2 – extracting rows for specific day and write output to respective daily table
Please note: In below example - we extracting rows for ts = 2 : this corresponds to column ts_2
SELECT
r[OFFSET(0)] AS id,
r[OFFSET(1)] AS x,
r[OFFSET(2)] AS ts
FROM (
SELECT SPLIT(ts_2, "|") AS r
FROM tempTable
WHERE NOT ts_2 IS NULL
)
The result will look like below (which is expected):
id x ts
6 106 2
7 107 2
8 108 2
9 109 2
I wish I had more time for this to write down, so don’t judge to heavy if something missing – this is more directional answer - but at the same time example is pretty reasonable and if you have plain simple schema – almost no extra thinking is required. Of course with records, nested stuff in schema - most challenging part is serialization / de-serialization – but that’s where fun is – along with extra $saving
I will add another fourth option to #Mikhail's answer
DML QUERY
Action = 1 query to run
Full scans = 1
Cost = $5 x 0.34 = 1.7$ (x270 times cheaper than solution #1 \o/)
With the new DML feature of BiQuery you can convert a none partitioned table to a partitioned one while doing only one full scan of the source table
To illustrate my solution I will use one of BQ's public tables, namely bigquery-public-data:hacker_news.comments. below is the tables schema
name | type | description
_________________________________
id | INTGER | ...
_________________________________
by | STRING | ...
_________________________________
author | STRING | ...
_________________________________
... | |
_________________________________
time_ts | TIMESTAMP | human readable timestamp in UTC YYYY-MM-DD hh:mm:ss /!\ /!\ /!\
_________________________________
... | |
_________________________________
We are going to partition the comments table based on time_ts
#standardSQL
CREATE TABLE my_dataset.comments_partitioned
PARTITION BY DATE(time_ts)
AS
SELECT *
FROM `bigquery-public-data:hacker_news.comments`
I hope it helps :)
If your data was in sharded tables (i.e. with YYYYmmdd suffix), you could've used "bq partition" command. But with data in a single table - you will have to scan it multiple times applying different WHERE clauses on your partition key column.
The only optimization I can think of is to do it hierarchically, i.e. instead of 270 queries which will do 270 full table scans - first split table in half, then each half in half etc. This way you will need to pay for 2*log_2(270) = 2*9 = 18 full scans.
Once the conversion is done - all the temporary tables can be deleted to eliminate extra storage costs.

Analyze intervals for gaps

I have a table with roads that contain mileage of start/end of every road.
I need to analyze this data and get query with same data more rows that contain mileage of start/end of gaps between roads with filled column name with value 'gap'.
Initial table:
id name kmstart kmend
1 road1 0 150
2 road2 150 200
3 road3 220 257
4 road4 260 290
Result query:
id name kmstart kmend
1 road1 0 150
2 road2 150 200
null gap 200 220
3 road3 220 257
null gap 257 260
4 road4 260 290
Try this query:
SELECT NULL, 'gap', previous_kmend AS kmstart, kmstart AS kmend
FROM (
SELECT id, name, kmstart, kmend, LAG(kmend) OVER (ORDER BY kmstart, kmend) AS previous_kmend
FROM roads
)
WHERE previous_kmend < kmstart
UNION ALL
SELECT id, name, kmstart, kmend
FROM roads
ORDER BY kmstart, kmend
I just put up a quick test and it works for me.
It uses the LAG function to get the previous kmend row, and then returns the "gap" row if it is less than the current record's kmstart row. I've written an article about the LAG function recently so it was helpful to remember.
Is this what you're after?
Also, as the other commenters have mentioned, "name" isn't a good column name as it's a reserved word. I've left it here in the code so it's consistent with your question though.

How to make one column fixed?

There is one scheme and different items inside it, so the scenario is that if user send SchemeID to the procedure then it should return the SchemeName(once) and all items inside a scheme i.e. DescriptionOfitem, Quantity, Rate, Amount... in this format
SchemeName DescriptionOfItems Quantity Unit Rate Amount
Scheme01 Bulbs 2 M2 200 400
Titles 10 M3 300 3000
SolarPanels 2 M2 1000 2000
Bricks 50 M9 50 2500
Total 7900
My try, it works but it also repeats the SchemeName for each row and can't find total
Select
Schemes.SchemeName,
ContractorsWorkDetails.ContractorsWorkDetailsItemDescription,
ContractorsWorkDetails.ContractorsWorkDetailsUnit,
ContractorsWorkDetails.ContractorsWorkDetailsItemQuantity,
ontractorsWorkDetails.ContractorsWorkDetailsItemRate,
ContractorsWorkDetails.ContractorsWorkDetailsAmount
From ContractorsWorkDetails
Inner Join Schemes
ON Schemes.pk_Schemes_SchemeID= ContractorsWorkDetails.fk_Schemes_ContractorsWorkDetails_SchemeID
Where ContractorsWorkDetails.fk_Schemes_ContractorsWorkDetails_SchemeID= 2
Update:
I tested the query as suggested below but it gives this kinda result
You can get the total using grouping sets. I would advise you to keep the schema name on each row. If you want it filtered out on certain rows, then do that at the application layer.
Now, having said that, I think this will do what you want in SQL:
Select (case when GROUPING(cwd.ContractorsWorkDetailsItemDescription) = 0
then 'Total'
when row_number() over (partition by s.SchemeName
order by cwd.ContractorsWorkDetailsItemDescription
) = 1
then s.SchemeName else ''
end) as SchemeName,
cwd.ContractorsWorkDetailsItemDescription,
cwd.ContractorsWorkDetailsUnit,
cwd.ContractorsWorkDetailsItemQuantity,
cwd.ContractorsWorkDetailsItemRate,
SUM(cwd.ContractorsWorkDetailsAmount) as ContractorsWorkDetailsAmount
From ContractorsWorkDetails cwd Inner Join
Schemes s
ON s.pk_Schemes_SchemeID = cwd.fk_Schemes_ContractorsWorkDetails_SchemeID
Where cwd.fk_Schemes_ContractorsWorkDetails_SchemeID = 2
group by GROUPING SETS ((s.SchemeName,
cwd.ContractorsWorkDetailsItemDescription,
cwd.ContractorsWorkDetailsUnit,
cwd.ContractorsWorkDetailsItemQuantity,
cwd.ContractorsWorkDetailsItemRate
), s.SchemeName)
Order By GROUPING(cwd.ContractorsWorkDetailsItemDescription),
s.SchemeName, cwd.ContractorsWorkDetailsItemDescription;
The reason you don't want to do this in SQL is because the result set no longer has a relational structure: the ordering of the rows is important.