Hive Explode / Lateral View multiple arrays - hive

I have a hive table with the following schema:
COOKIE | PRODUCT_ID | CAT_ID | QTY
1234123 [1,2,3] [r,t,null] [2,1,null]
How can I normalize the arrays so I get the following result
COOKIE | PRODUCT_ID | CAT_ID | QTY
1234123 [1] [r] [2]
1234123 [2] [t] [1]
1234123 [3] null null
I have tried the following:
select concat_ws('|',visid_high,visid_low) as cookie
,pid
,catid
,qty
from table
lateral view explode(productid) ptable as pid
lateral view explode(catalogId) ptable2 as catid
lateral view explode(qty) ptable3 as qty
however the result comes out as a Cartesian product.

I found a very good solution to this problem without using any UDF,
posexplode is a very good solution :
SELECT COOKIE ,
ePRODUCT_ID,
eCAT_ID,
eQTY
FROM TABLE
LATERAL VIEW posexplode(PRODUCT_ID) ePRODUCT_IDAS seqp, ePRODUCT_ID
LATERAL VIEW posexplode(CAT_ID) eCAT_ID AS seqc, eCAT_ID
LATERAL VIEW posexplode(QTY) eQTY AS seqq, eDateReported
WHERE seqp = seqc AND seqc = seqq;

You can use the numeric_range and array_index UDFs from Brickhouse ( http://github.com/klout/brickhouse ) to solve this problem. There is an informative blog posting describing in detail over at http://brickhouseconfessions.wordpress.com/2013/03/07/exploding-multiple-arrays-at-the-same-time-with-numeric_range/
Using those UDFs, the query would be something like
select cookie,
array_index( product_id_arr, n ) as product_id,
array_index( catalog_id_arr, n ) as catalog_id,
array_index( qty_id_arr, n ) as qty
from table
lateral view numeric_range( size( product_id_arr )) n1 as n;

You can do this by using posexplode, which will provide an integer between 0 and n to indicate the position in the array for each element in the array. Then use this integer - call it pos (for position) to get the matching values in other arrays, using block notation, like this:
select
cookie,
n.pos as position,
n.prd_id as product_id,
cat_id[pos] as catalog_id,
qty[pos] as qty
from table
lateral view posexplode(product_id_arr) n as pos, prd_id;
This avoids the using imported UDF's as well as joining various arrays together (this has much better performance).

If you are using Spark 2.4 in pyspark, use arrays_zip with posexplode:
df = (df
.withColumn('zipped', arrays_zip('col1', 'col2'))
.select('id', posexplode('zipped')))

I tried to work out on your scenario... please try this code -
create table info(cookie string,productid int,catid string,qty string);
insert into table info
select cookie,productid[myprod],categoryid[mycat],qty[myqty] from table
lateral view posexplode(productid) pro as myprod,pro
lateral view posexplode(categoryid) cate as mycat,cate
lateral view posexplode(qty) q as myqty,q
where myprod=mycat and mycat=myqty;
Note - In the above statements, if you place -
select cookie,myprod,mycat,myqty from table in place of select cookie,productid[myprod],categoryid[mycat],qty[myqty] from table
in the output you will get the index of the element in the array of productid, categoryid and qty. Hope this will be helpful.

Related

Can't find a way to improve my PostgreSQL query

In my PostgreSQL database I have 6 tables named storeAPrices, storeBprices etc., holding the same columns and indexes as follows:
item_code (string, primary_key)
item_name (string, btree index)
is_whigthed (number : 0|1, betree index)
item_price (number )
My desire is to join each storePrices table to other by item_code or item_name similarity but "OR" should act as in programming language (check right side only if left is false).
Currently, my query has low performance.
select
*
FROM "storeAprices" sap
left JOIN LATERAL (
SELECT * FROM "storeBPrices" sbp
WHERE
similarity(sap.item_name,sbp.item_name) >= 0.45
ORDER BY similarity(sap.item_name,sbp.item_name) DESC
limit 1
) bp ON case when sap.item_code = bp.item_code then true else sap.item_name % bp.item_name end
left JOIN LATERAL (
select * FROM "storeCPrices" scp
WHERE similarity(sap.item_name,scp.item_name) >= 0.45
ORDER BY similarity(sap.item_name,scp.item_name) desc
limit 1
) rp ON case when sap.item_code = rp.item_code then true else sap.item_name % rp.item_name end
This is part of my query and it took too much time to response. My data is not so large (15k items per table)
Also I have another index "is_whigthed" that I'm not sure how to use it. (I don't want set it as variable because I want to get all "is_whigthed" results)
Any suggestions?
OR should be faster than using case
bp ON sap.item_code = bp.item_code OR sap.item_name % bp.item_name
also you can create Trigram index on item_name columns as mentioned in pg_trgm module docs, since you are using it's % operator for similarity

Array field in SQL BigQuery return error when to filter

I have the follow query in BigQuery:
SELECT *
FROM `data`, UNNEST(deliveries.modalities.campaigns) as dmc
where
dmc.id = 4469
The struct of the field deliveries is:
deliveries RECORD REPEATED
-----items RECORD REPEATED
-----modalities RECORD REPEATED
----------campaigns RECORD REPEATED
---------------coparticipations RECORD REPEATED
---------------id
I wnat to filter deliveries.modalities.campaigns.id, but my query don't worked. Can anyone help me?
Another approach that you might try and consider is to use CTE as shown below:
with test_1 as (
select *
from `your-project.your-dataset.test_deliveries`, unnest (deliveries) d
JOIN unnest (d.modalities) m
)
select *
from test_1, unnest (campaigns) c
where c.id = 4469
Output:
My Test Schema:
My loaded .jsonl file to create my sample data:
{"deliveries": [{"items": [1,2,3],"modalities": [{"campaigns": [{"coparticipations": [1,2,3],"id": 1234}]}]}]}
{"deliveries": [{"items": [2,3,4],"modalities": [{"campaigns": [{"coparticipations": [4,5,6],"id": 2345}]}]}]}
{"deliveries": [{"items": [3,4,5],"modalities": [{"campaigns": [{"coparticipations": [7,8,9],"id": 4469}]}]}]}
{"deliveries": [{"items": [4,5,6],"modalities": [{"campaigns": [{"coparticipations": [10,11,12],"id": 3456}]}]}]}

Delete arguments from array

There is table w/ colum called Cars in this colum I have array [Audi, BMW, Toyota, ..., VW]
And I want update this table and set Cars without few elements from this array (Toyota,..., BMW)
How can I get it, I want put another array and delete elements that matched
You can unnest the array, filter, and reaggregate:
select t.*,
(select array_agg(car)
from unnest(t.cars) car
where car not in ( . . . )
) new_cars
from t;
If you want to keep the original ordering:
select t.*,
(select array_agg(u.car order by n)
from unnest(t.cars) with ordinality u(car, n)
where u.car not in ( . . . )
) new_cars
from t
You could call array_remove several times:
SELECT array_remove(
array_remove(
ARRAY['Audi', 'BMW', 'Toyota', 'Opel', 'VW'],
'Audi'
),
'BMW'
);
array_remove
------------------
{Toyota,Opel,VW}
(1 row)
Maybe I Can help using pandas in python. Assuming, you'd want to delete all the rows having the elements you'd like to delete. Lets say df is your dataframe, then,
import pandas as pd
vals_to_delete = df.loc[(df['cars']== 'Audi') | (df['cars']== 'VW')]
df = df.drop(vals_to_delete)
or you could also do
df1 = df.loc'[(df['cars']!= 'Audi') | (df['cars']!= 'VW')]
In sql, you could use
DELETE FROM table WHERE Cars in ('Audi','VW);

Filter by clustering fields using a sub-select query

With Google Bigquery, I am querying a clustered table by applying a filter on the clustering field projectId, like so:
WITH userProjects AS (
SELECT
projectsArray
FROM
projectsPerUser
WHERE
userId = "eben#somewhere.com"
)
SELECT
userProperty
FROM
`mydata.mydataset.mytable`
WHERE
--projectId IN UNNEST((SELECT projectsArray FROM userProjects))
projectId IN ("mydata", "anotherproject")
AND _PARTITIONTIME >= "2019-03-20"
Clustering is applied correctly in the code snippet above, but when I use the commented-out line --projectId IN UNNEST((SELECT projectsArray FROM userProjects)), clustering doesn't apply.
I've tried wrapping it in a UDF like this as well, which also doesn't work:
CREATE TEMP FUNCTION storedValue(item ARRAY<STRING>) AS (
item
);
...
WHERE projectId IN UNNEST(storedValue((SELECT projectsListArray FROM projectsList)))
As I understand from this, the execution path for sub-select queries is different to merely filtering on a scalar or array directly.
I expect a solution to exist where I can programmatically supply an array to filter on that will still allow me the cost benefit a clustered table provides.
In summary:
WHERE projectId IN ("mydata", "anotherproject") [OK]
WHERE projectId IN UNNEST((SELECT projectsArray FROM userProjects)) [Not OK]
WHERE projectId IN UNNEST(storedValue((SELECT projectsListArray FROM projectsList))) [Not OK]
Any ideas?
My suggestion is to rewrite your query so that your nested SELECT is a temporary table (which you've already done) and then perform the filtering you require by using an INNER JOIN rather than a set membership test, so your query would become something like this:
WITH userProjects AS (
SELECT
projectsArray
FROM
projectsPerUser
WHERE
userId = "eben#somewhere.com"
)
SELECT
userProperty
FROM
`mydata.mydataset.mytable` as a
JOIN
userProjects as b
ON a.projectId = b.projectsArray
WHERE
AND _PARTITIONTIME >= "2019-03-20"
I believe this will result in a query which does not scan the full partition if that field is clustered.
FWIW, clustering works well for me with dynamic filters:
SELECT title, SUM(views) views
FROM `fh-bigquery.wikipedia_v3.pageviews_2019`
WHERE DATE(TIMESTAMP_TRUNC(datehour, DAY)) = '2019-01-01'
AND wiki='en'
AND title IN ('Dogfight_(disambiguation)','Dogfight','Dogfight_(film)')
GROUP BY 1
1.8 sec elapsed, 364 MB processed
if instead I do
AND title IN (
SELECT DISTINCT prev
FROM `fh-bigquery.wikipedia_vt.clickstream_materialized`
WHERE date='2019-01-01' AND prev LIKE 'Dogfight%'
ORDER BY 1 LIMIT 3)
2.9 sec elapsed, 513.8 MB processed
If I go to v2 (not clustered), instead of v3:
FROM `fh-bigquery.wikipedia_v2.pageviews_2019`
2.6 sec elapsed, 9.6 GB processed
I'm not sure what's happening in your tables - but it might be interesting to revisit.

Aggregate values and Pivot

I am partly on my way to solving this, but have hit a stumbling block, which I think can be solved with pivot(s).
I have the following SQL query, combining two temporary table variables (may change these to temporary tables, as I think performance maybe come a problem as they will be hit a large number of times):
SELECT MeterId, MeterDataOutput.BuildingId, MeterDataOutput.Value,
MeterDataOutput.TimeStamp, UtilityId, SnapshotId
FROM #MeterDataOutput as MeterDataOutput INNER JOIN #InsertOutput AS InsertOutput
ON MeterDataOutput.BuildingId = InsertOutput.BuildingId
AND MeterDataOutput.[Timestamp] = InsertOutput.[TimeStamp]
This produces the following table:
I have then modified the query to group by BuildingId, SnapshotId, Timestamp, Utility and applied the SUM() function to aggregate the Value field (and dropped the MeterId as its not required), as follows:
SELECT MeterDataOutput.BuildingId, SUM(MeterDataOutput.Value) AS Value, MeterDataOutput.TimeStamp, UtilityId, SnapshotId
FROM #MeterDataOutput as MeterDataOutput
INNER JOIN #InsertOutput AS InsertOutput
ON MeterDataOutput.BuildingId = InsertOutput.BuildingId
AND MeterDataOutput.[Timestamp] = InsertOutput.[TimeStamp]
GROUP BY MeterDataOutput.BuildingId, MeterDataOutput.TimeStamp, UtilityId, SnapshotId
This query the provides me with the following table:
Now the bit I'm having trouble with is transforming the UtilityId values to columns, and placing the values from the Value field under each column. I.e:
For reference buildingId, Timestamp, Snapshot and Value are variable. UtilityId value 6 is always 'Electricity', 7 is always 'Gas' and 8 is always 'Water'.
I'm actually starting to get the hand of the SQL lark :)
Maybe something like this:
SELECT
pvt.BuildingId,
pvt.SnapshotId,
pvt.TimeStamp,
pvt.[6] AS Electricity,
pvt.[7] AS Gas,
pvt.[8] AS Water
FROM
(
SELECT
MeterDataOutput.BuildingId,
MeterDataOutput.Value,
MeterDataOutput.TimeStamp,
UtilityId,
SnapshotId
FROM #MeterDataOutput as MeterDataOutput
INNER JOIN #InsertOutput AS InsertOutput
ON MeterDataOutput.BuildingId = InsertOutput.BuildingId
AND MeterDataOutput.[Timestamp] = InsertOutput.[TimeStamp]
) AS SourceTable
PIVOT
(
SUM(Value)
FOR UtilityId IN ([6],[7],[8])
) AS pvt