Hive - Select amount of entries

Hive - Select amount of entries - hive

I am working in Hive. So far its really great but I have a problem regarding a query.
I have two tables called 'marked' and 'data' and want to extract data from both with one query.
First I want to extract the mindate from table 'marked' and count the entries in table 'data' between the mindate (obtained from 'marked') and the current date.
So I want to get one result containing the userID, the mindate and the number of occurrences of this userID of the other table between the mindate and the current date.
I try to get this query since hours but Joins as I know them are not working. Can anybody help me out?
Thanks a lot!
UPDATE:
Sorry, I was a little bit in hurry yesterday. Blame on me that I forgot some details.
About the schema:
The table marked has just some columns. In total 8. Here is the schema of this table:
"name": "Datetime",
"type": "long",
"logicalType": "timestamp-millis",
"name": "Hour",
"type": "string",
"name": "UserId64",
"type": "long"
"name": "MemberId",
"type": "int"
"name": "SegmentId",
"type": "int"
"name": "IsDailyUnique",
"type": "boolean"
"name": "IsMonthlyUnique",
"type": "boolean"
"name": "Value",
"type": "int"
The schema of the other table called data is a little bit more difficult since this table contains more than 100 columns. To keep it simple I outline just the important columns:
"name": "Datetime",
"type": "long",
"logicalType": "timestamp-millis",
"name": "Hour",
"type": "string",
"name": "UserId64",
"type": "long"
"type": "enum",
"name": "EventType",
"symbols": ["IMP", "CLICK", "PC_CONV", "PV_CONV"]
So if I do a query like the following I get a list with the result
select timestamp(datetime), hour, userid64, segmentid, isdailyunique,
ismonthlyunique, date from marked where userid64 = 8012570064195370898
and segmentid = 1878696 order by datetime desc;
The resulting table contains the data. Now I want to use the oldest obtained date for my further query.
If we go to table data and do the following query
select timestamp(datetime), auctionid64, hour, eventtype,
mediacostdollarscpm, buyerspend, buyerbid, ecp, eap, isimp, isclick,
userid64, sellerid, publisherid, siteid, sitedomain, advertiserid,
advertiserfrequency, advertiserrecency, campaigngroupid, campaignid,
creativeid, creativefreq, creativerec, pixelid, dealid, dealtype,
custommodelid, custommodellastmodified, leafname, datetime from data
where userid64 = 8012570064195370898 and advertiserid = 327758 order
by datetime desc;
you will get the results as seen below
2016-08-09 19:33:45.0 5908114946988383281 17 PV_CONV
2016-08-07 19:17:13.0 5908114946988383281 17 IMP
2016-08-07 19:16:29.0 5454485145188351263 17 IMP
2016-08-07 18:52:40.0 1074433759230515153 16 IMP
2016-08-07 18:52:40.0 6991642005216308404 16 IMP
2016-08-07 18:52:13.0 5024645171257244072 16 IMP
2016-08-07 18:51:55.0 5371107932239703086 16 IMP
2016-08-07 18:51:55.0 7321752276741166764 16 IMP
2016-08-07 18:51:01.0 3459181835067844898 16 IMP
2016-08-07 18:50:42.0 6208818658549255015 16 IMP
2016-08-07 18:50:41.0 5373958128201701132 16 IMP
2016-08-07 14:34:07.0 8393280749656213703 12 IMP
The import line here is the second line. One after there is a sign called "PV_CONV".
What I want:
I want a query which generates me a table containing
userid
min date of the table marked
max date of the table data containing the event_type "IMP"
time difference between marked date and max date of the table data
and some other columns of the table data.
Is there any chance to get this without creating additional tables?
All the best and thanks
Peter

Since the table schema was not provided, I assumed the below table schema to answer your question..
Table- Marked:
UserID int,
mindate date
Table- Data:
UserID int,
data_date date
Considering UserID as the primary column to join the tables, here is the query
SELECT D.UserID, M.mindate, count(D.data_date) from Marked M
join Data D on M.UserID = D.UserID
where M.mindate <= D.data_date and D.data_date <= from_unixtime(unix_timestamp());
Depending on the 'Date' datatype in your table, the where clause needs to be changed..

Related

BigQuery unnest to new columns instead of new rows

I have a BigQuery table with nested and repeated columns that I'm having trouble unnesting to columns instead of rows.
This is what it the table looks like:
{
"dateTime": "Tue Mar 01 2022 11:11:11 GMT-0800 (Pacific Standard Time)",
"Field1": "123456",
"Field2": true,
"Matches": [
{
"ID": "abc123",
"FieldA": "asdfljadsf",
"FieldB": "0.99"
},
{
"ID": "def456",
"FieldA": "sdgfhgdkghj",
"FieldB": "0.92"
},
{
"ID": "ghi789",
"FieldA": "tfgjyrtjy",
"FieldB": "0.64"
}
]
},
What I want is to return a table with each one of the nested fields as an individual column so I have a clean dataframe to work with:
{
"dateTime": "Tue Mar 01 2022 11:11:11 GMT-0800 (Pacific Standard Time)",
"Field1": "123456",
"Field2": true,
"Matches_1_ID": "abc123",
"Matches_1_FieldA": "asdfljadsf",
"Matches_1_FieldB": "0.99",
"Matches_2_ID": "def456",
"Matches_2_FieldA": "sdgfhgdkghj",
"Matches_2_FieldB": "0.92",
"Matches_3_ID": "ghi789",
"Matches_3_FieldA": "tfgjyrtjy",
"Matches_3_FieldB": "0.64"
},
I tried using UNNEST as below, however it creates new rows with only one set of additional columns, so not what I'm looking for.
SELECT *
FROM table,
UNNEST(Matches) as items
Any solution for this? Thank you in advance.

Consider below approach
select * from (
select t.* except(Matches), match.*, offset + 1 as offset
from your_table t, t.Matches match with offset
)
pivot (min(ID) as id, min(FieldA) as FieldA, min(FieldB) as FieldB for offset in (1,2,3))
If applied to sample data in y our question - output is
if there is are no matches, that entire row is not included in the output table, whereas I want to keep that record with the fields it does have. How can I modify to keep all records?
use left join instead as in below
select * from (
select t.* except(Matches), match.*, offset + 1 as offset
from your_table t left join t.Matches match with offset
)
pivot (min(ID) as id, min(FieldA) as FieldA, min(FieldB) as FieldB for offset in (1,2,3))

How to iterate on json data with sql/knexjs query

I'm using postgresql db.I have a table named 'offers' which has a column 'validity' which contains the following data in JSON format:
[{"end_date": "2019-12-31", "program_id": "4", "start_date": "2019-10-27"},
{"end_date":"2020-12-31", "program_id": "6", "start_date": "2020-01-01"},
{"end_date": "2020-01-01", "program_id": "3", "start_date": "2019-10-12"}]
Now I want to get all records where 'validity' column contains:
program_id = 4 and end_date > current_date.
How to write SQL query or knexjs query to achieve this?
Thanks in advance

You can use an EXISTS condition:
select o.*
from offers o
where exists (select *
from jsonb_array_elements(o.validity) as v(item)
where v.item ->> 'program_id' = '3'
and (v.item ->> 'end_date')::date > current_date)
Online example

Extracting data from an array of JSON objects for specific object values

In my table, there is a column of JSON type which contains an array of objects describing time offsets:
[
{
"type": "start"
"time": 1.234
},
{
"type": "end"
"time": 50.403
}
]
I know that I can extract these with JSON_EACH() and JSON_EXTRACT():
CREATE TEMPORARY TABLE Items(
id INTEGER PRIMARY KEY,
timings JSON
);
INSERT INTO Items(timings) VALUES
('[{"type": "start", "time": 12.345}, {"type": "end", "time": 67.891}]'),
('[{"type": "start", "time": 24.56}, {"type": "end", "time": 78.901}]');
SELECT
JSON_EXTRACT(Timings.value, '$.type'),
JSON_EXTRACT(Timings.value, '$.time')
FROM
Items,
JSON_EACH(timings) AS Timings;
This returns a table like:
start 12.345
end 67.891
start 24.56
end 78.901
What I really need though is to:
Find the timings of specific types. (Find the first object in the array that matches a condition.)
Take this data and select it as a column with the rest of the table.
In other words, I'm looking for a table that looks like this:
id start end
-----------------------------
0 12.345 67.891
1 24.56 78.901
I'm hoping for some sort of query like this:
SELECT
id,
JSON_EXTRACT(timings, '$.[type="start"].time'),
JSON_EXTRACT(timings, '$.[type="end"].time')
FROM Items;
Is there some way to use path in the JSON functions to select what I need? Or, some other way to pivot what I have in the first example to apply to the table?

One possibility:
WITH cte(id, json) AS
(SELECT Items.id
, json_group_object(json_extract(j.value, '$.type'), json_extract(j.value, '$.time'))
FROM Items
JOIN json_each(timings) AS j ON json_extract(j.value, '$.type') IN ('start', 'end')
GROUP BY Items.id)
SELECT id
, json_extract(json, '$.start') AS start
, json_extract(json, '$.end') AS "end"
FROM cte
ORDER BY id;
which gives
id start end
---------- ---------- ----------
1 12.345 67.891
2 24.56 78.901
Another one, that uses the window functions added in sqlite 3.25 and avoids creating intermediate JSON objects:
SELECT DISTINCT Items.id
, max(json_extract(j.value, '$.time'))
FILTER (WHERE json_extract(j.value, '$.type') = 'start') OVER ids AS start
, max(json_extract(j.value, '$.time'))
FILTER (WHERE json_extract(j.value, '$.type') = 'end') OVER ids AS "end"
FROM Items
JOIN json_each(timings) AS j ON json_extract(j.value, '$.type') IN ('start', 'end')
WINDOW ids AS (PARTITION BY Items.id)
ORDER BY Items.id;
The key is using the ON clause of the JOIN to limit results to just the two objects in each array that you care about, and then merging those up to two rows for each Items.id into one with a couple of different approaches.

How to query all entries with a value in a nested bigquery table

I generated a BigQuery table using an existing BigTable table, and the result is a multi-nested dataset that I'm struggling to query from. Here's the format of an entry from that BigQuery table just doing a simple select * from my_table limit 1:
[
{
"rowkey": "XA_1234_0",
"info": {
"column": [],
"somename": {
"cell": [
{
"timestamp": "1514357827.321",
"value": "1234"
}
]
},
...
}
},
...
]
What I need is to be able to get all entries from my_table where the value of somename is X, for instance. There will be multiple rowkeys where the value of somename will be X and I need all the data from each of those rowkey entries.
OR
If I could have a query where rowkey contains X, so to get "XA_1234_0", "XA_1234_1"... The "XA" and the "0" can change but the middle numbers to be the same. I've tried doing a where rowkey like "$_1234_$" but the query goes on for over a minute and is way too long for some reason.
I am using standard SQL.
EDIT: Here's an example of a query I tried that didn't work (with error: Cannot access field value on a value with type ARRAY<STRUCT<timestamp TIMESTAMP, value STRING>>), but best describes what I'm trying to achieve:
SELECT * FROM `my_dataset.mytable` where info.field_name.cell.value=12345
I want to get all records whose value in field_name equals some value.

From the sample Firebase Analytics dataset:
#standardSQL
SELECT *
FROM `firebase-analytics-sample-data.android_dataset.app_events_20160607`
WHERE EXISTS(
SELECT * FROM UNNEST(user_dim.user_properties)
WHERE key='powers' AND value.value.string_value='20'
)
LIMIT 1000

Below is for BigQuery Standard SQL
#standardSQL
SELECT t.*
FROM `my_dataset.mytable` t,
UNNEST(info.somename.cell) c
WHERE c.value = '1234'
above is assuming specific value can appear in each record just once - hope this is a true for you case
If this is not a case - below should make it
#standardSQL
SELECT *
FROM `yourproject.yourdadtaset.yourtable`
WHERE EXISTS(
SELECT *
FROM UNNEST(info.somename.cell)
WHERE value = '1234'
)
which I just realised pretty much same as Felipe's version - but just using your table / schema

Structure of table in BigQuery

I want to create table of below JSON.
{
"store_nbr": "1234",
"sls_dt": "2014-01-01 00:00:00",
"Items": [{
"sku": "3456",
"sls_amt": "9.99",
"discounts": [{
"disc_nbr": "1",
"disc_amt": "0.99"
}, {
"disc_nbr": "2",
"disc_amt": "1.00"
}]
}]
}
Can anyone help me what would be the structure of this JSON on BigQuery ? and How I can retrieve data using SQL query ?

I am wondering what would be the structure of my table?
Try below for BigQuery Standard SQL
#standardSQL
WITH yourTable AS (
SELECT
1234 AS store_nbr,
DATE('2014-01-01 00:00:00') AS sls_dt,
[STRUCT(
3456 AS sku,
9.99 AS sls_amt,
[STRUCT<disc_nbr INT64, disc_amt FLOAT64>
(1, 0.99),
(2, 1.00)
] AS discounts
)] AS items
)
SELECT *
FROM yourTable
The structure of table here is:
or if to look in Web UI:
How I can read values of it?
It is really depends on what exactly and how you want to "read" out of this data!
For example if you want to calc total discount per each sale - it can looks as below
#standardSQL
WITH yourTable AS (
SELECT
1234 AS store_nbr,
DATE('2014-01-01 00:00:00') AS sls_dt,
[STRUCT(
3456 AS sku, 9.99 AS sls_amt, [STRUCT<disc_nbr INT64, disc_amt FLOAT64>(1, 0.99), (2, 1.00)] AS discounts
)] AS items
)
SELECT
t.*,
(SELECT SUM(disc.disc_amt) FROM UNNEST(item.discounts) AS disc) AS total_discount
FROM yourTable AS t, UNNEST(items) AS item
I recommend you first to complete your "exercise" with table creation and actually get data into it, so than you can ask specific questions about query you want to build.
But this should be a new post so you do not mix all together as an all-in-one question, as such type of questions usually not welcomed here on SO

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Hive - Select amount of entries - hive

Related

BigQuery unnest to new columns instead of new rows

How to iterate on json data with sql/knexjs query

Extracting data from an array of JSON objects for specific object values

How to query all entries with a value in a nested bigquery table

Structure of table in BigQuery

Categories

Resources