Hive Query Where Json Like or Equals - sql

I'm new to learning about Hive and Hadoop. There is a table that I've created, which references a certain location containing files.
CREATE DATABASE IF NOT EXISTS <dbname>
LOCATION '/user/<username>/hive/<dbname>.db';
USE <dbname>;
CREATE EXTERNAL TABLE IF NOT EXISTS my_table (json STRING)
PARTITIONED BY (year INT, month INT, day INT)
STORED AS Parquet
LOCATION '/my-data/my/files';
This table has four columns: year, month, day, and json.
The json would look something like:
{
"t_id":"user.login",
"e_time":"2014-11-30T23:59:52Z",
"user_email_address":"someemail#email.com",
"la_id":"10",
"dbnum":16,
"remote_ip":"171.154.1.8",
"server_name":"some.server",
"protocol":"IMAPS",
"secure":true,
"result":"success"
}
A basic query, that works, looks something like this:
SELECT json FROM mydb WHERE year=2015 AND month=12 LIMIT 10;
What I would like to do is have a where clause where I could filter on the json fields listed above. I imagine that it would look like the following, but it does not work:
SELECT get_json_object(mytable.json, '$.t_id') as whatever
FROM mytable
WHERE year=2015 AND month=12 AND json like '%user.login%' LIMIT 1;
Or better yet, be able to query based on the json like so:
SELECT COUNT(*)
FROM mytable
WHERE json.t_id = 'user.login'
AND json.someDate > ... and so on...
Any advice is appreciated.

try this query:
select b.t_id from my_table a lateral view json_tuple(a.json,'t_id') b as t_id where a.year=2015 and a.month=12 LIMIT 10;
you can call another key in the json_tuple and use it in a where clause as well. e.g.:
select b.t_id from my_table a lateral view json_tuple(a.json,'t_id','result') b as t_id, result where a.year=2015 and a.month=12 and b.result ='true' LIMIT 10;

You need to have JSON Serde to read the data in JSON format. You can actually create table using JSON format then query normal table.
-- Add jar file using "add jar /path-to/hive-json-serde-0.2.jar"
CREATE EXTERNAL TABLE states_json (state_short_name string, state_full_name string)
ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.JsonSerde'
LOCATION '/user/hduser/states.json';
states.json have data like {"state_short_name":"CA", "state_full_name":"California"}

Related

BigQuery SQL for JSON field returns no data

Yet again I find myself flumoxed with my SQL of JSON field in bigquery.
This is the contents of a field called json_data - https://storage.googleapis.com/greyrock_storage/misc/freepik.json
The record has an id of 1675816490
This is my SQL:
SELECT
##JSON_EXTRACT(json_data, '$data.resources.boost.url_source') AS url_source,
JSON_VALUE(boost, "$.url_source") AS url_source,
FROM `my database` ,
UNNEST(JSON_QUERY_ARRAY(json_data.data)) AS data,
UNNEST(JSON_QUERY_ARRAY(data.resources)) AS resources,
UNNEST(JSON_QUERY_ARRAY(resources.boost)) AS boost
WHERE
id = 1675816490
I expected to see a list of all the values in the record for data.resources.boost.url_source BUT it returns 'There is no data to display.'
try like this
SELECT JSON_VALUE(boosts.url_source) AS url_source
FROM `my database` AS a
CROSS JOIN UNNEST(JSON_QUERY_ARRAY(a.json_data.data.resources.boost)) AS boosts

Hive: read table partitions defined in subselect

I have a Hive table which is partitioned by partitionDate field.
I can read partition of my choice via simple
select * from myTable where partitionDate = '2000-01-01'
My task is to specify the partition of my choise dynamically. I.e. first I want to read it from some table, and only then run select to myTable. And of course, I want the power of partitions to be used.
I have written a query which looks like
select * from myTable mt join thatTable tt on tt.reportDate = mt.partitionDate
The query works but looks like partitions are not used. The query works too long.
I tried another approach:
select * from myTable where partitionDate in (select reportDate from thatTable)
.. and again I see that the query works too slowly.
Is there a way to implement this in Hive?
update: create table for myTable
CREATE TABLE `myTable`(
`theDate` string,
')
PARTITIONED BY (
`partitionDate` string)
TBLPROPERTIES (
'DO_NOT_UPDATE_STATS'='true',
'STATS_GENERATED_VIA_STATS_TASK'='true',
'spark.sql.create.version'='2.2 or prior',
'spark.sql.sources.schema.numPartCols'='1',
'spark.sql.sources.schema.numParts'='2',
'spark.sql.sources.schema.part.0'='{"type":"struct","fields":[{"name":"theDate","type":"string","nullable":true}...
'spark.sql.sources.schema.part.1'='{"name":"partitionDate","type":"string","nullable":true}...',
'spark.sql.sources.schema.partCol.0'='partitionDate')
If you are running Hive on Tez execution engine, try
set hive.tez.dynamic.partition.pruning=true;
Read more details and related configuration in the Jira HIVE-7826
and at the same time try to rewrite as a LEFT SEMI JOIN:
select *
from myTable t
left semi join (select distinct reportDate from thatTable) s on t.partitionDate = s.reportDate
If nothing helps, see this workaround: https://stackoverflow.com/a/56963448/2700344
Or this one: https://stackoverflow.com/a/53279839/2700344
Similar question: Hive Query is going for full table scan when filtering on the partitions from the results of subquery/joins

How do I use SQL Server to select into another table?

I am consolidating a web service. I am replacing multiple calls to the service with one call that contains the data.
I have created a table:
CREATE TABLE InvResults
(
Invoices nvarchar(max),
InvoiceDetails nvarchar(max),
Products nvarchar(max)
);
I used (max) because I don't know how complex the json will get at this time.
I need to do some sort of selects like this (this is pseudocode, not actual SQL):
SELECT
(SELECT *
INTO InvResults for Column Invoices
FROM MyInvoiceTable
WHERE SomeColumns = 'someStuffvariable'
FOR JSON PATH, ROOT('invoices')) AS invoices;
SELECT
(SELECT *
INTO InvResults for Column InvoiceDetails
FROM MyInvoiceDetailsTable
WHERE SomeColumns = 'someStuffvariable'
FOR JSON PATH, ROOT('invoicedetails')) AS invoicedetails;
I don't know how to format this and my google skills are failing me at this point. I understand that I probably want to use an UPDATE statement, but I'm not sure how to do this in combination with the rest of my requirements. I'm exploring How do I UPDATE from a SELECT in SQL Server? but I am still at a halt.
The end result should be a table "InvResults" that has 3 columns containing one row with results from Select statements as JSON. The column names should be defined the same as the json root objects.
INSERT INTO InvResults(Invoices,InvoidesDetails)
SELECT
(SELECT *
INTO InvResults for Column Invoices
FROM MyInvoiceTable
WHERE SomeColumns = 'someStuffvariable'
FOR JSON PATH, ROOT('invoices'))
,
(SELECT *
INTO InvResults for Column InvoiceDetails
FROM MyInvoiceDetailsTable
WHERE SomeColumns = 'someStuffvariable'
FOR JSON PATH, ROOT('invoicedetails'))
;
Because the SELECT.. FOR JSON is only returning 1 row above works.
The third field is easily to added, but left to do for yourself 😉

Hive - getting the column names count of a table

How can I get the hive column count names using HQL? I know we can use the describe.tablename to get the names of columns. How do we get the count?
create table mytable(i int,str string,dt date, ai array<int>,strct struct<k:int,j:int>);
select count(*)
from (select transform ('')
using 'hive -e "desc mytable"'
as col_name,data_type,comment
) t
;
5
Some additional playing around:
create table mytable (id int,first_name string,last_name string);
insert into mytable values (1,'Dudu',null);
select size(array(*)) from mytable limit 1;
This is not bulletproof since not all combinations of columns types can be combined into an array.
It also requires that the table will contain at least 1 row.
Here is a more complex but also stronger solution (types versa), but also requires that the table will contain at least 1 row
select size(str_to_map(val)) from (select transform (struct(*)) using 'sed -r "s/.(.*)./\1/' as val from mytable) t;

How to create temporary table in Google BigQuery

Is there any way to create a temporary table in Google BigQuery through:
SELECT * INTO <temp table>
FROM <table name>
same as we can create in SQL?
For complex queries, I need to create temporary tables to store my data.
2018 update - definitive answer with DDL
With BigQuery's DDL support you can create a table from the results a query - and specify its expiration at creation time. For example, for 3 days:
#standardSQL
CREATE TABLE `fh-bigquery.public_dump.vtemp`
OPTIONS(
expiration_timestamp=TIMESTAMP_ADD(CURRENT_TIMESTAMP(), INTERVAL 3 DAY)
) AS
SELECT corpus, COUNT(*) c
FROM `bigquery-public-data.samples.shakespeare`
GROUP BY corpus
Docs: https://cloud.google.com/bigquery/docs/data-definition-language
2019 update -- With BigQuery scripting (Beta now), CREATE TEMP TABLE is officially supported. See public documentation here.
2018 update: https://stackoverflow.com/a/50227484/132438
Every query in bigquery creates a temporary table with the results. Temporary unless you give a name to the destination table, then you are in control of its lifecycle.
Use the api to see the temporary table name, or name your tables when querying.
2019 update -- With BigQuery scripting, CREATE TEMP TABLE is officially supported. See public documentation here.
CREATE TEMP TABLE Example
(
x INT64,
y STRING
);
INSERT INTO Example
VALUES (5, 'foo');
INSERT INTO Example
VALUES (6, 'bar');
SELECT *
FROM Example;
A temporary table can be created with WITH in the "New Standard SQL". See WITH clause.
An example given by Google:
WITH subQ1 AS (SELECT SchoolID FROM Roster),
subQ2 AS (SELECT OpponentID FROM PlayerStats)
SELECT * FROM subQ1
UNION ALL
SELECT * FROM subQ2;
To create a temporary table, use the TEMP or TEMPORARY keyword when you use the CREATE TABLE statement and use of CREATE TEMPORARY TABLE requires a script , so its better to start with begin statement.
Begin
CREATE TEMP TABLE <table_name> as select * from <table_name> where <condition>;
End ;
Example of creating temp tables in GCP bigquery
CREATE TABLE `project_ID_XXXX.Sales.superStore2011`
OPTIONS(
expiration_timestamp=TIMESTAMP_ADD(CURRENT_TIMESTAMP(), INTERVAL 1 DAY)
) AS
SELECT
Product_Name,Product_Category, SUM(profit) Total_Profit, FORMAT_DATE("%Y",Order_Date) AS Year
FROM
`project_ID_XXXX.Sales.superStore`
WHERE
FORMAT_DATE("%Y",Order_Date)="2011"
GROUP BY
Product_Name,Product_Category,Order_Date
ORDER BY
Year, Total_Profit DESC
LIMIT 5
It's 2022, and if you type the codes to create a TEMP table in BQ's interactive windows, it will not work. Probably will display below error message:
Vaguely it will give you an idea that your interactive windows should be tied with some session. There is the official documentation on how to create sessions etc.,
The short and easy method for me was go to MORE menu of the Google BigQuery Interactive windows, select Query Settings
It will display below SS (as of 2022 April)
Enable/click Use session mode and SAVE. That's it enjoy your Temporary Tables :D
Take the SQL sample of
SELECT name,count FROM mydataset.babynames
WHERE gender = 'M' ORDER BY count DESC LIMIT 6 INTO mydataset.happyhalloween;
The easiest command line equivalent is
bq query --destination_table=mydataset.happyhalloween \
"SELECT name,count FROM mydataset.babynames WHERE gender = 'M' \
ORDER BY count DESC LIMIT 6"
See the documentation here:
https://cloud.google.com/bigquery/bq-command-line-tool#createtablequery
I followed Google's official document while learning UDF and encountered the issue: use of create temporary table requires a script or session
Erroneous script:
CREATE TEMP TABLE users
AS SELECT 1 id, 10 age
UNION ALL SELECT 2, 30
UNION ALL SELECT 3, 10;
Solution:
BEGIN
CREATE TEMP TABLE users
AS SELECT 1 id, 10 age
UNION ALL SELECT 2, 30
UNION ALL SELECT 3, 10;
END;
To create and store your data on the fly, you can specify optional _SESSION qualifier to create temporary table.
CREATE TEMP TABLE _SESSION.tmp_01
AS
SELECT name FROM `bigquery-public-data`.usa_names.usa_1910_current
WHERE year = 2017
;
Here you can create the table from a complex query starting after 'AS' and the temporary table will be created at once and will be deleted after 24 hours.
To access the table,
select * from _SESSION.tmp_01;
Update September 2022:
As per the documentation, you can create a temporary table like:
CREATE TEMP TABLE continents(name STRING, visitors INT64)
AS
select geo.continent, count(distinct user_pseudo_id) as Continent_Visitors
FROM `firebaseProject.dataset.events_date`
group by geo.continent order by Continent_Visitors desc;
SELECT * from continents;
Drop table continents;