how to group by data from hive with specific partition? - hive

I have the following:
hive>show partitions TABLENAME
pt=2012.07.28.08
pt=2012.07.28.09
pt=2012.07.28.10
pt=2012.07.28.11
hive> select pt,count(*) from TABLENAME group by pt;
OK
Why can't the group by get the data?

Check if the hive.mapred.mode is set to "strict", if so it'll not allow all partitions to scan for the submitted query. You can set it to nonstrict as below:
hive>set hive.mapred.mode=nonstrict;
I'm not sure whether this caused NO results out of your query, but trying to address it. Do share the results.
Note: You can check the default value for this parameter in hive-default.xml

You can always achive the same using 2 select statements . For ex
Create table table1(
session_id string,
page_id string
)
partitioned by (metrics_date string);
Consider we are have loaded table for 2 partitions
hive>show partitions table1
metrics_date=2012.07.28.08
metrics_date=2012.07.28.09
select * from table1 ;
1212121212 google.com 2012.07.28.08
1212121212 google.com 2012.07.28.09`
Getting number of rows per partition
select metrics_date,count(*) from (
select * from table1 ) temp
group by metrics_date;

To get whole results along with group by ,You can use the below query.
SELECT pt,count(*) OVER (PARTITION BY pt) FROM TABLENAME;
This can be achiened through partition by.

Related

Hive: read table partitions defined in subselect

I have a Hive table which is partitioned by partitionDate field.
I can read partition of my choice via simple
select * from myTable where partitionDate = '2000-01-01'
My task is to specify the partition of my choise dynamically. I.e. first I want to read it from some table, and only then run select to myTable. And of course, I want the power of partitions to be used.
I have written a query which looks like
select * from myTable mt join thatTable tt on tt.reportDate = mt.partitionDate
The query works but looks like partitions are not used. The query works too long.
I tried another approach:
select * from myTable where partitionDate in (select reportDate from thatTable)
.. and again I see that the query works too slowly.
Is there a way to implement this in Hive?
update: create table for myTable
CREATE TABLE `myTable`(
`theDate` string,
')
PARTITIONED BY (
`partitionDate` string)
TBLPROPERTIES (
'DO_NOT_UPDATE_STATS'='true',
'STATS_GENERATED_VIA_STATS_TASK'='true',
'spark.sql.create.version'='2.2 or prior',
'spark.sql.sources.schema.numPartCols'='1',
'spark.sql.sources.schema.numParts'='2',
'spark.sql.sources.schema.part.0'='{"type":"struct","fields":[{"name":"theDate","type":"string","nullable":true}...
'spark.sql.sources.schema.part.1'='{"name":"partitionDate","type":"string","nullable":true}...',
'spark.sql.sources.schema.partCol.0'='partitionDate')
If you are running Hive on Tez execution engine, try
set hive.tez.dynamic.partition.pruning=true;
Read more details and related configuration in the Jira HIVE-7826
and at the same time try to rewrite as a LEFT SEMI JOIN:
select *
from myTable t
left semi join (select distinct reportDate from thatTable) s on t.partitionDate = s.reportDate
If nothing helps, see this workaround: https://stackoverflow.com/a/56963448/2700344
Or this one: https://stackoverflow.com/a/53279839/2700344
Similar question: Hive Query is going for full table scan when filtering on the partitions from the results of subquery/joins

Hive - getting the column names count of a table

How can I get the hive column count names using HQL? I know we can use the describe.tablename to get the names of columns. How do we get the count?
create table mytable(i int,str string,dt date, ai array<int>,strct struct<k:int,j:int>);
select count(*)
from (select transform ('')
using 'hive -e "desc mytable"'
as col_name,data_type,comment
) t
;
5
Some additional playing around:
create table mytable (id int,first_name string,last_name string);
insert into mytable values (1,'Dudu',null);
select size(array(*)) from mytable limit 1;
This is not bulletproof since not all combinations of columns types can be combined into an array.
It also requires that the table will contain at least 1 row.
Here is a more complex but also stronger solution (types versa), but also requires that the table will contain at least 1 row
select size(str_to_map(val)) from (select transform (struct(*)) using 'sed -r "s/.(.*)./\1/' as val from mytable) t;

SQL query for removing non-unique row

I'm using postgreSQL 9.2.
Let I've the following table:
id name definition
serial varchar(128) text
1 name1 definition1
..........................................
I need to write a query that remove all rows with the same name such that every row will have unique name. If two rows have the same name, their definitions are also the same.
Use row_number() function on name and remove all rows that have row_number() > 1
Here is an example query: Deleting duplicates
DELETE FROM mytable dd
WHERE EXISTS (
SELECT *
FROM mytable ex
WHERE ex.name = dd.name
AND ex.id < dd.id
);
Why do you even let client applications to add rows when name duplicates in the first place?

How to create temporary table in Google BigQuery

Is there any way to create a temporary table in Google BigQuery through:
SELECT * INTO <temp table>
FROM <table name>
same as we can create in SQL?
For complex queries, I need to create temporary tables to store my data.
2018 update - definitive answer with DDL
With BigQuery's DDL support you can create a table from the results a query - and specify its expiration at creation time. For example, for 3 days:
#standardSQL
CREATE TABLE `fh-bigquery.public_dump.vtemp`
OPTIONS(
expiration_timestamp=TIMESTAMP_ADD(CURRENT_TIMESTAMP(), INTERVAL 3 DAY)
) AS
SELECT corpus, COUNT(*) c
FROM `bigquery-public-data.samples.shakespeare`
GROUP BY corpus
Docs: https://cloud.google.com/bigquery/docs/data-definition-language
2019 update -- With BigQuery scripting (Beta now), CREATE TEMP TABLE is officially supported. See public documentation here.
2018 update: https://stackoverflow.com/a/50227484/132438
Every query in bigquery creates a temporary table with the results. Temporary unless you give a name to the destination table, then you are in control of its lifecycle.
Use the api to see the temporary table name, or name your tables when querying.
2019 update -- With BigQuery scripting, CREATE TEMP TABLE is officially supported. See public documentation here.
CREATE TEMP TABLE Example
(
x INT64,
y STRING
);
INSERT INTO Example
VALUES (5, 'foo');
INSERT INTO Example
VALUES (6, 'bar');
SELECT *
FROM Example;
A temporary table can be created with WITH in the "New Standard SQL". See WITH clause.
An example given by Google:
WITH subQ1 AS (SELECT SchoolID FROM Roster),
subQ2 AS (SELECT OpponentID FROM PlayerStats)
SELECT * FROM subQ1
UNION ALL
SELECT * FROM subQ2;
To create a temporary table, use the TEMP or TEMPORARY keyword when you use the CREATE TABLE statement and use of CREATE TEMPORARY TABLE requires a script , so its better to start with begin statement.
Begin
CREATE TEMP TABLE <table_name> as select * from <table_name> where <condition>;
End ;
Example of creating temp tables in GCP bigquery
CREATE TABLE `project_ID_XXXX.Sales.superStore2011`
OPTIONS(
expiration_timestamp=TIMESTAMP_ADD(CURRENT_TIMESTAMP(), INTERVAL 1 DAY)
) AS
SELECT
Product_Name,Product_Category, SUM(profit) Total_Profit, FORMAT_DATE("%Y",Order_Date) AS Year
FROM
`project_ID_XXXX.Sales.superStore`
WHERE
FORMAT_DATE("%Y",Order_Date)="2011"
GROUP BY
Product_Name,Product_Category,Order_Date
ORDER BY
Year, Total_Profit DESC
LIMIT 5
It's 2022, and if you type the codes to create a TEMP table in BQ's interactive windows, it will not work. Probably will display below error message:
Vaguely it will give you an idea that your interactive windows should be tied with some session. There is the official documentation on how to create sessions etc.,
The short and easy method for me was go to MORE menu of the Google BigQuery Interactive windows, select Query Settings
It will display below SS (as of 2022 April)
Enable/click Use session mode and SAVE. That's it enjoy your Temporary Tables :D
Take the SQL sample of
SELECT name,count FROM mydataset.babynames
WHERE gender = 'M' ORDER BY count DESC LIMIT 6 INTO mydataset.happyhalloween;
The easiest command line equivalent is
bq query --destination_table=mydataset.happyhalloween \
"SELECT name,count FROM mydataset.babynames WHERE gender = 'M' \
ORDER BY count DESC LIMIT 6"
See the documentation here:
https://cloud.google.com/bigquery/bq-command-line-tool#createtablequery
I followed Google's official document while learning UDF and encountered the issue: use of create temporary table requires a script or session
Erroneous script:
CREATE TEMP TABLE users
AS SELECT 1 id, 10 age
UNION ALL SELECT 2, 30
UNION ALL SELECT 3, 10;
Solution:
BEGIN
CREATE TEMP TABLE users
AS SELECT 1 id, 10 age
UNION ALL SELECT 2, 30
UNION ALL SELECT 3, 10;
END;
To create and store your data on the fly, you can specify optional _SESSION qualifier to create temporary table.
CREATE TEMP TABLE _SESSION.tmp_01
AS
SELECT name FROM `bigquery-public-data`.usa_names.usa_1910_current
WHERE year = 2017
;
Here you can create the table from a complex query starting after 'AS' and the temporary table will be created at once and will be deleted after 24 hours.
To access the table,
select * from _SESSION.tmp_01;
Update September 2022:
As per the documentation, you can create a temporary table like:
CREATE TEMP TABLE continents(name STRING, visitors INT64)
AS
select geo.continent, count(distinct user_pseudo_id) as Continent_Visitors
FROM `firebaseProject.dataset.events_date`
group by geo.continent order by Continent_Visitors desc;
SELECT * from continents;
Drop table continents;

Using not equal symbol in hive query

I need to use '!=' symbol in my hive query with partitions.
I tried something like
from sample_table
insert overwrite table sample1
partition (src='a')
select * where act=10
insert overwrite table sample1
partition (src!='a')
select * where act=20
But it is showing error at '!=' symbol. How can i replace !=
Try to use rlike/regex function in hive to specify condition.
I think you can also use not operator <> not !=
partition (src!='a') - what do you expect Hive to do - to write "select *" result into any partition instead of "a"? You see, partition (src='a') means that you are writing result of aftergoing select statement into table's partition named "a". "PARTITION (a=b)" is not a conditional command like "WHERE a=b", you're just specifying how to name a partition.
You have just to specify another partition name, so your query should look like:
from sample_table insert overwrite table sample1 partition (src='a') select * where act=10 insert overwrite table sample1 partition (src='b') select * where act=20;
After that you should see 2 new partitions "a" and "b" in table "sample1" with some data from these select * where act=10 and select * where act=20 queries respectively.
may i know your hive version?
try using A <> B
Description from Hive DOCS:
NULL if A or B is NULL, TRUE if expression A is NOT equal to expression B, otherwise FALSE.