I am trying to build a managed table (which orc formatted ,bucketed and table properties is set to true for transnational )on which i can run the update/Insert Statement In hive .
I am running this whole setup on AWS EMR and the Hive version is 2.4.3 the default directory store the data is S3.
I am able to populate the table from another external table .
However am getting select count(*) as zero and no output for select *
i dropped the table and recreated the table and repopulated the data .
The ANALYZE TABLE TABLE-NAME COMPUTE STATISTICS gives proper output .
Related
When Google Analytics is exported to BigQuery, the data is put in so called sharded tables, one for each day. They all start with ga_sessions_ followed by the suffix of a date.
I want to make a backup copy of these sharded tables.
How do I do that?
If you want to backup the tables to a Cloud Storage Bucket you can try the following.
Query table meta data to get the tables to export.
SELECT
table_name
FROM
`MyDataSet.INFORMATION_SCHEMA.TABLES`
WHERE
table_name LIKE 'ga_sessions_%'
Use the BigQuery export function to export the to the bucket.
-- If the tables are nested use json/avro/parquet
-- But be aware of the data type converstions:
-- https://cloud.google.com/bigquery/docs/exporting-data#exporting_data_stored_in
EXPORT DATA OPTIONS(
uri='gs://bucket/folder/ga_sessions_<date>_*.json',
format='JSON',
overwrite=true,
header=true,
field_delimiter=';') AS
SELECT * FROM mydataset.ga_sessions_<date>
Put it together in a BQ script with a loop, using FORMAT to create the query and EXECUTE IMMEDIATE to run the query.
BEGIN
DECLARE backup_date STRING DEFAULT CAST(CURRENT_DATE('UTC') AS STRING);
FOR record IN
(
SELECT
table_name
FROM
`MyDataSet.INFORMATION_SCHEMA.TABLES`
WHERE
table_name LIKE 'ga_sessions_%')
DO
EXECUTE IMMEDIATE
FORMAT("""
EXPORT DATA
OPTIONS(
uri=CONCAT('gs://your_backup_bucket/path/to/folder/', '%s','/', '%s','_*.json'),
format='JSON',
overwrite=true,
header=true,
field_delimiter=';')
AS SELECT * FROM my_project.my_data_set.%s
"""
, backup_date, record.table_name, record.table_name);
END FOR;
END;
backup_date is used to create a 'folder' with the export date as a name in the bucket for the tables.
The * in the URI allows a table to be exported into multiple tables. This only matters if the exported table is bigger than 1GB (See here)
Set a life cycle rule on your storage bucket to archive files after an appropriate time or set it to archive by default if it's only for backup purposes (Accessed less once a year, see storage classes).
Props to Tim Lou for this article on using table meta data.
This answer relates to creating the backup in BQ:
So a sharded table such as ga_sessions_ basically exists of many tables, one for each day. We need to copy all those tables separately.
The below answer is a copy from the article below and all respect goes to the author:
https://medium.com/#Nayana22/playing-with-sharded-tables-in-bigquery-123e1ec5e453
So you can do as follows:
Take the list of all the days that you wish to copy by hitting below SQL command. [You can modify this as per your need]
SELECT
REPLACE(STRING_AGG(CONCAT('"', partition_name,'"') ORDER BY partition_name ), ","," ")
FROM
(
SELECT
DISTINCT date AS partition_name
FROM
`[Project_ID].[DATASET_NAME].ga_sessions_*`
ORDER BY PARSE_DATE(“%Y%m%d”, date)
)
Output:
"20180322" "20180323" "20180324" "20180325" "20180326" "20180327"...
Now, go to google console, select the project and click on Activate Cloud Shell.
Check the current project using echo $DEVSHELL_PROJECT_ID. If you’re in an incorrect project then change it and move to step 4.
Create a bash script and specify the days that you just got from the above query.
tables=(“20180322” “20180323” “20180324” “20180325” “20180326” “20180327”…)
Iterate through all the days available in tables variable and use BigQuery’s copy command to move a table from source to destination.
Syntax: [Note: There is a space between the source and destination table name]
bq cp -a source_table destination_table
Our, script file looks like this:
tables=("20180322" "20180323" "20180324" "20180325" "20180326" "20180327"…)
for val in ${tables[*]}; do
bq cp -a [source_dataset].ga_sessions_$val [destination_dataset].ga_sessions_backup_$val
done
How to validate whether all the tables are copied or not?
WITH first_ga_session AS (
SELECT MIN(PARSE_DATE("%Y%m%d", REGEXP_EXTRACT(table_id, '20[0–9]
{6,6}'))) AS day
FROM `[PROJECT_ID].[DATASET_NAME].__TABLES__` AS ga_tables
WHERE table_id LIKE 'ga_sessions_backup_2%'
),
all_days AS (
SELECT period
FROM
UNNEST(GENERATE_DATE_ARRAY((SELECT day from first_ga_session),
CURRENT_DATE())) AS period
),
available_ga_sessions AS (
SELECT PARSE_DATE("%Y%m%d", REGEXP_EXTRACT(table_id, '20[0–9]
{6,6}')) AS ga_day
FROM `[PROJECT_ID].[DATASET_NAME].__TABLES__` AS ga_tables
WHERE table_id LIKE 'ga_sessions_backup_2%'
)
SELECT A.period AS Day, B.ga_day AS Available_session
FROM all_days A
LEFT JOIN available_ga_sessions B
ON A.period = B.ga_day
WHERE B.ga_day IS NULL
The above query will give us all the days that are missing in the destination table.
My bash script ended up looking like this (note i'm using cloning with flag --clone:
tables=("20220331" "20220401")
bq --location=eu mk --dataset destination_project_id:destination_dataset
for val in ${tables[*]}; do
echo source_project_id:source_dataset.ga_sessions_$val
bq cp --clone source_project_id:source_dataset.ga_sessions_$val destination_project_id:destination_dataset.ga_sessions_backup_$val
done
I have a query that I want to execute daily that's to be partitioned by the date it's executed. The results of this query should be appended to a the same table.
My idea was ideally having something similar to the CREATE TABLE IF NOT EXISTS command for adding data by a new partition every day to the existing table if the partition doesn't already exist, but I can't figure out how I'd be able to integrate this in my query.
My query:
CREATE TABLE IF NOT EXISTS db_name.table_name
WITH (
external_location = 's3://my-query-results-location/',
format = 'PARQUET',
parquet_compression = 'SNAPPY',
partitioned_by = ARRAY['date_executed'])
AS
SELECT
{columns_that_I_am_selecting_here_including_'date_executed'}
What this does is create a new table for the first day it's executed but nothing happens for subsequent days, I'm assuming because of the CREATE TABLE IF NOT EXISTS validating that the table already exists and not proceeding with the logic.
Is there a way to modify my query to create a table for the first day executed and append the results by a new partition for each subsequent day?
I'm quite sure ALTER TABLE table_name ADD [IF NOT EXISTS] PARTITION would not apply to my use case here as I'm running a CTAS query.
You can simply use INSERT INTO existing_table SELECT....
Presumably your table is already partitioned, so include that partition column in the SELECT and Amazon Athena will automatically put the data in the correct directory.
For example, you might include hte column like this: SELECT ... CURRENT_DATE as date_executed
See: INSERT INTO - Amazon Athena
I have an external table, now I want to add partitions to it. I have 224 unique city id's and I want to just write alter table my_table add partition (cityid) location /path; but hive complains, saying that I don't provide anything for the city id value, it should be e.g. alter table my_table add partition (cityid=VALUE) location /path;, but I don't want to run alter table commands for every value of city id, how can I do it for all id's in one go?
This is what hive command line looks like:
hive> alter table pavel.browserdata add partition (cityid) location '/user/maria_dev/data/cityidPartition';
FAILED: ValidationFailureSemanticException table is not partitioned but partition spec exists: {cityid=null}
Partition on physical level is a location (separate location for each value, usually looks like key=value) with data files. If you already have partitions directory structure with files, all you need is to create partitions in Hive metastore, then you can point your table to the root directory using ALTER TABLE SET LOCATION, then use MSCK REPAIR TABLE command. The equivalent command on Amazon Elastic MapReduce (EMR)'s version of Hive is: ALTER TABLE table_name RECOVER PARTITIONS. This will add Hive partitions metadata. See manual here: RECOVER PARTITIONS
If you have only not-partitioned table with data in it's location, then adding partitions will not work because the data needs to be reloaded, you need to:
Create another partitioned table and use insert overwrite to load partition data using dynamic partition load:
set hive.exec.dynamic.partition=true;
set hive.exec.dynamic.partition.mode=nonstrict;
insert overwrite table2 partition(cityid)
select col1, ... colN,
cityid
from table1; --partitions columns should be last in the select
This is quite efficient way to reorganize your data.
After this you can delete source table and rename your target table.
What is the efficient way to export the data from hive/impala table with conditions into file(the data would be huge, close to 10 GB)? The format of the hive table is paraquet with snappy compressed and file is csv.
The table is partitioned daily and data needs to be extracted on daily basis, I would like to know if
1) Imapala approach
impala-shell -k -i servername:portname -B -q 'select * from table where year_month_date=$$$$$$$$' -o filename '--output_delimiter=\001'
2) Hive approach
Insert overwrite directory '/path' select * from table where year_month_date=$$$$$$$$
would be efficient
Assuming table tbl as your hive parquet table and condition as your filter condition.
CTAS command:
CREATE TABLE tbl_text ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' LOCATION '/tmp/data' AS select * from tbl where condition;
You will find your CSV text file (delimited by ',') at /tmp/data in HDFS.
You can get this file to your local file system if needed using:
hadoop fs -get /tmp/data
Please try to use Dynamic Partitioning for your Hive/Impala table to efficiently export the data conditionally.
Partition your table with the columns of your interest and based on your queries for best results
Step 1: Create a Temporary Hive Table TmpTable and load your raw data into it
Step 2: Set hive parameters to support Dynamic partition
SET hive.exec.dynamic.partition.mode=non-strict;
SET hive.exec.dynamic.partition=true;
Step 3: Create your Main Hive Table with partition columns, example :
CREATE TABLE employee (
emp_id int,
emp_name string
PARTITIONED BY (location string)
STORED AS PARQUET;
Step 4: Load data from Temporary table to your employee table (Main Table)
insert overwrite table employee partition(location)
select emp_id,emp_name, location from TmpTable;
Step 5: export the data from hive with a condition
INSERT OVERWRITE DIRECTORY '/path/to/output/dir' SELECT * FROM employee WHERE location='CALIFORNIA';
Please refer this link:
Dynamic Partition Concept
Hope this is useful.
I have created an external table in Hive with at this location :
CREATE EXTERNAL TABLE tb
(
...
)
PARTITIONED BY (datehour INT)
ROW FORMAT SERDE 'com.cloudera.hive.serde.JSONSerDe'
LOCATION '/user/cloudera/data';
The data is present in the folder but when I query the table, it returns nothing. The table is structured in a way that it fits the data structure.
SELECT * FROM tb LIMIT 3;
Is there a kind of permission issue with Hive tables: do specific users have permissions to query some tables?
Do you know some solutions or workarounds?
You have created your table as partitioned table base on column datehour, but you are putting your data in /user/cloudera/data. Hive will look for data in /user/cloudera/data/datehour=(some int value). Since it is an external table hive will not update the metastore. You need to run some alter statement to update that
So here are the steps for external tables with partition:
1.) In you external location /user/cloudera/data, create a directory datehour=0909201401
OR
Load data using: LOAD DATA [LOCAL] INPATH '/path/to/data/file' INTO TABLE partition(datehour=0909201401)
2.) After creating your table run a alter statement:
ALTER TABLE ADD PARTITION (datehour=0909201401)
Hope it helps...!!!
When we create an EXTERNAL TABLE with PARTITION, we have to ALTER the EXTERNAL TABLE with the data location for that given partition. However, it need not be the same path as we specify while creating the EXTERNAL TABLE.
hive> ALTER TABLE tb ADD PARTITION (datehour=0909201401)
hive> LOCATION '/user/cloudera/data/somedatafor_datehour'
hive> ;
When we specify LOCATION '/user/cloudera/data' (though its optional) while creating an EXTERNAL TABLE we can take some advantage of doing repair operations on that table. So when we want to copy the files through some process like ETL into that directory, we can sync up the partition with the EXTERNAL TABLE instead of writing ALTER TABLE statement to create another new partition.
If we already know the directory structure of the partition that HIVE would create, we can simply place the data file in that location like '/user/cloudera/data/datehour=0909201401/data.txt' and run the statement as shown below:
hive> MSCK REPAIR TABLE tb;
The above statement will sync up the partition to the hive meta store of the table "tb".