I need to pull records from a MySQL table with n columns and store them in hive with extra columns. Is there any way in sqoop to perform it?
Example:
MySQL table has the following fields id, name, place. And,
Hive table structure is id, name, place and contact number(null).
So when performing sqoop, I want to add an extra column contact number in hive as (null).
You can specify it in the by using --query option in sqoop and select the extra column with NULL AS.
sqoop import \
--query 'SELECT id, name, place, NULL AS contact_number FROM mysql_table'
--connect jdbc:mysql://mysql.example.com/sqoop \
--Any other options
Related
When Google Analytics is exported to BigQuery, the data is put in so called sharded tables, one for each day. They all start with ga_sessions_ followed by the suffix of a date.
I want to make a backup copy of these sharded tables.
How do I do that?
If you want to backup the tables to a Cloud Storage Bucket you can try the following.
Query table meta data to get the tables to export.
SELECT
table_name
FROM
`MyDataSet.INFORMATION_SCHEMA.TABLES`
WHERE
table_name LIKE 'ga_sessions_%'
Use the BigQuery export function to export the to the bucket.
-- If the tables are nested use json/avro/parquet
-- But be aware of the data type converstions:
-- https://cloud.google.com/bigquery/docs/exporting-data#exporting_data_stored_in
EXPORT DATA OPTIONS(
uri='gs://bucket/folder/ga_sessions_<date>_*.json',
format='JSON',
overwrite=true,
header=true,
field_delimiter=';') AS
SELECT * FROM mydataset.ga_sessions_<date>
Put it together in a BQ script with a loop, using FORMAT to create the query and EXECUTE IMMEDIATE to run the query.
BEGIN
DECLARE backup_date STRING DEFAULT CAST(CURRENT_DATE('UTC') AS STRING);
FOR record IN
(
SELECT
table_name
FROM
`MyDataSet.INFORMATION_SCHEMA.TABLES`
WHERE
table_name LIKE 'ga_sessions_%')
DO
EXECUTE IMMEDIATE
FORMAT("""
EXPORT DATA
OPTIONS(
uri=CONCAT('gs://your_backup_bucket/path/to/folder/', '%s','/', '%s','_*.json'),
format='JSON',
overwrite=true,
header=true,
field_delimiter=';')
AS SELECT * FROM my_project.my_data_set.%s
"""
, backup_date, record.table_name, record.table_name);
END FOR;
END;
backup_date is used to create a 'folder' with the export date as a name in the bucket for the tables.
The * in the URI allows a table to be exported into multiple tables. This only matters if the exported table is bigger than 1GB (See here)
Set a life cycle rule on your storage bucket to archive files after an appropriate time or set it to archive by default if it's only for backup purposes (Accessed less once a year, see storage classes).
Props to Tim Lou for this article on using table meta data.
This answer relates to creating the backup in BQ:
So a sharded table such as ga_sessions_ basically exists of many tables, one for each day. We need to copy all those tables separately.
The below answer is a copy from the article below and all respect goes to the author:
https://medium.com/#Nayana22/playing-with-sharded-tables-in-bigquery-123e1ec5e453
So you can do as follows:
Take the list of all the days that you wish to copy by hitting below SQL command. [You can modify this as per your need]
SELECT
REPLACE(STRING_AGG(CONCAT('"', partition_name,'"') ORDER BY partition_name ), ","," ")
FROM
(
SELECT
DISTINCT date AS partition_name
FROM
`[Project_ID].[DATASET_NAME].ga_sessions_*`
ORDER BY PARSE_DATE(“%Y%m%d”, date)
)
Output:
"20180322" "20180323" "20180324" "20180325" "20180326" "20180327"...
Now, go to google console, select the project and click on Activate Cloud Shell.
Check the current project using echo $DEVSHELL_PROJECT_ID. If you’re in an incorrect project then change it and move to step 4.
Create a bash script and specify the days that you just got from the above query.
tables=(“20180322” “20180323” “20180324” “20180325” “20180326” “20180327”…)
Iterate through all the days available in tables variable and use BigQuery’s copy command to move a table from source to destination.
Syntax: [Note: There is a space between the source and destination table name]
bq cp -a source_table destination_table
Our, script file looks like this:
tables=("20180322" "20180323" "20180324" "20180325" "20180326" "20180327"…)
for val in ${tables[*]}; do
bq cp -a [source_dataset].ga_sessions_$val [destination_dataset].ga_sessions_backup_$val
done
How to validate whether all the tables are copied or not?
WITH first_ga_session AS (
SELECT MIN(PARSE_DATE("%Y%m%d", REGEXP_EXTRACT(table_id, '20[0–9]
{6,6}'))) AS day
FROM `[PROJECT_ID].[DATASET_NAME].__TABLES__` AS ga_tables
WHERE table_id LIKE 'ga_sessions_backup_2%'
),
all_days AS (
SELECT period
FROM
UNNEST(GENERATE_DATE_ARRAY((SELECT day from first_ga_session),
CURRENT_DATE())) AS period
),
available_ga_sessions AS (
SELECT PARSE_DATE("%Y%m%d", REGEXP_EXTRACT(table_id, '20[0–9]
{6,6}')) AS ga_day
FROM `[PROJECT_ID].[DATASET_NAME].__TABLES__` AS ga_tables
WHERE table_id LIKE 'ga_sessions_backup_2%'
)
SELECT A.period AS Day, B.ga_day AS Available_session
FROM all_days A
LEFT JOIN available_ga_sessions B
ON A.period = B.ga_day
WHERE B.ga_day IS NULL
The above query will give us all the days that are missing in the destination table.
My bash script ended up looking like this (note i'm using cloning with flag --clone:
tables=("20220331" "20220401")
bq --location=eu mk --dataset destination_project_id:destination_dataset
for val in ${tables[*]}; do
echo source_project_id:source_dataset.ga_sessions_$val
bq cp --clone source_project_id:source_dataset.ga_sessions_$val destination_project_id:destination_dataset.ga_sessions_backup_$val
done
I want to create a table which schema is exactly the same as another table. In other SQL engines, I think I was able to use "CREATE TABLE my_table (LIKE your_table)" or some variations.
I couldn't find the equivalent in BigQuery yet. Is this possible in some fashion?
Use this form:
CREATE TABLE dataset.new_table AS
SELECT *
FROM dataset.existing_table
LIMIT 0
This creates a new table with the same schema as the old one, and there is no cost due to the LIMIT 0.
Note that this does not preserve partitioning, table description, etc., however. Another option is to use the CLI (or API), making a copy of the table and then overwriting its contents, e.g.:
$ bq cp dataset.existing_table dataset.new_table
$ bq query --use_legacy_sql --replace --destination_table=dataset.new_table \
"SELECT * FROM dataset.new_table LIMIT 0;"
Now the new table has the same structure and attributes as the original did.
To create a partitioned and/or clustered table the syntax would be:
CREATE TABLE project.dataset.clustered_table
PARTITION BY DATE(created_time)
CLUSTER BY
account_id
AS SELECT * FROM project.dataset.example_table LIMIT 0
I want to create a table in Hive using a select statement which takes a subset of a data from another table. I used the following query to do so :
create table sample_db.out_table as
select * from sample_db.in_table where country = 'Canada';
When I looked into the HDFS location of this table, there are no field separators.
But I need to create a table with filtered data from another table along with a field separator. For example I am trying to do something like :
create table sample_db.out_table as
select * from sample_db.in_table where country = 'Canada'
ROW FORMAT SERDE
FIELDS TERMINATED BY '|';
This is not working though. I know the alternate way is to create a table structure with field names and the "FIELDS TERMINATED BY '|'" command and then load the data.
But is there any other way to combine the two into a single query that enables me to create a table with filtered data from another table and also with a field separator ?
Put row format delimited .. in front of AS select
do it like this
Change the query to yours
hive> CREATE TABLE ttt row format delimited fields terminated by '|' AS select *,count(1) from t1 group by id ,name ;
Query ID = root_20180702153737_37802c0e-525a-4b00-b8ec-9fac4a6d895b
here is the result
[root#hadoop1 ~]# hadoop fs -cat /user/hive/warehouse/ttt/**
2|\N|1
3|\N|1
4|\N|1
As you can see in the documentation, when using the CTAS (Create Table As Select) statement, the ROW FORMAT statement (in fact, all the settings related to the new table) goes before the SELECT statement.
What is the efficient way to export the data from hive/impala table with conditions into file(the data would be huge, close to 10 GB)? The format of the hive table is paraquet with snappy compressed and file is csv.
The table is partitioned daily and data needs to be extracted on daily basis, I would like to know if
1) Imapala approach
impala-shell -k -i servername:portname -B -q 'select * from table where year_month_date=$$$$$$$$' -o filename '--output_delimiter=\001'
2) Hive approach
Insert overwrite directory '/path' select * from table where year_month_date=$$$$$$$$
would be efficient
Assuming table tbl as your hive parquet table and condition as your filter condition.
CTAS command:
CREATE TABLE tbl_text ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' LOCATION '/tmp/data' AS select * from tbl where condition;
You will find your CSV text file (delimited by ',') at /tmp/data in HDFS.
You can get this file to your local file system if needed using:
hadoop fs -get /tmp/data
Please try to use Dynamic Partitioning for your Hive/Impala table to efficiently export the data conditionally.
Partition your table with the columns of your interest and based on your queries for best results
Step 1: Create a Temporary Hive Table TmpTable and load your raw data into it
Step 2: Set hive parameters to support Dynamic partition
SET hive.exec.dynamic.partition.mode=non-strict;
SET hive.exec.dynamic.partition=true;
Step 3: Create your Main Hive Table with partition columns, example :
CREATE TABLE employee (
emp_id int,
emp_name string
PARTITIONED BY (location string)
STORED AS PARQUET;
Step 4: Load data from Temporary table to your employee table (Main Table)
insert overwrite table employee partition(location)
select emp_id,emp_name, location from TmpTable;
Step 5: export the data from hive with a condition
INSERT OVERWRITE DIRECTORY '/path/to/output/dir' SELECT * FROM employee WHERE location='CALIFORNIA';
Please refer this link:
Dynamic Partition Concept
Hope this is useful.
Can anyone tell the difference between create-hive-table & hive-import method? Both will create a hive table, but still what is the significance of each?
hive-import command:
hive-import commands automatically populates the metadata for the populating tables in hive metastore. If the table in Hive does not exist yet, Sqoop
will simply create it based on the metadata fetched for your table or query. If the table already exists, Sqoop will import data into the existing table. If you’re creating a new Hive table, Sqoop will convert the data types of each column from your source table to a type compatible with Hive.
create-hive-table command:
Sqoop can generate a hive table (using create-hive-tablecommand) based on the table from an existing relational data source. If set, then the job will fail if the target hive table exists. By default this property is false.
Using create-hive-table command involves three steps: importing data into HDFS, creating hive table and then loading the HDFS data into Hive. This can be shortened to one step by using hive-import.
During a hive-import, Sqoop will first do a normal HDFS import to a temporary location. After a successful import, Sqoop generates two queries: one for creating a table and another one for loading the data from a temporary location. You can specify any temporary location using either the --target-dir or --warehouse-dir parameter.
Added a example below for above description
Using create-hive-table command:
Involves three steps:
Importing data from RDBMS to HDFS
sqoop import --connect jdbc:mysql://localhost:3306/hadoopexample --table employees --split-by empid -m 1;
Creating hive table using create-hive-table command
sqoop create-hive-table --connect jdbc:mysql://localhost:3306/hadoopexample --table employees --fields-terminated-by ',';
Loading data into Hive
hive> load data inpath "employees" into table employees;
Loading data to table default.employees
Table default.employees stats: [numFiles=1, totalSize=70]
OK
Time taken: 2.269 seconds
hive> select * from employees;
OK
1001 emp1 101
1002 emp2 102
1003 emp3 101
1004 emp4 101
1005 emp5 103
Time taken: 0.334 seconds, Fetched: 5 row(s)
Using hive-import command:
sqoop import --connect jdbc:mysql://localhost:3306/hadoopexample --table departments --split-by deptid -m 1 --hive-import;
The difference is that create-hive-table will create table in Hive based on the source table in database but will NOT transfer any data. Command "import --hive-import" will both create table in Hive and import data from the source table.