I have a partitioned_table in hive which partitioned by "year,month",so my hdfs list is
/user/hive/warehouse/demo.db/employee/year=2017/month=6
when I used "export" to export table and use "import" to create a new table ,the result is the year and month exchanged, the list is
/user/hive/warehouse/demo.db/new_employee/month=6/year=2017
my hive version is 1.2.2 and the query:
export table employee into /user/hadoop/data
import table new_employee from /user/hadoop/data
the partitions in hive have no different with the original table(which one I export). Even I add a new partition, the directory is not changed, it work as '/month=7/year=6'
so what's wrong happened?Thanks for help!
Is there issue in the way you are looking at the data as long as you dont have that issue then it is not a problem.by the way this is right export command for exporting the partition table
export table employee partition (year="2017", month="6") to 'hdfs_exports_location/employee';
import from 'hdfs_exports_location/employee';
well if you have more partition on year and month ..like each year will have 12 month of data i think you may have to do for each month separate command ..i have not tried..just try it with the above command let us know how it is coming
Related
I am using HIVE to load data into different partitions.
I am creating a table
CREATE TABLE X IF NOT EXISTS ... USING PARQUET PARTITIONED BY (Year, Month,Day)
LOCATION '...'
Afterwards I am performing a full load:
INSERT OVERWRITE TABLE ... PARTITION (Year, Month, Day)
SELECT ... FROM Y
Show partitions shows me all partitions correctly.
and after the full load, I just want to reload always the current year dynamically:
INSERT OVERWRITE TABLE ... PARTITION (Year, Month, Day)
SELECT ... FROM Y WHERE Year = YEAR(CURRENT_DATE())
The issue I have is that HIVE deletes all PREVIOUS partitions i.e. 2017, 2018 and just 2019 persists. I was supposed that HIVE ONLY overwrites the partition for 2019 but not all.
I suppose I do something wrong - any idea is welcome.
Try using "Insert into table" instead of "Insert overwrite table". It should solve your problem. :)
Okay, I got the solution once I studied the official databricks guide more carefully.
Here is the answer:
The semantics are different based on the type of the target table.
Hive SerDe tables: INSERT OVERWRITE doesn’t delete partitions ahead, and only overwrite those partitions that have data written into it at runtime. This matches Apache Hive semantics. For Hive SerDe tables, Spark SQL respects the Hive-related configuration, including hive.exec.dynamic.partition and hive.exec.dynamic.partition.mode.
Native data source tables: INSERT OVERWRITE first deletes all the partitions that match the partition specification (e.g., PARTITION(a=1, b)) and then inserts all the remaining values. Since Databricks Runtime 3.2, the behavior of native data source tables can be changed to be consistent with Hive SerDe tables by changing the session-specific configuration spark.sql.sources.partitionOverwriteMode to DYNAMIC. The default mode is STATIC.
The question is how to let Google BigQuery automatically create partitioned tables on the daily base (one day -> one table, etc.)?
I've used the following command in the command line to create the table:
bq mk --time_partitioning_type=DAY testtable1
The table1 appeared in the dataset, but how to create tables for every day automatically?
From the partitioned table documentation, you need to run the command to create the table only once. After that, you specify the partition to which you want to write as the destination table of the query, such as testtable1$20170919.
Note: this is nearly a duplicate of this question with the distinction that in this case, the source table is date partitioned and the destination table does not yet exist. Also, the accepted solution to that question didn't work in this case.
I'm trying to copy a single day's worth of data from one date partitioned table into a new date partitoined table that I have not yet created. My hope is that BigQuery would simply create the date-partitioned destination table for me like it usually does for the non-date-partitioned case.
Using BigQuery CLI, here's my command:
bq cp mydataset.sourcetable\$20161231 mydataset.desttable\$20161231
Here's the output of that command:
BigQuery error in cp operation: Error processing job
'myproject:bqjob_bqjobid': Partitioning specification must be provided
in order to create partitioned table
I've tried doing something similar using the python SDK: running a select command on a date partitioned table (which selects data from only one date partition) and saving the results into a new destination table (which I hope would also be date partitioned). The job fails with the same error:
{u'message': u'Partitioning specification must be provided in order to
create partitioned table', u'reason': u'invalid'}
Clearly I need to add a partitioning specification, but I couldn't find any documentation on how to do so.
You need to create the partitioned destination table first (as per the docs):
If you want to copy a partitioned table into another partitioned
table, the partition specifications for the source and destination
tables must match.
So, just create the destination partitioned table before you start copying. If you can't be bothered specifying the schema, you can create the destination partitioned table like so:
bq mk --time_partitioning_type=DAY mydataset.temps
Then, use a query instead of a copy to write to the destination table. The schema will be copied with it:
bq query --allow_large_results --replace --destination_table 'mydataset.temps$20160101''SELECT * from `source`'
Can anyone please suggest how to create partition table in Big Query ?.
Example: Suppose I have one log data in google storage for the year of 2016. I stored all data in one bucket partitioned by year , month and date wise. Here I want create table with partitioned by date.
Thanks in Advance
Documentation for partitioned tables is here:
https://cloud.google.com/bigquery/docs/creating-partitioned-tables
In this case, you'd create a partitioned table and populate the partitions with the data. You can run a query job that reads from GCS (and filters data for the specific date) and writes to the corresponding partition of a table. For example, to load data for May 1st, 2016 -- you'd specify the destination_table as table$20160501.
Currently, you'll have to run several query jobs to achieve this process. Please note that you'll be charged for each query job based on bytes processed.
Please see this post for some more details:
Migrating from non-partitioned to Partitioned tables
There are two options:
Option 1
You can load each daily file into separate respective table with name as YourLogs_YYYYMMDD
See details on how to Load Data from Cloud Storage
After tables created, you can access them either using Table wildcard functions (Legacy SQL) or using Wildcard Table (Standar SQL). See also Querying Multiple Tables Using a Wildcard Table for more examples
Option 2
You can create Date-Partitioned Table (just one table - YourLogs) - but you still will need to load each daily file into respective partition - see Creating and Updating Date-Partitioned Tables
After table is loaded you can easily Query Date-Partitioned Tables
Having partitions for an External Table is not allowed as for now. There is a Feature Request for it:
https://issuetracker.google.com/issues/62993684
(please vote for it if you're interested in it!)
Google says that they are considering it.
I have default db in hive table which contains 80 tables .
I have created one more database and I want to copy all the tables from default DB to new Databases.
Is there any way I can copy from One DB to Other DB, without creating individual table.
Please let me know if any solution..
Thanks in advance
I can think of couple of options.
Use CTAS.
CREATE TABLE NEWDB.NEW_TABLE1 AS select * from OLDDB.OLD_TABLE1;
CREATE TABLE NEWDB.NEW_TABLE2 AS select * from OLDDB.OLD_TABLE2;
...
Use IMPORT feature of Hive
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+ImportExport
Hope this helps.
create external table new_db.table like old_db.table location '(path of file in hdfs file)';
if you have partition in table then you have to add partition in new_db.table.
These are probably the fastest and simplest way to copy / move tables from one db to other.
To move table source
Since 0.14, you can use following statement to move table from one database to another in the same metastore:
alter table old_database.table_a rename to new_database.table_a;
The above statements will also move the table data on hdfs if table_a is a managed table.
To copy table
You can always use CREATE TABLE <new_db>.<new_table> AS SELECT * FROM <old_db>.<old_table>; statements. But I believe this alternate method of copying database using hdfs dfs -cp and then creating tables with LIKE can be a little faster if your tables are huge:
hdfs dfs -cp /user/hive/warehouse/<old_database>.db /user/hive/warehouse/<new_database>.db
And then in Hive:
CREATE DATABASE <new_database>;
CREATE TABLE <new_database>.<new_table> LIKE <old_database>.<old_table>;
You can approach one of the following option :
The syntax looks something like this:
EXPORT TABLE table_or_partition TO hdfs_path;
IMPORT [[EXTERNAL] TABLE table_or_partition] FROM hdfs_path [LOCATION [table_location]];
Some sample statements would look like:
EXPORT TABLE TO 'location in hdfs';
Use test_db;
IMPORT FROM 'location in hdfs';
Export Import can be appled on a partition basis as well:
EXPORT TABLE PARTITION (loc="USA") to 'location in hdfs';
The below import commands imports to an external table instead of a managed one
IMPORT EXTERNAL TABLE FROM 'location in hdfs' LOCATION ‘/location/of/external/table’;