Load partitioned BigQuery table from partitioned ORC - google-bigquery

I want to create a BigQuery partitioned table by mydate column from partitioned ORC.
Files in GCS :
mydate=2021-04-01/*.orc
...
mydate=2021-04-30/*.orc
Command bq:
bq load --source_format=ORC --time_partitioning_field mydate --time_partitioning_type DAY mydataset.mytable gs://mydata/*.orc
When I run this command I have this error : The field specified for partitioning cannot be found in the schema because mydate is not in ORC file.
How can I manage that?
Thanks for your help and have a nice day.

I think we can do this by Providing a custom partition key schema encoded via the source_uri_prefix field.
Using below links and examples [1] & [2] related to Partition Schema detection modes, I think you can do it.
[1] https://cloud.google.com/bigquery/docs/hive-partitioned-loads-gcs#command-line-tool
[2] https://cloud.google.com/bigquery/docs/hive-partitioned-loads-gcs

Related

Hive partitioning for data on s3

Our data is stored using s3://bucket/YYYY/MM/DD/HH and we are using aws firehouse to land parquet data in there locations in near real time . I can query data using AWS athena just fine however we have a hive query cluster which is giving troubles querying data when partitioning is enabled .
This is what I am doing :
PARTITIONED BY (
`year` string,
`month` string,
`day` string,
`hour` string)
This doesn't seem to work when data on s3 is stored as s3:bucket/YYYY/MM/DD/HH
however this does work for s3:bucket/year=YYYY/month=MM/day=DD/hour=HH
Given the stringent bucket paths of firehose i cannot modify the s3 paths. So my question is what's the right partitioning scheme in hive ddl when you don't have an explicitly defined column name on your data path like year = or month= ?
Now you can specify S3 prefix in firehose.https://docs.aws.amazon.com/firehose/latest/dev/s3-prefixes.html
myPrefix/year=!{timestamp:yyyy}/month=!{timestamp:MM}/day=!{timestamp:dd}/hour=!{timestamp:HH}/
If you can't obtain folder names as per hive naming convention, you will need to map all the partitions manually
ALTER TABLE tableName ADD PARTITION (year='YYYY') LOCATION 's3:bucket/YYYY'

Google Bigquery: Partitioning specification needed for copying date partitioned table

Note: this is nearly a duplicate of this question with the distinction that in this case, the source table is date partitioned and the destination table does not yet exist. Also, the accepted solution to that question didn't work in this case.
I'm trying to copy a single day's worth of data from one date partitioned table into a new date partitoined table that I have not yet created. My hope is that BigQuery would simply create the date-partitioned destination table for me like it usually does for the non-date-partitioned case.
Using BigQuery CLI, here's my command:
bq cp mydataset.sourcetable\$20161231 mydataset.desttable\$20161231
Here's the output of that command:
BigQuery error in cp operation: Error processing job
'myproject:bqjob_bqjobid': Partitioning specification must be provided
in order to create partitioned table
I've tried doing something similar using the python SDK: running a select command on a date partitioned table (which selects data from only one date partition) and saving the results into a new destination table (which I hope would also be date partitioned). The job fails with the same error:
{u'message': u'Partitioning specification must be provided in order to
create partitioned table', u'reason': u'invalid'}
Clearly I need to add a partitioning specification, but I couldn't find any documentation on how to do so.
You need to create the partitioned destination table first (as per the docs):
If you want to copy a partitioned table into another partitioned
table, the partition specifications for the source and destination
tables must match.
So, just create the destination partitioned table before you start copying. If you can't be bothered specifying the schema, you can create the destination partitioned table like so:
bq mk --time_partitioning_type=DAY mydataset.temps
Then, use a query instead of a copy to write to the destination table. The schema will be copied with it:
bq query --allow_large_results --replace --destination_table 'mydataset.temps$20160101''SELECT * from `source`'

Sqoop import hive ORC

All,
I have question for sqooping , I am sqooping around 2tb of data for one table and then need to write ORC table wit h that . What's best way to achieve
1) sqoop all data in dir1 as text and write HQL to load into ORC table , where script fail for vertex issue
2) sqoop data in chucks and process and append into hive table ( have you done this ? )
3) sqoop hive import to write all data to hive ORC table
Which is best way ?
Option three will be better because you dont need to create a hive table and again loading data into it and storing that data in orc format it is a long process for 2tb of data so its better to give in sqoop so it can directly push the data into hive table with orc format but when you are returning data from hive table to rdbms you have to use sqoopserde

BigQuery insert into a partitioned table from an existing table

I have to tables with the same schema tab1 and tab1_partitioned where the latter is partitioned by day.
I am trying to insert data into the partitioned table with the following command:
bq query --allow_large_results --replace --noflatten_results --destination_table 'advertiser.development_partitioned$20160101' 'select * from advertiser.development where ymd = 20160101';
but I get the following error:
BigQuery error in query operation: Error processing job 'total-handler-133811:bqjob_r78379ac2513cb515_000001553afb7196_1': Provided Schema does not match Table
Both have exactly the same schema and I really don't understand why I am getting that error. Can someone shed some light on my issue?
In fact, I would prefer If BigQuery supported the dynamic partitioning insert that is supported in Hive, but some days of search seem to point that is not possible :-/
The behavior you are seeing is due to how we treat write dispositions when using them with table partitions.
You should be able to append to the partition using a WRITE_APPEND disposition to get the query to go through.
bq query --allow_large_results --append_table --noflatten_results --destination_table 'advertiser.development_partitioned$20160101' 'select * from advertiser.development where ymd = 20160101';
There are some complications to making it work with --replace, but we are looking into improved schema support for table partitions at this time.
Please let me know if this doesn't work for you. Thanks!
To answer the other part of your question about dynamic partitioning - we do plan to support richer flavors of partitioning and we believe that they will handle the majority of use cases.
FYI, I don't think it was always so, but there is now a way to copy data from non-partitioned to partitioned tables in bigquery just using DML from the bigquery UI. For example, if you have a date string in your origin table, of the form YYYY-MM-DD, you could run this to move the data to a partitioned table ...
create table my_dataset.my_table (sesh STRING, prod STRING) partition by DATE(_PARTITIONTIME);
insert into my_dataset.my_table (_PARTITIONTIME, sesh, prod) select CAST(PARSE_DATE('%Y-%m-%d', mydatestr) as TIMESTAMP), sesh, prod from my_dataset.my_orig_table;

hive 0.13 msck repair table only lists partitions not in metastore

I'm trying to use Hive(0.13) msck repair table command to recover partitions and it only lists the partitions not added to metastore instead of adding them to metastore as well.
here's the ouput of the command
partitions not in metastore externalexample:CreatedAt=26 04%3A50%3A56 UTC 2014/profileLocation="Chicago"
here's how I'm creating the external table
CREATE EXTERNAL TABLE IF NOT EXISTS ExternalExample(
tweetId BIGINT, username STRING,
txt STRING, CreatedAt STRING,
profileLocation STRING,
favc BIGINT,retweet STRING,retcount BIGINT,followerscount BIGINT)
COMMENT 'This is the Twitter streaming data'
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'
STORED AS TEXTFILE
location '/user/hue/exttable/';
Am I missing something?
I had a similar issue with the MSCK REPAIR TABLE listing the partitions that were not in the metastore but not actually adding them (and no error message).
I tried manually adding the partition with the ALTER TABLE ADD PARTITION command, and this gave me an error message, leading me to the root cause which was that the HDFS folder containing the 'missing' partition had been set up with incorrect permissions.
Once the permissions issue was resolved, then the MSCK REPAIR TABLE command worked correctly.
If you encounter this issue, it may be worthwhile to try adding it manually with the ALTER TABLE ADD PARTITION command. It may produce a useful error message that would help you determine the root cause of the problem.
Please make sure that the name of the partitions defined in your table definition match the name of the partition on hdfs.
For example, in your table creation example, I see that you haven't defined any paritions at all.
I think you want to do something like this (note the use of PARTITIONED BY):
create external table ExternalExample(tweetId BIGINT, username STRING, txt STRING,favc BIGINT,retweet STRING,retcount BIGINT,followerscount BIGINT) PARTITIONED BY (CreatedAt STRING, profileLocation STRING) COMMENT 'This is the Twitter streaming data' ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' STORED AS TEXTFILE location '/user/hue/exttable/';
Then on hdfs you should have the following folder structure:
/user/hue/exttable/CreatedAt=<someString>/profileLocation=<someString>/your-data-file
The partition names for MSCK REPAIR TABLE ExternalTable should be in lowercase then only it will add it to hive metastore, I faced the similar issue in hive 1.2.1 where there was no support for ALTER TABLE ExternalTable RECOVER PARTITION, but after spending some time debugging found the issue that the partition names should be in lowercase i.e /some_external_path/mypartion=01 is valid and /some_external_path/myParition=01 is invalid;
Make your profileLocation to profilelocation or profile_location and test it should work.
My question is here Not able to recover partitions through alter table in Hive 1.2
Hive stores a list of partitions for each table in its metastore. If, however, new partitions are directly added to HDFS (manually by hadoop fs -put command), the metastore will not be aware of these partitions.
you need to add partition
ALTER TABLE ExternalExample ADD PARTITION
for every partition
or in short you can run
MSCK REPAIR TABLE ExternalExample;
It will add any partitions that exist on HDFS but not in metastore to the metastore.
Ref https://issues.apache.org/jira/browse/HIVE-874
1) You need to specify partitions
2) Partition names must have all lower case letters . See this - https://singhanuvrat.com/hive-partition-column-name-camelcase-bad-idea-b89796d4e741#.16d7uqfot
you might not be running as the hive user:
sudo -u hive** hive -e "set hive.msck.path.validation=ignore;msck repair table T1"
set hive.msck.path.validation=ignore; ( this is for tables with large number of partitions.)
You are just missing the PARTITIONED BY (CreatedAt STRING, profileLocation STRING).