Hive doesn't identify manually created folders as partitions - hive

I conducted an experiment. I have an external table and partitioned it by year,month,day,hour. If I use INSERT OVERWRITE and specify certain partition for the data to go to, it ends up creating appropriate folder structure. e.g.
INSERT OVERWRITE TABLE default.testtable PARTITION(year = 2016, month = 7, day=29, hour=18)
SELECT tbl.c1 FROM (select 'Test' as c1) as tbl;
This table has only one string column but that's not very important.
So the above statement creates appropriate folder structure. But if I try to manually create similar structure and fire SELECT query, hive doesn't return data in manually created folders. In terms of structure, I made sure manually created folders look exactly the same as that of auto created folders with a 0 size file at every level of hierarchy. Is it because whenever we insert data to specific partition, Hive creates (if it doesn't exist) that partition and also stores the partition information in its metastore? Because that is the only thing that would be bypassed if I create folder structure manually.

I just now figured out that merely by manually creating a folder won't make Hive start treating it as a partition. I would have to force Hive treat it as partition using ALTER TABLE ADD PARTITION statement:-
ALTER TABLE default.testtable ADD IF NOT EXISTS PARTITION (year = 2016, month = 7, day=29, hour = 18);
After this if I fire select statement on the table, I am able to see the manually created data in that folder location.

Related

How to update table in Athena

I am creating a table 'A' from another table 'B' in Athena using create sql query. However, Table 'B' is updated with new rows every hour. I want to know how can I update the table A data without dropping table A and creating it again.
I tried dropping table and creating it again, but that seems to create performance issue as every time a new table is getting created. I want to insert only new rows in table A whichever are added in Table B
Amazon Athena is a query engine, not a database.
When a query runs on a table, Athena uses the location of a table to determine where the data is stored in an Amazon S3 bucket. It then reads all files in that location (including sub-directories) and runs the query on that data.
Therefore, the easiest way to add data to Amazon Athena tables is to create additional files in that location in Amazon S3. The next time Athena runs a query, those files will be included as part of the referenced table. Even running the INSERT INTO command creates new files in that location. ("Each INSERT operation creates a new file, rather than appending to an existing file.")
If you wish to copy data from Table-B to Table-A, and you know a way to identify which rows to add (eg there is a column with a timestamp), you could use something like:
INSERT INTO table_a
SELECT * FROM table_b
WHERE timestamp_field > (SELECT MAX(timestamp_field FROM table_a))

Unable to load managed table with maptype column (complex datatype) from external table in hive

I have external table with complex datatype,(map(string,array(struct))) and I'm able to select and query this external table without any issue.
However if I am trying to load this data to a managed table, it runs forever. Is there any best approach to load this data to managed table in hive?
CREATE EXTERNAL TABLE DB.TBL(
id string ,
list map<string,array<struct<ID:string,col:boolean,col2:string,col3:string,col4:string>>>
) LOCATION <path>
BTW, you can convert table to managed (though this may not work on cloudera distribution due warehouse dir restriction):
use DB;
alter table TBLSET TBLPROPERTIES('EXTERNAL'='FALSE');
If you need to load into another managed table, you can simply copy files into it's location.
--Create managed table (or use existing one)
use db;
create table tbl_managed(id string,
list map<string,array<struct<ID:string,col:boolean,col2:string,col3:string,col4:string>>> ) ;
--Check table location
use db;
desc formatted tbl_managed;
This will print location along with other info, use it to copy files.
Copy all files from external table location into managed table location, this will work most efficiently, much faster than insert..select:
hadoop fs -cp external/location/path/* managed/location/path
After copying files, table will be selectable. You may want to analyze table to compute statistics:
ANALYZE TABLE db_name.tablename COMPUTE STATISTICS [FOR COLUMNS]

How to drop columns from a partitioned table in BigQuery

We can not use create or replace table statement for partitioned tables in BigQuery. I can export the table to GCS but BigQuery generates then multiple JSON files that can not be imported into a table in once. Is there a safe way to drop a column from a partitioned table? I use BigQuery's web interface.
Renaming a column is not supported by the Cloud Console, the classic BigQuery web UI, the bq command-line tool, or the API. If you attempt to update a table schema using a renamed column, the following error is returned: BigQuery error in update operation: Provided Schema does not match Table project_id:dataset.table.
There are two ways to manually rename a column:
Using a SQL query: choose this option if you are more concerned about simplicity and ease of use, and you are less concerned about costs.
Recreating the table: choose this option if you are more concerned about costs, and you are less concerned about simplicity and ease of use.
If you want to drop a column you can either:
Use a SELECT * EXCEPT query that excludes the column (or columns) you want to remove and use the query result to overwrite the table or to create a new destination table
You can also remove a column by exporting your table data to Cloud Storage, deleting the data corresponding to the column (or columns) you want to remove, and then loading the data into a new table with a schema definition that does not include the removed column(s). You can also use the load job to overwrite the existing table
There is a guide published for Manually Changing Table Schemas.
edit
In order to change a Partitioned table to a Non-partitioned table, you can use the Console to query your data and overwrite your current table or copy to a new one. As an example, I have a table in BigQuery partitioned by _PARTITIONTIME. I used the following query to create a non-partitioned table,
SELECT *, _PARTITIONTIME as pt FROM `project.dataset.table`
With the above code, you will query the data among all table's partitions and create an extra column to show which partition it came from. Then, before executing it, there are two options, save the view in a new non-partitioned table or overwrite the current table:
Creating a new table go to: More(under the query editor) > Query Settings > Check the box "Set a destination table for query results" > Choose your project, dataset and write your new table's name > Under Destination table write preference check Write if empty.
Overwriting the current table: More(under the query editor) > Query Settings > Check the box "Set a destination table for query results" > Choose the same project and dataset for your current table > Write the same table's name as the one you want to overwrite > Under Destination table write preference check Overwrite table.
credit

Hive - Is it mandatory to have '=' for external table to consider as partition

I am new to Hive and have a below basic question:
I am trying to create external table on HDFS directory at location
/projects/score/output/scores_2020-06-30.gzip
but it is not considering it as partition.
Should developer need to change directory name "scores=yyyy-mm-dd" in place of "scores_yyyy-mm-dd.gzip"
like "/projects/score/output/scores=2020-06-30"
then only it would consider as partitioned?
i.e. Is it mandatory to have '=' for external table to consider as partition
Or can I change something in below table while creation. Trying as below:
CREATE EXTERNAL TABLE IF NOT EXISTS XYZ (
...
)
PARTITIONED BY (scores STRING)
LOCATION '/projects/score/output/';
Thanks in advance!
You can define partition on top of any location, even outside table directory using ALTER TABLE ADD PARTITION. Partition in HDFS is a directory usually inside table location but not necessarily. If it is inside table directory, then you can use msck repair table to attach existing subdirestories inside table directory as partitions, it will scan table location and add partitions metadata.
In your example partition directory is missing, you have only table directory with a file inside. Filename does not matter in this case.
It is not absolutely mandatory to have partition directory in the format key=value, though msck repair table may not work in your Hive distribution, you still can add partitions using ALTER TABLE ADD PARTITION ... LOCATION command on top of any directory.
It may depend on vendor. For example on Qubole, ALTER TABLE RECOVER PARTITIONS(EMR alternative of MSCK REPAIR TABLE) works fine with directories like '2020-06-30'.
By default when inserting data using dynamic partitioning, it creates partition folders in the format key=value, but if you creating partition directories using some other tools, 'value' as partition folder name is okay. Just check does MSCK REPAIR work or not in your case. If it does not, create directories key=value if you need MSCK REPAIR.
The name of file(s) and the number of files inside partition folder does not matter in this context.

External Table in Hive - Location

The below table returns no data while running a select statement
CREATE EXTERNAL TABLE foo (
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\073'
LINES TERMINATED BY '\n'
LOCATION '/user/data/CSV/2016/1/27/*/part-*';
I need my hive to point to a dynamic folder so as a mapreduce job puts a part file in a folder and hive loads into the table.
Is there any way the location be made dynamic like
/user/data/CSV/*/*/*/*/part-*
or just /user/data/CSV/* would do fine ?
(The same code works fine when created as internal table and loaded with the file path - hence there is no issues due to formatting)
First of, your table definition is missing columns. Second, external table location always points to folder, not particular files. Hive will consider all files in the folder to be data for the table.
If you have data that is generated e.g. on a daily basis by some external process you should consider partitioning your table by date. Then you need to add a new partition to the table when the data is available.
Hive does not iterate through multiple folders -
Hence for the above scenario
I ran a command line argument that iterates through these multiple folders and cat (print to the console) all the part files and then put it to a desired location.(that Hive points to)
hadoop fs -cat /user/data/CSV/*/*/*/*/part-* | hadoop fs -put - <destination folder>
This line
LOCATION '/user/data/CSV/2016/1/27/*/part-*';
Does not look correct, I don't think that the table can created from multiple locations. Have you tried just importing by a single location to confirm this?
Could also be the delimiter you're using is not correct. If you are using a CSV file to import your data try delimitating by ','.
You can use an alter table statement to change the locations. In the example below partitions are based on dates where data is stored in time dependent file locations. If I want to search many days I have to add an alter table statement for each location. This idea may extend to your situation quite well. You create a script to generate the create table statement as below using some other technology such as python.
CREATE EXTERNAL TABLE foo (
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\073'
LINES TERMINATED BY '\n'
;
alter table foo add partition (date='20160201') location /user/data/CSV/20160201/data;
alter table foo add partition (date='20160202') location /user/data/CSV/20160202/data;
alter table foo add partition (date='20160203') location /user/data/CSV/20160203/data;
alter table foo add partition (date='20160204') location /user/data/CSV/20160204/data;
You can use as many add and drop statements you need to define your locations. Then your table can find data held in many locations in HDFS rather than having all your files in one location.
You may also be able to leverage a
create table like
statement. To create a schema like you have in another table. Then alter the table to point at the files you want.
I know this isn't exactly what you want and is more of a work around. Good luck!