How to know location about partition in hive? - sql

If I write a hive sql like
ALTER TABLE tbl_name ADD PARTITION (dt=20131023) LOCATION 'hdfs://path/to/tbl_name/dt=20131023;
How can I query this location about partition later? Because I found there is some data in location but I can't query them, hive sql like
SELECT data FROM tbl_name where dt=20131023;

Do a describe on the partition instead of the full table.
This will show the linked location if it's an external table.
describe formatted tbl_name partition (dt='20131023')

show table extended like 'tbl_name' partition (dt='20131023');
Show Tables/Partitions Extended
SHOW TABLE EXTENDED will list information for all tables matching the given regular expression. Users cannot use regular expression for table name if a partition specification is present. This command's output includes basic table information and file system information like totalNumberFiles, totalFileSize, maxFileSize, minFileSize, lastAccessTime, and lastUpdateTime. If partition is present, it will output the given partition's file system information instead of table's file system information.

If you have multiple nested partitions, the syntax is:
describe formatted table_name partition (day=123,hour=2);

If you want to know the location of files you're reading, use
SELECT INPUT__FILE__NAME, BLOCK__OFFSET__INSIDE__FILE FROM <table> WHERE <part_name> = '<part_key>'
Then you get
hdfs:///user/hive/warehouse/<db>/<table>/<part_name>=<part_key>/000000_0.snappy, 0
hdfs:///user/hive/warehouse/<db>/<table>/<part_name>=<part_key>/000000_1.snappy, 0

This is the format of the command I use to get the exact HDFS location of a specific partition in a specific table:
show table extended like flight_context_fused_record partition(date_key='20181013', partition_id='P-DUK2nESsv', custom_partition_1='ZMP');
In the command above, the partition spec consists of three separate fields. Your example may have more or less.
See results below. Notice the "location:" field shows the HDFS folder location.
hive (nva_test)> show table extended like flight_context_fused_record partition(date_key='20181013', partition_id='P-DUK2nESsv', custom_partition_1='ZMP');
OK
tableName:flight_context_fused_record
owner:nva-prod
location:hdfs://hdp1-ha/tmp/vfisher/cms-context-acquisition-2019-06-13/FlightContextFusedRecord/2018/10/13/ZMP/P-DUK2nESsv
inputformat:org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat
outputformat:org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat
columns:struct columns { string primary_key, string facility, string position, i32 dalr_channel, i64 start_time_unix_millis, i64 end_time_unix_millis, string foreign_key_to_audio_segment, struct<on_frequency_flight_list:list<struct<acid:string,ac_type:string>>,transfer_list:list<struct<primary_key:string,acid:string,data_id:string,ac_type:string,from_facility:string,from_position:string,transition_time:i64,transition_time_start:i64,transtition_time_end:i64,to_facility:string,to_position:string,source:string,source_info:string,source_time:i64,confidence:double,confidence_description:string,uuid:string>>,source_list:list<string>,domain:string,domains:list<string>> flight_context}
partitioned:true
partitionColumns:struct partition_columns { i32 date_key, string partition_id, string custom_partition_1}
totalNumberFiles:1
totalFileSize:247075687
maxFileSize:247075687
minFileSize:247075687
lastAccessTime:1561122938361
lastUpdateTime:1561071155639
The generic form of the command (taking out my specific values and putting in argument specifiers) looks like this:
show table extended like <your table name here> partition(<your partition spec here>);

you can simply do this:
DESC FORMATTED tablename PARTITION (yr_no='y2019');
OR
DESC EXTENDED tablename PARTITION (yr_no='y2019');

You can get the location of the Hive partitions on HDFS by running any of the following Hive commands.
DESCRIBE FORMATTED tbl_name PARTITION(dt=20131023);
SHOW TABLE EXTENDED LIKE tbl_name PARTITION(dt=20131023);
Alternatively, you can also get by running HDFS list command
hdfs dfs -ls <your Hive store location>/<tablename>
Link: Hive show or list all partitions
Thanks,
NNK

You can get this info via Hive Metastore Thrift protocol, e.g. with hmsclient library:
Hive cli:
hive> create table test_table_with_partitions(f1 string, f2 int) partitioned by (dt string);
OK
Time taken: 0.127 seconds
hive> alter table test_table_with_partitions add partition(dt=20210504) partition(dt=20210505);
OK
Time taken: 0.152 seconds
Python cli:
>>> with client as c:
... partition = c.get_partition_by_name(db_name='default',
tbl_name='test_table_with_partitions',
part_name='dt=20210504')
...
>>> partition.sd.location
'hdfs://hdfs.master.host:8020/user/hive/warehouse/test_table_with_partitions/dt=20210504'

Related

select row from orc snappy table in hive

I have created a table employee_orc which is orc format with snappy compression.
create table employee_orc(emp_id string, name string)
row format delimited fields terminated by '\t' stored as orc tblproperties("orc.compress"="SNAPPY");
I have uploaded data into the table using the insert statement.
employee_orc table has 1000 records.
When I run the below query, it shows all the records
select * from employee_orc;
But when run the below query, it shows zero results even though the records exist.
select * from employee_orc where emp_id = "EMP456";
Why I am unable to retrieve a single record from the employee_orc table?
The record does not exist. You may think they are the same because they look the same, but there is some difference. One possibility are spaces at the beginning or end of the string. For this, you can use like:
where emp_id like '%EMP456%'
This might help you.
On my part, I don't understand why you want to specify a delimiter in ORC. Are you confusing CSV and ORC or external vs managed ?
I advice you to create your table differently
create table employee_orc(emp_id string, name string)
stored as ORC
TBLPROPERTIES (
"orc.compress"="ZLIB");

Redshift showing 0 rows for external table, though data is viewable in Athena

I created an external table in Redshift and then added some data to the specified S3 folder. I can view all the data perfectly in Athena, but I can't seem to query it from Redshift. What's weird is that select count(*) works, so that means it can find the data, but it can't actually show anything. I'm guessing it's some mis-configuration somewhere, but I'm not sure what.
Some stuff that may be relevant (I anonymized some stuff):
create external schema spectrum_staging
from data catalog
database 'spectrum_db'
iam_role 'arn:aws:iam::############:role/RedshiftSpectrumRole'
create external database if not exists;
create external table spectrum_staging.errors(
id varchar(100),
error varchar(100))
stored as parquet
location 's3://mybucket/errors/';
My sample data is stored in s3://mybucket/errors/2018-08-27-errors.parquet
This query works:
db=# select count(*) from spectrum_staging.errors;
count
-------
11
(1 row)
This query does not:
db=# select * from spectrum_staging.errors;
id | error
----+-------
(0 rows)
Check your parquet file and make sure the column data types in the Spectrum table match up.
Then run SELECT pg_last_query_id(); after your query to get the query number and look in the system tables STL_S3CLIENT and STL_S3CLIENT_ERROR to find further details about the query execution.
You don't need to define external tables when you have defined external schema based on Glue Data Catalog. Redshift Spectrum pics up all the tables that are in the Catalog.
What's probably going on there is that you somehow have two things with the same name and in one case it picks it up from the data catalog and in the other case it tries to use the external table.
Check these tables from Redshift side to get a better view of what's there:
select * from SVV_EXTERNAL_SCHEMAS
select * from SVV_EXTERNAL_TABLES
select * from SVV_EXTERNAL_PARTITIONS
select * from SVV_EXTERNAL_COLUMNS
And these tables for queries that use the tables from external schema:
select * from SVL_S3QUERY_SUMMARY
select * from SVL_S3LOG order by eventtime desc
select * from SVL_S3QUERY where query = xyz
select * from SVL_S3PARTITION where query = xyz
was there ever a resolution for this? a year down, i have the same problem today.
nothing stands out in terms of schema differences- an error exists though
select recordtime, file, process, errcode, linenum as line,
trim(error) as err
from stl_error order by recordtime desc;
/home/ec2-user/padb/src/sys/cg_util.cpp padbmaster 1 601 Compilation of segment failed: /rds/bin/padb.1.0.10480/data/exec/227/48844003/de67afa670209cb9cffcd4f6a61e1c32a5b3dccc/0
Not sure what this means.
I encountered a similar issue when creating an external table in Athena using RegexSerDe row format. I was able to query this external table from Athena without any issues. However, when querying the external table from Redhift the results were null.
Resolved by converting to parquet format as Spectrum cannot handle regular expression serialization.
See link below:
Redshift spectrum shows NULL values for all rows

Athena partition locations

I can view all the partitions on my table using
show partitions my_table
and I can see the location of a partition by using
describe formatted my_table partition (partition_col='value')
but I have a lot of partitions, and don't want to have to parse the output of describe formatted if it can be avoided.
Is there a way to get all partitions and their locations, in a single query?
There's no built in or consistent way to get this information.
Assuming you know your partition column(s), you can get this information with a query like
select distinct partition_col, "$path" from my_table
The cheapest way to get the locations of the partitions of a table is to use the GetPartitions call from the Glue API. It will list all partitions, their values and locations. You can try it out using the AWS CLI tool like this:
aws glue get-partitions --region us-somewhere-1 --database-name your_database --table-name the_table
Using SQL like SELECT DISTINCT partition_col, "$path" FROM the_table could be expensive since Athena unfortunately scans the whole table to produce the output (it could have just looked at the table metadata but that optimization does not seem to exist yet).
Using boto3 (as of version 1.12.9) the following is returning the complete list:
glue_client = boto3.client("glue")
glue_paginator = glue_client.get_paginator("get_partitions")
pages_iter = glue_paginator.paginate(
DatabaseName=db_name, TableName=table_name
)
res = []
for page in pages_iter:
for partition in page["Partitions"]:
res.append(
{
"Values": partition["Values"],
"Location": partition["StorageDescriptor"]["Location"],
}
)

hive can't change column type Invalid column reference [duplicate]

I have a table which has a partition of type int but which I want to convert to string. However, I can't figure out how to do this.
The table description is:
Col1 timestamp
Col2 string
Col3 string
Col4 string
Part_col int
# Partition information
# col_name data_type comment
Part_col int
The partitions I have created are Part_col=0, Part_col=1, ..., Part_col=23
I want to change them to Part_col='0' etc
I run this command in hive:
set hive.exec.dynamic.partitions = true;
Alter table tbl_name partition (Part_col=0) Part_col Part_col string;
I have also tried using "partition (Part_col)" to change all partitions at once.
I get the error "Invalid column reference Part_col"
I am using the example from https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Types for conversion of decimal columns but can't figure out what dec_column_name represents.
Thanks
A bit of digging revealed that there was a hive JIRA to have a command exactly for updating partition column data type (https://issues.apache.org/jira/browse/HIVE-3672)
alter table {table_name} partition column ({column_name} {column_type});
According to JIRA the command was implemented, but it's apparent that it was never documented on Hive Wiki.
I used it on my Hive 0.14 system and it worked as expected.
I think yo should redefine the table's schema and redefine that your partition value is not gonna be a integer anymore and this is now gonna be a string type.
What I recommend you to do is:
Make your table external (in case you define this as a non-external table). In this case you can drop the table without removing the data in the directories.
Drop the table.
Create again the table with the new schema (Partition value as a string).
The steps above, physically (structure folders) is not gonna make any difference with the structure that you already had. The difference is gonna be in the Hive metastore, specifically in the "virtual column" created when you make partitions.
Also, now instead making queries like: part_col = 1, now you are gonna be able to make queries like: part_col = '1'.
Try this and tell me how this goes.

Hive non consistent input file

I have inconsistent log file which I would like to partition with Hive using dynamic partitioning. File example:
20/06/13 20:21:42.637 FLW CPTView::OnInitialUpdate nRemoveAppShareQSize0=50000\n
20/06/13 20:21:42.638 FLW \n
BandwidthGlobalSettings:Old Bandwidth common defines\n
Sometimes log file contains line which started with some word different from date. Each line delimited with \n.
I am running commands:
CREATE EXTERNAL TABLE IF NOT EXISTS log_messages_temp (date STRING,time STRING,severity STRING,message STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\040' LOCATION '/examples/hive/tmp';
CREATE EXTERNAL TABLE IF NOT EXISTS log_messages_partitioned (time STRING,severity STRING,message STRING) PARTITIONED BY (date STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\040' LOCATION '/examples/hive/partitions';
set hive.exec.dynamic.partition=true;
set hive.exec.dynamic.partition.mode=nonstrict;
FROM log_messages_temp pvs INSERT OVERWRITE TABLE log_messages_partitioned PARTITION(date) SELECT pvs.time, pvs.severity, pvs.message, pvs.date;
As a result two dynamic partitions were created: date=20/06/13 and date=BandwidthGlobalSettings:Old
I would like to define to Hive to ignore lines started with not date string.
How can I do this? Or maybe exists another solution?
Thanks.
I think you can write a UDF that will use regular expression to take only date format(ex: 20/06/13) and discard all others like "BandwidthGlobalSettings:Old" . You can use this UDF in your last query while inserting into final table.
I hope this explanation helps your requirement.