Difference between last_modified_time and last_ddl_time in Hive? - hive

In Hive when we use the following command : SHOW CREATE TABLE TABLE_NAME ;
It returns a list of metadata related to the particular table . Out of those metadata, there are two fields i am confused between.
'last_modified_time'='1620814731',
'transient_lastDdlTime'='1620820769'
What is the underlying difference between these two metrics ?

Related

List wildcard tables bigquery schema

I want to list the tables which can be used as wildcards in BigQuery.
My dataset has the table list is similar to the following:
events_122022
events_122021
events_122020
...
...
events_112012
...
...
analytics_122022
analytics_122021
analytics_122020
...
...
analytics_112012
These tables are created dynamically and I have no information on the used table prefix
Is there a way to find the list of tables which can be used dynamically?
The result should be:[events_, analytics_]
My attempt:
Find the tables with similar DDL using the following SQL
SELECT
SUBSTR(ddl, STRPOS(ddl, '(')) as commonDDL,
STRING_AGG(table_name) as table
FROM
dataset.INFORMATION_SCHEMA.TABLES
GROUP BY SUBSTR(ddl, STRPOS(ddl, '('))
This gives the output as :
commonDDL
table
(ID STRING, ...)
events_122022, events_122021, ...
(NAME STRING, ...)
analytics_112022, analytics_112021 ...
Now using a Longest common shared start algorithm I can find the required result.
(Longest common start code here )
What are the other ways we can approach this problem?
Couldn't find anything on BigQuery docs.
Note: I only have readonly permission for the BigQuery dataset
What about finding your table with some regex:
select table_name
from yourds.INFORMATION_SCHEMA.TABLES
where regexp_contains(table_name, "_[0-2]+") is true

Bigquery : get the name of the table as column value

I have a dataset that contains several tables that have suffixes in their name:
table_src1_serie1
table_src1_serie2
table_src2_opt1
table_src2_opt2
table_src3_type1_v1
table_src3_type2_v1
table_src3_type2_v2
I know that i can use this type of queries in BQ:
select * from `project.dataset.table_*`
to get all the rows from theses different tables.
What i am trying to achieve is to have a column that will contain for instance the type of source (src1, src2, src3)
Assuming the schema of all tables the same - you can add below to your select list (for BigQuery Standard SQL)
SPLIT(_TABLE_SUFFIX, '_')[SAFE_OFFSET(0)] AS src

Redshift showing 0 rows for external table, though data is viewable in Athena

I created an external table in Redshift and then added some data to the specified S3 folder. I can view all the data perfectly in Athena, but I can't seem to query it from Redshift. What's weird is that select count(*) works, so that means it can find the data, but it can't actually show anything. I'm guessing it's some mis-configuration somewhere, but I'm not sure what.
Some stuff that may be relevant (I anonymized some stuff):
create external schema spectrum_staging
from data catalog
database 'spectrum_db'
iam_role 'arn:aws:iam::############:role/RedshiftSpectrumRole'
create external database if not exists;
create external table spectrum_staging.errors(
id varchar(100),
error varchar(100))
stored as parquet
location 's3://mybucket/errors/';
My sample data is stored in s3://mybucket/errors/2018-08-27-errors.parquet
This query works:
db=# select count(*) from spectrum_staging.errors;
count
-------
11
(1 row)
This query does not:
db=# select * from spectrum_staging.errors;
id | error
----+-------
(0 rows)
Check your parquet file and make sure the column data types in the Spectrum table match up.
Then run SELECT pg_last_query_id(); after your query to get the query number and look in the system tables STL_S3CLIENT and STL_S3CLIENT_ERROR to find further details about the query execution.
You don't need to define external tables when you have defined external schema based on Glue Data Catalog. Redshift Spectrum pics up all the tables that are in the Catalog.
What's probably going on there is that you somehow have two things with the same name and in one case it picks it up from the data catalog and in the other case it tries to use the external table.
Check these tables from Redshift side to get a better view of what's there:
select * from SVV_EXTERNAL_SCHEMAS
select * from SVV_EXTERNAL_TABLES
select * from SVV_EXTERNAL_PARTITIONS
select * from SVV_EXTERNAL_COLUMNS
And these tables for queries that use the tables from external schema:
select * from SVL_S3QUERY_SUMMARY
select * from SVL_S3LOG order by eventtime desc
select * from SVL_S3QUERY where query = xyz
select * from SVL_S3PARTITION where query = xyz
was there ever a resolution for this? a year down, i have the same problem today.
nothing stands out in terms of schema differences- an error exists though
select recordtime, file, process, errcode, linenum as line,
trim(error) as err
from stl_error order by recordtime desc;
/home/ec2-user/padb/src/sys/cg_util.cpp padbmaster 1 601 Compilation of segment failed: /rds/bin/padb.1.0.10480/data/exec/227/48844003/de67afa670209cb9cffcd4f6a61e1c32a5b3dccc/0
Not sure what this means.
I encountered a similar issue when creating an external table in Athena using RegexSerDe row format. I was able to query this external table from Athena without any issues. However, when querying the external table from Redhift the results were null.
Resolved by converting to parquet format as Spectrum cannot handle regular expression serialization.
See link below:
Redshift spectrum shows NULL values for all rows

How to get column name and type in hive

I know of these,
To get column names in a table we can fire:
show columns in <database>.<table_name>
To get description of a table (including column_name, column_type and many other details):
describe [formatted] <database>.<table_name>
I know that I can use the above query and filter the result to get the columns names and types. But I want to know if there is any direct command to get just the column names and types like select columns, column_type ...?
In HIVE you could use:
DESCRIBE FORMATTED [DatabaseName].[TableName] [Column Name];
This gives you the column data type and some stats of that column.
DESCRIBE [DatabaseName].[TableName] [Column Name];
This just gives you the data type and comments if available for a specific column.
Hope this helps.
Unlike traditional RDBMS, Hive stores metadata in a separate database. In most cases it is in MySQL or Postgres. If you have access to the metastore database, you can run SELECT on table TBLS to get the details about the tables and COLUMNS_V2 to get the details about columns.

How to know location about partition in hive?

If I write a hive sql like
ALTER TABLE tbl_name ADD PARTITION (dt=20131023) LOCATION 'hdfs://path/to/tbl_name/dt=20131023;
How can I query this location about partition later? Because I found there is some data in location but I can't query them, hive sql like
SELECT data FROM tbl_name where dt=20131023;
Do a describe on the partition instead of the full table.
This will show the linked location if it's an external table.
describe formatted tbl_name partition (dt='20131023')
show table extended like 'tbl_name' partition (dt='20131023');
Show Tables/Partitions Extended
SHOW TABLE EXTENDED will list information for all tables matching the given regular expression. Users cannot use regular expression for table name if a partition specification is present. This command's output includes basic table information and file system information like totalNumberFiles, totalFileSize, maxFileSize, minFileSize, lastAccessTime, and lastUpdateTime. If partition is present, it will output the given partition's file system information instead of table's file system information.
If you have multiple nested partitions, the syntax is:
describe formatted table_name partition (day=123,hour=2);
If you want to know the location of files you're reading, use
SELECT INPUT__FILE__NAME, BLOCK__OFFSET__INSIDE__FILE FROM <table> WHERE <part_name> = '<part_key>'
Then you get
hdfs:///user/hive/warehouse/<db>/<table>/<part_name>=<part_key>/000000_0.snappy, 0
hdfs:///user/hive/warehouse/<db>/<table>/<part_name>=<part_key>/000000_1.snappy, 0
This is the format of the command I use to get the exact HDFS location of a specific partition in a specific table:
show table extended like flight_context_fused_record partition(date_key='20181013', partition_id='P-DUK2nESsv', custom_partition_1='ZMP');
In the command above, the partition spec consists of three separate fields. Your example may have more or less.
See results below. Notice the "location:" field shows the HDFS folder location.
hive (nva_test)> show table extended like flight_context_fused_record partition(date_key='20181013', partition_id='P-DUK2nESsv', custom_partition_1='ZMP');
OK
tableName:flight_context_fused_record
owner:nva-prod
location:hdfs://hdp1-ha/tmp/vfisher/cms-context-acquisition-2019-06-13/FlightContextFusedRecord/2018/10/13/ZMP/P-DUK2nESsv
inputformat:org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat
outputformat:org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat
columns:struct columns { string primary_key, string facility, string position, i32 dalr_channel, i64 start_time_unix_millis, i64 end_time_unix_millis, string foreign_key_to_audio_segment, struct<on_frequency_flight_list:list<struct<acid:string,ac_type:string>>,transfer_list:list<struct<primary_key:string,acid:string,data_id:string,ac_type:string,from_facility:string,from_position:string,transition_time:i64,transition_time_start:i64,transtition_time_end:i64,to_facility:string,to_position:string,source:string,source_info:string,source_time:i64,confidence:double,confidence_description:string,uuid:string>>,source_list:list<string>,domain:string,domains:list<string>> flight_context}
partitioned:true
partitionColumns:struct partition_columns { i32 date_key, string partition_id, string custom_partition_1}
totalNumberFiles:1
totalFileSize:247075687
maxFileSize:247075687
minFileSize:247075687
lastAccessTime:1561122938361
lastUpdateTime:1561071155639
The generic form of the command (taking out my specific values and putting in argument specifiers) looks like this:
show table extended like <your table name here> partition(<your partition spec here>);
you can simply do this:
DESC FORMATTED tablename PARTITION (yr_no='y2019');
OR
DESC EXTENDED tablename PARTITION (yr_no='y2019');
You can get the location of the Hive partitions on HDFS by running any of the following Hive commands.
DESCRIBE FORMATTED tbl_name PARTITION(dt=20131023);
SHOW TABLE EXTENDED LIKE tbl_name PARTITION(dt=20131023);
Alternatively, you can also get by running HDFS list command
hdfs dfs -ls <your Hive store location>/<tablename>
Link: Hive show or list all partitions
Thanks,
NNK
You can get this info via Hive Metastore Thrift protocol, e.g. with hmsclient library:
Hive cli:
hive> create table test_table_with_partitions(f1 string, f2 int) partitioned by (dt string);
OK
Time taken: 0.127 seconds
hive> alter table test_table_with_partitions add partition(dt=20210504) partition(dt=20210505);
OK
Time taken: 0.152 seconds
Python cli:
>>> with client as c:
... partition = c.get_partition_by_name(db_name='default',
tbl_name='test_table_with_partitions',
part_name='dt=20210504')
...
>>> partition.sd.location
'hdfs://hdfs.master.host:8020/user/hive/warehouse/test_table_with_partitions/dt=20210504'