Does anybody knows How to get a table's hdfs directory with select statement in hive environment - hive

I used use a select statement like 'select T** from xxx' to get a table's hdfs directory location in hive. But now I forgot the statement. Does anyone know it! Thanks!

I think you need DESCRIBE formatted.
The desired location:
Location: file:/tmp/warehouse/part_table/d=abc
DEMO
hive> DESCRIBE formatted part_table partition (d='abc');
OK
# col_name data_type comment
i int
# Partition Information
# col_name data_type comment
d string
# Detailed Partition Information
Partition Value: [abc]
Database: default
Table: part_table
CreateTime: Wed Mar 30 16:57:14 PDT 2016
LastAccessTime: UNKNOWN
Protect Mode: None
####### HERE IS THE LOCATION YOU WANT ########
Location: file:/tmp/warehouse/part_table/d=abc
Partition Parameters:
COLUMN_STATS_ACCURATE true
numFiles 1
numRows 1
rawDataSize 1
totalSize 2
transient_lastDdlTime 1459382234
# Storage Information
SerDe Library: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
InputFormat: org.apache.hadoop.mapred.TextInputFormat
OutputFormat: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
Compressed: No
Num Buckets: -1
Bucket Columns: []
Sort Columns: []
Storage Desc Params:
serialization.format 1
Time taken: 0.334 seconds, Fetched: 35 row(s)

Related

TimescaleDB: efficiently select last row

I have a postgres database with the timescaledb extension.
My primary index is a timestamp, and I would like to select the latest row.
If I happen to know the latest row happened after a certain time, then I can use a query such as:
query = 'select * from prices where time > %(dt)s'
Here I specify a datetime, and execute the query using psycopg2:
# 2018-01-10 11:15:00
dt = datetime.datetime(2018,1,10,11,15,0)
with psycopg2.connect(**params) as conn:
cur = conn.cursor()
# start timing
beg = datetime.datetime.now()
# execute query
cur.execute(query, {'dt':dt})
rows = cur.fetchall()
# stop timing
end = datetime.datetime.now()
print('took {} ms'.format((end-beg).total_seconds() * 1e3))
The timing output:
took 2.296 ms
If, however, I don't know the time to input into the above query, I can use a query such as:
query = 'select * from prices order by time desc limit 1'
I execute the query in a similar fashion
with psycopg2.connect(**params) as conn:
cur = conn.cursor()
# start timing
beg = datetime.datetime.now()
# execute query
cur.execute(query)
rows = cur.fetchall()
# stop timing
end = datetime.datetime.now()
print('took {} ms'.format((end-beg).total_seconds() * 1e3))
The timing output:
took 19.173 ms
So that's more than 8 times slower.
I'm no expert in SQL, but I would have thought the query planner would figure out that "limit 1" and "order by primary index" equates to an O(1) operation.
Question:
Is there a more efficient way to select the last row in my table?
In case it is useful, here is the description of my table:
# \d+ prices
Table "public.prices"
Column | Type | Collation | Nullable | Default | Storage | Stats target | Description
--------+-----------------------------+-----------+----------+---------+---------+--------------+-------------
time | timestamp without time zone | | not null | | plain | |
AAPL | double precision | | | | plain | |
GOOG | double precision | | | | plain | |
MSFT | double precision | | | | plain | |
Indexes:
"prices_time_idx" btree ("time" DESC)
Child tables: _timescaledb_internal._hyper_12_100_chunk,
_timescaledb_internal._hyper_12_101_chunk,
_timescaledb_internal._hyper_12_102_chunk,
...
An efficient way to get last / first record in TimescaleDB:
First record:
SELECT <COLUMN>, time FROM <TABLE_NAME> ORDER BY time ASC LIMIT 1 ;
Last record:
SELECT <COLUMN>, time FROM <TABLE_NAME> ORDER BY time DESC LIMIT 1 ;
The question has already answered but I believe it might be useful if people will get here.
Using first() and last() in TimescaleDB takes much longer.
Your first query can exclude all but the last chunk, while your second query has to look in every chunk since there is no information to help the planner exclude chunks. So its not an O(1) operation but an O(n) operation with n being the number of chunks for that hypertable.
You could give that information to the planner by writing your query in the following form:
select * from prices WHERE time > now() - interval '1day' order by time desc limit 1
You might have to choose a different interval depending on your chunk time interval.
Starting with TimescaleDB 1.2 this is an O(1) operation if an entry can be found in the most recent chunk and the explicit time constraint in the WHERE clause is no longer needed if you order by time and have a LIMIT.
I tried to solve this problem in multiple ways: using last(), trying to create indexes to get the last items faster. In the end, I just ended up creating another table where I store the first and the last item inserted in the hypertable, keyed by WHERE condition that is a relationship in my case.
The database writer updates this table as well when it is inserting entries to the hypertable
I get first and last item with a simple BTree lookup - no need to go to hypertable at all
Here is my SQLAlchemy code:
class PairState(Base):
"""Cache the timespan endpoints for intervals we are generating with hypertable.
Getting the first / last row (timestamp) from hypertable is very expensive:
https://stackoverflow.com/questions/51575004/timescaledb-efficiently-select-last-row
Here data is denormalised per trading pair, and being updated when data is written to the database.
Save some resources by not using true NULL values.
"""
__tablename__ = "pair_state"
# This table has 1-to-1 relationship with Pair
pair_id = sa.Column(sa.ForeignKey("pair.id"), nullable=False, primary_key=True, unique=True)
pair = orm.relationship(Pair,
backref=orm.backref("pair_state",
lazy="dynamic",
cascade="all, delete-orphan",
single_parent=True, ), )
# First raw event in data stream
first_event_at = sa.Column(sa.TIMESTAMP(timezone=True), nullable=False, server_default=text("TO_TIMESTAMP(0)"))
# Last raw event in data stream
last_event_at = sa.Column(sa.TIMESTAMP(timezone=True), nullable=False, server_default=text("TO_TIMESTAMP(0)"))
# The last hypertable entry added
last_interval_at = sa.Column(sa.TIMESTAMP(timezone=True), nullable=False, server_default=text("TO_TIMESTAMP(0)"))
#staticmethod
def create_first_event_if_not_exist(dbsession: Session, pair_id: int, ts: datetime.datetime):
"""Sets the first event value if not exist yet."""
dbsession.execute(
insert(PairState).
values(pair_id=pair_id, first_event_at=ts).
on_conflict_do_nothing()
)
#staticmethod
def update_last_event(dbsession: Session, pair_id: int, ts: datetime.datetime):
"""Replaces the the column last_event_at for a named pair."""
# Based on the original example of https://stackoverflow.com/a/49917004/315168
dbsession.execute(
insert(PairState).
values(pair_id=pair_id, last_event_at=ts).
on_conflict_do_update(constraint=PairState.__table__.primary_key, set_={"last_event_at": ts})
)
#staticmethod
def update_last_interval(dbsession: Session, pair_id: int, ts: datetime.datetime):
"""Replaces the the column last_interval_at for a named pair."""
dbsession.execute(
insert(PairState).
values(pair_id=pair_id, last_interval_at=ts).
on_conflict_do_update(constraint=PairState.__table__.primary_key, set_={"last_interval_at": ts})
)
Create table where you will store latest timestamp after every inserting. And use this timestamp in query. It's the most efficent way for me
SELECT <COLUMN> FROM <TABLE_NAME>, <TABLE_WITH_TIMESTAMPS> WHERE time = TABLE_WITH_TIMESTAMPS.time;

How to list HDFS location of all the partitions in a hive table?

Using the command:
describe formatted my_table partition my_partition
we are able to list the metadata including hdfs location of the partition my_partition in my_table. But how can we get an output with 2 columns:
Partition | Location
which would list all the partitions in my_table and their hdfs locations?
Query the metastore.
Demo
Hive
create table mytable (i int) partitioned by (dt date,type varchar(10))
;
alter table mytable add
partition (dt=date '2017-06-10',type='A')
partition (dt=date '2017-06-11',type='A')
partition (dt=date '2017-06-12',type='A')
partition (dt=date '2017-06-10',type='B')
partition (dt=date '2017-06-11',type='B')
partition (dt=date '2017-06-12',type='B')
;
Metastore (MySQL)
select p.part_name
,s.location
from metastore.DBS as d
join metastore.TBLS as t
on t.db_id =
d.db_id
join metastore.PARTITIONS as p
on p.tbl_id =
t.tbl_id
join metastore.SDS as s
on s.sd_id =
p.sd_id
where d.name = 'default'
and t.tbl_name = 'mytable'
;
+----------------------+----------------------------------------------------------------------------------+
| part_name | location |
+----------------------+----------------------------------------------------------------------------------+
| dt=2017-06-10/type=A | hdfs://quickstart.cloudera:8020/user/hive/warehouse/mytable/dt=2017-06-10/type=A |
| dt=2017-06-11/type=A | hdfs://quickstart.cloudera:8020/user/hive/warehouse/mytable/dt=2017-06-11/type=A |
| dt=2017-06-12/type=A | hdfs://quickstart.cloudera:8020/user/hive/warehouse/mytable/dt=2017-06-12/type=A |
| dt=2017-06-10/type=B | hdfs://quickstart.cloudera:8020/user/hive/warehouse/mytable/dt=2017-06-10/type=B |
| dt=2017-06-11/type=B | hdfs://quickstart.cloudera:8020/user/hive/warehouse/mytable/dt=2017-06-11/type=B |
| dt=2017-06-12/type=B | hdfs://quickstart.cloudera:8020/user/hive/warehouse/mytable/dt=2017-06-12/type=B |
+----------------------+----------------------------------------------------------------------------------+
If it is not necessary to get the information in a nice tabular format - and you do not have access to the HMS database, you may want to run the explain extended:
explain extended select * from default.mytable;
and then you can grep out the essential information, the partition values and the location.
root#ubuntu:/home/sathya# hive -e "explain extended select * from default.mytable;" | grep location
OK
location hdfs://localhost:9000/user/hive/warehouse/mytable/dt=2017-06-10/type=A
location hdfs://localhost:9000/user/hive/warehouse/mytable
location hdfs://localhost:9000/user/hive/warehouse/mytable/dt=2017-06-10/type=B
location hdfs://localhost:9000/user/hive/warehouse/mytable
location hdfs://localhost:9000/user/hive/warehouse/mytable/dt=2017-06-11/type=A
location hdfs://localhost:9000/user/hive/warehouse/mytable
location hdfs://localhost:9000/user/hive/warehouse/mytable/dt=2017-06-11/type=B
location hdfs://localhost:9000/user/hive/warehouse/mytable
location hdfs://localhost:9000/user/hive/warehouse/mytable/dt=2017-06-12/type=A
location hdfs://localhost:9000/user/hive/warehouse/mytable
location hdfs://localhost:9000/user/hive/warehouse/mytable/dt=2017-06-12/type=B
location hdfs://localhost:9000/user/hive/warehouse/mytable
The best solution from my point of view is to get this info from Hive Metastore via Thrift protocol.
If you write code in python, you can use from hmsclient library:
Hive cli:
hive> create table test_table_with_partitions(f1 string, f2 int) partitioned by (dt string);
OK
Time taken: 0.127 seconds
hive> alter table test_table_with_partitions add partition(dt=20210504) partition(dt=20210505);
OK
Time taken: 0.152 seconds
Python cli:
>>> from hmsclient import hmsclient
>>> client = hmsclient.HMSClient(host='hive.metastore.location', port=9083)
>>> with client as c:
... all_partitions = c.get_partitions(db_name='default',
... tbl_name='test_table_with_partitions',
... max_parts=24 * 365 * 3)
...
>>> print([{'dt': part.values[0], 'location': part.sd.location} for part in all_partitions])
[{'dt': '20210504',
'location': 'hdfs://hdfs.master.host:8020/user/hive/warehouse/test_table_with_partitions/dt=20210504'},
{'dt': '20210505',
'location': 'hdfs://hdfs.master.host:8020/user/hive/warehouse/test_table_with_partitions/dt=20210505'}]
If you have Airflow installed together with apache.hive extra, you create hmsclient
using data from Airflow Connections quite easy:
hive_hook = HiveMetastoreHook()
with hive_hook.metastore as hive_client:
... your code goes here ...

Presto can't fetch content in HIVE table

My environment:
hadoop 1.0.4
hive 0.12
hbase 0.94.14
presto 0.56
All packages are installed on pseudo machine. The services are not running on localhost but
on the host name with a static IP.
presto conf:
coordinator=false
datasources=jmx,hive
http-server.http.port=8081
presto-metastore.db.type=h2
presto-metastore.db.filename=/root
task.max-memory=1GB
discovery.uri=http://<HOSTNAME>:8081
In presto cli I can get the table in hive successfully:
presto:default> show tables;
Table
-------------------
ht1
k_business_d_
k_os_business_d_
...
tt1_
(11 rows)
Query 20140114_072809_00002_5zhjn, FINISHED, 1 node
Splits: 2 total, 2 done (100.00%)
0:11 [11 rows, 291B] [0 rows/s, 26B/s]
but when I try to query data from any table the result always be empty: (no error information)
presto:default> select * from k_business_d_;
key | business | business_name | collect_time | numofalarm | numofhost | test
-----+----------+---------------+--------------+------------+-----------+------
(0 rows)
Query 20140114_072839_00003_5zhjn, FINISHED, 1 node
Splits: 1 total, 1 done (100.00%)
0:02 [0 rows, 0B] [0 rows/s, 0B/s]
If I executed the same sql in HIVE, the result show there are 1 row in the table.
hive> select * from k_business_d_;
OK
9223370648089975807|2 2 测试机 2014-01-04 00:00:00 NULL 1.0 NULL
Time taken: 2.574 seconds, Fetched: 1 row(s)
Why presto can't fetch from HIVE tables?
It looks like this is an external table that uses HBase via org.apache.hadoop.hive.hbase.HBaseStorageHandler. This is not supported yet, but one mailing list post indicates it might be possible if you copy the appropriate jars to the Hive plugin directory: https://groups.google.com/d/msg/presto-users/U7vx8PhnZAA/9edzcK76tD8J

Can we get Column Name from specific Table in Google BigQuery?

Can we get Column Name from specific Table in Google BigQuery??
Let me know query for this activity.
I tried this but cant got result...
SELECT column_name FROM publicdata:samples.shakespeare
OR
SELECT schema FROM publicdata:samples.shakespeare
1.You can use commandline tool (https://developers.google.com/bigquery/bq-command-line-tool#gettable):
bq show :.
$ bq show publicdata:samples.shakespeare
tableId Last modified Schema
------------- ----------------- ------------------------------------
shakespeare 01 Sep 13:46:28 |- word: string (required)
|- word_count: integer (required)
|- corpus: string (required)
|- corpus_date: integer (required)
2.BigQuery Browser Tool : https://developers.google.com/bigquery/bigquery-browser-tool#examineschema
3.Or use BigQuery API: https://developers.google.com/bigquery/docs/reference/v2/tables/get
I got result using Java:
Tables tableRequest = bigquery.tables();
Table table = tableRequest.get(projectName,datasetName,tableName).execute();
List<TableFieldSchema> fields = table.getSchema().getFields();
Use INFORMATION_SCHEMA to get column names with SQL:
SELECT column_name, data_type
FROM `bigquery-public-data.samples.INFORMATION_SCHEMA.COLUMNS`
WHERE table_name = 'shakespeare'
It gives you:
+-------------+-----------+
| column_name | data_type |
+-------------+-----------+
| word | STRING |
| word_count | INT64 |
| corpus | STRING |
| corpus_date | INT64 |
+-------------+-----------+
A sample using python in Jupyter:
SERVICE_ACCOUNT = 'sa_bq.json'
!pip install google-cloud
!pip install google-api-python-client
!pip install oauth2client
from google.cloud import bigquery
client_bq = bigquery.Client.from_service_account_json(SERVICE_ACCOUNT)
table = client_bq.get_table('bigquery-public-data.samples.shakespeare')
print(list(c.name for c in table.schema))
Without any queries, on the Classic UI, you can proceed as follow:
click on the blue down arrow on the left panel
Switch to project, then Display project...
on Project ID, write the name of the project (in your case you have publicdata:samples.shakespeare, your project is publicdata)
now, this project appears on the left panel
select the dataset (in your case it is Sample)
select the table (in your case it is shakespeare)
finally, in the middle of the screen you should see three tabs: Schema, Details, Preview.
If I understand you correctly, you would like to do a tables.list or tables.get and not a jobs.query.
This is how it works in google apps script:
var results = BigQuery.Tables.list(projectId, datasetId, optionalArgs);
Or by the API:
GET https://www.googleapis.com/bigquery/v2/projects/projectId/datasets/datasetId/tables
https://developers.google.com/bigquery/docs/reference/v2/tables/list
GET https://www.googleapis.com/bigquery/v2/projects/projectId/datasets/datasetId/tables/tableId
https://developers.google.com/bigquery/docs/reference/v2/tables/get
Otherwise, you can query like this SELECT * FROM [] limit 0 and write some procedure that looks at the column names.

Postgres DB Size Command

What is the command to find the size of all the databases?
I am able to find the size of a specific database by using following command:
select pg_database_size('databaseName');
You can enter the following psql meta-command to get some details about a specified database, including its size:
\l+ <database_name>
And to get sizes of all databases (that you can connect to):
\l+
You can get the names of all the databases that you can connect to from the "pg_datbase" system table. Just apply the function to the names, as below.
select t1.datname AS db_name,
pg_size_pretty(pg_database_size(t1.datname)) as db_size
from pg_database t1
order by pg_database_size(t1.datname) desc;
If you intend the output to be consumed by a machine instead of a human, you can cut the pg_size_pretty() function.
-- Database Size
SELECT pg_size_pretty(pg_database_size('Database Name'));
-- Table Size
SELECT pg_size_pretty(pg_relation_size('table_name'));
Based on the answer here by #Hendy Irawan
Show database sizes:
\l+
e.g.
=> \l+
berbatik_prd_commerce | berbatik_prd | UTF8 | en_US.UTF-8 | en_US.UTF-8 | | 19 MB | pg_default |
berbatik_stg_commerce | berbatik_stg | UTF8 | en_US.UTF-8 | en_US.UTF-8 | | 8633 kB | pg_default |
bursasajadah_prd | bursasajadah_prd | UTF8 | en_US.UTF-8 | en_US.UTF-8 | | 1122 MB | pg_default |
Show table sizes:
\d+
e.g.
=> \d+
public | tuneeca_prd | table | tomcat | 8192 bytes |
public | tuneeca_stg | table | tomcat | 1464 kB |
Only works in psql.
Yes, there is a command to find the size of a database in Postgres. It's the following:
SELECT pg_database.datname as "database_name", pg_size_pretty(pg_database_size(pg_database.datname)) AS size_in_mb FROM pg_database ORDER by size_in_mb DESC;
SELECT pg_size_pretty(pg_database_size('name of database'));
Will give you the total size of a particular database however I don't think you can do all databases within a server.
However you could do this...
DO
$$
DECLARE
r RECORD;
db_size TEXT;
BEGIN
FOR r in
SELECT datname FROM pg_database
WHERE datistemplate = false
LOOP
db_size:= (SELECT pg_size_pretty(pg_database_size(r.datname)));
RAISE NOTICE 'Database:% , Size:%', r.datname , db_size;
END LOOP;
END;
$$
From the PostgreSQL wiki.
NOTE: Databases to which the user cannot connect are sorted as if they were infinite size.
SELECT d.datname AS Name, pg_catalog.pg_get_userbyid(d.datdba) AS Owner,
CASE WHEN pg_catalog.has_database_privilege(d.datname, 'CONNECT')
THEN pg_catalog.pg_size_pretty(pg_catalog.pg_database_size(d.datname))
ELSE 'No Access'
END AS Size
FROM pg_catalog.pg_database d
ORDER BY
CASE WHEN pg_catalog.has_database_privilege(d.datname, 'CONNECT')
THEN pg_catalog.pg_database_size(d.datname)
ELSE NULL
END DESC -- nulls first
LIMIT 20
The page also has snippets for finding the size of your biggest relations and largest tables.
Start pgAdmin, connect to the server, click on the database name, and select the statistics tab. You will see the size of the database at the bottom of the list.
Then if you click on another database, it stays on the statistics tab so you can easily see many database sizes without much effort. If you open the table list, it shows all tables and their sizes.
You can use below query to find the size of all databases of PostgreSQL.
Reference is taken from this blog.
SELECT
datname AS DatabaseName
,pg_catalog.pg_get_userbyid(datdba) AS OwnerName
,CASE
WHEN pg_catalog.has_database_privilege(datname, 'CONNECT')
THEN pg_catalog.pg_size_pretty(pg_catalog.pg_database_size(datname))
ELSE 'No Access For You'
END AS DatabaseSize
FROM pg_catalog.pg_database
ORDER BY
CASE
WHEN pg_catalog.has_database_privilege(datname, 'CONNECT')
THEN pg_catalog.pg_database_size(datname)
ELSE NULL
END DESC;
du -k /var/lib/postgresql/ |sort -n |tail