My environment:
hadoop 1.0.4
hive 0.12
hbase 0.94.14
presto 0.56
All packages are installed on pseudo machine. The services are not running on localhost but
on the host name with a static IP.
presto conf:
coordinator=false
datasources=jmx,hive
http-server.http.port=8081
presto-metastore.db.type=h2
presto-metastore.db.filename=/root
task.max-memory=1GB
discovery.uri=http://<HOSTNAME>:8081
In presto cli I can get the table in hive successfully:
presto:default> show tables;
Table
-------------------
ht1
k_business_d_
k_os_business_d_
...
tt1_
(11 rows)
Query 20140114_072809_00002_5zhjn, FINISHED, 1 node
Splits: 2 total, 2 done (100.00%)
0:11 [11 rows, 291B] [0 rows/s, 26B/s]
but when I try to query data from any table the result always be empty: (no error information)
presto:default> select * from k_business_d_;
key | business | business_name | collect_time | numofalarm | numofhost | test
-----+----------+---------------+--------------+------------+-----------+------
(0 rows)
Query 20140114_072839_00003_5zhjn, FINISHED, 1 node
Splits: 1 total, 1 done (100.00%)
0:02 [0 rows, 0B] [0 rows/s, 0B/s]
If I executed the same sql in HIVE, the result show there are 1 row in the table.
hive> select * from k_business_d_;
OK
9223370648089975807|2 2 测试机 2014-01-04 00:00:00 NULL 1.0 NULL
Time taken: 2.574 seconds, Fetched: 1 row(s)
Why presto can't fetch from HIVE tables?
It looks like this is an external table that uses HBase via org.apache.hadoop.hive.hbase.HBaseStorageHandler. This is not supported yet, but one mailing list post indicates it might be possible if you copy the appropriate jars to the Hive plugin directory: https://groups.google.com/d/msg/presto-users/U7vx8PhnZAA/9edzcK76tD8J
Related
I have a hive external table that is partitioned by the date and time it was inserted, say for example 20200331_0505 which is in YYYYMMDD_HHMM format.
Currently there is only one partition:
> hdfs dfs -ls /path/to/external/table
-rw-r----- 2020-03-31 05:06 /path/to/external/table/_SUCCESS
drwxr-x--- 2020-03-31 05:06 /path/to/external/table/loaddate=20200331_0505
And if I run a hive query to find the partitions:
select distinct loaddate from table;
+----------------+
| loaddate |
+----------------+
| 20200331_0505 |
+----------------+
That is expected and what I want to see, but if I run this:
select * from table where loaddate=(select max(loaddate) from table);
Then I get this error:
ERROR : FAILED: Execution Error, return code 2 from org.apache.hadoop.hive.ql.exec.tez.TezTask. Vertex failed, vertexName=Map 3, vertexId=vertex_1585179445264_14095_4_00, diagnostics=[Vertex vertex_1585179445264_14095_4_00 [Map 3] killed/failed due to:ROOT_INPUT_INIT_FAILURE, Vertex Input: <Table> initializer failed, vertex=vertex_1585179445264_14095_4_00 [Map 3], java.lang.RuntimeException: ORC split generation failed with exception: java.io.FileNotFoundException: File hdfs://path/to/external/table/loaddate=20200327_0513 does not exist.
at org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.generateSplitsInfo(OrcInputFormat.java:1851)
at org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.getSplits(OrcInputFormat.java:1939)
at org.apache.hadoop.hive.ql.io.HiveInputFormat.addSplitsForGroup(HiveInputFormat.java:524)
at org.apache.hadoop.hive.ql.io.HiveInputFormat.getSplits(HiveInputFormat.java:779)
at org.apache.hadoop.hive.ql.exec.tez.HiveSplitGenerator.initialize(HiveSplitGenerator.java:243)
So it is trying to load a partition that does not exist, 20200327_0513, what could be causing this?
When you delete partitions either directly with the rm command or with something like a SaveMode.Overwrite write command, it does not alert hive to the changes in partitions, so you need to tell hive that the partitions have changed. There are many ways to do this, the way I chose to fix it was:
msck repair table <table> sync partitions
I'm running a Python script to load data from a DataFrame into a SQL Table. However, the insert command is throwing this error:
(pyodbc.Error) ('HY000', '[HY000] ERROR 3587: Insufficient resources to execute plan on pool fastlane [Request exceeds session memory cap: 28357027KB > 20971520KB]\n (3587) (SQLExecDirectW)')
This is my code:
df.to_sql('TableName',engine,schema='trw',if_exists='append',index=False) #copying data from Dataframe df to a SQL Table
Can you do the following for me:
run this command - and share the output. MAXMEMORYSIZE, MEMORYSIZE and MAXQUERYMEMORYSIZE, plus PLANNEDCONCURRENCY give you an idea of the (memory) budget at the time when the query / copy command was planned.
gessnerm#gessnerm-HP-ZBook-15-G3:~/1/fam/fam-ostschweiz$ vsql -x -c \
"select * from resource_pools where name='fastlane'"
-[ RECORD 1 ]------------+------------------
pool_id | 45035996273841188
name | fastlane
is_internal | f
memorysize | 0%
maxmemorysize |
maxquerymemorysize |
executionparallelism | 16
priority | 0
runtimepriority | MEDIUM
runtimeprioritythreshold | 2
queuetimeout | 00:05
plannedconcurrency | 2
maxconcurrency |
runtimecap |
singleinitiator | f
cpuaffinityset |
cpuaffinitymode | ANY
cascadeto |
Then, you should dig, out of the QUERY_REQUESTS system table, the acutal SQL command that your python script triggered. It should be in the format of:
COPY <_the_target_table_>
FROM STDIN DELIMITER ',' ENCLOSED BY '"'
DIRECT REJECTED DATA '<_bad_file_name_>'
or similar.
Then: how big is the file / are the files you're trying to load in one go? If too big, then B.Muthamizhselvi is right - you'll need to portion the data volume you load.
Can you also run:
vsql -c "SELECT EXPORT_OBJECTS('','<schema>.<table>',FALSE)"
.. .and share the output? It could well be that you have too many projections for the memory to be enough, that you are sorting by too many columns.
Hope this helps for starters ...
I'm issuing a spark-sql job to dataproc that simply displays some data from a table:
gcloud dataproc jobs submit spark-sql --cluster mycluster --region europe-west1 -e "select * from mydb.mytable limit 10"
When the data is returned and outputted to stdout I don't see column headings, I only see the raw data, whitespace delimited. I'd really like the output to be formatted better, specifically I'd like to see column headings. I tried this:
gcloud dataproc jobs submit spark-sql --cluster mycluster --region europe-west1 -e "SET hive.cli.print.header=true;select * from mydb.mytable limit 10"
But it had no affect.
Is there a way to get spark-sql to display column headings on dataproc?
If there is a way to get the data displayed like so:
+----+-------+
| ID | Name |
+----+-------+
| 1 | Jim |
| 2 | Ann |
| 3 | Simon |
+----+-------+
then that would be even better.
I have been performing some tests with a Dataproc cluster, and it looks like it is not possible to retrieve query results with the column names using Spark SQL. However, this is more a Apache Spark SQL issue, rather than Dataproc, so I have added that tag to your question too, in order for it to receive a better attention.
If you get into the Spark SQL console in your Dataproc cluster (by SSHing in the master and typing spark-sql), you will see that the result for SELECT queries does not include the table name:
SELECT * FROM mytable;
18/04/17 10:31:51 INFO org.apache.hadoop.mapred.FileInputFormat: Total input files to process : 3
2 Ann
1 Jim
3 Simon
There's no change if using instead SELECT ID FROM mytable;. Therefore, the issue is not in the gcloud dataproc jobs sbmit spark-sql command, but instead in the fact that Spark SQL does not provide this type of data.
If you do not necessarily have to use Spark SQL, you can try using HIVE instead. HIVE does provide the type of information you want (including the column names plus a prettier formatting):
user#project:~$ gcloud dataproc jobs submit hive --cluster <CLUSTER_NAME> -e "SELECT * FROM mytable;"
Job [JOB_ID] submitted.
Waiting for job output...
+-------------+---------------+--+
| mytable.id | mytable.name |
+-------------+---------------+--+
| 2 | Ann |
| 1 | Jim |
| 3 | Simon |
+-------------+---------------+--+
Job [JOB_ID] finished successfully.
Using the command:
describe formatted my_table partition my_partition
we are able to list the metadata including hdfs location of the partition my_partition in my_table. But how can we get an output with 2 columns:
Partition | Location
which would list all the partitions in my_table and their hdfs locations?
Query the metastore.
Demo
Hive
create table mytable (i int) partitioned by (dt date,type varchar(10))
;
alter table mytable add
partition (dt=date '2017-06-10',type='A')
partition (dt=date '2017-06-11',type='A')
partition (dt=date '2017-06-12',type='A')
partition (dt=date '2017-06-10',type='B')
partition (dt=date '2017-06-11',type='B')
partition (dt=date '2017-06-12',type='B')
;
Metastore (MySQL)
select p.part_name
,s.location
from metastore.DBS as d
join metastore.TBLS as t
on t.db_id =
d.db_id
join metastore.PARTITIONS as p
on p.tbl_id =
t.tbl_id
join metastore.SDS as s
on s.sd_id =
p.sd_id
where d.name = 'default'
and t.tbl_name = 'mytable'
;
+----------------------+----------------------------------------------------------------------------------+
| part_name | location |
+----------------------+----------------------------------------------------------------------------------+
| dt=2017-06-10/type=A | hdfs://quickstart.cloudera:8020/user/hive/warehouse/mytable/dt=2017-06-10/type=A |
| dt=2017-06-11/type=A | hdfs://quickstart.cloudera:8020/user/hive/warehouse/mytable/dt=2017-06-11/type=A |
| dt=2017-06-12/type=A | hdfs://quickstart.cloudera:8020/user/hive/warehouse/mytable/dt=2017-06-12/type=A |
| dt=2017-06-10/type=B | hdfs://quickstart.cloudera:8020/user/hive/warehouse/mytable/dt=2017-06-10/type=B |
| dt=2017-06-11/type=B | hdfs://quickstart.cloudera:8020/user/hive/warehouse/mytable/dt=2017-06-11/type=B |
| dt=2017-06-12/type=B | hdfs://quickstart.cloudera:8020/user/hive/warehouse/mytable/dt=2017-06-12/type=B |
+----------------------+----------------------------------------------------------------------------------+
If it is not necessary to get the information in a nice tabular format - and you do not have access to the HMS database, you may want to run the explain extended:
explain extended select * from default.mytable;
and then you can grep out the essential information, the partition values and the location.
root#ubuntu:/home/sathya# hive -e "explain extended select * from default.mytable;" | grep location
OK
location hdfs://localhost:9000/user/hive/warehouse/mytable/dt=2017-06-10/type=A
location hdfs://localhost:9000/user/hive/warehouse/mytable
location hdfs://localhost:9000/user/hive/warehouse/mytable/dt=2017-06-10/type=B
location hdfs://localhost:9000/user/hive/warehouse/mytable
location hdfs://localhost:9000/user/hive/warehouse/mytable/dt=2017-06-11/type=A
location hdfs://localhost:9000/user/hive/warehouse/mytable
location hdfs://localhost:9000/user/hive/warehouse/mytable/dt=2017-06-11/type=B
location hdfs://localhost:9000/user/hive/warehouse/mytable
location hdfs://localhost:9000/user/hive/warehouse/mytable/dt=2017-06-12/type=A
location hdfs://localhost:9000/user/hive/warehouse/mytable
location hdfs://localhost:9000/user/hive/warehouse/mytable/dt=2017-06-12/type=B
location hdfs://localhost:9000/user/hive/warehouse/mytable
The best solution from my point of view is to get this info from Hive Metastore via Thrift protocol.
If you write code in python, you can use from hmsclient library:
Hive cli:
hive> create table test_table_with_partitions(f1 string, f2 int) partitioned by (dt string);
OK
Time taken: 0.127 seconds
hive> alter table test_table_with_partitions add partition(dt=20210504) partition(dt=20210505);
OK
Time taken: 0.152 seconds
Python cli:
>>> from hmsclient import hmsclient
>>> client = hmsclient.HMSClient(host='hive.metastore.location', port=9083)
>>> with client as c:
... all_partitions = c.get_partitions(db_name='default',
... tbl_name='test_table_with_partitions',
... max_parts=24 * 365 * 3)
...
>>> print([{'dt': part.values[0], 'location': part.sd.location} for part in all_partitions])
[{'dt': '20210504',
'location': 'hdfs://hdfs.master.host:8020/user/hive/warehouse/test_table_with_partitions/dt=20210504'},
{'dt': '20210505',
'location': 'hdfs://hdfs.master.host:8020/user/hive/warehouse/test_table_with_partitions/dt=20210505'}]
If you have Airflow installed together with apache.hive extra, you create hmsclient
using data from Airflow Connections quite easy:
hive_hook = HiveMetastoreHook()
with hive_hook.metastore as hive_client:
... your code goes here ...
So we have a production database that is 32GB on a machine with 16GB of RAM. Thanks to caching this is usually not a problem at all. But whenever I start a pg_dump of the database, queries from the app servers start queueing up, and after a few minutes the queue runs away and our app grinds to a halt.
I'll be the first to acknowledge that we have query performance issues, and we're addressing those. Meanwhile, I want to be able to run pg_dump nightly, in a way that sips from the database and doesn't take our app down. I don't care if it takes hours. Our app doesn't run any DDL, so I'm not worried about lock contention.
Attempting to fix the problem, I'm running pg_dump with both nice and ionice. Unfortunately, this doesn't address the issue.
nice ionice -c2 -n7 pg_dump -Fc production_db -f production_db.sql
Even with ionice I still see the issue above. It appears that i/o wait and lots of seeks are causing the problem.
vmstat 1
Shows me that iowait hovers around 20-25% and spikes to 40% sometimes. Real CPU % fluctuates between 2-5% and spikes to 70% sometimes.
I don't believe locks are a possible culprit. When I run this query:
select pg_class.relname,pg_locks.* from pg_class,pg_locks where pg_class.relfilenode=pg_locks.relation;
I only see locks which are marked granted = 't'. We don't typically run any DDL in production -- so locks don't seem to be the issue.
Here is output from a ps with the WCHAN column enabled:
PID WIDE S TTY TIME COMMAND
3901 sync_page D ? 00:00:50 postgres: [local] COPY
3916 - S ? 00:00:01 postgres: SELECT
3918 sync_page D ? 00:00:07 postgres: INSERT
3919 semtimedop S ? 00:00:04 postgres: SELECT
3922 - S ? 00:00:01 postgres: SELECT
3923 - S ? 00:00:01 postgres: SELECT
3924 - S ? 00:00:00 postgres: SELECT
3927 - S ? 00:00:06 postgres: SELECT
3928 - S ? 00:00:06 postgres: SELECT
3929 - S ? 00:00:00 postgres: SELECT
3930 - S ? 00:00:00 postgres: SELECT
3931 - S ? 00:00:00 postgres: SELECT
3933 - S ? 00:00:00 postgres: SELECT
3934 - S ? 00:00:02 postgres: SELECT
3935 semtimedop S ? 00:00:13 postgres: UPDATE waiting
3936 - R ? 00:00:12 postgres: SELECT
3937 - S ? 00:00:01 postgres: SELECT
3938 sync_page D ? 00:00:07 postgres: SELECT
3940 - S ? 00:00:07 postgres: SELECT
3943 semtimedop S ? 00:00:04 postgres: UPDATE waiting
3944 - S ? 00:00:05 postgres: SELECT
3948 sync_page D ? 00:00:05 postgres: SELECT
3950 sync_page D ? 00:00:03 postgres: SELECT
3952 sync_page D ? 00:00:15 postgres: SELECT
3964 log_wait_commit D ? 00:00:04 postgres: COMMIT
3965 - S ? 00:00:03 postgres: SELECT
3966 - S ? 00:00:02 postgres: SELECT
3967 sync_page D ? 00:00:01 postgres: SELECT
3970 - S ? 00:00:00 postgres: SELECT
3971 - S ? 00:00:01 postgres: SELECT
3974 sync_page D ? 00:00:00 postgres: SELECT
3975 - S ? 00:00:00 postgres: UPDATE
3977 - S ? 00:00:00 postgres: INSERT
3978 semtimedop S ? 00:00:00 postgres: UPDATE waiting
3981 semtimedop S ? 00:00:01 postgres: SELECT
3982 - S ? 00:00:00 postgres: SELECT
3983 semtimedop S ? 00:00:02 postgres: UPDATE waiting
3984 - S ? 00:00:04 postgres: SELECT
3986 sync_buffer D ? 00:00:00 postgres: SELECT
3988 - R ? 00:00:01 postgres: SELECT
3989 - S ? 00:00:00 postgres: SELECT
3990 - R ? 00:00:00 postgres: SELECT
3992 - R ? 00:00:01 postgres: SELECT
3993 sync_page D ? 00:00:01 postgres: SELECT
3994 sync_page D ? 00:00:00 postgres: SELECT
The easiest:
You can throttle pg_dump using pv.
The harder:
Change your backup procedure. Use for example:
psql -c 'pg_start_backup()'
rsync --checksum --archive /var/lib/pgsql /backups/pgsql
psql -c 'pg_stop_backup()'
But take care that you also need to have continuous archiving set up for this to work and all WAL files created during backup stashed along the data files backup.
Even harder:
You can setup a replicated database (using for example log shipping) on additional cheap disk and instead of backing up production database backup a replica. Even it will fall behind some transactions it will eventually catch up. But check if replica is reasonably up to date before starting backup.
Your PS output has multiple UPDATE statements in "waiting" state, which still says locks to me (your locks test query aside). I'm pretty sure you wouldn't see "waiting" in the PS output otherwise. Can you check to see if this query shows anything during the issue:
SELECT * FROM pg_stat_activity WHERE waiting;
(You didn't say what version of PostgreSQL you are running so I'm not sure if this will work.)
If there's anything in there (that is, with waiting = TRUE), then it's a lock/transaction problem.