What is the idea behind the fact that using '\\_' and '_' have the same effect in Hive split function? - hive

I just found out that in hive script, the effect of splitting a string with delimiter '_' and '\_' are the same. However, underscore is not a special character in hive. Any idea why?
0: jdbc:hive2://hadoopzk10-phx2.prod.uber.int> select split('2_1122', '_')[0];
INFO : Compiling command(queryId=hive_20220328223849_deeabdc9-ead3-43af-ab13-17de9c3d9cf5): select split('2_1122', '_')[0]
INFO : Semantic Analysis Completed
INFO : Returning Hive schema: Schema(fieldSchemas:[FieldSchema(name:_c0, type:string, comment:null)], properties:null)
INFO : Completed compiling command(queryId=hive_20220328223849_deeabdc9-ead3-43af-ab13-17de9c3d9cf5); Time taken: 0.106 seconds
INFO : Concurrency mode is disabled, not creating a lock manager
INFO : Executing command(queryId=hive_20220328223849_deeabdc9-ead3-43af-ab13-17de9c3d9cf5): select split('2_1122', '_')[0]
INFO : Completed executing command(queryId=hive_20220328223849_deeabdc9-ead3-43af-ab13-17de9c3d9cf5); Time taken: 0.036 seconds
INFO : OK
+------+--+
| _c0 |
+------+--+
| 2 |
+------+--+
1 row selected (0.152 seconds)
0: jdbc:hive2://hadoopzk10-phx2.prod.uber.int> select split('2_1122', '\\_')[0];
INFO : Compiling command(queryId=hive_20220328223906_1cf0c657-bc91-44e2-8cea-4f989ae53d9f): select split('2_1122', '\\_')[0]
INFO : Semantic Analysis Completed
INFO : Returning Hive schema: Schema(fieldSchemas:[FieldSchema(name:_c0, type:string, comment:null)], properties:null)
INFO : Completed compiling command(queryId=hive_20220328223906_1cf0c657-bc91-44e2-8cea-4f989ae53d9f); Time taken: 0.101 seconds
INFO : Concurrency mode is disabled, not creating a lock manager
INFO : Executing command(queryId=hive_20220328223906_1cf0c657-bc91-44e2-8cea-4f989ae53d9f): select split('2_1122', '\\_')[0]
INFO : Completed executing command(queryId=hive_20220328223906_1cf0c657-bc91-44e2-8cea-4f989ae53d9f); Time taken: 0.033 seconds
INFO : OK
+------+--+
| _c0 |
+------+--+
| 2 |
+------+--+
1 row selected (0.146 seconds)
I ask this question just out of curiosity. Feel free to share your thoughts.

Related

Setting up an alert for Long Running Pipelines in ADF v2 using Kusto Query

I have a pipeline in ADF V2 which generally takes 3 hours to run but some times it takes more than 3 hours. so I want to set up an alert if the pipeline running more than 3 hours using Azure log analytics (Kusto Query), I have written a query but it shows the result if the pipeline succeeded or failed. I want an alert if the pipeline taking more than 3 hours and it's in progress.
My query is
ADFPipelineRun
| where PipelineName == "XYZ"
| where (End - Start) > 3h
| project information = 'Expected Time : 3 Hours, Pipeline took more that 3 hours' ,PipelineName,(End - Start)
Could you please help me to solve this issue?
Thanks in Advance.
Lalit
Updated:
Please change your query like below:
ADFPipelineRun
| where PipelineName == "pipeline11"
| top 1 by TimeGenerated
| where Status in ("Queued","InProgress")
| where (now() - Start) > 3h //please change the time to 3h in your case
| project information = 'Expected Time : 3 Hours, Pipeline took more that 3 hours' ,PipelineName,(now() - Start)
Explanation:
The pipeline has some statuses like: Succeeded, Failed, Queued, InProgress. If the pipeline is now running and not completed, its status must be one of the two: Queued, InProgress.
So we just need to get the latest one record by using top 1 by TimeGenerated, then check its status if Queued or InProgress(in query, its where Status in ("Queued","InProgress")).
At last, we just need to check if it's running more than 3 hours by using where (now() - Start) > 3h.
I test it by myself, it works ok. Please let me know if you still have more issue.

Insufficient Resources error while inserting into SQL table using Vertica

I'm running a Python script to load data from a DataFrame into a SQL Table. However, the insert command is throwing this error:
(pyodbc.Error) ('HY000', '[HY000] ERROR 3587: Insufficient resources to execute plan on pool fastlane [Request exceeds session memory cap: 28357027KB > 20971520KB]\n (3587) (SQLExecDirectW)')
This is my code:
df.to_sql('TableName',engine,schema='trw',if_exists='append',index=False) #copying data from Dataframe df to a SQL Table
Can you do the following for me:
run this command - and share the output. MAXMEMORYSIZE, MEMORYSIZE and MAXQUERYMEMORYSIZE, plus PLANNEDCONCURRENCY give you an idea of the (memory) budget at the time when the query / copy command was planned.
gessnerm#gessnerm-HP-ZBook-15-G3:~/1/fam/fam-ostschweiz$ vsql -x -c \
"select * from resource_pools where name='fastlane'"
-[ RECORD 1 ]------------+------------------
pool_id | 45035996273841188
name | fastlane
is_internal | f
memorysize | 0%
maxmemorysize |
maxquerymemorysize |
executionparallelism | 16
priority | 0
runtimepriority | MEDIUM
runtimeprioritythreshold | 2
queuetimeout | 00:05
plannedconcurrency | 2
maxconcurrency |
runtimecap |
singleinitiator | f
cpuaffinityset |
cpuaffinitymode | ANY
cascadeto |
Then, you should dig, out of the QUERY_REQUESTS system table, the acutal SQL command that your python script triggered. It should be in the format of:
COPY <_the_target_table_>
FROM STDIN DELIMITER ',' ENCLOSED BY '"'
DIRECT REJECTED DATA '<_bad_file_name_>'
or similar.
Then: how big is the file / are the files you're trying to load in one go? If too big, then B.Muthamizhselvi is right - you'll need to portion the data volume you load.
Can you also run:
vsql -c "SELECT EXPORT_OBJECTS('','<schema>.<table>',FALSE)"
.. .and share the output? It could well be that you have too many projections for the memory to be enough, that you are sorting by too many columns.
Hope this helps for starters ...

Presto can't fetch content in HIVE table

My environment:
hadoop 1.0.4
hive 0.12
hbase 0.94.14
presto 0.56
All packages are installed on pseudo machine. The services are not running on localhost but
on the host name with a static IP.
presto conf:
coordinator=false
datasources=jmx,hive
http-server.http.port=8081
presto-metastore.db.type=h2
presto-metastore.db.filename=/root
task.max-memory=1GB
discovery.uri=http://<HOSTNAME>:8081
In presto cli I can get the table in hive successfully:
presto:default> show tables;
Table
-------------------
ht1
k_business_d_
k_os_business_d_
...
tt1_
(11 rows)
Query 20140114_072809_00002_5zhjn, FINISHED, 1 node
Splits: 2 total, 2 done (100.00%)
0:11 [11 rows, 291B] [0 rows/s, 26B/s]
but when I try to query data from any table the result always be empty: (no error information)
presto:default> select * from k_business_d_;
key | business | business_name | collect_time | numofalarm | numofhost | test
-----+----------+---------------+--------------+------------+-----------+------
(0 rows)
Query 20140114_072839_00003_5zhjn, FINISHED, 1 node
Splits: 1 total, 1 done (100.00%)
0:02 [0 rows, 0B] [0 rows/s, 0B/s]
If I executed the same sql in HIVE, the result show there are 1 row in the table.
hive> select * from k_business_d_;
OK
9223370648089975807|2 2 测试机 2014-01-04 00:00:00 NULL 1.0 NULL
Time taken: 2.574 seconds, Fetched: 1 row(s)
Why presto can't fetch from HIVE tables?
It looks like this is an external table that uses HBase via org.apache.hadoop.hive.hbase.HBaseStorageHandler. This is not supported yet, but one mailing list post indicates it might be possible if you copy the appropriate jars to the Hive plugin directory: https://groups.google.com/d/msg/presto-users/U7vx8PhnZAA/9edzcK76tD8J

Issue when issuing a SELECT query to a Hive table in Prestodb

I am able to connect to my Hive metastore, and doing a DESCRIBE:
DESCRIBE sample_07;
Query 20131113_025614_00005_af2fx, RUNNING, 1 node, 2 splits
Column | Type | Null | Partition Key
-------------+---------+------+---------------
code | varchar | true | false
description | varchar | true | false
total_emp | bigint | true | false
salary | bigint | true | false
(4 rows)
However, a SELECT does not work:
select * from sample_07;
2013-11-12T16:54:58.611-0800 DEBUG query-scheduler-7 com.facebook.presto.execution.QueryStateMachine Query 20131113_005458_00004_af2fx is PLANNING
Query 20131113_005458_00004_af2fx failed: java.io.IOException: Failed on local exception: com.facebook.presto.hadoop.shaded.com.google.protobuf.InvalidProtocolBufferException: Message missing required fields: callId, status; Host Details : local host is: "sandbox.hortonworks.com/xx.xx.2.15"; destination host is: "sandbox.hortonworks.com":8020;
presto:default> 2013-11-12T16:56:04.771-0800 ERROR Stage-20131113_005458_00004_af2fx.1-219 com.facebook.presto.execution.SqlStageExecution Error while starting stage 20131113_005458_00004_af2fx.1 ~[guava-15.0.jar:na]
at com.facebook.presto.hive.HiveSplitIterable$HiveSplitQueue.computeNext(HiveSplitIterable.java:433) ~[na:na]
at com.facebook.presto.hive.HiveSplitIterable$HiveSplitQueue.computeNext(HiveSplitIterable.java:392) ~[na:na]
at com.google.common.collect.AbstractIterator.tryToComputeNext(AbstractIterator.java:143) ~[guava-15.0.jar:na]
at com.google.common.collect.AbstractIterator.hasNext(AbstractIterator.java:138) ~[guava-15.0.jar:na]
As you can tell, i am using Hortonworks' sandbox, so it may be that that's the issue? Or is it choking on the IP address ? Not completely sure i understand the problem.
cheers,
Matt
Your error message suggests that you are not running Presto against CDH4 but against Hortonworks Sandbox which I believe is Hadoop 2.2.0. There are known incompatibilities at this point. See this thread on the Presto Google Group for more information: https://groups.google.com/forum/#!topic/presto-users/lVLvMGP1sKE

Throttling i/o in postgres's pg_dump?

So we have a production database that is 32GB on a machine with 16GB of RAM. Thanks to caching this is usually not a problem at all. But whenever I start a pg_dump of the database, queries from the app servers start queueing up, and after a few minutes the queue runs away and our app grinds to a halt.
I'll be the first to acknowledge that we have query performance issues, and we're addressing those. Meanwhile, I want to be able to run pg_dump nightly, in a way that sips from the database and doesn't take our app down. I don't care if it takes hours. Our app doesn't run any DDL, so I'm not worried about lock contention.
Attempting to fix the problem, I'm running pg_dump with both nice and ionice. Unfortunately, this doesn't address the issue.
nice ionice -c2 -n7 pg_dump -Fc production_db -f production_db.sql
Even with ionice I still see the issue above. It appears that i/o wait and lots of seeks are causing the problem.
vmstat 1
Shows me that iowait hovers around 20-25% and spikes to 40% sometimes. Real CPU % fluctuates between 2-5% and spikes to 70% sometimes.
I don't believe locks are a possible culprit. When I run this query:
select pg_class.relname,pg_locks.* from pg_class,pg_locks where pg_class.relfilenode=pg_locks.relation;
I only see locks which are marked granted = 't'. We don't typically run any DDL in production -- so locks don't seem to be the issue.
Here is output from a ps with the WCHAN column enabled:
PID WIDE S TTY TIME COMMAND
3901 sync_page D ? 00:00:50 postgres: [local] COPY
3916 - S ? 00:00:01 postgres: SELECT
3918 sync_page D ? 00:00:07 postgres: INSERT
3919 semtimedop S ? 00:00:04 postgres: SELECT
3922 - S ? 00:00:01 postgres: SELECT
3923 - S ? 00:00:01 postgres: SELECT
3924 - S ? 00:00:00 postgres: SELECT
3927 - S ? 00:00:06 postgres: SELECT
3928 - S ? 00:00:06 postgres: SELECT
3929 - S ? 00:00:00 postgres: SELECT
3930 - S ? 00:00:00 postgres: SELECT
3931 - S ? 00:00:00 postgres: SELECT
3933 - S ? 00:00:00 postgres: SELECT
3934 - S ? 00:00:02 postgres: SELECT
3935 semtimedop S ? 00:00:13 postgres: UPDATE waiting
3936 - R ? 00:00:12 postgres: SELECT
3937 - S ? 00:00:01 postgres: SELECT
3938 sync_page D ? 00:00:07 postgres: SELECT
3940 - S ? 00:00:07 postgres: SELECT
3943 semtimedop S ? 00:00:04 postgres: UPDATE waiting
3944 - S ? 00:00:05 postgres: SELECT
3948 sync_page D ? 00:00:05 postgres: SELECT
3950 sync_page D ? 00:00:03 postgres: SELECT
3952 sync_page D ? 00:00:15 postgres: SELECT
3964 log_wait_commit D ? 00:00:04 postgres: COMMIT
3965 - S ? 00:00:03 postgres: SELECT
3966 - S ? 00:00:02 postgres: SELECT
3967 sync_page D ? 00:00:01 postgres: SELECT
3970 - S ? 00:00:00 postgres: SELECT
3971 - S ? 00:00:01 postgres: SELECT
3974 sync_page D ? 00:00:00 postgres: SELECT
3975 - S ? 00:00:00 postgres: UPDATE
3977 - S ? 00:00:00 postgres: INSERT
3978 semtimedop S ? 00:00:00 postgres: UPDATE waiting
3981 semtimedop S ? 00:00:01 postgres: SELECT
3982 - S ? 00:00:00 postgres: SELECT
3983 semtimedop S ? 00:00:02 postgres: UPDATE waiting
3984 - S ? 00:00:04 postgres: SELECT
3986 sync_buffer D ? 00:00:00 postgres: SELECT
3988 - R ? 00:00:01 postgres: SELECT
3989 - S ? 00:00:00 postgres: SELECT
3990 - R ? 00:00:00 postgres: SELECT
3992 - R ? 00:00:01 postgres: SELECT
3993 sync_page D ? 00:00:01 postgres: SELECT
3994 sync_page D ? 00:00:00 postgres: SELECT
The easiest:
You can throttle pg_dump using pv.
The harder:
Change your backup procedure. Use for example:
psql -c 'pg_start_backup()'
rsync --checksum --archive /var/lib/pgsql /backups/pgsql
psql -c 'pg_stop_backup()'
But take care that you also need to have continuous archiving set up for this to work and all WAL files created during backup stashed along the data files backup.
Even harder:
You can setup a replicated database (using for example log shipping) on additional cheap disk and instead of backing up production database backup a replica. Even it will fall behind some transactions it will eventually catch up. But check if replica is reasonably up to date before starting backup.
Your PS output has multiple UPDATE statements in "waiting" state, which still says locks to me (your locks test query aside). I'm pretty sure you wouldn't see "waiting" in the PS output otherwise. Can you check to see if this query shows anything during the issue:
SELECT * FROM pg_stat_activity WHERE waiting;
(You didn't say what version of PostgreSQL you are running so I'm not sure if this will work.)
If there's anything in there (that is, with waiting = TRUE), then it's a lock/transaction problem.