Write the output of a script to a file in hive - hive

I have a script with a set of 5 queries.I would like to execute the script and write the output to a file.What command should I give from the hive cli.
Thanks

sample Queries file (3 queries) :
ramisetty#aspire:~/my_tmp$ cat queries.q
show databases; --query1
use my_db; --query2
INSERT OVERWRITE LOCAL DIRECTORY './outputLocalDir' --query3
select * from students where branch = "ECE"; --query3
Run HIVE:
ramisetty#aspire:~/my_tmp$ hive
hive (default)> source ./queries.q;
--output of Q1 on console-----
Time taken: 7.689 seconds
--output of Q2 on console -----
Time taken: 1.689 seconds
____________________________________________________________
Total MapReduce jobs = 1
Launching Job 1 out of 1
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_201401251835_0004, Tracking URL = http://localhost:50030/jobdetails.jsp?jobid=job_201401251835_0004
Kill Command = /home/ramisetty/VJDATA/hadoop-1.0.4/libexec/../bin/hadoop job -kill job_201401251835_0004
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0
2014-01-25 19:06:56,689 Stage-1 map = 0%, reduce = 0%
2014-01-25 19:07:05,868 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 2.07 sec
2014-01-25 19:07:14,047 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 2.07 sec
2014-01-25 19:07:15,059 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 2.07 sec
MapReduce Total cumulative CPU time: 2 seconds 70 msec
Ended Job = job_201401251835_0004
**Copying data to local directory outputLocalDir
Copying data to local directory outputLocalDir**
2 Rows loaded to outputLocalDir
MapReduce Jobs Launched:
Job 0: Map: 1 Cumulative CPU: 2.07 sec HDFS Read: 525 HDFS Write: 66 SUCCESS
Total MapReduce CPU Time Spent: 2 seconds 70 msec
OK
firstname secondname dob score branch
Time taken: 32.44 seconds
output file :
cat ./outputLocalDir/000000_0

Related

Unable to load large pandas dataframe to pyspark

I've been trying to join two large pandas dataframes using pyspark using the following code. I'm trying to vary executor cores allocated for the application and measure scalability of pyspark (strong scaling).
r = 1000000000 # 1Bn rows
it = 10
w = 256
unique = 0.9
TOTAL_MEM = 240
TOTAL_NODES = 14
max_val = r * unique
rng = default_rng()
frame_data = rng.integers(0, max_val, size=(r, 2))
frame_data1 = rng.integers(0, max_val, size=(r, 2))
print(f"data generated", flush=True)
df_l = pd.DataFrame(frame_data).add_prefix("col")
df_r = pd.DataFrame(frame_data1).add_prefix("col")
print(f"data loaded", flush=True)
procs = int(math.ceil(w / TOTAL_NODES))
mem = int(TOTAL_MEM*0.9)
print(f"world sz {w} procs per worker {procs} mem {mem} iter {it}", flush=True)
spark = SparkSession\
.builder\
.appName(f'join {r} {w}')\
.master('spark://node:7077')\
.config('spark.executor.memory', f'{int(mem*0.6)}g')\
.config('spark.executor.pyspark.memory', f'{int(mem*0.4)}g')\
.config('spark.cores.max', w)\
.config('spark.driver.memory', '100g')\
.config('sspark.sql.execution.arrow.pyspark.enabled', 'true')\
.getOrCreate()
sdf0 = spark.createDataFrame(df_l).repartition(w).cache()
sdf1 = spark.createDataFrame(df_r).repartition(w).cache()
print(f"data loaded to spark", flush=True)
try:
for i in range(it):
t1 = time.time()
out = sdf0.join(sdf1, on='col0', how='inner')
count = out.count()
t2 = time.time()
print(f"timings {r} {w} {i} {(t2 - t1) * 1000:.0f} ms, {count}", flush=True)
del out
del count
gc.collect()
finally:
spark.stop()
Cluster:
I am using standalone spark cluster in a 15 node cluster with 48 cores and 240GB RAM each. I've spawned master and the driver code in node1, while other 14 nodes have spawned workers allocating maximum memory.
In the spark context, I am reserving 90% of total memory to executor, splitting 60% to jvm and 40% to pyspark.
Issue:
When I run the above program, I can see that the executors are being assigned to the app. But it doesn't move forward, even after 60 mins. For smaller row count (10M), this was working without a problem.
Driver output
world sz 256 procs per worker 19 mem 216 iter 8
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
22/08/26 14:52:22 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
/N/u/d/dnperera/.conda/envs/cylonflow/lib/python3.8/site-packages/pyspark/sql/pandas/conversion.py:425: UserWarning: createDataFrame attempted Arrow optimization because 'spark.sql.execution.arrow.pyspark.enabled' is set to true; however, failed by the reason below:
Negative initial size: -589934400
Attempting non-optimization as 'spark.sql.execution.arrow.pyspark.fallback.enabled' is set to true.
warn(msg)
Any help on this is much appreciated.

TEZ mapper resource request

We recently migrated from MapReduce to TEZ for executing Hive queries on EMR. We are seeing cases where for the exact hive query launches very different number of mappers. See Map 3 phase below. On the first run it requested for 305 resources and on another run it requested for 4534 mappers. ( Please ignore the KILLED status because I manually killed the query.) Why does this happen ? How can we change it to be based on underlying data size instead ?
Run 1
----------------------------------------------------------------------------------------------
VERTICES MODE STATUS TOTAL COMPLETED RUNNING PENDING FAILED KILLED
----------------------------------------------------------------------------------------------
Map 1 container KILLED 5 0 0 5 0 0
Map 3 container KILLED 305 0 0 305 0 0
Map 5 container KILLED 16 0 0 16 0 0
Map 6 container KILLED 1 0 0 1 0 0
Reducer 2 container KILLED 333 0 0 333 0 0
Reducer 4 container KILLED 796 0 0 796 0 0
----------------------------------------------------------------------------------------------
VERTICES: 00/06 [>>--------------------------] 0% ELAPSED TIME: 14.16 s
----------------------------------------------------------------------------------------------
Run 2
----------------------------------------------------------------------------------------------
VERTICES MODE STATUS TOTAL COMPLETED RUNNING PENDING FAILED KILLED
----------------------------------------------------------------------------------------------
Map 1 .......... container SUCCEEDED 5 5 0 0 0 0
Map 3 container KILLED 4534 0 0 4534 0 0
Map 5 .......... container SUCCEEDED 325 325 0 0 0 0
Map 6 .......... container SUCCEEDED 1 1 0 0 0 0
Reducer 2 container KILLED 333 0 0 333 0 0
Reducer 4 container KILLED 796 0 0 796 0 0
----------------------------------------------------------------------------------------------
VERTICES: 03/06 [=>>-------------------------] 5% ELAPSED TIME: 527.16 s
----------------------------------------------------------------------------------------------
This article explains the process in which Tez allocates resources. https://cwiki.apache.org/confluence/display/TEZ/How+initial+task+parallelism+works
If Tez grouping is enabled for the splits, then a generic grouping
logic is run on these splits to group them into larger splits. The
idea is to strike a balance between how parallel the processing is and
how much work is being done in each parallel process.
First, Tez tries to find out the resource availability in the cluster for these tasks. For that, YARN provides a headroom value (and
in future other attributes may be used). Lets say this value is T.
Next, Tez divides T with the resource per task (say M) to find out how many tasks can run in parallel at one (ie in a single wave). W =
T/M.
Next W is multiplied by a wave factor (from configuration - tez.grouping.split-waves) to determine the number of tasks to be used.
Lets say this value is N.
If there are a total of X splits (input shards) and N tasks then this would group X/N splits per task. Tez then estimates the size of
data per task based on the number of splits per task.
If this value is between tez.grouping.max-size & tez.grouping.min-size then N is accepted as the number of tasks. If
not, then N is adjusted to bring the data per task in line with the
max/min depending on which threshold was crossed.
For experimental purposes tez.grouping.split-count can be set in configuration to specify the desired number of groups. If this config
is specified then the above logic is ignored and Tez tries to group
splits into the specified number of groups. This is best effort.
After this the grouping algorithm is executed. It groups splits by node locality, then rack locality, while respecting the group size
limits.

Make hive returns the value only

I want to make hive only returns the value only! not other words like information about the processing!
hive> select max(temp) from temp where dtime like '2014-07%' ;
Query ID = hduser_20170608003255_d35b8a43-8cc5-4662-89ce-9ee5f87d3ba0
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks determined at compile time: 1
In order to change the average load for a reducer (in bytes):
set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
set mapreduce.job.reduces=<number>
Starting Job = job_1496864651740_0008, Tracking URL = http://localhost:8088/proxy/application_1496864651740_0008/
Kill Command = /home/hduser/hadoop/bin/hadoop job -kill job_1496864651740_0008
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1
2017-06-08 00:33:01,955 Stage-1 map = 0%, reduce = 0%
2017-06-08 00:33:08,187 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 4.13 sec
2017-06-08 00:33:14,414 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 5.91 sec
MapReduce Total cumulative CPU time: 5 seconds 910 msec
Ended Job = job_1496864651740_0008
MapReduce Jobs Launched:
Stage-Stage-1: Map: 1 Reduce: 1 Cumulative CPU: 5.91 sec HDFS Read: 853158 HDFS Write: 5 SUCCESS
Total MapReduce CPU Time Spent: 5 seconds 910 msec
OK
44.4
Time taken: 20.01 seconds, Fetched: 1 row(s)
I want it to return the value only which is 44.4
Thanks in advance...
You can put result into variable in a shell script. max_temp variable will contain the result only:
max_temp=$(hive -e " set hive.cli.print.header=false; select max(temp) from temp where dtime like '2014-07%';")
echo "$max_temp"
You can also use -S
hive -S -e "select max(temp) from temp where dtime like '2014-07%';"

Is there a way to get database name in a hive UDF

I am writing a Hive UDF.
I have to get the name of the database (the function is deployed in). Then, I need to access a few files from hdfs depending on the database environment. Can you please help me which function can help with running a HQL query from Hive UDF.
write UDF class and prepare jar file
public class MyHiveUdf extends UDF {
public Text evaluate(String text,String dbName) {
if(text == null) {
return null;
} else {
return new Text(dbName+"."+text);
}
}
}
Use this UDF inside hive query like mentioned below
hive> use mydb;
OK
Time taken: 0.454 seconds
hive> ADD jar /root/MyUdf.jar;
Added [/root/MyUdf.jar] to class path
Added resources: [/root/MyUdf.jar]
hive> create temporary function myUdfFunction as 'com.hiveudf.strmnp.MyHiveUdf';
OK
Time taken: 0.018 seconds
hive> select myUdfFunction(username,current_database()) from users;
Query ID = root_20170407151010_2ae29523-cd9f-4585-b334-e0b61db2c57b
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_1491484583384_0004, Tracking URL = http://mac127:8088/proxy/application_1491484583384_0004/
Kill Command = /opt/cloudera/parcels/CDH-5.9.0-1.cdh5.9.0.p0.23/lib/hadoop/bin/hadoop job -kill job_1491484583384_0004
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0
2017-04-07 15:11:11,376 Stage-1 map = 0%, reduce = 0%
2017-04-07 15:11:19,766 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 3.12 sec
MapReduce Total cumulative CPU time: 3 seconds 120 msec
Ended Job = job_1491484583384_0004
MapReduce Jobs Launched:
Stage-Stage-1: Map: 1 Cumulative CPU: 3.12 sec HDFS Read: 21659 HDFS Write: 381120 SUCCESS
Total MapReduce CPU Time Spent: 3 seconds 120 msec
OK
mydb.user1
mydb.user2
mydb.user3
Time taken: 2.137 seconds, Fetched: 3 row(s)
hive>

FAILED: Execution Error, return code 2 from org.apache.hadoop.hive.ql.exec.mr.MapRedTask

While I am trying to overwrite Hive managed table to external table I am getting below error.
Query ID = hdfs_20150701054444_576849f9-6b25-4627-b79d-f5defc13c519
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_1435327542589_0049, Tracking URL = http://ip-XXX.XX.XX.XX ec2.internal:8088/proxy/application_1435327542589_0049/
Kill Command = /usr/hdp/2.2.6.0-2800/hadoop/bin/hadoop job -kill job_1435327542589_0049
Hadoop job information for Stage-0: number of mappers: 0; number of reducers: 0
2015-07-01 05:44:11,676 Stage-0 map = 0%, reduce = 0%
Ended Job = job_1435327542589_0049 with errors
Error during job, obtaining debugging information...
FAILED: Execution Error, return code 2 from org.apache.hadoop.hive.ql.exec.mr.MapRedTask
MapReduce Jobs Launched:
Stage-Stage-0: HDFS Read: 0 HDFS Write: 0 FAIL
Total MapReduce CPU Time Spent: 0 msec