I am writing a Hive UDF.
I have to get the name of the database (the function is deployed in). Then, I need to access a few files from hdfs depending on the database environment. Can you please help me which function can help with running a HQL query from Hive UDF.
write UDF class and prepare jar file
public class MyHiveUdf extends UDF {
public Text evaluate(String text,String dbName) {
if(text == null) {
return null;
} else {
return new Text(dbName+"."+text);
}
}
}
Use this UDF inside hive query like mentioned below
hive> use mydb;
OK
Time taken: 0.454 seconds
hive> ADD jar /root/MyUdf.jar;
Added [/root/MyUdf.jar] to class path
Added resources: [/root/MyUdf.jar]
hive> create temporary function myUdfFunction as 'com.hiveudf.strmnp.MyHiveUdf';
OK
Time taken: 0.018 seconds
hive> select myUdfFunction(username,current_database()) from users;
Query ID = root_20170407151010_2ae29523-cd9f-4585-b334-e0b61db2c57b
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_1491484583384_0004, Tracking URL = http://mac127:8088/proxy/application_1491484583384_0004/
Kill Command = /opt/cloudera/parcels/CDH-5.9.0-1.cdh5.9.0.p0.23/lib/hadoop/bin/hadoop job -kill job_1491484583384_0004
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0
2017-04-07 15:11:11,376 Stage-1 map = 0%, reduce = 0%
2017-04-07 15:11:19,766 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 3.12 sec
MapReduce Total cumulative CPU time: 3 seconds 120 msec
Ended Job = job_1491484583384_0004
MapReduce Jobs Launched:
Stage-Stage-1: Map: 1 Cumulative CPU: 3.12 sec HDFS Read: 21659 HDFS Write: 381120 SUCCESS
Total MapReduce CPU Time Spent: 3 seconds 120 msec
OK
mydb.user1
mydb.user2
mydb.user3
Time taken: 2.137 seconds, Fetched: 3 row(s)
hive>
Related
In a .NET Core 5 Web API project I have a job scheduler, which is updating something in the database.
I want to run that job scheduler twice a day at 12 AM and 12 PM. What will be the cron expression for that?
How am I able to run the Quartz job scheduler twice in a day?
Here is the code of scheduler start:
public async Task StartAsync(CancellationToken cancellationToken)
{
Scheduler = await _schedulerFactory.GetScheduler(cancellationToken);
Scheduler.JobFactory = _jobFactory;
var job2 = new JobSchedule(jobType: typeof(MCBJob),
cronExpression: "0 0 0/12 * * ");
var mcbJob = CreateJob(job2);
var mcbTrigger = CreateTrigger(job2);
await Scheduler.ScheduleJob(mcbJob, mcbTrigger, cancellationToken);
await Scheduler.Start(cancellationToken);
}
You can separate values with , to specify individual values.
https://en.wikipedia.org/wiki/Cron#CRON_expression
4 -> 4
0-4 -> 0,1,2,3,4
*/4 -> 0,4,8,12,...,52,56
0,4 -> 0,4
We can build the schedule now:
0 0 0,12 * *
| | | | every month
| | | every day
| | at hour 0 and 12
| at minute 0
at first second
You can use https://crontab.guru/ to build a cron expression interactively.
May be this is helpful in your case.
Visit http://www.cronmaker.com/
CronMaker is a simple website which helps you to build cron expressions. CronMaker uses Quartz open source scheduler. Generated expressions are based on Quartz cron format.
I want to make hive only returns the value only! not other words like information about the processing!
hive> select max(temp) from temp where dtime like '2014-07%' ;
Query ID = hduser_20170608003255_d35b8a43-8cc5-4662-89ce-9ee5f87d3ba0
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks determined at compile time: 1
In order to change the average load for a reducer (in bytes):
set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
set mapreduce.job.reduces=<number>
Starting Job = job_1496864651740_0008, Tracking URL = http://localhost:8088/proxy/application_1496864651740_0008/
Kill Command = /home/hduser/hadoop/bin/hadoop job -kill job_1496864651740_0008
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1
2017-06-08 00:33:01,955 Stage-1 map = 0%, reduce = 0%
2017-06-08 00:33:08,187 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 4.13 sec
2017-06-08 00:33:14,414 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 5.91 sec
MapReduce Total cumulative CPU time: 5 seconds 910 msec
Ended Job = job_1496864651740_0008
MapReduce Jobs Launched:
Stage-Stage-1: Map: 1 Reduce: 1 Cumulative CPU: 5.91 sec HDFS Read: 853158 HDFS Write: 5 SUCCESS
Total MapReduce CPU Time Spent: 5 seconds 910 msec
OK
44.4
Time taken: 20.01 seconds, Fetched: 1 row(s)
I want it to return the value only which is 44.4
Thanks in advance...
You can put result into variable in a shell script. max_temp variable will contain the result only:
max_temp=$(hive -e " set hive.cli.print.header=false; select max(temp) from temp where dtime like '2014-07%';")
echo "$max_temp"
You can also use -S
hive -S -e "select max(temp) from temp where dtime like '2014-07%';"
While I am trying to overwrite Hive managed table to external table I am getting below error.
Query ID = hdfs_20150701054444_576849f9-6b25-4627-b79d-f5defc13c519
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_1435327542589_0049, Tracking URL = http://ip-XXX.XX.XX.XX ec2.internal:8088/proxy/application_1435327542589_0049/
Kill Command = /usr/hdp/2.2.6.0-2800/hadoop/bin/hadoop job -kill job_1435327542589_0049
Hadoop job information for Stage-0: number of mappers: 0; number of reducers: 0
2015-07-01 05:44:11,676 Stage-0 map = 0%, reduce = 0%
Ended Job = job_1435327542589_0049 with errors
Error during job, obtaining debugging information...
FAILED: Execution Error, return code 2 from org.apache.hadoop.hive.ql.exec.mr.MapRedTask
MapReduce Jobs Launched:
Stage-Stage-0: HDFS Read: 0 HDFS Write: 0 FAIL
Total MapReduce CPU Time Spent: 0 msec
I have 3 text files in hdfs which I am reading using spark sql and registering them as table. After that I am doing almost 5-6 operations - including joins , group by etc.. And this whole process is taking hardly 6-7 secs. ( Source File size - 3 GB with almost 20 million rows ).
As a final step of my computation, I am expecting only 1 record in my final rdd - named as acctNPIScr in below code snippet.
My question here is that when I am trying to print this rdd either by registering as table and printing records from table or by this method - acctNPIScr.map(t => "Score: " + t(1)).collect().foreach(println). It is taking very long time - almost 1.5 minute to print 1 record.
Can someone pls help me if I am doing something wrong in printing. What is the best way to print final result from schemardd.
.....
val acctNPIScr = sqlContext.sql(""SELECT party_id, sum(npi_int)/sum(device_priority_new) as npi_score FROM AcctNPIScoreTemp group by party_id ")
acctNPIScr.registerTempTable("AcctNPIScore")
val endtime = System.currentTimeMillis()
logger.info("Total sql Time :" + (endtime - st)) // this time is hardly 5 secs
println("start printing")
val result = sqlContext.sql("SELECT * FROM AcctNPIScore").collect().foreach(println)
//acctNPIScr.map(t => "Score: " + t(1)).collect().foreach(println)
logger.info("Total printing Time :" + (System.currentTimeMillis() - endtime)) // print one record is taking almost 1.5 minute
They key here is that all Spark transformations are lazy. Only actions cause the transformations to be evaluated.
// this time is hardly 5 seconds
This is because you haven't forced the transformations to be evaluated yet. That doesn't happen until here map(t => "Score: " + t(1)).collect(). collect is an action, so your entire dataset is processed at this point.
I have a script with a set of 5 queries.I would like to execute the script and write the output to a file.What command should I give from the hive cli.
Thanks
sample Queries file (3 queries) :
ramisetty#aspire:~/my_tmp$ cat queries.q
show databases; --query1
use my_db; --query2
INSERT OVERWRITE LOCAL DIRECTORY './outputLocalDir' --query3
select * from students where branch = "ECE"; --query3
Run HIVE:
ramisetty#aspire:~/my_tmp$ hive
hive (default)> source ./queries.q;
--output of Q1 on console-----
Time taken: 7.689 seconds
--output of Q2 on console -----
Time taken: 1.689 seconds
____________________________________________________________
Total MapReduce jobs = 1
Launching Job 1 out of 1
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_201401251835_0004, Tracking URL = http://localhost:50030/jobdetails.jsp?jobid=job_201401251835_0004
Kill Command = /home/ramisetty/VJDATA/hadoop-1.0.4/libexec/../bin/hadoop job -kill job_201401251835_0004
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0
2014-01-25 19:06:56,689 Stage-1 map = 0%, reduce = 0%
2014-01-25 19:07:05,868 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 2.07 sec
2014-01-25 19:07:14,047 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 2.07 sec
2014-01-25 19:07:15,059 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 2.07 sec
MapReduce Total cumulative CPU time: 2 seconds 70 msec
Ended Job = job_201401251835_0004
**Copying data to local directory outputLocalDir
Copying data to local directory outputLocalDir**
2 Rows loaded to outputLocalDir
MapReduce Jobs Launched:
Job 0: Map: 1 Cumulative CPU: 2.07 sec HDFS Read: 525 HDFS Write: 66 SUCCESS
Total MapReduce CPU Time Spent: 2 seconds 70 msec
OK
firstname secondname dob score branch
Time taken: 32.44 seconds
output file :
cat ./outputLocalDir/000000_0