HQL failed sqlexception - hive

When i run
Select * from table_a
using my hive ide, I receive:
SQLException pa current sql input files exceed the maximum limit. Please optimize. Maxfile--→100000
I could find nothing when I google this error.

Seems like you table has exploded the maximum number of allowed HDFS file.
Go to hive (or beeline) and run following command, by default the value is 100000
set hive.exec.max.created.files;
To fix the issue, you need to go and check your insert Queries and understand why it's creating so many files.
However for time being and you concatenate some of the partition (better to do it for all) using the following command
alter table dbname.tblName partition (col_name=value) concatenate;

Related

Why does Java OutOfMemoryError occurs when selecting less columns in hive query?

I have two hive select statements:
select * from ode limit 5;
This successfully pulls out 5 records from the table 'ode'. All the columns are included in the result. However, This following query caused an error:
select content from ode limit 5;
Where 'content' is one column in the table. The error is:
hive> select content from ode limit 5;
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks is set to 0 since there's no reduce operator
java.lang.OutOfMemoryError: Java heap space
at java.util.Arrays.copyOfRange(Arrays.java:3664)
at java.lang.String.<init>(String.java:207)
The second query should be a lot cheaper and why does it cause a memory issue? How to fix this?
When you select the whole table, Hive triggers Fetch task instead of MR that involves no parsing (it is like calling hdfs dfs -cat ... | head -5).
As far as I can see in your case, the hive client tries to run map locally.
You can choose one of the two ways:
Force remote execution with hive.fetch.task.conversion
Increase hive client heap size using HADOOP_CLIENT_OPTS env variable.
You can find more details regarding fetch tasks here.

drop hive table partition through pig script

Currently we are dropping the table daily and running the script which loads the data to the tables. Script takes 3-4 hrs during which data will not be available. So now our aim is to make the old hive data available to analysts until new data load execution is complete.
I am achieving this thing in hql script by loading daily data to the hive tables partitioned on load_year, load_month and load_day and dropping the yesterdays data by dropping the partition.
But what is the option for pig script to achieve the same? Can we alter the table through pig script? I dont want to execute the other hql to drop partition after pig.
Thanks
Since HDP 2.3 you can use HCatalog commands inside Pig scripts. Therefore, you can use the HCatalog command to drop a Hive table partition. The following is an example of dropping a Hive partition:
-- Set the correct hcat path
set hcat.bin /usr/bin/hcat;
-- Drop a table partion or execute other any Hcatalog command
sql ALTER TABLE midb1.mitable1 DROP IF EXISTS PARTITION(activity_id = "VENTA_ALIMENTACION",transaction_month = 1);
Another way is to use sh command execution inside Pig Script. However I had some problems to escape special characters in ALTER commands. So, the first is the best option in my opinion.
Regards,
Roberto Tardío

How to use Vertica's COPY LOCAL as an sql statement from MATLAB on Windows

I'm trying to insert around 80 million records created using MATLAB into Vertica Database table. I wanted to know if we can call COPY LOCAL statement in MATLAB as a regular sql statement using exec(conn, sql). For test purpose, I tried with a dat file having around 4 million records as following:
sqlstmnt = 'COPY schema.table_name (FK_CUSTOMER_ID,FK_RUN_START_DATE_ID,FK_RUN_END_DATE_ID,FK_TRAVEL_ID,FK_ORIGIN_ID,FK_DEST_ID,FK_SEGMENT_ID,SEGMENT_PERCENTAGE,LAST_UPDATED) FROM LOCAL ''/my/file/full/path/test1.dat''';
results = exec(conn,sqlstmnt);
But it gave an error in results.Message like:
[Vertica]JDBC A ResultSet was expected but not generated from query "COPY schema.table_name(FK_CUSTOMER_ID,FK_RUN_START_DATE_ID,FK_RUN_END_DATE_ID,FK_TRAVEL_ID,FK_ORIGIN_ID,FK_DEST_ID,FK_SEGMENT_ID,SEGMENT_PERCENTAGE,LAST_UPDATED) FROM LOCAL '/my/file/full/path/test1.dat'". Query not executed.
I have the data in the '.dat' file in the order in which the columns are mentioned in COPY LOCAL.
I could not find any helpful resource explaining this error.
I have this test1.dat file which I'm able to insert using COPY from vsql but since I run my codes in MATLAB with many iterations,each iteration producing about a million records, I would want to insert them during each iteration. Any help will be really great.
COPY command return ResultSet that includes the amount of loaded data , i see two main options
1) results =exec(conn,sqlstmnt);
2)results = runsqlscript(conn,'nameOfSQLScriptthatIncludeTheCopyCommand.sql')
I hope you will find it useful
Thanks
I just finish reviewing you’re your input sample data .
i see major problem with the mapping of the input csv to the target table .
Main issues are :
1) Lines are broken into 2 lines ( you should prefer having one sample per line and avoid brock it into 2 lines )
Eg : "1,20150101,0,2,2573,2714,1,8.147237e-01
50,48,49,54,45,48,51,-28 12:11:46"
2) when you define data types on vertica table ,eg: timestamp the data on the csv must reflect to it ( what you have is "-28 12:11:46" , this will not work )
After you fix all this issues , make sure you test it using vsql , then go and try it with matlab
I hope you will find it useful.

How can I clear the stl_load_errors table in Redshift?

Is there a way to clear out the contents of the stl_load_errors table in Amazon's Redshift?
I am running batch processes to COPY into Redshift and it would be convenient if I could view the entire stl_load_errors in one go without having to filter by a time range.
When I attempt to DELETE FROM stl_load_errors, I get "ERROR: cannot delete from a system table"
When I attempt to TRUNCATE stl_load_errors, I get "ERROR: permission denied: "stl_load_errors" is a system catalog"
Nope, you can't delete from that table.
It's worth noting that Redshift will automatically clear down that table over time, i.e., it doesn't hold all load errors forever.
You can't delete from stl_load_errors but if you use COPY query from S3 you can filter the SELECT from stl_load_errors using filename.
for example:
select * from stl_load_errors where filename like 's3://BUCKET/PREFIX_OF_PATH%'
The stl_load_errors will delete old data (usually a week old), so you don't need to worry about disk space.
You can use below query to get all errors for your copy command.
SELECT err.userid,
err.process,
err.recordtime,
err.pid,
err.errcode,
err.file,
err.linenum,
err.context,
err.error
FROM stl_error err,
stv_recents rec
WHERE rec.pid=err.pid
AND rec.status='running'
AND rec.query LIKE 'COPY%';
Please edit copy% in above query part as per your command.

Best equivalent of SQL Server UPDATE command in Hive

What is the best (less expensive) equivalent of SQL Server UPDATE SET command in Hive?
For example, consider the case in which I want to convert the following query:
UPDATE TABLE employee
SET visaEligibility = 'YES'
WHERE experienceMonths > 36
to equivalent Hive query.
I'm assuming you have a table without partitions, in which case you should be able to do the following command:
INSERT OVERWRITE TABLE employee SELECT employeeId,employeeName, experienceMonths ,salary, CASE WHEN experienceMonths >=36 THEN ‘YES’ ELSE visaEligibility END AS visaEligibility FROM employee;
There are other ways but they are much more convoluted, I think the way Bejoy described is the most efficient.
(source: Bejoy KS blog)
Note that if you have to do this on a partitioned table (which is likely if you have a lot of data), you would probably need to overwrite your partition when doing this.
You can create an external table and use the 'insert overwrite into local directory' and in case you want to change the column values, you can use 'CASE WHEN', 'IF' or other conditional operations. And copy the output file back to HDFS location.
You can upgrade your hive to 0.14.0
Starting from 0.14.0 hive supports UPDATE operation.
To do the same we need to create hive tables such that they support ACID output format and need to set additional properties in hive-site.xml.
How to do CURD operations in Hive