Batch insert in Hive using Hive-JDBC

Batch insert in Hive using Hive-JDBC - hive

I am trying to insert data into Hive (NON-ACID) table using hive-jdbc connection. It works if I execute a single SQL query in a 'statement'. If I try to batch the SQL using 'addBatch', I get an error 'method not supported'. I am using hive-jdbc 2.1 and HDP 2.3. Is there a way to batch multiple SQL into a single 'statement' using hive-jdbc?

As Ben mentioned, the addBatch() method is not supported in hive jdbc.
You can insert multiple data in one statement, for example:
String batchInsertSql = "insert into name_age values (?,?),(?,?)";
preparedStatement = connection.prepareStatement(batchInsertSql);
preparedStatement.setString(1, "tom");
preparedStatement.setInt(2, 10);
preparedStatement.setString(3, "sam");
preparedStatement.setInt(4, 20);
preparedStatement.execute();

Unfortunately there is just an interface for method addBatch from Hive-JDBC, there is NO implementation ...
public void addBatch() throws SQLException {
// TODO Auto-generated method stub
throw new SQLException("Method not supported");
}

Try this, works for me:
INSERT INTO <table_name> VALUES ("Hello", "World"), ("This", "works"), ("Be", "Awesome")
This will run as one map-reduce job, hence will save time as well.
It will create three rows with the mentioned values.
Use StringBuilder to loop over the values and keep appending to the query String, and then execute that String.

Related

.NET Core - ExecuteSqlRaw - IDENTITY_INSERT ON not working

I try to do a raw sql query, becos I need to insert a particular id when injecting data in the SQL db
I set the IDENTITY INSERT flag in C# : SET IDENTITY_INSERT [MyDb-Dev].[dbo].[companies] ON
and when I run the query, it complains the flag is not set properly
Exception thrown: 'Microsoft.Data.SqlClient.SqlException' in Microsoft.EntityFrameworkCore.Relational.dll
Cannot insert explicit value for identity column in table 'companies' when IDENTITY_INSERT is set to OFF.
I tried :
removing [MyDb-Dev]. from the first query but still the same
running setIdentityInsert("companies", "ON"); TWICE and it never triggers any exception
here is my code (it never throws any exception so I guess it works)
private void setIdentityInsert(string table,string value)
{
try
{
var sql = #"SET IDENTITY_INSERT [MyDb-Dev].[dbo].[" + table + "] " + value;
_context.Database.ExecuteSqlRaw(sql);
_logger.LogInformation(sql);
}
catch (Exception e)
{
_logger.LogWarning(e.Message);
}
}
How can I figure if the SET IDENTITY_INSERT query worked correctly ?
Why would that query run without affecting the SQL flag ?
thanks for your precious help

The reason why this fails is because of a different session. Each request is its own session and SET IDENTITY_INSERT = ON is valid only for the current version. You will have to rework using string construction to append your query along side the IDENTITY_INSERT = ON

ok, entity is really not the best partner in this story
one got to do
_context.Database.OpenConnection();
before the first query

EFUtilities with EFProfiler running

I was wondering if there was a way to get EFUtilities running at the same time EFProfiler is running.
I appreciate the profiler would not show the bulk insert due to it being done outside the confines of DBContext. At the moment, I cannot run batch jobs as the profiler has the connection wrapped. It Runs fine when not enabled
The exception I am getting is thus:
A first chance exception of type 'System.InvalidOperationException'
occurred in EntityFramework.Utilities.dll
Additional information: No provider supporting the InsertAll operation
for this datasource was found
The inner exception is null.

This is because EFUtilities automatically finds the correct provider. But when the connection is wrapped this is no longer possible.
InsertAll looks like this.
public void InsertAll<TEntity>(IEnumerable<TEntity> items, DbConnection connection = null, int? batchSize = null)
To use the SqlProvider (which is actually the only provider out of the box) you can create a new SqlConnection() and pass that to insert all.
So basically you would need to do this:
using (var db = new YourContext())
using (var con = new SqlConnection(YourConnectionString))
{
EFBatchOperation.For(db, db.PartialTestClass1).InsertAll(partials, con);
}
Now, maybe you are doing more and want both parts to run under the same transaction. In that case you can wrap that code block in a TransactionScope.

jdbc statement connection

QUESTION: can YOU use multiple statements and recordset, which operate simultaneously, using the same connection in a non MULTI THREAD?
I only found this question which interests me, but the answer is not consistent.
JDBC Statement/PreparedStatement per connection
The answer explain the relationship between recordset and statement, which is known to me.
Given that, you can not have multiple recordsets for statement
The answer says that you can have multiple recordsets for connection. But they are not mentioned any other sources.
I'm asking if it's possible to loop over the first recordset, then using the same connection (used to generate first recordset) to open another recordset use it looping in iteration. And where is the documentation that define this behavior?
The situation that interests me is like this, the statement perform tasks simultaneously ins
Connection con = Factory.getDBConn (user, pss, endpoint, etc);
Statement stmt = con.createStatement ();
ResultSet rs = stmt.executeQuery ("SELECT TEXT FROM dba");
while (rs.next ()) {
rs.getInt (....
rs.getInt (....
rs.getInt (....
rs.getInt (....
Statement stmt2 con.createStatement = ();
ResultSet rs2 = stmt2.executeQuery ("iSelect ......");
while (rs2.next ()) {
....
rs2.close ();
stm2.close ();
Statement stmt3 con.createStatement = ();
ResultSet rs3 = stmt3.executeQuery ("Insert Into table xxx ......");
....
rs3.close ();
stm3.close ();
}
To clarify a bit more: with the execution of update in stmt3, you could obtain an error like this:
java.sql.SQLException: There is an open result set on the current connection, which must be closed prior to executing a query.
So you can't mix SQL in the same connection.

If I understand correctly, you need to work with two (or more) resultsets simmultaneously within a single method.
It is possible, and it works well. But you have to remember a few things:
Everything you do on each recordset is handled by a single Connection, unless you declare new connections for each Statement (and ResultSet)
If you need to do a multithreaded process, I suggest you to create a Connection for each thread (or use a connection pool); if you use a single connection in a multithreaded process, your program will hang or crash, since every SQL statement goes through a single connection, and every new statement has to wait until the previous one has finnished.
Besides that, your question needs some clarification. What do you really need to do?

A ResultSet object is automatically closed when the Statement object that generated it is closed, re-executed, or used to retrieve the next result from a sequence of multiple results.
http://docs.oracle.com/javase/6/docs/api/java/sql/ResultSet.html

SQL Server is a database that does support multiple record sets. So you can exceute a couple of queries in a single stored procedure for example
SELECT * FROM employees
SELECT * FROM products
SELECT * FROM depts
You can then move between each record set. At least I know you can do this in .Net for example
using (var conn = new SqlConnection("connstring"))
using (var command = new SqlCommand("SPName", conn))
{
conn.Open();
command.CommandType = CommandType.StoredProcedure;
var (reader = command.ExecuteReader())
{
while(reader.Read())
{
//Process all records from first result set
}
reader.Next();
while(reader.Read())
{
//Process all records from 2nd result set
}
reader.Next();
while(reader.Read())
{
//Process all records from 3rd result set
}
}
}
I am assuming that java would support a similar mechanism

Inserting data into PostgreSQL table from MATLAB with JDBC throws BatchUpdateException

I am trying to write to a PostgreSQL database table from MATLAB. I have got the connection working using JDBC and created the table, but I am getting a BatchUpdateException when I try to insert a record.
The MATLAB query to insert the data is:
user_table = 'rm_user';
colNames = {user_id};
data = {longRecords(iterator)};
fastinsert(conn, user_table, colNames, data);
The exception says:
java.sql.BatchUpdateException: Batch entry 0 INSERT INTO rm_user (user_id) VALUES ( '4') was aborted. Call getNextException to see the cause.
But I don't know how to call getNextException from MATLAB.
Any ideas what's causing the problem or how I can find out more about the exception?
EDIT
Turns out I was looking at documentation for a newer version of MATLAB than mine. I have changed from fastinsert to insert and it is now working. However, I'm still interested in knowing if there is a way I could use getNextException from MATLAB.

This should work:
try
user_table = 'rm_user';
colNames = {user_id};
data = {longRecords(iterator)};
fastinsert(conn, user_table, colNames, data);
catch err
err.getNextException ()
end
Alternatively, just look at the caught error, it should contain the same information.
Also, Matlab has a function lasterr which will give you the last error without a catch statement. The function is deprecated, but you can find the documentation for replacements at the link provided.

Grails/Hibernate Batch Insert

I am using STS + Grails 1.3.7 and doing the batch insertion for thousands instances of a domain class.
It is very slow because Hibernate simply batch all the SQL statements into one JDBC call instead of combining the statements into one.
How can I make them into one large statement?

What you can do is to flush the hibernate session each 20 insert like this :
int cpt = 0
mycollection.each{
cpt ++
if(cpt > 20){
mycollection.save(flush:true)
}
else{
mycollection.save()
}
}
The flushing of hbernate session executes the SQL statement each 20 inserts.
This is the easiest method but you can find more interessant way to do it in Tomas lin blog. He is explaining exactly what you want to do : http://fbflex.wordpress.com/2010/06/11/writing-batch-import-scripts-with-grails-gsql-and-gpars/

Using the withTransaction() method on the domain classes makes the inserts much faster for batch scripts. You can build up all of the domain objects in one collection, then insert them in one block.
For example:
Player.withTransaction{
for (p in players) {
p.save()
}
}

You can see this line in Hibernate doc:
Hibernate disables insert batching at the JDBC level transparently if you use an identity identifier generator.
When I changed the type of generator, it worked.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Batch insert in Hive using Hive-JDBC - hive

Unfortunately there is just an interface for method addBatch from Hive-JDBC, there is NO implementation ... public void addBatch() throws SQLException { // TODO Auto-generated method stub throw new SQLException("Method not supported"); }

Related

.NET Core - ExecuteSqlRaw - IDENTITY_INSERT ON not working

EFUtilities with EFProfiler running

jdbc statement connection

Inserting data into PostgreSQL table from MATLAB with JDBC throws BatchUpdateException

Grails/Hibernate Batch Insert

Categories

Resources