Kafka to Flink to Hive - Writes failing - hive

I am trying to Sink data to Hive via Kafka -> Flink -> Hive using following code snippet:
But I am getting following error:
final StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
DataStream<GenericRecord> stream = readFromKafka(env);
private static final TypeInformation[] FIELD_TYPES = new TypeInformation[]{
BasicTypeInfo.INT_TYPE_INFO, BasicTypeInfo.STRING_TYPE_INFO
};
JDBCAppendTableSink sink = JDBCAppendTableSink.builder()
.setDrivername("org.apache.hive.jdbc.HiveDriver")
.setDBUrl("jdbc:hive2://hiveconnstring")
.setUsername("myuser")
.setPassword("mypass")
.setQuery("INSERT INTO testHiveDriverTable (key,value) VALUES (?,?)")
.setBatchSize(1000)
.setParameterTypes(FIELD_TYPES)
.build();
DataStream<Row> rows = stream.map((MapFunction<GenericRecord, Row>) st1 -> {
Row row = new Row(2); //
row.setField(0, st1.get("SOME_ID"));
row.setField(1, st1.get("SOME_ADDRESS"));
return row;
});
sink.emitDataStream(rows);
env.execute("Flink101");
Caused by: java.lang.RuntimeException: Execution of JDBC statement failed.
at org.apache.flink.api.java.io.jdbc.JDBCOutputFormat.flush(JDBCOutputFormat.java:219)
at org.apache.flink.api.java.io.jdbc.JDBCSinkFunction.snapshotState(JDBCSinkFunction.java:43)
at org.apache.flink.streaming.util.functions.StreamingFunctionUtils.trySnapshotFunctionState(StreamingFunctionUtils.java:118)
at org.apache.flink.streaming.util.functions.StreamingFunctionUtils.snapshotFunctionState(StreamingFunctionUtils.java:99)
at org.apache.flink.streaming.api.operators.AbstractUdfStreamOperator.snapshotState(AbstractUdfStreamOperator.java:90)
at org.apache.flink.streaming.api.operators.AbstractStreamOperator.snapshotState(AbstractStreamOperator.java:356)
... 12 more
Caused by: java.sql.SQLException: Method not supported
at org.apache.hive.jdbc.HiveStatement.executeBatch(HiveStatement.java:381)
at org.apache.flink.api.java.io.jdbc.JDBCOutputFormat.flush(JDBCOutputFormat.java:216)
... 17 more
I checked hive-jdbc driver and it seems that the Method is not supported in hive-jdbc driver.
public class HiveStatement implements java.sql.Statement {
...
#Override
public int[] executeBatch() throws SQLException {
throw new SQLFeatureNotSupportedException("Method not supported");
}
..
}
Is there any way we can achieve this using JDBC Driver ?
Let me know,
Thanks in advance.

Hive's JDBC implementation is not complete yet. Your problem is tracked by this issue.
You could try to patch Flink's JDBCOutputFormat to not use batching by replacing upload.addBatch with upload.execute in JDBCOutputFormat.java:202 and remove the call to upload.executeBatch in JDBCOutputFormat.java:216. The down side will be that you issue for every record a dedicated SQL query which might slow down things.

Related

apache beam use run options to read big query table data

I am trying to build apache beam dataflow service to read data from BQ table as well as big query sql query and do some transformations on it.
I am trying to build a template with few run time parameters but however template build failing with the following error..
java.lang.IllegalStateException: Value only available at runtime, but accessed from a non-runtime context: RuntimeValueProvider{propertyName=tableDate, default=null}
at org.apache.beam.sdk.options.ValueProvider$RuntimeValueProvider.get (ValueProvider.java:254)
The pipeline options are defined as
public interface BeamOptions
extends DataflowPipelineOptions {
void setTableDate(ValueProvider<String> value);
ValueProvider<String> getTableDate();
void setTableSuffix(ValueProvider<String> value);
ValueProvider<String> getTableSuffix();
}
the pipeline is defined as follows
public static void main(String[] args) {
BeamOptions options = PipelineOptionsFactory.fromArgs(args).withValidation()
.as(BeamOptions.class);
Pipeline pipeline = Pipeline.create(options);
String someQuery = "select * from table1 where date = replace_date";
//here trying to use some date passed from run time options
String updatedQuery = someQuery.replaceAll("replace_date",options.getTableDate.get());
//throwing error while building the template..
PCollection<TableRow> sqlRoows = pipeline
.apply("extract-bq-data",
"Read from BigQuery using query",
BigQueryIO.readTableRows().fromQuery(updatedQuery));
//use run time tableSuffix throwing error while building the template..
TableReference tableSpec =
new TableReference()
.setProjectId("xyz")
.setDatasetId("dataset")
.setTableId("tableSuffix".concat(options.tableSuffix.get())));
PCollection<TableRow> tableRows = pipeline.apply(
"Read from BigQuery table",
BigQueryIO.readTableRows().from(tableSpec)
);
Is there any way to pass runtime options and use that to read data from big query?

Java - Insert a single row at a time into google Big Query ?

I am creating an application where every time a user clicks on an article, I need to capture the article data and the user data to calculate the reach of every article and be able to run analytics on the reached data.
My application is on App Engine.
When I check documentation for inserts into BQ, most of them point towards bulk inserts in the form of Jobs or Streams.
Question:
Is it even a good practice to insert into big Query one row at a time every time a user action is initiated ? If so, could you point me to some Java code to effectively do this ?
There are limits on the number of load jobs and DML queries (1,000 per day), so you'll need to use streaming inserts for this kind of application. Note that streaming inserts are different from loading data from a Java stream.
TableId tableId = TableId.of(datasetName, tableName);
// Values of the row to insert
Map<String, Object> rowContent = new HashMap<>();
rowContent.put("booleanField", true);
// Bytes are passed in base64
rowContent.put("bytesField", "Cg0NDg0="); // 0xA, 0xD, 0xD, 0xE, 0xD in base64
// Records are passed as a map
Map<String, Object> recordsContent = new HashMap<>();
recordsContent.put("stringField", "Hello, World!");
rowContent.put("recordField", recordsContent);
InsertAllResponse response =
bigquery.insertAll(
InsertAllRequest.newBuilder(tableId)
.addRow("rowId", rowContent)
// More rows can be added in the same RPC by invoking .addRow() on the builder
.build());
if (response.hasErrors()) {
// If any of the insertions failed, this lets you inspect the errors
for (Entry<Long, List<BigQueryError>> entry : response.getInsertErrors().entrySet()) {
// inspect row error
}
}
(From the example at https://cloud.google.com/bigquery/streaming-data-into-bigquery#bigquery-stream-data-java)
Note especially that a failed insert does not always throw an exception. You must also check the response object for errors.
Is it even a good practice to insert into big Query one row at a time every time a user action is initiated ?
Yes, it's pretty typical to stream event streams to BigQuery for analytics. You'll could get better performance if you buffer multiple events into the same streaming insert request to BigQuery, but one row at a time is definitely supported.
A simplified version of Google's example.
Map<String, Object> row1Data = new HashMap<>();
row1Data.put("booleanField", true);
row1Data.put("stringField", "myString");
Map<String, Object> row2Data = new HashMap<>();
row2Data.put("booleanField", false);
row2Data.put("stringField", "myOtherString");
TableId tableId = TableId.of("myDatasetName", "myTableName");
InsertAllResponse response =
bigQuery.insertAll(
InsertAllRequest.newBuilder(tableId)
.addRow("row1Id", row1Data)
.addRow("row2Id", row2Data)
.build());
if (response.hasErrors()) {
// If any of the insertions failed, this lets you inspect the errors
for (Map.Entry<Long, List<BigQueryError>> entry : response.getInsertErrors().entrySet()) {
// inspect row error
}
}
You can use Cloud Logging API to write one row at a time.
https://cloud.google.com/logging/docs/reference/libraries
Sample code from document
public class QuickstartSample {
/** Expects a new or existing Cloud log name as the first argument. */
public static void main(String... args) throws Exception {
// Instantiates a client
Logging logging = LoggingOptions.getDefaultInstance().getService();
// The name of the log to write to
String logName = args[0]; // "my-log";
// The data to write to the log
String text = "Hello, world!";
LogEntry entry =
LogEntry.newBuilder(StringPayload.of(text))
.setSeverity(Severity.ERROR)
.setLogName(logName)
.setResource(MonitoredResource.newBuilder("global").build())
.build();
// Writes the log entry asynchronously
logging.write(Collections.singleton(entry));
System.out.printf("Logged: %s%n", text);
}
}
In this case you need to create sink from dataflow logs. Then message will be redirect to the big Query table.
https://cloud.google.com/logging/docs/export/configure_export_v2

BigQuery in Dataflow fails to load data from Cloud Storage: JSON object specified for non-record field

I have a Dataflow pipeline running locally on my machine writing to BigQuery. BigQuery in this batch job, requires a temporary location. I have provided one in my Cloud Storage. The relevant parts are:
PipelineOptions options = PipelineOptionsFactory.create();
options.as(BigQueryOptions.class)
.setTempLocation("gs://folder/temp");
Pipeline p = Pipeline.create(options);
....
List<TableFieldSchema> fields = new ArrayList<>();
fields.add(new TableFieldSchema().setName("uuid").setType("STRING"));
fields.add(new TableFieldSchema().setName("start_time").setType("TIMESTAMP"));
fields.add(new TableFieldSchema().setName("end_time").setType("TIMESTAMP"));
TableSchema schema = new TableSchema().setFields(fields);
session_windowed_items.apply(ParDo.of(new FormatAsTableRowFn()))
.apply(BigQueryIO.Write
.withSchema(schema)
.withCreateDisposition(BigQueryIO.Write.CreateDisposition.CREATE_NEVER)
.withWriteDisposition(BigQueryIO.Write.WriteDisposition.WRITE_APPEND)
.to("myproject:db.table"));
Where for FormatAsTableRowFn I have:
static class FormatAsTableRowFn extends DoFn<KV<String, String>, TableRow>
implements RequiresWindowAccess{
#Override
public void processElement(ProcessContext c) {
TableRow row = new TableRow()
.set("uuid", c.element().getKey())
// include a field for the window timestamp
.set("start_time", ((IntervalWindow) c.window()).start().toInstant()) //NOTE: I tried both with and without
.set("end_time", ((IntervalWindow) c.window()).end().toInstant()); // .toInstant receiving the same error
c.output(row);
}
}
If I print out row.toString() I will get legit timestamps:
{uuid=00:00:00:00:00:00, start_time=2016-09-22T07:34:38.000Z, end_time=2016-09-22T07:39:38.000Z}
When I run this code JAVA says: Failed to create the load job beam_job_XXX
Manually inspecting the temp folder in GCS, the objects look like:
{"mac":"00:00:00:00:00:00","start_time":{"millis":1474529678000,"chronology":{"zone":{"fixed":true,"id":"UTC"}},"zone":{"fixed":true,"id":"UTC"},"afterNow":false,"beforeNow":true,"equalNow":false},"end_time":{"millis":1474529978000,"chronology":{"zone":{"fixed":true,"id":"UTC"}},"zone":{"fixed":true,"id":"UTC"},"afterNow":false,"beforeNow":true,"equalNow":false}}
Looking at the failed job report in BigQuery, the Error says:
JSON object specified for non-record field: start_time (error code: invalid)
This is very strange, because I am pretty sure I said this is a TIMESTAMP, and I am 100% sure my schema in BigQuery conforms with the TableSchema in the SDK. (NOTE: setting the withCreateDisposition...CREATE_IF_NEEDEDyields the same result)
Could someone please tell me how I need to remedy this to get the data inside BigQuery?
Don't use Instant objects. Try using milliseconds/seconds.
https://cloud.google.com/bigquery/data-types
A positive number specifies the number of seconds since the epoch
So, something like this should work:
.getMillis() / 1000

In-memory H2 database, insert not working in SpringBootTest

I have a SpringBootApplicationWhich I wish to test.
Below are the details about my files
application.properties
PRODUCT_DATABASE_PASSWORD=
PRODUCT_DATABASE_USERNAME=sa
PRODUCT_DATABASE_CONNECTION_URL=jdbc:h2:file:./target/db/testdb
PRODUCT_DATABASE_DRIVER=org.h2.Driver
RED_SHIFT_DATABASE_PASSWORD=
RED_SHIFT_DATABASE_USERNAME=sa
RED_SHIFT_DATABASE_CONNECTION_URL=jdbc:h2:file:./target/db/testdb
RED_SHIFT_DATABASE_DRIVER=org.h2.Driver
spring.datasource.platform=h2
ConfigurationClass
#SpringBootConfiguration
#SpringBootApplication
#Import({ProductDataAccessConfig.class, RedShiftDataAccessConfig.class})
public class TestConfig {
}
Main Test Class
#RunWith(SpringJUnit4ClassRunner.class)
#SpringBootTest(classes = {TestConfig.class,ConfigFileApplicationContextInitializer.class}, webEnvironment = SpringBootTest.WebEnvironment.NONE)
public class MainTest {
#Autowired(required = true)
#Qualifier("dataSourceRedShift")
private DataSource dataSource;
#Test
public void testHourlyBlock() throws Exception {
insertDataIntoDb(); //data sucessfully inserted
SpringApplication.run(Application.class, new String[]{}); //No data found
}
}
Data Access In Application.class;
try (Connection conn = dataSourceRedShift.getConnection();
Statement stmt = conn.createStatement() {
//access inserted data
}
Please Help!
PS for the spring boot application the test beans are being picked so bean instantiation definitely not a problem. I think I am missing some properties.
I do not use hibernate in my application and data goes off even within the same application context (child context). i.e. I run a spring boot application which reads that data inserted earlier
Problem solved.
removing spring.datasource.platform=h2 from the application.properties.
Made my h2 data persists.
But I still wish to know how is h2 starting automatically?

Access remote objects with an RMI client by creating an initial context and performing a lookup

I'm trying to look up the PublicRepository class from an EJB on a Weblogic 10 server. This is the piece of code:
/**
* RMI/IIOP clients should use this narrow function
*/
private static Object narrow(Object ref, Class c) {
return PortableRemoteObject.narrow(ref, c);
}
/**
* Lookup the EJBs home in the JNDI tree
*/
private static PublicRepository lookupHome() throws NamingException {
// Lookup the beans home using JNDI
Context ctx = getInitialContext();
try {
Object home = ctx.lookup("cea");
return (PublicRepository) narrow(home, PublicRepository.class);
} catch(NamingException ne) {
System.out.println("The client was unable to lookup the EJBHome. Please make sure ");
System.out.println("that you have deployed the ejb with the JNDI name "
+ "cea" + " on the WebLogic server at " + "iiop://localhost:7001");
throw ne;
}
}
private static Context getInitialContext() throws NamingException {
try {
// Get an InitialContext
Properties h = new Properties();
h.put(Context.INITIAL_CONTEXT_FACTORY,
"weblogic.jndi.WLInitialContextFactory");
h.put(Context.PROVIDER_URL, "iiop://localhost:7001");
return new InitialContext(h);
} catch(NamingException ne) {
System.out.println("We were unable to get a connection to the WebLogic server at " + "iiop://localhost:7001");
System.out.println("Please make sure that the server is running.");
throw ne;
}
}
I'm however getting Cast Exception:
Exception in thread "main" java.lang.ClassCastException
at com.sun.corba.se.impl.javax.rmi.PortableRemoteObject.narrow(Unknown Source)
at javax.rmi.PortableRemoteObject.narrow(Unknown Source)
at vrd.narrow(vrd.java:67)
at vrd.lookupHome(vrd.java:80)
at vrd.main(vrd.java:34)
Caused by: java.lang.ClassCastException: weblogic.corba.j2ee.naming.ContextImpl
... 5 more
Am I correct when I'm using the above code to retrive a certain class to be used in my client application? How could I get rid of the cast exception?
The simple thing to do would be to store the result of 'narrow' in a java.lang.Object and then see what type it is...
The error means you've looked up a Context rather than a bound object. In other words, you looked up "cea" instead of something like "cea/Bean". It's the analogous to using a FileInputStream on a directory.
I was using the wrong JNDI name, hence it couldn't retrieve the object. Thanks everyone for looking.