How to catch any exceptions thrown by BigQueryIO.Write and rescue the data which is failed to output? - google-bigquery

I want to read data from Cloud Pub/Sub and write it to BigQuery with Cloud Dataflow. Each data contains a table ID where the data itself will be saved.
There are various factors that writing to BigQuery fails:
Table ID format is wrong.
Dataset does not exist.
Dataset does not allow the pipeline to access.
Network failure.
When one of the failures occurs, a streaming job will retry the task and stall. I tried using WriteResult.getFailedInserts() in order to rescue the bad data and avoid stalling, but it did not work well. Is there any good way?
Here is my code:
public class StarterPipeline {
private static final Logger LOG = LoggerFactory.getLogger(StarterPipeline.class);
public class MyData implements Serializable {
String table_id;
}
public interface MyOptions extends PipelineOptions {
#Description("PubSub topic to read from, specified as projects/<project_id>/topics/<topic_id>")
#Validation.Required
ValueProvider<String> getInputTopic();
void setInputTopic(ValueProvider<String> value);
}
public static void main(String[] args) {
MyOptions options = PipelineOptionsFactory.fromArgs(args).withValidation().as(MyOptions.class);
Pipeline p = Pipeline.create(options);
PCollection<MyData> input = p
.apply("ReadFromPubSub", PubsubIO.readStrings().fromTopic(options.getInputTopic()))
.apply("ParseJSON", MapElements.into(TypeDescriptor.of(MyData.class))
.via((String text) -> new Gson().fromJson(text, MyData.class)));
WriteResult writeResult = input
.apply("WriteToBigQuery", BigQueryIO.<MyData>write()
.to(new SerializableFunction<ValueInSingleWindow<MyData>, TableDestination>() {
#Override
public TableDestination apply(ValueInSingleWindow<MyData> input) {
MyData myData = input.getValue();
return new TableDestination(myData.table_id, null);
}
})
.withSchema(new TableSchema().setFields(new ArrayList<TableFieldSchema>() {{
add(new TableFieldSchema().setName("table_id").setType("STRING"));
}}))
.withFormatFunction(new SerializableFunction<MyData, TableRow>() {
#Override
public TableRow apply(MyData myData) {
return new TableRow().set("table_id", myData.table_id);
}
})
.withCreateDisposition(BigQueryIO.Write.CreateDisposition.CREATE_IF_NEEDED)
.withWriteDisposition(BigQueryIO.Write.WriteDisposition.WRITE_APPEND)
.withFailedInsertRetryPolicy(InsertRetryPolicy.neverRetry()));
writeResult.getFailedInserts()
.apply("LogFailedData", ParDo.of(new DoFn<TableRow, TableRow>() {
#ProcessElement
public void processElement(ProcessContext c) {
TableRow row = c.element();
LOG.info(row.get("table_id").toString());
}
}));
p.run();
}
}

There is no easy way to catch exceptions when writing to output in a pipeline definition. I suppose you could do it by writing a custom PTransform for BigQuery. However, there is no way to do it natively in Apache Beam. I also recommend against this because it undermines Cloud Dataflow's automatic retry functionality.
In your code example, you have the failed insert retry policy set to never retry. You can set the policy to always retry. This is only effective during something like an intermittent network failure (4th bullet point).
.withFailedInsertRetryPolicy(InsertRetryPolicy.alwaysRetry())
If the table ID format is incorrect (1st bullet point), then the CREATE_IF_NEEDED create disposition configuration should allow the Dataflow job to automatically create a new table without error, even if the table ID is incorrect.
If the dataset does not exist or there is an access permission issue to the dataset (2nd and 3rd bullet points), then my opinion is that the streaming job should stall and ultimately fail. There is no way to proceed under any circumstances without manual intervention.

Related

Streaming objects from S3 Object using Spring Aws Integration

I am working on a usecase where I am supposed to poll S3 -> read the stream for the content -> do some processing and upload it to another bucket rather than writing the file in my server.
I know I can achieve it using S3StreamingMessageSource in Spring aws integration but the problem I am facing is that I do not know on how to process the message stream received by polling
public class S3PollerConfigurationUsingStreaming {
#Value("${amazonProperties.bucketName}")
private String bucketName;
#Value("${amazonProperties.newBucket}")
private String newBucket;
#Autowired
private AmazonClientService amazonClient;
#Bean
#InboundChannelAdapter(value = "s3Channel", poller = #Poller(fixedDelay = "100"))
public MessageSource<InputStream> s3InboundStreamingMessageSource() {
S3StreamingMessageSource messageSource = new S3StreamingMessageSource(template());
messageSource.setRemoteDirectory(bucketName);
messageSource.setFilter(new S3PersistentAcceptOnceFileListFilter(new SimpleMetadataStore(),
"streaming"));
return messageSource;
}
#Bean
#Transformer(inputChannel = "s3Channel", outputChannel = "data")
public org.springframework.integration.transformer.Transformer transformer() {
return new StreamTransformer();
}
#Bean
public S3RemoteFileTemplate template() {
return new S3RemoteFileTemplate(new S3SessionFactory(amazonClient.getS3Client()));
}
#Bean
public PollableChannel s3Channel() {
return new QueueChannel();
}
#Bean
IntegrationFlow fileStreamingFlow() {
return IntegrationFlows
.from(s3InboundStreamingMessageSource(),
e -> e.poller(p -> p.fixedDelay(30, TimeUnit.SECONDS)))
.handle(streamFile())
.get();
}
}
Can someone please help me with the code to process the stream ?
Not sure what is your problem, but I see that you have a mix of concerns. If you use messaging annotations (see #InboundChannelAdapter in your config), what is the point to use the same s3InboundStreamingMessageSource in the IntegrationFlow definition?
Anyway it looks like you have already explored for yourself a StreamTransformer. This one has a charset property to convert your InputStreamfrom the remote S3 resource to the String. Otherwise it returns a byte[]. Everything else is up to you what and how to do with this converted content.
Also I don't see reason to have an s3Channel as a QueueChannel, since the start of your flow is pollable anyway by the #InboundChannelAdapter.
From big height I would say we have more questions to you, than vise versa...
UPDATE
Not clear what is your idea for InputStream processing, but that is really a fact that after S3StreamingMessageSource you are going to have exactly InputStream as a payload in the next handler.
Also not sure what is your streamFile(), but it must really expect InputStream as an input from the payload of the request message.
You also can use the mentioned StreamTransformer over there:
#Bean
IntegrationFlow fileStreamingFlow() {
return IntegrationFlows
.from(s3InboundStreamingMessageSource(),
e -> e.poller(p -> p.fixedDelay(30, TimeUnit.SECONDS)))
.transform(Transformers.fromStream("UTF-8"))
.get();
}
And the next .handle() will be ready for String as a payload.

Custom Liquibase executor combining JdbcExecutor and LoggingExecutor

I'm looking for a way to record and write all those SQL statements to an output
file which get executed while running a Liquibase migration against an empty
target database.
The idea behind this is to speed up the initialization phase of integration tests
against a test database by simply reading and executing the SQL statements from the
generated file for subsequent tests.
I had no luck using updateSQL due to different handling of changesets with
pre-conditions (e.g. changeSetExecuted resolves to true for "update" but false for
"updateSQL").
Another approach was to run the Liquibase migration first, then writing a temporary
changelog file using GenerateChangeLogCommand which is finally used by another Liquibase
instance to produce an SQL update file.
While this approach works, it a) feels a bit hacky-ish, b) the end result is not the same
as running the migration directly.
Anyway, what I've come up with is a custom implementation of JdbcExecutor which incorporates
a LoggingExecutor. The implementation looks as follows:
#LiquibaseService(skip = true)
public class LoggingJdbcExecutor extends JdbcExecutor {
private LoggingExecutor loggingExecutor;
public LoggingJdbcExecutor(Database database, Writer writer) {
loggingExecutor = new LoggingExecutor(this, writer, database);
setDatabase(database);
}
#Override
public void execute(SqlStatement sql, List<SqlVisitor> sqlVisitors) throws DatabaseException {
super.execute(sql, sqlVisitors);
loggingExecutor.execute(sql, sqlVisitors);
}
#Override
public int update(SqlStatement sql, List<SqlVisitor> sqlVisitors) throws DatabaseException {
final int result = super.update(sql, sqlVisitors);
loggingExecutor.update(sql, sqlVisitors);
return result;
}
#Override
public void comment(String message) throws DatabaseException {
super.comment(message);
loggingExecutor.comment(message);
}
}
This executor gets injected into Liquibase before update() is invoked as follows:
final String path = configuration.getUpdateSqlExportFile();
ExecutorService.getInstance().setExecutor(liquibase.getDatabase(), new LoggingJdbcExecutor(
liquibase.getDatabase(), new FileWriter(path)
));
Is this approach reasonable and future proof ? While it seems to work I'm not sure if maybe I'm
missing something and there's a better way.
Thanks

HBase secondary index with observer coprocessor, .put on the index table results in recursion

In HBase database I want to create a secondary index by using additional "linking" table. I have followed the example given in this answer: Create secondary index using coprocesor HBase
I am not very well familiar with the entire concept of HBase, and I had read some examples on the issue of creating secondary indexes. I am attaching the coprocessor to single table only, like this:
disable 'Entry2'
alter 'Entry2', METHOD => 'table_att', 'COPROCESSOR' => '/home/user/hbase/rootdir/hcoprocessors.jar|com.acme.hobservers.EntryParentIndex||'
enable 'Entry2'
The source code of it, is as follows:
public class EntryParentIndex extends BaseRegionObserver{
private static final Log LOG = LogFactory.getLog(CoprocessorHost.class);
private HTablePool pool = null;
private final static String INDEX_TABLE = "EntryParentIndex";
private final static String SOURCE_TABLE = "Entry2";
#Override
public void start(CoprocessorEnvironment env) throws IOException {
pool = new HTablePool(env.getConfiguration(), 10);
}
#Override
public void prePut(
final ObserverContext<RegionCoprocessorEnvironment> observerContext,
final Put put,
final WALEdit edit,
final boolean writeToWAL)
throws IOException {
try {
final List<KeyValue> filteredList = put.get(Bytes.toBytes ("data"),Bytes.toBytes("parentId"));
byte [] id = put.getRow(); //Get the Entry ID
KeyValue kv=filteredList.get( 0 ); //get Entry PARENT_ID
byte[] parentId=kv.getValue();
HTableInterface htbl = pool.getTable(Bytes.toBytes(INDEX_TABLE));
//create row key for the index table
byte[] p1=concatTwoByteArrays(parentId, ":".getBytes()); //Insert a semicolon between two UUIDs
byte[] rowkey=concatTwoByteArrays(p1, id);
Put indexput = new Put(rowkey);
//The following call is setting up a strange? recursion, resulting
//...in thesame prePut method invoken again and again. Interestingly
//...the recursion limits itself up to 6 times. The actual row does not
//...get inserted into the INDEX_TABLE
htbl.put(indexput);
htbl.close();
}
catch ( IllegalArgumentException ex) { }
}
#Override
public void stop(CoprocessorEnvironment env) throws IOException {
pool.close();
}
public static final byte[] concatTwoByteArrays(byte[] first, byte[] second) {
byte[] result = Arrays.copyOf(first, first.length + second.length);
System.arraycopy(second, 0, result, first.length, second.length);
return result;
}
}
This executes when I perform put on the SOURCE_TABLE.
There is a comment in the code (please seek it): "The following call is setting up a strange".
I set a debugging print in the log confirming that the prePut method is being executed only on the SOURCE_TABLE, and never on the INDEX_TABLE. Yet I don't understand why this strange recursion is happening despite in the coprocessor I only execute one put on the INDEX_TABLE.
I have also confirmed that the put action on the source table is again only one.
I have fixed my problem. It came out to be that I was adding multiple times the same observer mistakenly thinking that it is getting lost after Hbase restart.
Also the reason why the .put call to the INDEX_TABLE was not working is because I did not set a value to it, but only a rowkey, mistakenly thinking that this is possible. Yet HBase did not throw any excepiton whatsoever, just did not perform the PUT, no info given, which may be confusing for newcommers to this technology.

Can the KnowledgeAgent be used to automatically write the KnowledgeBase to a file so it can be used externally?

i'm working at a little drools project and i have following problem:
- when i read the knowledgepackages from drools via the knowledgeAgent it takes a long time to load((now i know that building the knowledgeBase in general and especially when loading packages from guvnor is very intense ))
so I'm trying to serialize the KnowledgeBase to a file which is located locally on the system - on the one hand because loading the kBase from a local file is much much faster - and for the other so that i can use the KnowledgeBase for other applications The Problem with this is, that while using the KnowledgeAgent to load the KnowledgeBase the first time, the base will be updated by the Agent automatically
BUT: whilst the Base is updated, my local file will not be updated too
So I'm wondering how to handle/get the changeNotification from my KnowledgeAgent so i can call a method to serialize my KnowledgeBase ?
Is this somehow possible? basically i just want to update my local knowledgeBase file, everytime someone edits a rule in governor, so that my local file is always up to date.
If it isn't possible, or a really bad solution to begin with, what is the recommended / best way to go about it?
Please endure my english and the question itself, if you cant really make out what i want to accomplish or if my request is actually not a good solution or the question itself is redundant, im rather new to java and a total noob when it comes to drools.
Down below is the code:
public class DroolsConnection {
private static KnowledgeAgent kAgent;
private static KnowledgeBase kAgentBase;
public DroolsConnection(){
ResourceFactory.getResourceChangeNotifierService().start();
ResourceFactory.getResourceChangeScannerService() .start();
}
public KnowledgeBase readKnowledgeBase( ) throws Exception {
kAgent = KnowledgeAgentFactory.newKnowledgeAgent("guvnorAgent");
kAgent .applyChangeSet( ResourceFactory.newFileResource(CHANGESET_PATH));
kAgent.monitorResourceChangeEvents(true);
kAgentBase = kAgent.getKnowledgeBase();
serializeKnowledgeBase(kAgentBase);
return kAgentBase;
}
public List<EvaluationObject> runAgainstRules( List<EvaluationObject> objectsToEvaluate,
KnowledgeBase kBase ) throws Exception{
StatefulKnowledgeSession knowSession = kBase.newStatefulKnowledgeSession();
KnowledgeRuntimeLogger knowLogger = KnowledgeRuntimeLoggerFactory.newFileLogger(knowSession, "logger");
for ( EvaluationObject o : objectsToEvaluate ){
knowSession.insert( o );
}
knowSession.fireAllRules();
knowLogger .close();
knowSession.dispose();
return objectsToEvaluate;
}
public KnowledgeBase serializeKnowledgeBase(KnowledgeBase kBase) throws IOException{
OutputStream outStream = new FileOutputStream( SERIALIZE_BASE_PATH );
ObjectOutputStream oos = new ObjectOutputStream( outStream );
oos.writeObject ( kBase );
oos.close();
return kBase;
}
public KnowledgeBase loadFromSerializedKnowledgeBase() throws Exception {
KnowledgeBase kBase = KnowledgeBaseFactory.newKnowledgeBase();
InputStream is = new FileInputStream( SERIALIZE_BASE_PATH );
ObjectInputStream ois = new ObjectInputStream( is );
kBase = (KnowledgeBase) ois.readObject();
ois.close();
return kBase;
}
}
thanks for your help in advance!
best regards,
Marenko
In order to keep your local kbase updated you could use a KnowledgeAgentEventListener to know when its internal kbase gets updated:
kagent.addEventListener( new KnowledgeAgentEventListener() {
public void beforeChangeSetApplied(BeforeChangeSetAppliedEvent event) {
}
public synchronized void afterChangeSetApplied(AfterChangeSetAppliedEvent event) {
}
public void beforeChangeSetProcessed(BeforeChangeSetProcessedEvent event) {
}
public void afterChangeSetProcessed(AfterChangeSetProcessedEvent event) {
}
public void beforeResourceProcessed(BeforeResourceProcessedEvent event) {
}
public void afterResourceProcessed(AfterResourceProcessedEvent event) {
}
public void knowledgeBaseUpdated(KnowledgeBaseUpdatedEvent event) {
//THIS IS THE EVENT YOU ARE INTERESTED IN
}
public void resourceCompilationFailed(ResourceCompilationFailedEvent event) {
}
} );
You still need to handle concurrently accesses on your local kbase though.
By the way, since you are not using 'newInstance' configuration option, the agent will create a new instance of a kbase each time a change-set is applied. So, make sure you serialize the kagent's internal kbase (kagent.getKnowledgeBase()) instead of the reference you have in your app.
Hope it helps,

NHibernate database versioning: object level schema and data upgrades

I would like to approach database versioning and automated upgrades in NHibernate from a different direction than most of the strategies proposed out there.
As each object is defined by an XML mapping, I would like to take size and checksum for each mapping file/ configuration and store that in a document database (raven or something) along with a potential custom update script. If no script is found, use the NHibernate DDL generator to update the object schema. This way I can detect changes, and if I need to make DML changes in addition to DDL, or perform a carefully ordered transformation, I can theoretically do so in a controlled, testable manner. This should also maintain a certain level of persistence-layer agnosticism, although I'd imagine the scripts would still necessarily be database system-specific.
The trick would be, generating the "old" mapping files from the database and comparing them to the current mapping files. I don't know if this is possible. I also don't know if I'm missing anything else that would make this strategy prohibitively impractical.
My question, then: how practical is this strategy, and why?
what i did to solve just that problem
version the database in a table called SchemaVersion
query the table to see if schema is up to date (required version stored in DAL), if yes goto 6.
get updatescript with version == versionFromBb from resources/webservices/...
run the script which also alters the schemaversion to the new version
goto 2.
run app
to generate the scripts i have used 2 options
support one rdbms: run SchemaUpdate to export into file and add DML statements manually
support multiple rdbms: use Nhibernate class Table to generate at runtime ddl to add/alter/delete tables and code which uses a session DML
Update:
"what method did you use to store the current version"
small example
something like this
public static class Constants
{
public static readonly Version DatabaseSchemaVersion = new Version(1, 2, 3, 4);
}
public class DBMigration
{
private IDictionary<Version, Action> _updates = new Dictionary<Version, Action>();
private Configuration _config;
private Dialect _dialect;
private IList<Action<ISession>> _actions = new List<Action<ISession>>(16);
private string _defaultCatalog;
private string _defaultSchema;
private void CreateTable(string name, Action<Table> configuretable)
{
var table = new Table(name);
configuretable(table);
string createTable = table.SqlCreateString(_dialect, _config.BuildMapping(), _defaultCatalog, _defaultSchema);
_actions.Add(session => session.CreateSQLQuery(createTable).ExecuteUpdate());
}
private void UpdateVersionTo(Version version)
{
_actions.Add(session => { session.Get<SchemaVersion>(1).Value = version; session.Flush(); });
}
private void WithSession(Action<session> action)
{
_actions.Add(action);
}
public void Execute(Configuration config)
{
_actions.Clear();
_defaultCatalog = config.Properties[NH.Environment.DefaultCatalog];
_defaultSchema = config.Properties[NH.Environment.DefaultSchema];
_config = config;
_dialect = Dialect.GetDialect(config.Properties);
using (var sf = _config.BuildSessionFactory())
using (var session = sf.OpenSession())
using (var tx = session.BeginTransaction())
{
Version dbVersion = session.Get<SchemaVersion>(1).Value;
while (dbVersion < Constants.DatabaseSchemaVersion)
{
_actions.Clear();
_updates[dbVersion].Invoke(); // init migration, TODO: error handling
foreach (var action in _actions)
{
action.Invoke(session);
}
tx.Commit();
session.Clear();
dbVersion = session.Get<SchemaVersion>(1).Value;
}
}
}
public DBMigration()
{
_updates.Add(new Version(1, 0, 0, 0), UpdateFromVersion1);
_updates.Add(new Version(1, 0, 1, 0), UpdateFromVersion1);
...
}
private void UpdateFromVersion1()
{
AddTable("Users", table => table.AddColumn(...));
WithSession(session => session.CreateSqlQuery("INSERT INTO ..."));
UpdateVersionTo(new Version(1,0,1,0));
}
...
}