When running a SPARQL UPDATE query (DELETE/INSERT) the message :
Cannot retrieve Value with ID of 133705167
is given in the GraphDB workbench. This happened after a large update query successfully finished.
Any ideas as to why this message occurred?
GraphDB database engine represents the RDF data into two parallel collections named entity pool and statement collections. The entity pool is a dictionary-like structure mapping the RDF values to the internal identifiers. The statement collections are the actual indexes storing the subject predicate object context data organised as a paged structure.
The exception Cannot retrieve Value with ID of 133705167 indicates that the statement collections contains a value which cannot be retrieved from the entity pool. Few possible scenarios can lead to such inconsistency:
The database collections were manually edited or copied while the database was running
You hit an unknown database bug
I recommend you first to scan your image with the storage tool. It will detect and report all data inconsistencies. Later once you determine the full scope of the issue, you can recover your repository content, by dumping it to RDF file with the export. All internal IDs without RDF value will be skipped.
Related
Our BigQuery schema is heavily nested/repeated and constantly changes. For example, a new page, form, or user-info field to the website would correspond to new columns for in BigQuery. Also if we stop using a certain form, the corresponding deprecated columns will be there forever because you can't delete columns in Bigquery.
So we're going to eventually result in tables with hundreds of columns, many of which are deprecated, which doesn't seem like a good solution.
The primary alternative I'm looking into is to store everything as json (for example where each Bigquery table will just have two columns, one for timestamp and another for the json data). Then batch jobs that we have running every 10minutes will perform joins/queries and write to aggregated tables. But with this method, I'm concerned about increasing query-job costs.
Some background info:
Our data comes in as protobuf and we update our bigquery schema based off the protobuf schema updates.
I know one obvious solution is to not use BigQuery and just use a document storage instead, but we use Bigquery as both a data lake and also as a data warehouse for BI and building Tableau reports off of. So we have jobs that aggregates raw data into tables that serve Tableau.
The top answer here doesn't work that well for us because the data we get can be heavily nested with repeats: BigQuery: Create column of JSON datatype
You are already well prepared, you layout several options in your question.
You could go with the JSON table and to maintain low costs
you can use a partition table
you can cluster your table
so instead of having just two timestamp+json column I would add 1 partitioned column and 5 cluster colums as well. Eventually even use yearly suffixed tables. This way you have at least 6 dimensions to scan only limited number of rows for rematerialization.
The other would be to change your model, and do an event processing middle-layer. You could first wire all your events either to Dataflow or Pub/Sub then process it there and write to bigquery as a new schema. This script would be able to create tables on the fly with the schema you code in your engine.
Btw you can remove columns, that's rematerialization, you can rewrite the same table with a query. You can rematerialize to remove duplicate rows as well.
I think this use case can be implemeted using Dataflow (or Apache Beam) with Dynamic Destination feature in it. The steps of dataflow would be like:
read the event/json from pubsub
flattened the events and put filter on the columns which you want to insert into BQ table.
With Dynamic Destination you will be able to insert the data into the respective tables
(if you have various event of various types). In Dynamic destination
you can specify the schema on the fly based on the fields in your
json
Get the failed insert records from the Dynamic
Destination and write it to a file of specific event type following some windowing based on your use case (How frequently you observe such issues).
read the file and update the schema once and load the file to that BQ table
I have implemented this logic in my use case and it is working perfectly fine.
I am testing out NiFi to replace our current ingestion setup which imports data from multiple MySQL shards of a table and store it in HDFS.
I am using GenerateTableFetch and ExecuteSQL to achieve this.
Each incoming flow file will have a database.name attribute which is being used by DBCPConnectionPoolLookup to select the relevant shard.
Issue is that, let's say I have 2 shards to pull data from, shard_1 and shard_2 for table accounts and also I have updated_at as Maximum Value Columns, it is not storing state for the for the table#updated_at per shard. There is only 1 entry per table in state.
When I check in Data Provenance, I see the shard_2 flowfile file getting dropped without being passed to ExecuteSQL. And my guess is it's because shard_1 query gets executed first and then when shard_2 query comes, it's records are checked against shard_1's updated_at and since it returns empty, it drops the file.
Has anyone faced this issue? Or am I missing something?
The ability to choose different databases via DBCPConnectionPoolLookup was added after the scheme to store state in the database fetch processors (QueryDatabaseTable, GenerateTableFetch, e.g.). Also, getting the database name differs between RDBMS drivers, it might be in the DatabaseMetaData or ResultSetMetaData, possibly in getCatalog() or getSchema() or neither.
I have written NIFI-5590 to cover this improvement.
cacheloader : Use case
One of main use case where GemFire is used is, where it is used as a fast running cache which holds most recent data (example last 1 month) and all remaining data sits in back-end database. I mean Gemfire data which is 1 month old is overflowed to database after 1 month.
However when user is looking for data which was beyond 1 month, we need to go to the database and get the data.
Cache loader is suitable for doing this operation on cache misses and gets data from the database. Regarding cache loader I beleive cache misses is triggered only when we do a Get operation on a key and if the key is missing.
What I do not understand is when the data gets overflowed to back-end, I beleieve no reference exist in Gemfire. Also a user may not know the Key - to do a get operation on Key, he might need to execute a OQL query on some other fields other than key.
How will cache miss be triggered when I don't know the key?
Then how does Cache loader fits into the overall solution?
Geode will not invoke a CacheLoader during a Query operation.
From the Geode documentation:
The loader is called on cache misses during get operations, and it populates the cache with the new entry value in addition to returning the value to the calling thread.
(Emphasis is my own)
So, I have read that using internal tables increases the performance of the program and that we should make operations on DB tables as less as possible. But I have started working on a project that does not use internal tables at all.
Some details:
It is a scanner that adds or removes products in/from a store. First the primary key is checked (to see if that type of product exists) and then the product is added or removed. We use ‘Insert Into’ and ‘Delete From’ to add/remove the products directly from the DB table.
I have not asked why they do not use internal tables because I do not have a better solution so far.
Here’s what I have so far: Insert all products in an internal table, place the deleted products in another internal table.
Form update.
Modify zop_db_table from table gt_table." – to add all new products
LOOP AT gt_deleted INTO gs_deleted.
DELETE FROM zop_db_table WHERE index_nr = gs_deleted-index_nr.
ENDLOOP. " – to delete products
Endform.
But when can I perform this update?
I could set a ‘Save button’ to perform the update, but then there will be the risk that the user forgets to save large amounts of data, or drops the scanner, shutting it down or similar situations. So this is clearly not a good solution.
My final question is: Is there a (good) way to implement internal tables in a project like this?
internal tables should be used for data processing, like lists or arrays in other languages (c#, java...). From a performance and system load perspective it is preferred to first load all data you need into an internal table, then process that internal table instead of loading individual records from the database.
But that is mostly true for reporting, which is probably the most common type of custom abap program. You often see developers use select...endselect-statements, that in effect loop over a database table, transferring row after row to the report, one at a time. That is extremely slow compared to reading all records at once into an itab, then looping over the itab. More than once i've cut the execution time of a report down to a fraction by just eliminating roundtrips to the database.
If you have a good reason to read from the database or update records immediately, you should do so. If you can safely delay updates and deletes to a point in time where you can process all of them together, without risking inconsistencies, I'd consider than an improvement. But if there is a good reason (like consistency or data loss) to update immediately, do it.
Update: as #vwegert mentioned regarding the select-endselect statement, the statement doesn't actually create individual database queries for each row. The database interface of the application server optimizes the query, transferring rows in bulk to the application server. From there the records are transported to the abap report one by one (because in the report there is only the work area to store a single row), which has a significant performance impact especially for queries with large result sets. A select into an internal table can transport all rows directly to the abap report (as long as there is enough memory to hold them), as now there is the internal table to hold those records in the report.
We are evaluating avro v/s thrift for storage. At this point Avro seems to be our choice, however the documentation states that the schema is stored alongside the data when serialized, is there a way to avoid this, since we are incharge of both producing and consuming the data, we want to see if we can avoid serializing the schema, and also is the difference in size of the serialized data with the schema is much larger than just the data without schema?
A little late to the party, but you don't actually need to store the actual schema with each and every record. You do, however, need a way to get back to the original schema from each record's serialized format.
Thus, you could use a schema store + custom serializer that writes the avro record content and the schema id. On read, you can read back in that schema ID, retrieve it from the schema store and then use that schema to rehydrate the record content. Bonus points for using a local cache if your schema store is remote.
This is the exactly the approach that Oracle's NoSQL DB takes to managing schema in a storage efficient manner (its also available able under the AGPL license).
Full disclosure: currently and never previously employed by Oracle or Sun, or worked on the above store. Just came across it recently :)
I'm pretty sure you will always need the schema to be stored with the data. This is because Avro will use it when reading and writing to the .avro file.
According to http://docs.oracle.com/cd/NOSQL/html/GettingStartedGuide/avroschemas.html:
You apply a schema to the value portion of an Oracle NoSQL Database record using Avro bindings. These bindings are used to serialize values before writing them, and to deserialize values after reading them. The usage of these bindings requires your applications to use the Avro data format, which means that each stored value is associated with a schema.
As far as size difference, you only have to store the schema once, so in the big scheme of things, it doesn't make that much of a difference. My schema takes up 105.5KB (And that is a really large schema, yours shouldn't be that large) and each serialized value takes up 3.3KB. I'm not sure what the difference would be for just the raw json of the data, but according to that link I posted:
Each value is stored without any metadata other than a small internal schema identifier, between 1 and 4 bytes in size.
But I believe that may just be for single, simple values.
This is on HDFS for me btw.
Thanks JGibel, Our data would eventually end up in HDFS eventually, and the object container file format does ensure that the schema is only written as a header on the file.
For uses other than HDFS, I was under the wrong assumption that the schema would be attached to every encoded data, but its not the case, meaning you need the schema to deserialize it, but the serialized data does not have to have the schema string attached to it.
E.g.
DatumWriter<TransactionInfo> eventDatumWriter = new SpecificDatumWriter<TransactionInfo>(TransactionInfo.class);
TransactionInfo t1 = getTransaction();
ByteArrayOutputStream baos = new ByteArrayOutputStream();
BinaryEncoder becoder = EncoderFactory.get().binaryEncoder(baos, null);
eventDatumWriter.setSchema(t1.getSchema());
eventDatumWriter.write(t1, becoder);
becoder.flush();