RavenDb: Document Refresh feature does not run at nor after the specified time by #refresh flag - ravendb

I need to mark documets as expired after some time and therefore I am trying to use #refresh feature to re-run subscription and to compute my 'expired' flag. I know there is 'Document expiration' feature but this one removes data which I don't want.
I have turned Refresh feature in settings and added #refresh UTC datetime in metadata for required documents. For example I added manually this document:
{
"Name": "My data",
"#metadata": {
"#collection": "Testing",
"#refresh": "2021-04-30T07:41:35.4845961Z"
}
}
It looks like I am facing non deterministic behavior - sometimes refresh is processed sometimes not. I tried with different combinations of times and set through code or via Raven Studio.
Refresh interval is set to refresh but still says "in less than a minute"
I am using
Community license (Document refresh not mentioned here, but I don't see it mentioned for any other licenses as well)
community license extensions
tried more vresions of RavenDB with same result (5.1.7. was looking more promising as it worked for some time but after a while stopped):
4.2.111 server/studio version in Docker on Windows 10
5.1.7 server/studio version
C# RavenDB.Client 5.1.6
Did not find related issue in bug tracker
https://issues.hibernatingrhinos.com/issues/RavenDB?q=document%20refresh
Any ideas what to check or what might be the case?
EDIT: After turned logging into console I found some error log. It looks like
RavendbProject, Raven.Server.Documents.Expiration.ExpiredDocumentsCleaner, Failed to refresh documents on RavendbProject which are older than 05/17/2021 09:48:47, EXCEPTION: System.NullReferenceException: Object reference not set to an instance of an object.
RavendbProject | at Sparrow.Server.ByteStringContext`1.From(String value, ByteStringType type, ByteString& str) in C:\Builds\RavenDB-Stable-5.1\51024\src\Sparrow.Server\ByteString.cs:line 1297
RavendbProject | at Raven.Server.Documents.DocumentPutAction.PutDocument(DocumentsOperationContext context, String id, String expectedChangeVector, BlittableJsonReaderObject document, Nullable`1 lastModifiedTicks, String changeVector, DocumentFlags flags, NonPersistentDocumentFlags nonPersistentFlags) in C:\Builds\RavenDB-Stable-5.1\51024\src\Raven.Server\Documents\DocumentPutAction.cs:line 190
Also worth mentioning is that my document was stored in ClusterWide transaction and thus I can see in one of my documents corresponding flag:
"#flags": "FromClusterTransaction",
My current suspicion is that it may happen that one of these documents prevented other documents from being refreshed. After deleting cluster-transaction document, other documents in collection were refreshed

The bug related to document that was added via cluster transaction, the workaround for now would be to not use cluster transaction.
I have opened an issue on bug tracker,
https://issues.hibernatingrhinos.com/issue/RavenDB-16710

Related

Apache Beam Java 2.26.0: BigQueryIO 'No rows present in the request'

Since the Beam 2.26.0 update we ran into errors in our Java SDK streaming data pipelines. We have been investigating the issue for quite some time now but are unable to track down the root cause. When downgrading to 2.25.0 the pipeline works as expected.
Our pipelines are responsible for ingestion, i.e., consume from Pub/Sub and ingest into BigQuery. Specifically, we use the PubSubIO source and the BigQueryIO sink (streaming mode). When running the pipeline, we encounter the following error:
{
"code" : 400,
"errors" : [ {
"domain" : "global",
"message" : "No rows present in the request.",
"reason" : "invalid"
} ],
"message" : "No rows present in the request.",
"status" : "INVALID_ARGUMENT"
}
Our initial guess was that the pipeline's logic was somehow bugged, causing the BigQueryIO sink to fail. After investigation, we concluded that the PCollection feeding the sink is indeed containing correct data.
Earlier today I was looking in the changelog and noticed that the BigQueryIO sink received numerous updates. I was specifically worried about the following changes:
BigQuery’s DATETIME type now maps to Beam logical type org.apache.beam.sdk.schemas.logicaltypes.SqlTypes.DATETIME
Java BigQuery streaming inserts now have timeouts enabled by default. Pass --HTTPWriteTimeout=0 to revert to the old behavior
With respect to the first update, I made sure to disable all DATETIME in the resulting TableRow objects. In this specific scenario, the error still stands.
For the second change, I'm unsure how to pass the --HTTPWriteTimeout=0 flag to the pipeline. How is this best achieved?
Any other suggestions as to the root cause of this issue?
Thanks in advance!
We have finally been able to fix this issue and rest assured it has been a hell of a ride. We basically debugged the entire BigQueryIO connector and came to the following conclusions:
The TableRow objects that are being forwarded to BigQuery used to contain enum values. Due to these not being serializable, an empty payload is forwarded to BigQuery. In my opinion, this error should be made more explicit (why was this suddenly changed anyway?).
The issue was solved by adding the #value annotation to each enum entry (com.google.api.client.util.Value).
The same TableRow object also contained values of the type byte[]. This value was injected in a BigQuery column with the bytes type. While this was working without explicitly computing a base64 before, it was now yielding errors.
The issue was solved by computing a base64 ourselves (this setup is also discussed in the following post).
--HTTPWriteTimeout is a pipeline option. You can set it the same way you set the runner, etc. (typically on the command line).

Issues pulling change log using python

I am trying to query and pull changelog details using python.
The below code returns the list of issues in the project.
issued = jira.search_issues('project= proj_a', maxResults=5)
for issue in issued:
print(issue)
I am trying to pass values obtained in the issue above
issues = jira.issue(issue,expand='changelog')
changelog = issues.changelog
projects = jira.project(project)
I get the below error on trying the above:
JIRAError: JiraError HTTP 404 url: https://abc.atlassian.net/rest/api/2/issue/issue?expand=changelog
text: Issue does not exist or you do not have permission to see it.
Could anyone advise as to where am I going wrong or what permissions do I need.
Please note, if I pass a specific issue_id in the above code it works just fine but I am trying to pass a list of issue_id
You can already receive all the changelog data in the search_issues() method so you don't have to get the changelog by iterating over each issue and making another API call for each issue. Check out the code below for examples on how to work with the changelog.
issues = jira.search_issues('project= proj_a', maxResults=5, expand='changelog')
for issue in issues:
print(f"Changes from issue: {issue.key} {issue.fields.summary}")
print(f"Number of Changelog entries found: {issue.changelog.total}") # number of changelog entries (careful, each entry can have multiple field changes)
for history in issue.changelog.histories:
print(f"Author: {history.author}") # person who did the change
print(f"Timestamp: {history.created}") # when did the change happen?
print("\nListing all items that changed:")
for item in history.items:
print(f"Field name: {item.field}") # field to which the change happened
print(f"Changed to: {item.toString}") # new value, item.to might be better in some cases depending on your needs.
print(f"Changed from: {item.fromString}") # old value, item.from might be better in some cases depending on your needs.
print()
print()
Just to explain what you did wrong before when iterating over each issue: you have to use the issue.key, not the issue-resource itself. When you simply pass the issue, it won't be handled correctly as a parameter in jira.issue(). Instead, pass issue.key:
for issue in issues:
print(issue.key)
myIssue = jira.issue(issue.key, expand='changelog')

Can't replace mongo document

I am attempting to save documents to a mongoDB cluster (sharded replica sets) and am having a strange issue. I am using pymongo 2.7.2 and TokuMX 1.5 mongodb 2.4.10.
When I attempt to save (overwrite) existing documents I am getting an exception that looks like the document I am saving is too large:
doc = db.collection.find_one()
db.collection.save(doc)
pymongo.errors.OperationFailure: BSONObj size: 18798961 (0x71D91E01) is invalid. Size must be between 0 and 16793600(16MB) First element: op: "u"
However this works fine:
doc = db.collection.find_one()
db.collection.remove({'_id': doc['_id']})
db.collection.save(doc)
The document in question is about 9mb, so it looks like when I attempt to replace the document it is somehow doubling the size of the document, exceeding the 16mb limit.
Any ideas as to what could cause this behavior?
Apparently this is a known issue with TokuMX. Oplog entries are twice the size of the document, so replacing a 9mb document will result in a 18mb oplog entry- which raises the exception.
The solution would be to limit document writes to less than 8mb so that oplog entries never exceed 16mb.
I think this is a side effect of how save is implemented in PyMongo.
Under the hood if the document has a _id then the save(doc) is turned into an update(doc, doc). That is where the doubling is coming into play since the query+update is 18MB.
When you removed the _id you changed the save(doc) into a insert(doc) of a new document with a new _id. I don't think that is what you wanted.
Rather than use save I would recommend constructing a query with just the _id field from the original document and doing the update call manually. I would even go so far as you should enter a Jira ticket to get PyMongo to do this for you.
HTH,
Rob.

ravendb remove multiple versions of the same document

I have a database with version bundle turned on. So the documents were saved like:
user/1/revision/1, user/1/revision/2, etc.
But what I didnt expect is that on search I will have all the versions of the same user or whatever else document I deal with.
I tried to restore this db to a new database with version bundle turned on and off and I still have all the versions in search.
I make search like this:
session.Query<Entity>().Search(x=>x.Name, query, options: SearchOptions.And, escapeQueryOptions: EscapeQueryOptions.AllowPostfixWildcard)
Not sure, maybe I should use some specific parameters to work with latest document version only?
UPDATE:
What I did so far:
reinstalled ravendb and installed it as a service(it was as a service, just made sure I didnt break anything)
imported data from old database to new database
deleted all the indexes related to the entity
I still get all the revisions in my search results. Also my Raven.Server.config doesnt have anything related to bundles. My raven version is 2750 which seem to be the latest production release recommended.
UPDATE 2: When I try to import data to new database from old dump I get the following error:
Client side exception:
System.Exception: Server Error:
/bulk_docs
Raven.Abstractions.Exceptions.OperationVetoedException: PUT vetoed by Raven.Bundles.Versioning.Triggers.VersioningPutTrigger because: Modifying a historical revision is not allowed
at Raven.Database.DocumentDatabase.AssertPutOperationNotVetoed(String key, RavenJObject metadata, RavenJObject document, TransactionInformation transactionInformation)
at Raven.Database.DocumentDatabase.<>c_DisplayClass4b.b_43(IStorageActionsAccessor actions)
at Raven.Storage.Esent.TransactionalStorage.Batch(Action1 action)
at Raven.Database.DocumentDatabase.Put(String key, Etag etag, RavenJObject document, RavenJObject metadata, TransactionInformation transactionInformation)
at Raven.Database.Extensions.CommandExtensions.Execute(ICommandData self, DocumentDatabase database, BatchResult batchResult)
at Raven.Database.DocumentDatabase.ProcessBatch(IList1 commands)
at Raven.Database.DocumentDatabase.<>c_DisplayClass10c.b_108(IStorageActionsAccessor actions)
at Raven.Storage.Esent.TransactionalStorage.ExecuteBatch(Action1 action, EsentTransactionContext transactionContext)
at Raven.Storage.Esent.TransactionalStorage.Batch(Action1 action)
at Raven.Database.DocumentDatabase.Batch(IList`1 commands)
at Raven.Database.Server.Responders.DocumentBatch.Batch(IHttpContext context)
at Raven.Database.Server.HttpServer.DispatchRequest(IHttpContext ctx)
at Raven.Database.Server.HttpServer.HandleActualRequest(IHttpContext ctx)
Any ideas how to fix it?
Revisions shouldn't be indexed. As long as the versioning bundle is active on the database and the revision documents have a metadata key Raven-Document-Revision-Status with value Historical they should be ignored by all indexed.
Check the bundle is active on that DB, and the metadata mentioned above exists.
This holds true for 2.0, 2.5 and iirc 1.0 as well.

RavenDB, RavenHQ and Appharbor - document size error with very first document

I have a completely empty RavenHQ database that's linked to my Appharbor application. The amount of space the database is currently using is 1.1mb out of an available 25mb for my bronze account. The database previously had records in it, but I have deleted them using "delete collection" in the management studio.
The very first time I call session.Store(myobject), and BEFORE I call .SaveChanges(), I get the following error.
System.InvalidOperationException: Url: "/docs/Raven/Hilo/AccItems"
Raven.Database.Exceptions.OperationVetoedException: PUT vetoed by Raven.Bundles.Quotas.Triggers.DatabaseSizeQoutaForDocumetsPutTrigger because: Database size is 45,347 KB, which is over the allowed quota of 25,600 KB. No more documents are allowed in.
Now, the document is definitely not that big, so I don't know what this error can mean, especially as I don't think I've even hit the database at that point since I haven't closed the session by calling SaveChanges(). Any ideas? Here's the code itself.
XDocument doc = XDocument.Parse(rawXml);
var accItems = ExtractItemsFromFeed(doc);
using (IDocumentSession session = _store.OpenSession())
{
var dbItems = session.Query<AccItem>().ToList();
foreach (var item in accItems)
{
var existingRecord = dbItems.SingleOrDefault(x => x.Source == x.SourceId == cottage.SourceId);
if (existingRecord == null)
{
session.Store(item);
_logger.Info("Saved new item {0}.", item.ShortName);
}
else
{
existingRecord.ShortName = item.ShortName;
_logger.Info("Updated item {0}.", item.ShortName);
}
session.SaveChanges();
}
}
Any other comments about the style of this code would be most welcome, as I was unsure of the best way to approach the "update existing item or create if it isn't there" scenario.
The answer here was as follows.
RavenHQ support found that the database was indeed oversized, but it seemed that the size reported in the Appharbor-branded RavenHQ control panel was incorrect. I had filled up the database way over the limit with a previous faulty version of the code posted above, so the error message I received was actually correct.
Fixing this problem without paying to upgrade the database wasn't straightforward, as it's not possible to shrink the database. As I also wasn't able to delete my single Appharbor/RavenHQ database or create another one that left me with the choice of creating an entirely new Appharbor application, or registering directly with RavenHQ for a new account. I chose the latter. The RavenHQ-branded control panel is slightly different to the Appharbor one, in that it has the ability to create and delete databases.
So to summarize: there doesn't seem to be any benefit to using RavenHQ as an add-on to Appharbor - you might as well go and get a proper free RavenHQ account.