Trimming down or capping max version history for Apache Jackrabbit JCR - jcr

In an application that uses Apache Jackrabbit as JCR backend there are some versionable nodes in the repository (mostly text content). It seems that version history for those nodes just grows "in perpetuity".
What is the best/correct way to configure max number of versions in the version history (per workspace?) - for the future history entries AND to "trim down" the existing version histories for an arbitrary max number of entries per history?

We just set a property, a limit on number of versions we want to keep then delete the excess.
Here's an untested idea of how you could do it...
public void deleteOldVersions(Session session, String absPath, int limit)
throws RepositoryException {
VersionManager versionManager = session.getWorkspace().getVersionManager();
VersionHistory versionHistory = versionManager.getVersionHistory(absPath);
VersionIterator versionIterator = versionHistory.getAllLinearVersions(); // Gets the history in order
long numberToDelete = versionIterator.getSize() - limit;
while (numberToDelete-- > 0) {
String versionName = versionIterator.nextVersion().getName();
versionHistory.removeVersion(versionName);
}
}

Related

Batch read from DBs

Im a bit confused on how golangs sql package reads large datasets into memory. In this previous stackoverflow question - How to set fetch size in golang?, there seems to be conflicting ideas on whether batching of large datasets on read happens or not.
I am writing a go binary that connects to different remote DBs based on input params given and fetches resutls and subsequently converts them to a csv file. Suppose I have a query that returns a lot of rows; say 20 million rows. Loading this all at once in memory would be very exhaustive. Does the library batch the results automatically and only on row.Next() load the next batch into memory ?
If the db/sql package does not handle it, are there options in the various driver packages ?
https://github.com/golang/go/issues/13067 - From this issue and discussion, I understand that the general idea is to have the driver packages handle this. As mentioned in the issue and also in this blog https://oralytics.com/2019/06/17/importance-of-setting-fetched-rows-size-for-database-query-using-golang/, I found out that golangs oracle driver package has this option that I can pass for batching. But am not able to find an equivalent in the other driver packages.
To summarize -
Does db/sql batch read results automatically.
If yes, then my 2nd & 3rd question does not matter
If no, are there options that I can pass to the various driver pacakges to set the batch size and where can I find what these options are. I have already tried looking at pgx docs and cannot find anything there that sets a batch size.
Is there any other way to batch reads like a prepared statement with configuration specifying the batch size ?
Some clarifications:
My question is when the a query returns a large dataset, is the entire dataset loaded into memory or is it batched whether internally by some code that is called downstream from rows.Next or not.
From what I can see there is a chunk reader that gets created with a default 8kb size and is used to chunk. Are there cases where this does not happen ? Or are the results from db always chunked.
Is there any way this 8kb buffer size that the chunk reader uses configurable ?
For more clarity, I am adding what is existing in java. This is what already exists and I am looking to rewrite it in golang.
private static final int RESULT_SIZE = 10000;
private void generate() {
... //connection and other code...
Statement stmt = connection.createStatement(ResultSet.TYPE_FORWARD_ONLY,
ResultSet.CONCUR_READ_ONLY);
stmt.setFetchSize(RESULT_SIZE);
ResultSet resultset = stmt.executeQuery(dataQuery);
String fileInHome = getFullFileName(filePath, manager, parentDir);
rsToCSV(resultset, new BufferedWriter(new FileWriter(fileInHome)));
}
private void rsToCSV(ResultSet rs, BufferedWriter os) throws SQLException {
ResultSetMetaData metaData = rs.getMetaData();
int columnCount = metaData.getColumnCount();
try (PrintWriter pw = new PrintWriter(os)) {
readHeaders(metaData, columnCount, pw);
if (rs.next()) {
readRow(rs, metaData, columnCount, pw);
while (rs.next()) {
pw.println();
readRow(rs, metaData, columnCount, pw);
}
}
}
}
The stmt.setFetchSize(RESULT_SIZE); sets the number of rows to return in each result set which is then processed one by one to a csv.

Writing different values to different BigQuery tables in Apache Beam

Suppose I have a PCollection<Foo> and I want to write it to multiple BigQuery tables, choosing a potentially different table for each Foo.
How can I do this using the Apache Beam BigQueryIO API?
This is possible using a feature recently added to BigQueryIO in Apache Beam.
PCollection<Foo> foos = ...;
foos.apply(BigQueryIO.write().to(new SerializableFunction<ValueInSingleWindow<Foo>, TableDestination>() {
#Override
public TableDestination apply(ValueInSingleWindow<Foo> value) {
Foo foo = value.getValue();
// Also available: value.getWindow(), getTimestamp(), getPane()
String tableSpec = ...;
String tableDescription = ...;
return new TableDestination(tableSpec, tableDescription);
}
}).withFormatFunction(new SerializableFunction<Foo, TableRow>() {
#Override
public TableRow apply(Foo foo) {
return ...;
}
}).withSchema(...));
Depending on whether the input PCollection<Foo> is bounded or unbounded, under the hood this will either create multiple BigQuery import jobs (one or more per table depending on amount of data), or it will use the BigQuery streaming inserts API.
The most flexible version of the API uses DynamicDestinations, which allows you to write different values to different tables with different schemas, and even allows you to use side inputs from the rest of the pipeline in all of these computations.
Additionally, BigQueryIO has been refactored into a number of reusable transforms that you can yourself combine to implement more complex use cases - see files in the source directory.
This feature will be included in the first stable release of Apache Beam and into the next release of Dataflow SDK (which will be based on the first stable release of Apache Beam). Right now you can use this by running your pipeline against a snapshot of Beam at HEAD from github.
As of Beam 2.12.0, this feature is available in the Python SDK as well. It is marked as experimental, so you will have to pass --experiments use_beam_bq_sink to enable it. You'd do something like so:
def get_table_name(element):
if meets_some_condition(element):
return 'mytablename1'
else:
return 'mytablename2'
p = beam.Pipeline(...)
my_input_pcoll = p | ReadInMyPCollection()
my_input_pcoll | beam.io.gcp.bigquery.WriteToBigQuery(table=get_table_name)
The new sink supports a number of other options, which you can review in the pydoc

App Folder files not visible after un-install / re-install

I noticed this in the debug environment where I have to do many re-installs in order to test persistent data storage, initial settings, etc... It may not be relevant in production, but I mention this anyway just to inform other developers.
Any files created by an app in its App Folder are not 'visible' to queries after manual un-install / re-install (from IDE, for instance). The same applies to the 'Encoded DriveID' - it is no longer valid.
It is probably 'by design' but it effectively creates 'orphans' in the app folder until manually cleaned by 'drive.google.com > Manage Apps > [yourapp] > Options > Delete hidden app data'. It also creates problem if an app relies on finding of files by metadata, title, ... since these seem to be gone. As I said, not a production problem, but it can create some frustration during development.
Can any of friendly Googlers confirm this? Is there any other way to get to these files after re-install?
Try this approach:
Use requestSync() in onConnected() as:
#Override
public void onConnected(Bundle connectionHint) {
super.onConnected(connectionHint);
Drive.DriveApi.requestSync(getGoogleApiClient()).setResultCallback(syncCallback);
}
Then, in its callback, query the contents of the drive using:
final private ResultCallback<Status> syncCallback = new ResultCallback<Status>() {
#Override
public void onResult(#NonNull Status status) {
if (!status.isSuccess()) {
showMessage("Problem while retrieving results");
return;
}
query = new Query.Builder()
.addFilter(Filters.and(Filters.eq(SearchableField.TITLE, "title"),
Filters.eq(SearchableField.TRASHED, false)))
.build();
Drive.DriveApi.query(getGoogleApiClient(), query)
.setResultCallback(metadataCallback);
}
};
Then, in its callback, if found, retrieve the file using:
final private ResultCallback<DriveApi.MetadataBufferResult> metadataCallback =
new ResultCallback<DriveApi.MetadataBufferResult>() {
#SuppressLint("SetTextI18n")
#Override
public void onResult(#NonNull DriveApi.MetadataBufferResult result) {
if (!result.getStatus().isSuccess()) {
showMessage("Problem while retrieving results");
return;
}
MetadataBuffer mdb = result.getMetadataBuffer();
for (Metadata md : mdb) {
Date createdDate = md.getCreatedDate();
DriveId driveId = md.getDriveId();
}
readFromDrive(driveId);
}
};
Job done!
Hope that helps!
It looks like Google Play services has a problem. (https://stackoverflow.com/a/26541831/2228408)
For testing, you can do it by clearing Google Play services data (Settings > Apps > Google Play services > Manage Space > Clear all data).
Or, at this time, you need to implement it by using Drive SDK v2.
I think you are correct that it is by design.
By inspection I have concluded that until an app places data in the AppFolder folder, Drive does not sync down to the device however much to try and hassle it. Therefore it is impossible to check for the existence of AppFolder placed by another device, or a prior implementation. I'd assume that this was to try and create a consistent clean install.
I can see that there are a couple of strategies to work around this:
1) Place dummy data on AppFolder and then sync and recheck.
2) Accept that in the first instance there is the possibility of duplicates, as you cannot access the existing file by definition you will create a new copy, and use custom metadata to come up with a scheme to differentiate like-named files and choose which one you want to keep (essentially implement your conflict merge strategy across the two different files).
I've done the second, I have an update number to compare data from different devices and decide which version I want so decide whether to upload, download or leave alone. As my data is an SQLite DB I also have some code to only sync once updates have settled down and I deliberately consider people updating two devices at once foolish and the results are consistent but undefined as to which will win.

A sha value represent to the status for git repository

There is a sha value represent to the status for git repository in current status?
The sha value to be updated each time the Object Database updated and References changed if git has this sha value.
In other words, the sha value represent to the current version of whole repository.
Here is my code for calculate a sha for current references. There is a git level API or interface?
private string CalcBranchesSha(bool includeTags = false)
{
var sb = new StringBuilder();
sb.Append(":HEAD");
if (_repository.Head.Tip != null)
sb.Append(_repository.Head.Tip.Sha);
sb.Append(';');
foreach (var branch in _repository.Branches.OrderBy(s => s.Name))
{
sb.Append(':');
sb.Append(branch.Name);
if (branch.Tip != null)
sb.Append(branch.Tip.Sha);
}
sb.Append(';');
if (includeTags)
{
foreach (var tag in _repository.Tags.OrderBy(s => s.Name))
{
sb.Append(':');
sb.Append(tag.Name);
if (tag.Target != null)
sb.Append(tag.Target.Sha);
}
}
return sb.ToString().CalcSha();
}
There is a sha value represent to the status for git repository in current status?
In git parlance, the status of a repository usually refers to the paths that have differences between the working directory, the index and the current HEAD commit.
However, it doesn't look like you're after this. From what I understand, you're trying to calculate a checksum that represents the state of the current repository (ie. what your branches and tags point to).
Regarding the code, it may be improved in some ways to get a more precise checksum:
Account for all the references (beside refs/tags and refs/heads) in the repository (think refs/stash, refs/notes or the generated backup references in refs/original when one rewrite the history of a repository).
Consider disambiguating symbolic from direct references. (ie: HEAD->master->08a4217 should lead to a different checksum than HEAD->08a4127)
Below a modified version of the code which deals with those two points above, by leveraging the Refs namespace:
private string CalculateRepositoryStateSha(IRepository repo)
{
var sb = new StringBuilder();
sb.Append(":HEAD");
sb.Append(repo.Refs.Head.TargetIdentifier);
sb.Append(';');
foreach (var reference in repo.Refs.OrderBy(r => r.CanonicalName))
{
sb.Append(':');
sb.Append(reference.CanonicalName);
sb.Append(reference.TargetIdentifier);
sb.Append(';');
}
return sb.ToString().CalcSha();
}
Please keep in mind the following limits:
This doesn't consider the changes to the index or the workdir (eg. the same checksum will be returned either a file has been staged or not)
There are ways to create object in the object database without modifying the references. Those kind of changes won't be reflected with the code above. One possible hack-ish way to do this could be to also append to the StringBuuilder, the number of object in the object database (ie. repo.ObjectDatabase.Count()), but that may hinder the overall performance as this will enumerate all the objects each time the checksum is calculated).
There is a git level API or interface?
I don't know of any equivalent native function in git (although similar result may be achieved through some scripting). There's nothing native in libgit2 or Libgit2Sharp APIs.

Alfresco: unable to backup alf_data

I am an alfresco 3.3c user with an instance supporting more that 4 million objects. I’m starting having problems with backup, because to backup the alf_data/contentstore folder even in a incremental mode, it takes to long (always need to analyze all those files for changes).
I’ve noticed that alf_data/contentstore is organized internally per years, could I assume that the olders years (2012) are not anymore changed? (if yes, I can just create an exception and remove those dirs from the backup process, obviously with a previous full backup )
Thanks, kind regards.
Yes, you can assume that no objects will be created (and items are never updated) in old directories within your content store, although items may be removed by the repository's cleanup jobs after being deleted from Alfresco's trash can.
This is the section from org.alfresco.repo.content.filestore.FileContentStore which generates a new content URL. You can easily see that it always uses the current date and time.
/**
* Creates a new content URL. This must be supported by all
* stores that are compatible with Alfresco.
*
* #return Returns a new and unique content URL
*/
public static String createNewFileStoreUrl()
{
Calendar calendar = new GregorianCalendar();
int year = calendar.get(Calendar.YEAR);
int month = calendar.get(Calendar.MONTH) + 1; // 0-based
int day = calendar.get(Calendar.DAY_OF_MONTH);
int hour = calendar.get(Calendar.HOUR_OF_DAY);
int minute = calendar.get(Calendar.MINUTE);
// create the URL
StringBuilder sb = new StringBuilder(20);
sb.append(FileContentStore.STORE_PROTOCOL)
.append(ContentStore.PROTOCOL_DELIMITER)
.append(year).append('/')
.append(month).append('/')
.append(day).append('/')
.append(hour).append('/')
.append(minute).append('/')
.append(GUID.generate()).append(".bin");
String newContentUrl = sb.toString();
// done
return newContentUrl;
}
Actually no you can't, because if the file was modified/updated in Alfresco the filesystem path doesn't change. Remember, you can hot-backup the content-store (not the lucene index folder) dir, and it's not necessary to check every single file for consistency. Just launch a shell/batch script executing a copy without check, or use a tool like xxcopy.
(I'm talking about node properties, not the node content)