Jackrabbit query for version labels taking 40 minutes time - jcr

I need to track the history of files and need to show history to a user based on the selected node. I am using Apache jackrabbit to get the data for a particular version label. I am using the following query:
SELECT versioned.[jcr:uuid]
FROM [nt:frozenNode] AS versioned
INNER JOIN [nt:version] AS version
ON ISCHILDNODE(versioned,version) INNER JOIN [nt:versionLabels] as node
ON node.[20170921114713] = version.[jcr:uuid]
But my version DB has 133129 records. Query is taking 35 minutes to execute. Please let me know how can I achieve best performance time. Anybody having similar requirement and implemented with good performance, please let me know. Thanks in advance.

Found that query is taking too much of time because of huge data in forzen node. Also getting GC over limit .
Used API to fetch the data. now time decreased to 90seconds.
TreeTraverser traverse = new TreeTraverser(node, null, TreeTraverser.InclusionPolicy.LEAVES);
int count = 0;
for (Node r : traverse) {
count++;
Node parent = r.getParent();
String test = parent.getPath();
VersionHistory history1 = session.getWorkspace().getVersionManager().getVersionHistory(test);
HashSet<String> test1 = printHistory1(history1);
map.put(test, test1);
}

Related

Read very slow from a database

I use spring boot with spring data jpa, hibernate and oracle.
Actually, I my table I have arount 10 millions of record, I need to do some operation, write info to a file and after delete the record.
It's a basic sql query
select * from zzz where status = 2;
I done a test without doing operation and delete record
long start = System.nanoTime();
int page = 0;
Pageable pageable = PageRequest.of(page, LIMIT);
Page<Billing> pageBilling = billingRepository.findAllByStatus(pageable);
while (true) {
for (Billing: pageBilling .getContent()) {
//process
//write to file
//delete element
}
if (!pageBilling .hasNext()) {
break;
}
pageable = pageBilling .nextPageable();
pageBilling = billingRepository.findAllByStatus(pageable);
}
long end = System.nanoTime();
long microseconds = (end - start) / 1000;
System.out.println(microseconds + " to write");
Result it's bad, with a limit of 10 000, that took 157 minutes, with 100 000 28 minutes, with millions 19 minutes.
It's there a better solution to increase performance?
The following are likely to improve the performance significantly:
You should not iterate past the first page. Instead, delete the processed data and select the first page again. Actually you don't need a page for that you can encode the limit in the method name. Selecting late pages is rather inefficient.
The process of loading, processing and deleting one batch of items should be in a separate transaction. Otherwise the EntityManager will hold all the entities ever loaded which will make things really slow.
If that still isn't sufficient yet you may look into the following:
Inspect the SQL executed. Does it look sensible? If not consider switching to JdbcTemplate or NamedParameterJdbcTemplate with a query method that takes a RowCallbackHandler you should be able to load and process all rows with a single select statement and at the end to process one delete statement to remove all rows. This requires that the status that you use for filtering does not change in the mean time.
How do the execution plans look like? If they seem of inspect your indices.

Using deepstream List for tens of thousands unique values

I wonder if it's a good/bad idea to use deepstream record.getList for storing a lot of unique values, for example, emails or any other unique identifiers. The main purpose is to be able to answer a question quickly whether we already have, say, a user with such email (email in use) or another record by specific unique field.
I made few experiments today and got two problems:
1) when I tried to populate the list with few thousands values I got
FATAL ERROR: CALL_AND_RETRY_LAST Allocation failed - process out of memory
and my deepstream server went off. I was able to fix it by adding more memory to the server node process with this flag
--max-old-space-size=5120
it doesn't look fine but allowed me to make a list with more than 5000 items.
2) It wasn't enough for my tests so I precreated the list with 50000 items and put the data directly to rethinkdb table and got another issue on getting the list or modifing it:
RangeError: Maximum call stack size exceeded
I was able to fix it with another flag:
--stack-size=20000
It helps but I believe it's only matter of time when one of those errors appear in production when the list size reaches proper value. I don't know really whether it's nodejs, javascript, deepstream or rethinkdb issue. That's all in general made me think that I try to use deepstream List wrong way. Please, let me know. Thank you in advance!
Whilst you can use lists to store arrays of strings, they are actually intended as collections of recordnames - the actual data would be stored in the record itself, the list would only manage the order of the records.
Having said that, there are two open Github issues to improve performance for very long lists by sending more efficient deltas and by introducing a pagination option
Interesting results in regards to memory though, definitely something that needs to be handled more gracefully. In the meantime you could drastically improve performance by combining updates into one:
var myList = ds.record.getList( 'super-long-list' );
// Sends 10.000 messages
for( var i = 0; i < 10000; i++ ) {
myList.addEntry( 'something-' + i );
}
// Sends 1 message
var entries = [];
for( var i = 0; i < 10000; i++ ) {
entries.push( 'something-' + i );
}
myList.setEntries( entries );

How to use LINQ to Entities to make a left join using a static value

I've got a few tables, Deployment, Deployment_Report and Workflow. In the event that the deployment is being reviewed they join together so you can see all details in the report. If a revision is going out, the new workflow doesn't exist yet new workflow is going into place so I'd like the values to return null as the revision doesn't exist yet.
Complications aside, this is a sample of the SQL that I'd like to have run:
DECLARE #WorkflowID int
SET #WorkflowID = 399 -- Set to -1 if new
SELECT *
FROM Deployment d
LEFT JOIN Deployment_Report r
ON d.FSJ_Deployment_ID = r.FSJ_Deployment_ID
AND r.Workflow_ID = #WorkflowID
WHERE d.FSJ_Deployment_ID = 339
The above in SQL works great and returns the full record if viewing an active workflow, or the left side of the record with empty fields for revision details which haven't been supplied in the event that a new report is being generated.
Using various samples around S.O. I've produced some Entity to SQL based on a few multiple on statements but I feel like I'm missing something fundamental to make this work:
int Workflow_ID = 399 // or -1 if new, just like the above example
from d in context.Deployments
join r in context.Deployment_Reports.DefaultIfEmpty()
on
new { d.Deployment_ID, Workflow_ID }
equals
new { r.Deployment_ID, r.Workflow_ID }
where d.FSJ_Deployment_ID == fsj_deployment_id
select new
{
...
}
Is the SQL query above possible to create using LINQ to Entities without employing Entity SQL? This is the first time I've needed to create such a join since it's very confusing to look at but in the report it's the only way to do it right since it should only return one record at all times.
The workflow ID is a value passed in to the call to retrieve the data source so in the outgoing query it would be considered a static value (for lack of better terminology on my part)
First of all don't kill yourself on learning the intricacies of EF as there are a LOT of things to learn about it. Unfortunately our deadlines don't like the learning curve!
Here's examples to learn over time:
http://msdn.microsoft.com/en-us/library/bb397895.aspx
In the mean time I've found this very nice workaround using EF for this kind of thing:
var query = "SELECT * Deployment d JOIN Deployment_Report r d.FSJ_Deployment_ID = r.Workflow_ID = #WorkflowID d.FSJ_Deployment_ID = 339"
var parm = new SqlParameter(parameterName="WorkFlowID" value = myvalue);
using (var db = new MyEntities()){
db.Database.SqlQuery<MyReturnType>(query, parm.ToArray());
}
All you have to do is create a model for what you want SQL to return and it will fill in all the values you want. The values you are after are all the fields that are returned by the "Select *"...
There's even a really cool way to get EF to help you. First find the table with the most fields, and get EF to generated the model for you. Then you can write another class that inherits from that class adding in the other fields you want. SQL is able to find all fields added regardless of class hierarchy. It makes your job simple.
Warning, make sure your filed names in the class are exactly the same (case sensitive) as those in the database. The goal is to make a super class model that contains all the fields of all the join activity. SQL just knows how to put them into that resultant class giving you strong typing ability and even more important use-ability with LINQ
You can even use dataannotations in the Super Class Model for displaying other names you prefer to the User, this is a super nice way to keep the table field names but show the user something more user friendly.

twiiter4j when to STOP when no more tweets available?

So, I've figured out how to be able to get more than 100 tweets, thanks to How to retrieve more than 100 results using Twitter4j
However, when do I make the script stop and print stop when maximum results have been reached? For example, I set
int numberOfTweets = 512;
And, it finds just 82 tweets matching my query.
However, because of:
while (tweets.size () < numberOfTweets)
it still continues to keep on querying over and over until I max out my rate limit of 180 requests per 15 seconds.
I'm really a novice at java, so I would really appreciate if you could show me how to resolve this by modifying the first answer script at How to retrieve more than 100 results using Twitter4j
Thanks in advance!
You only need to modify things in the try{} block. One solution is to check whether the ID of the last tweet you found on the previous loop(previousLastID) in the while is the same as the ID of the last tweet (lastID) in the new batch collected (newTweets). If it is, it means the new batch's elements already exist in the previous array, and that that we have reached the end of possible tweets for this hastag.
try {
QueryResult result = twitter.search(query);
List<Status> newTweets = result.getTweets();
long previousLastID = lastID;
for (Status t: newTweets)
if (t.getId() < lastID) lastID = t.getId();
if (previousLastID == lastID) {
println("Last batch (" + tweets.size() + " tweets) was the same as first. Stopping the Gathering process");
break;
}

DQL query to return all files in a Cabinet in Documentum?

I want to retrieve all the files from a cabinet (called 'Wombat Insurance Co'). Currently I am using this DQL query:
select r_object_id, object_name from dm_document(all)
where folder('/Wombat Insurance Co', descend);
This is ok except it only returns a maximum of 100 results. If there are 5000 files in the cabinet I want to get all 5000 results. Is there a way to use pagination to get all the results?
I have tried this query:
select r_object_id, object_name from dm_document(all)
where folder('/Wombat Insurance Co', descend)
ENABLE (RETURN_RANGE 0 100 'r_object_id DESC');
with the intention of getting results in 100 file increments, but this query gives me an error when I try to execute it. The error says this:
com.emc.documentum.fs.services.core.CoreServiceException: "QUERY" action failed.
java.lang.Exception: [DM_QUERY2_E_UNRECOGNIZED_HINT]error:
"RETURN_RANGE is an unknown hint or is being used incorrectly."
I think I am using the RETURN_RANGE hint correctly, but maybe I'm not. Any help would be appreciated!
I have also tried using the hint ENABLE(FETCH_ALL_RESULTS 0) but this still only returns a maximum of 100 results.
To clarify, my question is: how can I get all the files from a cabinet?
You have already accepted an answer which is using DFS.
Since your are playing with DFC, these information might help you.
DFS:
If you are using DFS, you have to aware about the number of concurrent sessions that you can consume with DFS.
I think it is 100 or 150.
DFC:
Actually there is a limit that you can fetch via DFC (I'm not sure with DFS).
Go to your DFC application(webtop or da or anything) and check the dfc.properties file.
# Maximum number of results to retrieve by a query search.
# min value: 1, max value: 10000000
#
dfc.search.max_results = 100
# Maximum number of results to retrieve per source by a query search.
# min value: 1, max value: 10000000
#
dfc.search.max_results_per_source = 400
dfc.properties.full or similar file is there and you can verify these values according to your system.
And I'm talking about the ContentServer side, not the client side dfc.properties file.
If you use ENABLE (RETURN_TOP) hint with DFC, there are 2 ways to fetch the results from the ContentServer.
Object based
Row based
You have to configure this by using the parameter return_top_results_row_based in the server.ini file.
All of these changes for the documentum server side, not for your DFC/DQL client.
Aha, I've figured it out. Using DFS with Java (an abstraction layer on top of DFC) you can set the starting index for query results:
String queryStr = "select r_object_id, object_name from dm_document(all)
where folder('/Wombat Insurance Co', descend);"
PassthroughQuery query = new PassthroughQuery();
query.setQueryString(queryStr);
query.addRepository(repositoryStr);
QueryExecution queryEx = new QueryExecution();
queryEx.setCacheStrategyType(CacheStrategyType.DEFAULT_CACHE_STRATEGY);
queryEx.setStartingIndex(currentIndex); // set start index here
OperationOptions operationOptions = null;
// will return 100 results starting from currentIndex
QueryResult queryResult = queryService.execute(query, queryEx, operationOptions);
You can just increment the currentIndex variable to get all results.
Well, the hint is being used incorrectly. Start with 1, not 0.
There is no built-in limit in DQL itself. All results are returned by default. The reason you get only 100 results must have something to do with the way you're using DFC (or whichever other client you are using). Using IDfCollection in the following way will surely return everything:
IDfQuery query = new DfQuery("SELECT r_object_id, object_name "
+ "FROM dm_document(all) WHERE FOLDER('/System', DESCEND)");
IDfCollection coll = query.execute(session, IDfQuery.DF_READ_QUERY);
int i = 0;
while (coll.next()) i++;
System.out.println("Number of results: " + i);
In a test environment (CS 6.7 SP1 x64, MS SQL), this outputs:
Number of results: 37162
Now, there's proof. Using paging is however a good idea if you want to improve the overall performance in your application. As mentioned, start counting with the number 1:
ENABLE(RETURN_RANGE 1 100 'r_object_id DESC')
This way of paging requires that sorting be specified in the hint rather than as a DQL statement. If all you want is the first 100 records, try this hint instead:
ENABLE(RETURN_TOP 100)
In this case sorting with ORDER BY will work as you'd expect.
Lastly, note that adding (all) will not only find all documents matching the specified qualification, but all versions of every document. If this was your intention, that's fine.
I've worked with DFC API (with Java) for a while but I don't remember any default limit on queries, IIRC we've always got all of the documents, there weren't any limit. Actually (according to my notes) we have to set the limit explicitly with, for example, enable (return_top 2000). (As far I know the syntax might be depend on the DBMS behind EMC Documentum.)
Just a guess: check your dfc.properties file.