Scaling Repositories horizontally on common x86 hardware - repository

I was wondering if you guys had any tips which repository implementation has good clustering and horizontal scaling characteristics on common hardware?
The problem is that we have to implement a preservation system on top of a repository which is be able to ingest and manage LOTS of heterogeneous data (> 500 TB) with big files (>50GB).
Fedora Commons it seems can only be clustered by using a distributed filesystem. Apache Jackrabbit can be clustered but its DataStore (for large binary data) has to be the same for all nodes in a clustered environment. Do you guys have any tips which repository systems I should check out?

Give ModeShape a try. It is a JCR 2.0 implementation that can be configured to use an Infinispan data grid as its backing store, and ModeShape is also easily clustered (it uses JGroups, which is the same communication library used in the clustering features in Infinispan and JBoss Application Server, among many others).

Related

Gemfire versus BigMemory Go

Can Gemfire be used like Big Memory Go as a L2 cache provider with hibernate? Using hibernate XML files , not annotations. If an applications uses lots of redundant hibernate calls, trying to see if Gemfire could integrate as a L2 cache to use as a off heap solution for caching.
Prior to Pivotal GemFire 9.0.x (e.g. Pivotal GemFire 8.2.x and earlier) GemFire had support for Hibernate L2 Cache; see here.
However, this was pulled in Pivotal GemFire 9 due to a lack of support on maintaining the feature and keeping it up-to-date with the latest versions of Hibernate.
SIDE NOTE:
I am not sure if you are aware of this... but Pivotal GemFire was released to the Apache Software Foundation (ASF) as the Apache Geode open source project (April, 2015) and became a TLP last year (~October 2016). Therefore, Apache Geode is the open source core for Pivotal GemFire, especially as of Pivotal GemFire 9.0.
I mention this because the work/code is not lost, it is mostly a WIP. See...
https://issues.apache.org/jira/browse/GEODE-1972
I see that the feature branch (i.e. feature/GEODE-1972) does NOT exist yet.
There was discussion about this on the Geode dev list...
http://apache.markmail.org/thread/uvuzoohkfplkg46u
So, it probably just needs some "interests", maybe even some help/contributions from the community to move this along. A good opportunity to get involved and have an impact.
Cheers,
John

Migrate 100+ virtual machines from on-prem to azure

Apologies if this is the wrong platform for this question.
If I want to migrate 100 VM's onto Azure VM's what all things I need to consider and how can I migrate?
This is not a comprehensive answer but some things to consider are:
- Start with a thorough inventory of the VMs to migrate. Issues to watch out for include..
- Any unsupported OS versions, including 32-bit.
- large numbers of attached drives.
- Disk drives >1TB.
- Gen 2 VHDs.
- Application and network interdependencies which need to be maintained.
- Specific performance requirements (i.e. any VMs that would need Azure premium storage, SSD drives etc.).
In developing a migration strategy some important considerations are:
- How much downtime can you tolerate? To minimize downtime look at solutions like Azure Site Recovery which supports rapid switchover. If downtime is more flexible there are more offline migration tools and scripts available.
- Understand whether to move to the new Azure Resource Manager or the Service Management deployment model. See https://azure.microsoft.com/en-us/documentation/articles/resource-group-overview/.
- Which machines to move first (pick the simplest, with fewest dependences).
- Consider cases where it may be easier to migrate the data or application to a new VM rather than migrate the VM itself).
A good forum to ask specific migration questions is: Microsoft Azure Site Recovery
Appending to sendmarsh's reply
The things you will have to consider are:
Version of virtual environment i.e VMWare or Hyper-V.
Os version, RAM Size, OS disk size, OS disk count, Number of disks, Capacity of each disk, format of hard disk, number of processor cores,number of NIC's, processor architecture, Network configurations such as IP address's, generation type if the environment is Hyper-V.
I could have missed a few more things... like checking if the VMWare tools are installed. Some of the configurations are not supported like having an iSCSI disk will not be supported. Microsoft supports not all naming conventions for the machines, so be careful in setting the name as that might affect things later.
A full length of pre-requisites list is over at:
[1]: https://azure.microsoft.com/en-us/documentation/articles/site-recovery-best-practices/#azure-virtual-machine-requirements
Update: Using Powershell to automate the migration would make your life easier.

Set up distributed index using Hibernate Search and Lucene

Our application is using Hibernate Search for indexing some of its data. The application is running on two JBoss EAP 6.2 application servers for load distribution and failover. We need changes made on one machine to be immediately visible on the other. The index is a central part of the application and needs to be consistent with the database data. Completely rebuilding it takes a long time so it is important that it remains intact even in the case of a server crash. Also, the index is expected to grow too large to keep all of it in memory.
Our current solution is to use the standard filesystem directory with a shared filesystem (NFS) and the JGroups backend to ensure that only one server writes to a given index at any time. This works more or less, but sometimes we have problems with index updates taking very long (up to 20 seconds) or failing completely. Due to some other reasons we need to migrate away from the currently used file system, so we are evaluating alternatives for the current setup.
One thing we tried is the Infinispan directory with a file cache store for persistence, but we had some problems there regarding OutOfMemoryErrors (see also my post in the Infinispan forums https://developer.jboss.org/thread/253732). Also, performance was still not acceptable in our first tests (about 3 seconds for an index update with two clustered servers set up on my developer machine), though that may be due to configuration issues.
I think this is not such an uncommon requirement, but I couldn't find much information on best practices to implement it.
Who has experiences with similar setups? Does the Infinispan directory work for you? Can anybody suggest a working configuration or how to proceed to arrive at one? What alternatives have you tried and which work?
You need to be careful about which versions are being used. The Infinispan version which is bundled within JBoss EAP is not intended (i.e. tested as extensively as for other purposes) for storing the Lucene index.
When JBoss EAP 6.2 was released, the bundled Infinispan was considered good to go for the internal needs of the application server, but as you might have discovered, the feature of index storage was having at least some performance issues.
In recent developments of Infinispan we applied many improvements to the index storage feature, fixing some bugs and getting very significant performance improvements out of it. I would hope you could be willing to try Infinispan 7.2.0.Beta1 ?
All of these improvements are also being backported to JBoss Data Grid, version 6.5 will make them available as a supported product. Note this feature of storing an Hibernate Search index wasn't supported before - it is going to be a new feature of JDG 6.5.
Modules from JDG 6.5 will be compatible with JBoss EAP, you'll just have to make sure you'll use the Infinispan build provided by JDG and not the one meant for internal usage of EAP.
Performance improvements are still being worked on. It's much better already - especially compared to that older version - but we won't stop working on that yet so if you could try latest bleeding edge versions of Infinispan 7.2.x (another release is scheduled for tomorrow), I'd highly appreciate your feedback to keep pushing it.

Is Redis a better option for SignalR scale out over SQL Server, and do each support failover?

In David Fowler's blog, SQL Server has been added to the list of scale out providers for service bus.
I am in the process of implementing Redis on our Windows servers. Based on what I know about Redis, I'm guessing it will be significantly faster than using SQL Server - is that a fair assumption?
If so, how does the Windows version of Redis implement fail-over?
Redis is ~x200 faster than SQL, mainly because it's in-memory and the protocol is designed for speed.
If that helps, Redis Cloud is now offered on Windows Azure, and HA is a built-in capability of the service.
Disclosure - I'm the Co-Founder & CTO of Garantia Data, the company behind the Redis Cloud service.
Based on what I know about Redis, I'm guessing it will be
significantly faster than using SQL Server - is that a fair
assumption?
It will be faster than SQL Server since it's optimized for in-memory based operations, however its speed isn't the only advantage. Support of advanced data structures offers a great deal of flexibility when dealing with various scenarios.
If so, how does the Windows version of Redis implement fail-over?
There is a link in download section to unofficial windows based port of redis which however isn't meant to be used for production purpose. Official version of redis supports replication and sentinel has automatic failover, but it's hard to say what's the state of these features in windows port. In general I wouldn't recommend to use redis on windows machine but rather use virtual machine with linux distro and run it there.

Does Alfresco 3.4e CE support clustering?

I have just a simple question to answer !
Does Alfresco 3.4e Community Edition support clustering ?
If yes, then what are the supported clustering methods (e.g is JGroups supported?) ?
Regards,
It will work with Community, yes. There are a few little bits in Enterprise that'll make the setup and monitoring easier, which coupled with the support you get may mean you'd be better off going to Enterprise if you can.
You should probably start with this presentation to get you through the basics of Alfresco clustering. Once you've understood that, you likely want to read the Alfresco documentation on Setting up high availability systems which covers the concepts, initial cluster config, setting up JGroups etc.
You may also find it useful to read this guide on the Alfresco Wiki for instructions on setting it up, including how to configure JGroups as part of that process, if you haven't already.