We've got an ehcache cluster that distributes a cache to several (Wildfly) servers using manual replication. We replicate the cache synchronously for historical reasons. If one server in the pool is unavailable, eg if the box hangs, or the network goes down, then ehcache seems to have a very long timeout (from my timings with a stopwatch it's about 128 seconds). If the Wildfly server itself is down, then there is no delay in updating the replicated cache.
We've got the socketTimeoutMillis property set to 5000 across all the nodes in the cluster in ehcache.xml but it doesn't seem to help.:
<cacheManagerPeerProviderFactory class=
"net.sf.ehcache.distribution.RMICacheManagerPeerProviderFactory"
properties="peerDiscovery=manual,
rmiUrls=server1:41234|server2:41234"
propertySeparator="," />
<cacheManagerPeerListenerFactory
class="net.sf.ehcache.distribution.RMICacheManagerPeerListenerFactory"
properties="port=41234,socketTimeoutMillis=5000"/>
...
<cache name="sso.userDetails"
maxElementsInMemory="20000"
eternal="true"
timeToIdleSeconds="0"
timeToLiveSeconds="0"
overflowToDisk="false">
<cacheEventListenerFactory
class="MISCore.sso.ZSCMonitorCacheEventListenerFactory"
properties="replicateAsynchronously=true,
replicatePuts=true,
replicateUpdates=true,
replicateUpdatesViaCopy=true,
replicateRemovals=true,
asynchronousReplicationIntervalMillis=1000"
propertySeparator=","/>
<cacheEventListenerFactory
class="net.sf.ehcache.distribution.RMICacheReplicatorFactory"
properties="replicateAsynchronously=false"/>
<bootstrapCacheLoaderFactory
class="net.sf.ehcache.distribution.RMIBootstrapCacheLoaderFactory"/>
</cache>
The (partial) exception I see is:
2023-10-03 13:22:42,475 DEBUG [net.sf.ehcache.distribution.ManualRMICacheManagerPeerProvider] (default task-1) Looking up rmiUrl //server1:41234/sso.userDetails through exception Connection refused to host: server1; nested exception is:
java.net.ConnectException: Connection timed out (Connection timed out). This may be normal if a node has gone offline. Or it may indicate network connectivity difficulties: java.rmi.ConnectException: Connection refused to host: server1; nested exception is:
java.net.ConnectException: Connection timed out (Connection timed out)
at sun.rmi.transport.tcp.TCPEndpoint.newSocket(TCPEndpoint.java:623)
at sun.rmi.transport.tcp.TCPChannel.createConnection(TCPChannel.java:216)
at sun.rmi.transport.tcp.TCPChannel.newConnection(TCPChannel.java:202)
at sun.rmi.server.UnicastRef.newCall(UnicastRef.java:343)
at sun.rmi.registry.RegistryImpl_Stub.lookup(RegistryImpl_Stub.java:116)
at java.rmi.Naming.lookup(Naming.java:101)
at net.sf.ehcache.distribution.RMICacheManagerPeerProvider.lookupRemoteCachePeer(RMICacheManagerPeerProvider.java:127)
Any suggestions are welcome.