Very long timeout in ehcache when pushing synchronous update over RMI to server that is down

65 views Asked by At

We've got an ehcache cluster that distributes a cache to several (Wildfly) servers using manual replication. We replicate the cache synchronously for historical reasons. If one server in the pool is unavailable, eg if the box hangs, or the network goes down, then ehcache seems to have a very long timeout (from my timings with a stopwatch it's about 128 seconds). If the Wildfly server itself is down, then there is no delay in updating the replicated cache.

We've got the socketTimeoutMillis property set to 5000 across all the nodes in the cluster in ehcache.xml but it doesn't seem to help.:

    <cacheManagerPeerProviderFactory class=
                          "net.sf.ehcache.distribution.RMICacheManagerPeerProviderFactory"
                          properties="peerDiscovery=manual,
                          rmiUrls=server1:41234|server2:41234"
                          propertySeparator="," />

    <cacheManagerPeerListenerFactory
            class="net.sf.ehcache.distribution.RMICacheManagerPeerListenerFactory"
            properties="port=41234,socketTimeoutMillis=5000"/>

...

    <cache name="sso.userDetails"
           maxElementsInMemory="20000"
           eternal="true"
           timeToIdleSeconds="0"
           timeToLiveSeconds="0"
           overflowToDisk="false">
        <cacheEventListenerFactory
                class="MISCore.sso.ZSCMonitorCacheEventListenerFactory"
                properties="replicateAsynchronously=true,
                            replicatePuts=true,
                            replicateUpdates=true,
                            replicateUpdatesViaCopy=true,
                            replicateRemovals=true,
                            asynchronousReplicationIntervalMillis=1000"
                propertySeparator=","/>
        <cacheEventListenerFactory
                class="net.sf.ehcache.distribution.RMICacheReplicatorFactory"
                properties="replicateAsynchronously=false"/>
        <bootstrapCacheLoaderFactory
                class="net.sf.ehcache.distribution.RMIBootstrapCacheLoaderFactory"/>
    </cache>

The (partial) exception I see is:

2023-10-03 13:22:42,475 DEBUG [net.sf.ehcache.distribution.ManualRMICacheManagerPeerProvider] (default task-1) Looking up rmiUrl //server1:41234/sso.userDetails through exception Connection refused to host: server1; nested exception is: 
    java.net.ConnectException: Connection timed out (Connection timed out). This may be normal if a node has gone offline. Or it may indicate network connectivity difficulties: java.rmi.ConnectException: Connection refused to host: server1; nested exception is: 
    java.net.ConnectException: Connection timed out (Connection timed out)
    at sun.rmi.transport.tcp.TCPEndpoint.newSocket(TCPEndpoint.java:623)
    at sun.rmi.transport.tcp.TCPChannel.createConnection(TCPChannel.java:216)
    at sun.rmi.transport.tcp.TCPChannel.newConnection(TCPChannel.java:202)
    at sun.rmi.server.UnicastRef.newCall(UnicastRef.java:343)
    at sun.rmi.registry.RegistryImpl_Stub.lookup(RegistryImpl_Stub.java:116)
    at java.rmi.Naming.lookup(Naming.java:101)
    at net.sf.ehcache.distribution.RMICacheManagerPeerProvider.lookupRemoteCachePeer(RMICacheManagerPeerProvider.java:127)

Any suggestions are welcome.

0

There are 0 answers