I am debugging an issue with one of our Ignite servers which was running in Kubernetes for some months without issue.
Client nodes started failing to come up as as they were failing a readiness probe for the Ignite server. When I went to the Ignite server I saw the logs were full of [PdsFoldersResolver] Unable to acquire lock to file [/ignite/work/db], reason: No locks available. The start of these messages seem to coincide with when client nodes started failing.
I tried scaling down the Ignite server and then scaling it back up however this did not resolve the issue. Logs from the restart below.
Further when I looked at the contents of the /ignite/work/log directory it has a large number of locks for the log files. Normally I only see a lock for the current log, not previous.
My questions are:
- Why are locks for older logs continuing to persist and how can I stop this from happening?
- How can I clean up the Ignite server? Scaling down the pods does not help. Would manually deleting the lock files fix the issue?
# ls -l /ignite/work/log/*.lck | wc -l
3846
Edit: I have also noticed that the number of lock file if far greater than the number of locks:
# ls -l ./*.log |wc -l
35
Example lock:
-rw-r--r-- 1 root root 0 Nov 17 10:32 ignite-f543b237.0.log.lck
>>> __________ ________________
>>> / _/ ___/ |/ / _/_ __/ __/
>>> _/ // (7 7 // / / / / _/
>>> /___/\___/_/|_/___/ /_/ /___/
>>>
>>> ver. 2.11.1#20211220-sha1:eae1147d
>>> 2021 Copyright(C) Apache Software Foundation
>>>
>>> Ignite documentation: http://ignite.apache.org
[21:20:08,116][INFO][main][IgniteKernal] Config URL: file:/ignite/config/ignite_server_config.xml
[21:20:08,140][INFO][main][IgniteKernal] IgniteConfiguration [igniteInstanceName=null, pubPoolSize=8, svcPoolSize=8, callbackPoolSize=8, stripedPoolSize=8, sysPoolSize=8, mgmtPoolSize=4, dataStreamerPoolSize=8, utilityCachePoolSize=8, utilityCacheKeepAliveTime=60000, p2pPoolSize=2, qryPoolSize=8, buildIdxPoolSize=1, ign
iteHome=/opt/ignite/apache-ignite, igniteWorkDir=/ignite/work, mbeanSrv=com.sun.jmx.mbeanserver.JmxMBeanServer@6e38921c, nodeId=165f2068-01fd-4356-ae65-f199e33bf7cd, marsh=BinaryMarshaller [], marshLocJobs=false, daemon=false, p2pEnabled=false, netTimeout=5000, netCompressionLevel=1, sndRetryDelay=1000, sndRetryCnt=3, m
etricsHistSize=10000, metricsUpdateFreq=2000, metricsExpTime=9223372036854775807, discoSpi=TcpDiscoverySpi [addrRslvr=null, addressFilter=null, sockTimeout=0, ackTimeout=0, marsh=null, reconCnt=10, reconDelay=2000, maxAckTimeout=600000, soLinger=0, forceSrvMode=false, clientReconnectDisabled=false, internalLsnr=null, sk
ipAddrsRandomization=false], segPlc=STOP, segResolveAttempts=2, waitForSegOnStart=true, allResolversPassReq=true, segChkFreq=10000, commSpi=TcpCommunicationSpi [connectGate=org.apache.ignite.spi.communication.tcp.internal.ConnectGateway@6ad82709, ctxInitLatch=java.util.concurrent.CountDownLatch@510f3d34[Count = 1], stop
ping=false, clientPool=null, nioSrvWrapper=null, stateProvider=null], evtSpi=org.apache.ignite.spi.eventstorage.NoopEventStorageSpi@7817fd62, colSpi=NoopCollisionSpi [], deploySpi=LocalDeploymentSpi [], indexingSpi=org.apache.ignite.spi.indexing.noop.NoopIndexingSpi@24313fcc, addrRslvr=null, encryptionSpi=org.apache.ign
ite.spi.encryption.noop.NoopEncryptionSpi@7d20d0b, tracingSpi=org.apache.ignite.spi.tracing.NoopTracingSpi@77f1baf5, clientMode=false, rebalanceThreadPoolSize=1, rebalanceTimeout=10000, rebalanceBatchesPrefetchCnt=3, rebalanceThrottle=0, rebalanceBatchSize=524288, txCfg=TransactionConfiguration [txSerEnabled=false, dflt
Isolation=REPEATABLE_READ, dfltConcurrency=PESSIMISTIC, dfltTxTimeout=0, txTimeoutOnPartitionMapExchange=0, deadlockTimeout=10000, pessimisticTxLogSize=0, pessimisticTxLogLinger=10000, tmLookupClsName=null, txManagerFactory=null, useJtaSync=false], cacheSanityCheckEnabled=true, discoStartupDelay=60000, deployMode=SHARED
, p2pMissedCacheSize=100, locHost=null, timeSrvPortBase=31100, timeSrvPortRange=100, failureDetectionTimeout=10000, sysWorkerBlockedTimeout=null, clientFailureDetectionTimeout=30000, metricsLogFreq=60000, connectorCfg=ConnectorConfiguration [jettyPath=null, host=null, port=11211, noDelay=true, directBuf=false, sndBufSiz
e=32768, rcvBufSize=32768, idleQryCurTimeout=600000, idleQryCurCheckFreq=60000, sndQueueLimit=0, selectorCnt=1, idleTimeout=7000, sslEnabled=false, sslClientAuth=false, sslCtxFactory=null, sslFactory=null, portRange=100, threadPoolSize=8, msgInterceptor=null], odbcCfg=null, warmupClos=null, atomicCfg=AtomicConfiguration
[seqReserveSize=1000, cacheMode=PARTITIONED, backups=1, aff=null, grpName=null], classLdr=null, sslCtxFactory=null, platformCfg=null, binaryCfg=null, memCfg=null, pstCfg=null, dsCfg=DataStorageConfiguration [sysRegionInitSize=41943040, sysRegionMaxSize=104857600, pageSize=0, concLvl=0, dfltDataRegConf=DataRegionConfigu
ration [name=default, maxSize=54081806336, initSize=268435456, swapPath=null, pageEvictionMode=DISABLED, pageReplacementMode=CLOCK, evictionThreshold=0.9, emptyPagesPoolSize=100, metricsEnabled=false, metricsSubIntervalCount=5, metricsRateTimeInterval=60000, persistenceEnabled=true, checkpointPageBufSize=0, lazyMemoryAl
location=true, warmUpCfg=null], dataRegions=null, storagePath=null, checkpointFreq=180000, lockWaitTime=10000, checkpointThreads=4, checkpointWriteOrder=SEQUENTIAL, walHistSize=20, maxWalArchiveSize=1073741824, walSegments=10, walSegmentSize=67108864, walPath=/ignite/wal, walArchivePath=/ignite/walarchive, metricsEnable
d=false, walMode=LOG_ONLY, walTlbSize=131072, walBuffSize=0, walFlushFreq=2000, walFsyncDelay=1000, walRecordIterBuffSize=67108864, alwaysWriteFullPages=false, fileIOFactory=org.apache.ignite.internal.processors.cache.persistence.file.AsyncFileIOFactory@66d18979, metricsSubIntervalCnt=5, metricsRateTimeInterval=60000, w
alAutoArchiveAfterInactivity=-1, writeThrottlingEnabled=false, walCompactionEnabled=false, walCompactionLevel=1, checkpointReadLockTimeout=null, walPageCompression=DISABLED, walPageCompressionLevel=null, dfltWarmUpCfg=null, encCfg=org.apache.ignite.configuration.EncryptionConfiguration@609cd4d8, defragmentationThreadPoo
lSize=4], snapshotPath=snapshots, activeOnStart=true, activeOnStartPropSetFlag=false, autoActivation=true, autoActivationPropSetFlag=false, clusterStateOnStart=null, sqlConnCfg=null, cliConnCfg=ClientConnectorConfiguration [host=null, port=10800, portRange=100, sockSndBufSize=0, sockRcvBufSize=0, tcpNoDelay=true, maxOpe
nCursorsPerConn=128, threadPoolSize=8, selectorCnt=4, idleTimeout=0, handshakeTimeout=10000, jdbcEnabled=true, odbcEnabled=true, thinCliEnabled=true, sslEnabled=false, useIgniteSslCtxFactory=true, sslClientAuth=false, sslCtxFactory=null, thinCliCfg=ThinClientConfiguration [maxActiveTxPerConn=100, maxActiveComputeTasksPe
rConn=0]], mvccVacuumThreadCnt=2, mvccVacuumFreq=5000, authEnabled=false, failureHnd=null, commFailureRslvr=null, sqlCfg=SqlConfiguration [longQryWarnTimeout=3000, dfltQryTimeout=0, sqlQryHistSize=1000, validationEnabled=false], asyncContinuationExecutor=null]
[21:20:08,141][INFO][main][IgniteKernal] Daemon mode: off
[21:20:08,141][INFO][main][IgniteKernal] OS: Linux 5.15.0-58-generic amd64
[21:20:08,141][INFO][main][IgniteKernal] OS user: root
[21:20:08,144][INFO][main][IgniteKernal] PID: 1
[21:20:08,144][INFO][main][IgniteKernal] Language runtime: Java Platform API Specification ver. 1.8
[21:20:08,144][INFO][main][IgniteKernal] VM information: OpenJDK Runtime Environment 1.8.0_212-b04 IcedTea OpenJDK 64-Bit Server VM 25.212-b04
[21:20:08,147][INFO][main][IgniteKernal] VM total memory: 4.0GB
[21:20:08,148][INFO][main][IgniteKernal] Remote Management [restart: off, REST: on, JMX (remote: off)]
[21:20:08,148][INFO][main][IgniteKernal] Logger: JavaLogger [quiet=true, config=null]
[21:20:08,148][INFO][main][IgniteKernal] IGNITE_HOME=/opt/ignite/apache-ignite
[21:20:08,148][INFO][main][IgniteKernal] VM arguments: [-XX:+AggressiveOpts, -DIGNITE_WAL_MMAP=false, -DIGNITE_UPDATE_NOTIFIER=false, -XX:+UseG1GC, -Xmx4g, -XX:+DisableExplicitGC, -Xms4g, -XX:+AlwaysPreTouch, -XX:+ScavengeBeforeFullGC, -DIGNITE_HOME=/opt/ignite/apache-ignite]
[21:20:08,149][INFO][main][IgniteKernal] System cache's DataRegion size is configured to 40 MB. Use DataStorageConfiguration.systemRegionInitialSize property to change the setting.
[21:20:08,149][INFO][main][IgniteKernal] Configured caches [in 'sysMemPlc' dataRegion: ['ignite-sys-cache']]
[21:20:08,149][WARNING][main][IgniteKernal] Please set system property '-Djava.net.preferIPv4Stack=true' to avoid possible problems in mixed environments.
[21:20:08,153][INFO][main][IgniteKernal] 3-rd party licenses can be found at: /opt/ignite/apache-ignite/libs/licenses
[21:20:08,315][INFO][main][IgnitePluginProcessor] Configured plugins:
[21:20:08,315][INFO][main][IgnitePluginProcessor] ^-- None
[21:20:08,315][INFO][main][IgnitePluginProcessor]
[21:20:08,324][INFO][main][FailureProcessor] Configured failure handler: [hnd=StopNodeOrHaltFailureHandler [tryStop=false, timeout=0, super=AbstractFailureHandler [ignoredFailureTypes=UnmodifiableSet [SYSTEM_WORKER_BLOCKED, SYSTEM_CRITICAL_OPERATION_TIMEOUT]]]]
[21:20:08,770][INFO][main][TcpCommunicationSpi] Successfully bound communication NIO server to TCP port [port=47100, locHost=0.0.0.0/0.0.0.0, selectorsCnt=4, selectorSpins=0, pairedConn=false]
[21:20:08,770][WARNING][main][TcpCommunicationSpi] Message queue limit is set to 0 which may lead to potential OOMEs when running cache operations in FULL_ASYNC or PRIMARY_SYNC modes due to message queues growth on sender and receiver sides.
[21:20:08,785][WARNING][main][NoopCheckpointSpi] Checkpoints are disabled (to enable configure any GridCheckpointSpi implementation)
[21:20:08,832][WARNING][main][GridCollisionManager] Collision resolution is disabled (all jobs will be activated upon arrival).
[21:20:08,833][INFO][main][IgniteKernal] Security status [authentication=off, sandbox=off, tls/ssl=off]
[21:20:08,908][INFO][main][TcpDiscoverySpi] Successfully bound to TCP port [port=47500, localHost=0.0.0.0/0.0.0.0, locNodeId=165f2068-01fd-4356-ae65-f199e33bf7cd]
[22:01:05,356][INFO][main][PdsFoldersResolver] Unable to acquire lock to file [/ignite/work/db/node00-2fc48c4c-2bf8-44e2-a4c3-ca22a0001fa2], reason: No locks available
[00:10:06,798][INFO][main][PdsFoldersResolver] Unable to acquire lock to file [/ignite/work/db], reason: No locks available
[00:34:41,360][INFO][main][PdsFoldersResolver] Unable to acquire lock to file [/ignite/work/db], reason: No locks available
[03:01:07,293][INFO][main][PdsFoldersResolver] Unable to acquire lock to file [/ignite/work/db], reason: No locks available
[03:25:41,852][INFO][main][PdsFoldersResolver] Unable to acquire lock to file [/ignite/work/db], reason: No locks available
[05:52:07,788][INFO][main][PdsFoldersResolver] Unable to acquire lock to file [/ignite/work/db], reason: No locks available
[06:16:42,348][INFO][main][PdsFoldersResolver] Unable to acquire lock to file [/ignite/work/db], reason: No locks available
[08:43:08,270][INFO][main][PdsFoldersResolver] Unable to acquire lock to file [/ignite/work/db], reason: No locks available
[09:07:42,832][INFO][main][PdsFoldersResolver] Unable to acquire lock to file [/ignite/work/db], reason: No locks available
[11:34:08,764][INFO][main][PdsFoldersResolver] Unable to acquire lock to file [/ignite/work/db], reason: No locks available
[11:58:43,329][INFO][main][PdsFoldersResolver] Unable to acquire lock to file [/ignite/work/db], reason: No locks available
[14:25:09,266][INFO][main][PdsFoldersResolver] Unable to acquire lock to file [/ignite/work/db], reason: No locks available
[14:49:43,829][INFO][main][PdsFoldersResolver] Unable to acquire lock to file [/ignite/work/db], reason: No locks available
Edit:
Despite scaling up and down the number of Ignite servers I stopped seeing logs indicating that the server was starting up, just more logs that it was unable to acquire a lock.
I tried manually launching a server on the pod but it just stalls out on start up and does not progress:
./ignite.sh ../config/nkw-ignite.xml -v
Ignite Command Line Startup, ver. 2.11.1#20211220-sha1:eae1147d
2021 Copyright(C) Apache Software Foundation
First of all according to your logs this node was started in server mode:
So it looks like following messages:
And ignite-xxx.0.log.lck files are different things.
According to very first message
Ignite tried to acquire an exclusive lock to /ignite/work/db/node00-2fc48c4c-2bf8-44e2-a4c3-ca22a0001fa2, because current instance was started in server mode, but failed to do so.
Most likely this pds folder is already used by different instance. Do you share same persistence volume between different pods?
So first of all try to identify whether this volume is shared or not. If yes, try to find another instance that might locked this directory. You should find
lockfile in/$IGNITE_WORK/db/$NODE_CONSISTENT_IDfolder with content like:If you find this file but there is no other running node that have this id, you possibly could try to remove this file and try to restart problematic node.