Doris-0.14.7 FE多次意外终止
longfei113 发布于2021-05 浏览:6857 回复:4
0
收藏

一、Doris版本

version info
Version: 0.14.7
Git: file:///root/doris/doris-0.14.7-src@Unknown
Build Info: root@b9d633f7f390
Build Time: Wed, 07 Apr 2021 03:49:18 UTC

二、集群部署

集群共6个节点,各节点角色如下。其中fe节点JVM配置为8GB。

xx.xx.205.2 fe(follower),创建集群时是fe master角色
xx.xx.205.7 fe(follower)+be
xx.xx.205.50 fe(follower)+be
xx.xx.205.51 be
xx.xx.205.52 be
xx.xx.205.14 be

三、问题描述

集群xx.xx.205.2、xx.xx.205.7 FE节点近期发生过2次意外宕机,fe进程终止了。

只有xx.xx.205.50进程还存活并可以访问http server,但是无法提供读写服务。

四、fe日志信息

最近一次fe宕机发生在2021-05-25,相关节点日志提供如下,首先是原master节点 xx.xx.305.2的fe.out、fe.log日志片段,然后是现在存活的第3个fe节点xx.xx.305.50的fe.out、fe.log日志片段。

xx.xx.305.2 fe.out

May 20, 2021 10:02:25 AM com.baidu.jprotobuf.pbrpc.client.ProtobufRpcProxy proxy
INFO: Use Simple share channel pool to create protobuf RPC proxy with interface interface org.apache.doris.rpc.PBackendService
2021-05-20 10:02:31 ERROR ResourceLeakDetector:317 - LEAK: ByteBuf.release() was not called before it's garbage-collected. See https://netty.io/wiki/reference-counted-objects.html for more information.
Recent access records:
Created at:
        io.netty.buffer.AbstractByteBufAllocator.compositeDirectBuffer(AbstractByteBufAllocator.java:223)
        io.netty.buffer.AbstractByteBufAllocator.compositeBuffer(AbstractByteBufAllocator.java:201)
        io.netty.handler.codec.MessageAggregator.decode(MessageAggregator.java:271)
        io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:88)
        io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:374)
        io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:360)
        io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:352)
        io.netty.channel.CombinedChannelDuplexHandler$DelegatingChannelHandlerContext.fireChannelRead(CombinedChannelDuplexHandler.java:438)
        io.netty.handler.codec.ByteToMessageDecoder.fireChannelRead(ByteToMessageDecoder.java:328)
        io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:302)
        io.netty.channel.CombinedChannelDuplexHandler.channelRead(CombinedChannelDuplexHandler.java:253)
        io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:374)
        io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:360)
        io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:352)
        io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1422)
        io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:374)
        io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:360)
        io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:931)
        io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:163)
        io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:700)
        io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:635)
        io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:552)
        io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:514)
        io.netty.util.concurrent.SingleThreadEventExecutor$6.run(SingleThreadEventExecutor.java:1044)
        io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
        io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
        java.lang.Thread.run(Thread.java:748)
May 20, 2021 10:04:35 AM com.baidu.jprotobuf.pbrpc.client.ProtobufRpcProxy proxy
INFO: Use Simple share channel pool to create protobuf RPC proxy with interface interface org.apache.doris.rpc.PBackendService
May 20, 2021 10:04:36 AM com.baidu.jprotobuf.pbrpc.client.ProtobufRpcProxy proxy
INFO: Use Simple share channel pool to create protobuf RPC proxy with interface interface org.apache.doris.rpc.PBackendService
May 20, 2021 10:04:36 AM com.baidu.jprotobuf.pbrpc.client.ProtobufRpcProxy proxy
INFO: Use Simple share channel pool to create protobuf RPC proxy with interface interface org.apache.doris.rpc.PBackendService
May 20, 2021 10:04:36 AM com.baidu.jprotobuf.pbrpc.client.ProtobufRpcProxy proxy
INFO: Use Simple share channel pool to create protobuf RPC proxy with interface interface org.apache.doris.rpc.PBackendService
[2021-05-25 08:03:27] write bdb failed. will exit. journalId: 1760442, bdb database Name: 1734838

xx.xx.305.2 fe.log

2021-05-25 01:06:52,350 INFO (tablet scheduler|25) [BeLoadRebalancer.selectAlternativeTabletsForCluster():84] cluster is balance: default_cluster with medium: SSD. skip
2021-05-25 01:06:53,063 ERROR (heartbeat mgr|20) [BDBJEJournal.write():166] catch an exception when writing to database. sleep and retry. journal id 1753065
com.sleepycat.je.rep.InsufficientAcksException: (JE 7.3.7) Transaction: -17023838  VLSN: 33,607,823, initiated at: 01:06:51.  Insufficient acks for policy:SIMPLE_MAJORITY. Need replica acks: 1. Missing replica acks: 1. Timeout: 2000ms. FeederState=31.16.205.2_9010_1617961030158(1)[MASTER]
Current feeds:
 31.16.205.7_9010_1617961064994: feederVLSN=33,607,824 replicaTxnEndVLSN=33,607,821
 31.16.205.50_9010_1617961999894: feederVLSN=33,607,824 replicaTxnEndVLSN=33,607,821

        at com.sleepycat.je.rep.impl.node.DurabilityQuorum.ensureSufficientAcks(DurabilityQuorum.java:205) ~[je-7.3.7.jar:7.3.7]
        at com.sleepycat.je.rep.stream.FeederTxns.awaitReplicaAcks(FeederTxns.java:189) ~[je-7.3.7.jar:7.3.7]
        at com.sleepycat.je.rep.impl.RepImpl.postLogCommitHookInternal(RepImpl.java:1426) ~[je-7.3.7.jar:7.3.7]
        at com.sleepycat.je.rep.impl.RepImpl.postLogCommitHook(RepImpl.java:1385) ~[je-7.3.7.jar:7.3.7]
        at com.sleepycat.je.rep.txn.MasterTxn.postLogCommitHook(MasterTxn.java:226) ~[je-7.3.7.jar:7.3.7]
        at com.sleepycat.je.txn.Txn.commit(Txn.java:772) ~[je-7.3.7.jar:7.3.7]
        at com.sleepycat.je.txn.Txn.commit(Txn.java:625) ~[je-7.3.7.jar:7.3.7]
        at com.sleepycat.je.txn.Txn.operationEnd(Txn.java:1803) ~[je-7.3.7.jar:7.3.7]
        at com.sleepycat.je.Database.put(Database.java:1506) ~[je-7.3.7.jar:7.3.7]
        at com.sleepycat.je.Database.put(Database.java:1556) ~[je-7.3.7.jar:7.3.7]
        at org.apache.doris.journal.bdbje.BDBJEJournal.write(BDBJEJournal.java:159) [palo-fe.jar:3.4.0]
        at org.apache.doris.persist.EditLog.logEdit(EditLog.java:849) [palo-fe.jar:3.4.0]
        at org.apache.doris.persist.EditLog.logHeartbeat(EditLog.java:1265) [palo-fe.jar:3.4.0]
        at org.apache.doris.system.HeartbeatMgr.runAfterCatalogReady(HeartbeatMgr.java:154) [palo-fe.jar:3.4.0]
        at org.apache.doris.common.util.MasterDaemon.runOneCycle(MasterDaemon.java:58) [palo-fe.jar:3.4.0]
        at org.apache.doris.common.util.Daemon.run(Daemon.java:116) [palo-fe.jar:3.4.0]
2021-05-25 01:06:53,350 INFO (tablet scheduler|25) [BeLoadRebalancer.selectAlternativeTabletsForCluster():84] cluster is balance: default_cluster with medium: HDD. skip
2021-05-25 01:06:53,350 INFO (tablet scheduler|25) [BeLoadRebalancer.selectAlternativeTabletsForCluster():84] cluster is balance: default_cluster with medium: SSD. skip
2021-05-25 01:06:54,350 INFO (tablet scheduler|25) [BeLoadRebalancer.selectAlternativeTabletsForCluster():84] cluster is balance: default_cluster with medium: HDD. skip
2021-05-25 01:06:54,350 INFO (tablet scheduler|25) [BeLoadRebalancer.selectAlternativeTabletsForCluster():84] cluster is balance: default_cluster with medium: SSD. skip
2021-05-25 01:06:55,350 INFO (tablet scheduler|25) [BeLoadRebalancer.selectAlternativeTabletsForCluster():84] cluster is balance: default_cluster with medium: HDD. skip
2021-05-25 01:06:55,350 INFO (tablet scheduler|25) [BeLoadRebalancer.selectAlternativeTabletsForCluster():84] cluster is balance: default_cluster with medium: SSD. skip
2021-05-25 01:06:55,419 INFO (Thread-43|89) [ReportHandler.taskReport():319] begin to handle task report from backend 10008
2021-05-25 01:06:55,419 INFO (thrift-server-pool-6|155) [ReportHandler.handleReport():174] receive report from be 10008. type: task, current queue size: 1
2021-05-25 01:06:55,419 INFO (Thread-43|89) [ReportHandler.taskReport():359] finished to handle task report from backend 10008, diff task num: 0. cost: 0 ms
2021-05-25 01:06:56,351 INFO (tablet scheduler|25) [BeLoadRebalancer.selectAlternativeTabletsForCluster():84] cluster is balance: default_cluster with medium: HDD. skip
2021-05-25 01:06:56,351 INFO (tablet scheduler|25) [BeLoadRebalancer.selectAlternativeTabletsForCluster():84] cluster is balance: default_cluster with medium: SSD. skip
2021-05-25 01:06:57,225 INFO (Routine load scheduler|31) [RoutineLoadScheduler.process():74] there are 0 job need schedule
....
.....
......
2021-05-25 08:03:27,258 INFO (Thread-43|89) [ReportHandler.taskReport():319] begin to handle task report from backend 10008
2021-05-25 08:03:27,258 INFO (Thread-43|89) [ReportHandler.taskReport():359] finished to handle task report from backend 10008, diff task num: 0. cost: 0 ms
2021-05-25 08:03:27,258 INFO (thrift-server-pool-12|182) [ReportHandler.handleReport():174] receive report from be 10008. type: task, current queue size: 1
2021-05-25 08:03:27,472 WARN (Thread-42|81) [BDBJEMetricHandler.write():117] write metric data into bdb error, key:31.16.205.2:8030_quantile0.98_1621900995000
com.sleepycat.je.rep.InsufficientAcksException: (JE 7.3.7) Transaction: -17132024  VLSN: 33,821,069, initiated at: 08:03:25.  Insufficient acks for policy:SIMPLE_MAJORITY. Need replica acks: 1. Missing replica acks: 1. Timeout: 2000ms. FeederState=31.16.205.2_9010_1617961030158(1)[MASTER]
Current feeds:
 31.16.205.7_9010_1617961064994: feederVLSN=33,821,070 replicaTxnEndVLSN=33,821,067
 31.16.205.50_9010_1617961999894: feederVLSN=33,821,070 replicaTxnEndVLSN=33,821,067

        at com.sleepycat.je.rep.impl.node.DurabilityQuorum.ensureSufficientAcks(DurabilityQuorum.java:205) ~[je-7.3.7.jar:7.3.7]
        at com.sleepycat.je.rep.stream.FeederTxns.awaitReplicaAcks(FeederTxns.java:189) ~[je-7.3.7.jar:7.3.7]
        at com.sleepycat.je.rep.impl.RepImpl.postLogCommitHookInternal(RepImpl.java:1426) ~[je-7.3.7.jar:7.3.7]
        at com.sleepycat.je.rep.impl.RepImpl.postLogCommitHook(RepImpl.java:1385) ~[je-7.3.7.jar:7.3.7]
        at com.sleepycat.je.rep.txn.MasterTxn.postLogCommitHook(MasterTxn.java:226) ~[je-7.3.7.jar:7.3.7]
        at com.sleepycat.je.txn.Txn.commit(Txn.java:772) ~[je-7.3.7.jar:7.3.7]
        at com.sleepycat.je.txn.Txn.commit(Txn.java:625) ~[je-7.3.7.jar:7.3.7]
        at com.sleepycat.je.txn.Txn.operationEnd(Txn.java:1803) ~[je-7.3.7.jar:7.3.7]
        at com.sleepycat.je.Database.put(Database.java:1506) ~[je-7.3.7.jar:7.3.7]
        at org.apache.doris.metric.collector.BDBJEMetricHandler.write(BDBJEMetricHandler.java:115) ~[palo-fe.jar:3.4.0]
        at org.apache.doris.metric.collector.BDBJEMetricHandler.writeDouble(BDBJEMetricHandler.java:109) ~[palo-fe.jar:3.4.0]
        at org.apache.doris.metric.collector.MetricCollector.parseFeMetricJsonAndWriteMetric(MetricCollector.java:257) ~[palo-fe.jar:3.4.0]
        at org.apache.doris.metric.collector.MetricCollector.writeMetric(MetricCollector.java:105) ~[palo-fe.jar:3.4.0]
        at org.apache.doris.metric.collector.MetricCollector.lambda$init$0(MetricCollector.java:77) ~[palo-fe.jar:3.4.0]
        at java.lang.Thread.run(Thread.java:748) [?:1.8.0_202]
2021-05-25 08:03:27,695 INFO (tablet scheduler|25) [BeLoadRebalancer.selectAlternativeTabletsForCluster():84] cluster is balance: default_cluster with medium: HDD. skip
2021-05-25 08:03:27,695 INFO (tablet scheduler|25) [BeLoadRebalancer.selectAlternativeTabletsForCluster():84] cluster is balance: default_cluster with medium: SSD. skip
2021-05-25 08:03:27,718 INFO (Routine load scheduler|31) [RoutineLoadScheduler.process():74] there are 0 job need schedule
2021-05-25 08:03:27,784 ERROR (heartbeat mgr|20) [BDBJEJournal.write():189] write bdb failed. will exit. journalId: 1760442, bdb database Name: 1734838

 

xx.xx.305.50 fe.out

[2021-05-20 10:01:01] notify new FE type transfer: UNKNOWN
[2021-05-20 10:01:01] notify new FE type transfer: FOLLOWER
2021-05-20 10:07:51 ERROR ResourceLeakDetector:317 - LEAK: ByteBuf.release() was not called before it's garbage-collected. See https://netty.io/wiki/reference-counted-objects.html for more information.
Recent access records:
Created at:
        io.netty.buffer.AbstractByteBufAllocator.compositeDirectBuffer(AbstractByteBufAllocator.java:223)
        io.netty.buffer.AbstractByteBufAllocator.compositeBuffer(AbstractByteBufAllocator.java:201)
        io.netty.handler.codec.MessageAggregator.decode(MessageAggregator.java:271)
        io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:88)
        io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:374)
        io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:360)
        io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:352)
        io.netty.channel.CombinedChannelDuplexHandler$DelegatingChannelHandlerContext.fireChannelRead(CombinedChannelDuplexHandler.java:438)
        io.netty.handler.codec.ByteToMessageDecoder.fireChannelRead(ByteToMessageDecoder.java:328)
        io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:302)
        io.netty.channel.CombinedChannelDuplexHandler.channelRead(CombinedChannelDuplexHandler.java:253)
        io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:374)
        io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:360)
        io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:352)
        io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1422)
        io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:374)
        io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:360)
        io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:931)
        io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:163)
        io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:700)
        io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:635)
        io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:552)
        io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:514)
        io.netty.util.concurrent.SingleThreadEventExecutor$6.run(SingleThreadEventExecutor.java:1044)
        io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
        io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
        java.lang.Thread.run(Thread.java:748)
[2021-05-25 08:03:29] notify new FE type transfer: UNKNOWN
[2021-05-25 08:03:32] notify new FE type transfer: FOLLOWER
[2021-05-25 08:10:38] notify new FE type transfer: UNKNOWN

xx.xx.305.50 fe.log

2021-05-25 01:06:51,064 INFO (replayer|57) [Catalog.replayJournal():2467] replayed journal id is 1753064, replay to journal id is 1753065
2021-05-25 01:06:52,064 WARN (replayer|57) [BDBJournalCursor.next():148] Catch an exception when get next JournalEntity. key:1753065
com.sleepycat.je.LockTimeoutException: (JE 7.3.7) Lock expired. Locker 1994034537 -1_replayer_ReplicaThreadLocker: waited for lock on database=1734838 LockAddr:1402590721 LSN=0x2ce/0x32a94e type=READ grant=WAIT_NEW timeoutMillis=1000 startTime=1621876011064 endTime=1621876012064
Owners: [1134441671 -17023838_ReplayThread_ReplayTxn" type="WRITE"/>]
Waiters: []

        at com.sleepycat.je.txn.LockManager.makeTimeoutException(LockManager.java:1117) ~[je-7.3.7.jar:7.3.7]
        at com.sleepycat.je.txn.LockManager.waitForLock(LockManager.java:606) ~[je-7.3.7.jar:7.3.7]
        at com.sleepycat.je.txn.LockManager.lock(LockManager.java:345) ~[je-7.3.7.jar:7.3.7]
        at com.sleepycat.je.txn.BasicLocker.lockInternal(BasicLocker.java:124) ~[je-7.3.7.jar:7.3.7]
        at com.sleepycat.je.rep.txn.ReplicaThreadLocker.lockInternal(ReplicaThreadLocker.java:63) ~[je-7.3.7.jar:7.3.7]
        at com.sleepycat.je.txn.Locker.lock(Locker.java:499) ~[je-7.3.7.jar:7.3.7]
        at com.sleepycat.je.dbi.CursorImpl.lockLN(CursorImpl.java:3585) ~[je-7.3.7.jar:7.3.7]
        at com.sleepycat.je.dbi.CursorImpl.lockLN(CursorImpl.java:3316) ~[je-7.3.7.jar:7.3.7]
        at com.sleepycat.je.dbi.CursorImpl.lockLNAndCheckDefunct(CursorImpl.java:2138) ~[je-7.3.7.jar:7.3.7]
        at com.sleepycat.je.dbi.CursorImpl.searchExact(CursorImpl.java:1950) ~[je-7.3.7.jar:7.3.7]
        at com.sleepycat.je.Cursor.searchExact(Cursor.java:4194) ~[je-7.3.7.jar:7.3.7]
        at com.sleepycat.je.Cursor.searchNoDups(Cursor.java:4055) ~[je-7.3.7.jar:7.3.7]
        at com.sleepycat.je.Cursor.search(Cursor.java:3857) ~[je-7.3.7.jar:7.3.7]
        at com.sleepycat.je.Cursor.getInternal(Cursor.java:1284) ~[je-7.3.7.jar:7.3.7]
        at com.sleepycat.je.Database.get(Database.java:1271) ~[je-7.3.7.jar:7.3.7]
        at com.sleepycat.je.Database.get(Database.java:1330) ~[je-7.3.7.jar:7.3.7]
        at org.apache.doris.journal.bdbje.BDBJournalCursor.next(BDBJournalCursor.java:108) [palo-fe.jar:3.4.0]
        at org.apache.doris.catalog.Catalog.replayJournal(Catalog.java:2477) [palo-fe.jar:3.4.0]
        at org.apache.doris.catalog.Catalog$3.runOneCycle(Catalog.java:2266) [palo-fe.jar:3.4.0]
        at org.apache.doris.common.util.Daemon.run(Daemon.java:116) [palo-fe.jar:3.4.0]
2021-05-25 01:06:52,066 WARN (replayer|57) [Catalog.replayJournal():2495] replay journal cost too much time: 1002 replayedJournalId: 1753064
2021-05-25 01:06:52,069 INFO (replayer|57) [Catalog.replayJournal():2467] replayed journal id is 1753064, replay to journal id is 1753065
2021-05-25 01:06:53,070 WARN (replayer|57) [BDBJournalCursor.next():148] Catch an exception when get next JournalEntity. key:1753065
com.sleepycat.je.LockTimeoutException: (JE 7.3.7) Lock expired. Locker 939294473 -1_replayer_ReplicaThreadLocker: waited for lock on database=1734838 LockAddr:1402590721 LSN=0x2ce/0x32a94e type=READ grant=WAIT_NEW timeoutMillis=1000 startTime=1621876012069 endTime=1621876013069
...
....
.....
2021-05-25 08:10:15,269 WARN (replayer|57) [Catalog.replayJournal():2495] replay journal cost too much time: 1001 replayedJournalId: 1760538
2021-05-25 08:10:15,270 INFO (replayer|57) [Catalog.replayJournal():2467] replayed journal id is 1760538, replay to journal id is 1760539
2021-05-25 08:10:16,270 WARN (replayer|57) [BDBJournalCursor.next():148] Catch an exception when get next JournalEntity. key:1760539
com.sleepycat.je.LockTimeoutException: (JE 7.3.7) Lock expired. Locker 1722487307 -1_replayer_ReplicaThreadLocker: waited for lock on database=1760443 LockAddr:222546656 LSN=0x2d3/0x33529e type=READ grant=WAIT_NEW timeoutMillis=1000 startTime=1621901415270 endTime=1621901416270
Owners: [612770246 -17133270_ReplayThread_ReplayTxn" type="WRITE"/>]
Waiters: []

        at com.sleepycat.je.txn.LockManager.makeTimeoutException(LockManager.java:1117) ~[je-7.3.7.jar:7.3.7]
        at com.sleepycat.je.txn.LockManager.waitForLock(LockManager.java:606) ~[je-7.3.7.jar:7.3.7]
        at com.sleepycat.je.txn.LockManager.lock(LockManager.java:345) ~[je-7.3.7.jar:7.3.7]
        at com.sleepycat.je.txn.BasicLocker.lockInternal(BasicLocker.java:124) ~[je-7.3.7.jar:7.3.7]
        at com.sleepycat.je.rep.txn.ReplicaThreadLocker.lockInternal(ReplicaThreadLocker.java:63) ~[je-7.3.7.jar:7.3.7]
        at com.sleepycat.je.txn.Locker.lock(Locker.java:499) ~[je-7.3.7.jar:7.3.7]
        at com.sleepycat.je.dbi.CursorImpl.lockLN(CursorImpl.java:3585) ~[je-7.3.7.jar:7.3.7]
        at com.sleepycat.je.dbi.CursorImpl.lockLN(CursorImpl.java:3316) ~[je-7.3.7.jar:7.3.7]
        at com.sleepycat.je.dbi.CursorImpl.lockLNAndCheckDefunct(CursorImpl.java:2138) ~[je-7.3.7.jar:7.3.7]
        at com.sleepycat.je.dbi.CursorImpl.searchExact(CursorImpl.java:1950) ~[je-7.3.7.jar:7.3.7]
        at com.sleepycat.je.Cursor.searchExact(Cursor.java:4194) ~[je-7.3.7.jar:7.3.7]
        at com.sleepycat.je.Cursor.searchNoDups(Cursor.java:4055) ~[je-7.3.7.jar:7.3.7]
        at com.sleepycat.je.Cursor.search(Cursor.java:3857) ~[je-7.3.7.jar:7.3.7]
        at com.sleepycat.je.Cursor.getInternal(Cursor.java:1284) ~[je-7.3.7.jar:7.3.7]
        at com.sleepycat.je.Database.get(Database.java:1271) ~[je-7.3.7.jar:7.3.7]
        at com.sleepycat.je.Database.get(Database.java:1330) ~[je-7.3.7.jar:7.3.7]
        at org.apache.doris.journal.bdbje.BDBJournalCursor.next(BDBJournalCursor.java:108) [palo-fe.jar:3.4.0]
        at org.apache.doris.catalog.Catalog.replayJournal(Catalog.java:2477) [palo-fe.jar:3.4.0]
        at org.apache.doris.catalog.Catalog$3.runOneCycle(Catalog.java:2266) [palo-fe.jar:3.4.0]
        at org.apache.doris.common.util.Daemon.run(Daemon.java:116) [palo-fe.jar:3.4.0]
2021-05-25 08:10:16,271 WARN (replayer|57) [Catalog.replayJournal():2495] replay journal cost too much time: 1001 replayedJournalId: 1760538
2021-05-25 08:10:16,272 INFO (replayer|57) [Catalog.replayJournal():2467] replayed journal id is 1760538, replay to journal id is 1760539
2021-05-25 08:10:38,574 WARN (REPLICA 31.16.205.50_9010_1617961999894(3)|44) [Catalog.notifyNewFETypeTransfer():2341] notify new FE type transfer: UNKNOWN
2021-05-25 08:10:38,574 INFO (stateListener|56) [Catalog$4.runOneCycle():2364] begin to transfer FE type from FOLLOWER to UNKNOWN
2021-05-25 08:10:38,574 WARN (stateListener|56) [Catalog.transferToNonMaster():1344] FOLLOWER to UNKNOWN, still offer read service
2021-05-25 08:10:38,575 INFO (stateListener|56) [Catalog$4.runOneCycle():2450] finished to transfer FE type to UNKNOWN
2021-05-25 08:11:22,821 INFO (tablet stat mgr|22) [TabletStatMgr.runAfterCatalogReady():78] finished to get tablet stat of all backends. cost: 9 ms
2021-05-25 08:11:22,821 INFO (tablet stat mgr|22) [TabletStatMgr.runAfterCatalogReady():123] finished to update index row num of all databases. cost: 0 ms
2021-05-25 08:14:49,342 WARN (replayer|57) [Catalog.setCanRead():2312] meta out of date. current time: 1621901689341, synchronized time: 1621901389340, has log: false, fe type: UNKNOWN
2021-05-25 08:14:54,343 WARN (replayer|57) [Catalog.setCanRead():2312] meta out of date. current time: 1621901694343, synchronized time: 1621901389340, has log: false, fe type: UNKNOWN
2021-05-25 08:14:59,345 WARN (replayer|57) [Catalog.setCanRead():2312] meta out of date. current time: 1621901699345, synchronized time: 1621901389340, has log: false, fe type: UNKNOWN
2021-05-25 08:15:04,346 WARN (replayer|57) [Catalog.setCanRead():2312] meta out of date. current time: 1621901704346, synchronized time: 1621901389340, has log: false, fe type: UNKNOWN
2021-05-25 08:15:09,348 WARN (replayer|57) [Catalog.setCanRead():2312] meta out of date. current time: 1621901709348, synchronized time: 1621901389340, has log: false, fe type: UNKNOWN
收藏
点赞
0
个赞
共4条回复 最后由IamStrangers回复于2021-05
#5IamStrangers回复于2021-05
#4 longfei113回复
还有一个问题请教,不知道是否跟本帖的问题有关。 就是FE和BE混部时,元数据目录和数据目录在同一个磁盘上,会不会有影响? 是否一定要严格分开在不同的磁盘?
展开

没有硬性限制,但是非常不建议。因为数据盘本身的写入压力较大,坏盘率相对较高,一旦坏盘,会同时硬性到数据和元数据。所以建议分盘存储。

0
#4longfei113回复于2021-05
#2 IamStrangers回复
当前请根据官网元数据恢复操作进行故障恢复。 之后可以尝试在 fe.conf 中添加 bdbje_lock_timeout_second=5 这个参数,再观察是否还有问题。 这个问题可能是FE的某些操作长时间占用锁,导致元数据写操作响应时间过长导致的,在后续版本中有相关的优化。  
展开

还有一个问题请教,不知道是否跟本帖的问题有关。

就是FE和BE混部时,元数据目录和数据目录在同一个磁盘上,会不会有影响? 是否一定要严格分开在不同的磁盘?

0
#3longfei113回复于2021-05
#2 IamStrangers回复
当前请根据官网元数据恢复操作进行故障恢复。 之后可以尝试在 fe.conf 中添加 bdbje_lock_timeout_second=5 这个参数,再观察是否还有问题。 这个问题可能是FE的某些操作长时间占用锁,导致元数据写操作响应时间过长导致的,在后续版本中有相关的优化。  
展开

收到,谢谢解答!

0
#2IamStrangers回复于2021-05

当前请根据官网元数据恢复操作进行故障恢复。

之后可以尝试在 fe.conf 中添加 bdbje_lock_timeout_second=5 这个参数,再观察是否还有问题。

这个问题可能是FE的某些操作长时间占用锁,导致元数据写操作响应时间过长导致的,在后续版本中有相关的优化。

 

1
快速回复
TOP
切换版块