stream load导入报错
_依然爱你丶m 发布于2021-05 浏览:5571 回复:6
0
收藏

Doris版本:0.14x

Flink 通过 Stream load 将数据实时导入Doris BE节点,最大导入批次大小:5000条   导入频率:2s,经常会返回如下错误。

请问先红色标识的错误信息一般是什么原因导致的?谢谢

{
"TxnId": 11169477
"Label": "audit_20210520_150710_bd4c6dc578ad452f8b1c644ba7977611",
"Status": "Fail",
"Message": "already stopped, skip waiting for close. cancelled/!eos: : 1/0",
"NumberTotalRows": 0,
"NumberLoadedRows": 0,
"NumberFilteredRows": 0,
"NumberUnselectedRows": 0,
"LoadBytes": 799,
"LoadTimeMs": 185,
"BeginTxnTimeMs": 3,
"StreamLoadPutTimeMs": 4,
"ReadDataTimeMs": 0,
"WriteDataTimeMs": 178,
"CommitAndPublishTimeMs": 0
}

收藏
点赞
0
个赞
共6条回复 最后由IamStrangers回复于2021-05
#7IamStrangers回复于2021-05
#6 13671653088回复
[代码]   当前集群的版本情况是这样的 几乎没有BC,只有CC 目前打算执行以下两项参数,有没有更多的建议调整参数和参数值? echo "compaction_task_num_per_disk=5" >> /etc/doris/be/conf/be.conf echo "max_cumulative_compaction_num_singleton_deltas=500" >> /etc/doris/be/conf/be.conf 另外,查看版本合并时,发现有大量的版本大小为0,想要了解下这是什么情况造成的,数据导入失败? [代码]
展开

大小为0可能是空版本,比如导入数据在这个分桶上没有数据。这里可能要注意是否有数据倾斜。

0
#613671653088回复于2021-05
I0522 01:31:52.280251  7858 compaction.cpp:134] succeed to do cumulative compaction. tablet=324693.42239599.094bf4335219664a-63f07497380fc9b4, output_version=2-261632, segments=42. elapsed time=6.6938s.
I0522 01:31:56.484786  7858 compaction.cpp:134] succeed to do cumulative compaction. tablet=306392.1830401504.954fe37981907cd0-c198e7c5a81e5ea5, output_version=2-318482, segments=16. elapsed time=3.83519s.
I0522 01:32:01.131608  7854 compaction.cpp:134] succeed to do cumulative compaction. tablet=2561770.1719526158.0340de4927ba0d10-37f77a2193a0bab8, output_version=2-228407, segments=18. elapsed time=8.79973s.
I0522 01:32:02.287439  7858 compaction.cpp:134] succeed to do cumulative compaction. tablet=305722.1544667288.9147650786c3941d-fa8b7321dd772389, output_version=2-229191, segments=46. elapsed time=5.53432s.
I0522 01:32:08.327064  7858 compaction.cpp:134] succeed to do cumulative compaction. tablet=2561738.1719526158.8d484f17609a7ae6-d4a986ffc24f0dad, output_version=2-228411, segments=1. elapsed time=5.74763s.
I0522 01:32:10.300772  7856 compaction.cpp:134] succeed to do cumulative compaction. tablet=306412.1830401504.314304b7328ee487-77bc61313070c792, output_version=2-318486, segments=81. elapsed time=8.89613s.
I0522 01:32:17.102450  7856 compaction.cpp:134] succeed to do cumulative compaction. tablet=324709.42239599.2c4205899c6ecf52-3541753b6e4041ac, output_version=179273-261642, segments=44. elapsed time=6.46553s.
I0522 01:32:17.923872  7856 compaction.cpp:134] succeed to do cumulative compaction. tablet=309323.756207201.1f45e25b2843db08-b532f260e1f98b97, output_version=2-45600, segments=1. elapsed time=0.515178s.
I0522 01:32:24.051213  7858 compaction.cpp:134] succeed to do cumulative compaction. tablet=2561800.1202678566.414803dd224c9634-06cd498c40ea66ac, output_version=559939-701558, segments=33. elapsed time=15.4279s.
I0522 01:32:29.120107  7856 compaction.cpp:134] succeed to do cumulative compaction. tablet=325157.1004740614.c44d6be49ee7eb3f-1d085191370659ac, output_version=492314-1027607, segments=4. elapsed time=10.9195s.

 

当前集群的版本情况是这样的

几乎没有BC,只有CC

目前打算执行以下两项参数,有没有更多的建议调整参数和参数值?

echo "compaction_task_num_per_disk=5" >> /etc/doris/be/conf/be.conf

echo "max_cumulative_compaction_num_singleton_deltas=500" >> /etc/doris/be/conf/be.conf

另外,查看版本合并时,发现有大量的版本大小为0,想要了解下这是什么情况造成的,数据导入失败?

{
    "cumulative policy type": "SIZE_BASED",
    "cumulative point": 2,
    "last cumulative failure time": "1970-01-01 08:00:00.000",
    "last base failure time": "2021-05-21 11:09:30.533",
    "last cumulative success time": "2021-05-22 10:14:35.965",
    "last base success time": "1970-01-01 08:00:00.000",
    "rowsets": [
        "[0-1] 0 DATA NONOVERLAPPING 020000000052a3cbbe411e9752c02807a111f63a03e1b191 0",
        "[2-460677] 1 DATA NONOVERLAPPING 02000000009ac4bccf4d44429c2a166ddd9dc897b274a199 57.90 MB"
    ],
    "stale_rowsets": [
        "[2-306622] 1 02000000010fb8543f4cba592a86fb317931083d333d1daa 57.63 MB",
        "[306623-306623] 1 02000000010fbdc33f4cba592a86fb317931083d333d1daa 37.18 KB",
        "[306624-306624] 1 02000000010fbdc83f4cba592a86fb317931083d333d1daa 20.04 KB",
        "[306625-306625] 0 02000000010fbe523f4cba592a86fb317931083d333d1daa 0",
        "[306626-306626] 0 02000000010fbe9f3f4cba592a86fb317931083d333d1daa 0",
        "[306627-306627] 0 02000000010fbf1e3f4cba592a86fb317931083d333d1daa 0",
        "[306628-306628] 0 02000000010fbf4d3f4cba592a86fb317931083d333d1daa 0",
        "[306629-306629] 0 02000000010fbf723f4cba592a86fb317931083d333d1daa 0",
        "[306630-306630] 0 02000000010fc0333f4cba592a86fb317931083d333d1daa 0",
        "[306631-306631] 0 02000000010fc0723f4cba592a86fb317931083d333d1daa 0",
        "[306632-306632] 0 02000000010fc0993f4cba592a86fb317931083d333d1daa 0",
        "[306633-306633] 0 02000000010fc0c03f4cba592a86fb317931083d333d1daa 0",
        "[306634-306634] 0 02000000010fc0f93f4cba592a86fb317931083d333d1daa 0",
        "[306635-306635] 0 02000000010fc1333f4cba592a86fb317931083d333d1daa 0",
        "[306636-306636] 0 02000000010fc1563f4cba592a86fb317931083d333d1daa 0",
        "[306637-306637] 0 02000000010fc23b3f4cba592a86fb317931083d333d1daa 0",
        "[306638-306638] 0 02000000010fc2543f4cba592a86fb317931083d333d1daa 0",
        "[306639-306639] 0 02000000010fc2893f4cba592a86fb317931083d333d1daa 0",
        "[306640-306640] 0 02000000010fc29f3f4cba592a86fb317931083d333d1daa 0",
        "[306641-306641] 1 02000000010fc2ef3f4cba592a86fb317931083d333d1daa 5.90 KB",
        "[306642-306642] 0 02000000010fc3093f4cba592a86fb317931083d333d1daa 0",
        "[306643-306643] 0 02000000010fc37d3f4cba592a86fb317931083d333d1daa 0",
        "[306644-306644] 0 02000000010fc3a03f4cba592a86fb317931083d333d1daa 0",
        "[306645-306645] 0 02000000010fc3dd3f4cba592a86fb317931083d333d1daa 0",
0
#5IamStrangers回复于2021-05

tablet_id=2527704, txn_id=11178426, err=-215 应该是数据版本堆积过多,doris目前有版本数500的限制,由be的 max_tablet_version_num 参数控制。该参数是为了防止导入过于频繁,compaction速度无法跟上写入速度,而导致版本持续积压的问题。

这种问题通常需要降低导入频率,并等待compaction消化完当前的数据版本。Doris的监控中也有compaction score相关监控。

或者你可以通过 show tablet 2527704 ,然后执行后面的 show proc 语句来查看副本的版本数量(versionCount 列)

0
#413671653088回复于2021-05

- AverageThreadTokens: 0.00
- FragmentCpuTime: 492.766us
- MemoryLimit: 2.00 GB
- PeakMemoryUsage: 1.21 MB
- PeakReservation: 0
- PeakUsedReservation: 0
- RowsProduced: 1
BlockMgr:

- BlockWritesOutstanding: 0
- BlocksCreated: 0
- BlocksRecycled: 0
- BufferedPins: 0
- BytesWritten: 0
- MaxBlockSize: 8.00 MB
- TotalBufferWaitTime: 0.000ns
- TotalEncryptionTime: 0.000ns
- TotalIntegrityCheckTime: 0.000ns
- TotalReadBlockTime: 0.000ns
OlapTableSink:(Active: 21.307ms, non-child: 100.00%)
- CloseWaitTime: 20.370ms
- ConvertBatchTime: 0.000ns
- MaxAddBatchExecTime: 17.989ms
- NonBlockingSendTime: 1.409ms
- NonBlockingSendWorkTime: 298.548us
- SerializeBatchTime: 23.098us
- NumberBatchAdded: 10
- NumberNodeChannels: 10
- OpenTime: 744.186us
- RowsFiltered: 0
- RowsRead: 1
- RowsReturned: 1
- SendDataTime: 13.745us
- WaitMemLimitTime: 0.000ns
- TotalAddBatchExecTime: 39.369ms
- ValidateDataTime: 3.116us
BROKER_SCAN_NODE (id=0):(Active: 38.553us, non-child: 0.19%)
- BytesDecompressed: 0
- BytesRead: 284.00 B
- DecompressTime: 0.000ns
- FileReadTime: 5.751us - MaterializeTupleTime(*): 11.861us
- NumDiskAccess: 0 , txn_id=11178426, err=-215, id=d94952cdffe4ff08- - PeakMemoryUsage: 1.02 MB
- RowsRead: 1 1178426, err=-215
- RowsReturned: 1
- RowsReturnedRate: 25.94 K/sec
- TotalRawReadTime(*): 32.899us
- TotalReadThroughput: 0.00 /sec
- WaitScannerTime: 0.000ns
W0520 15:51:04.718528 2596 stream_load_executor.cpp:90] fragment execute failed, query_id=d94952cdffe4ff08-131b39ecc22d6988, err_msg=close wait failed coz rpc error. node=10.188.3.155:8060, errmsg=tablet writer write failed, tablet_id=2527704, txn_id=11178426, err=-215, id=d94952cdffe4ff08-131b39ecc22d6988, job_id=-1, txn_id=11178426, label=audit_20210520_155104_6fe859081c2a46918e5a74bf3d71332e
W0520 15:51:04.718605 2801 stream_load.cpp:142] handle streaming load failed, id=d94952cdffe4ff08-131b39ecc22d6988, errmsg=close wait failed coz rpc error. node=10.188.3.155:8060, errmsg=tablet writer write failed, tablet_id=2527704, txn_id=11178426, err=-215

0
#313671653088回复于2021-05

be.INFO,选取了一个-215报错的Stream load日志如下:

需要麻烦看下这种错误有什么好的解决方案

I0520 15:51:04.691452 2801 stream_load.cpp:214] new income streaming load request.id=d94952cdffe4ff08-131b39ecc22d6988, job_id=-1, txn_id=-1, label=audit_20210520_155104_6fe859081c2a46918e5a74bf3d71332e, db=ods_dental, tbl=ods_dental_operationlog

I0520 15:51:04.696164 2801 stream_load_executor.cpp:53] begin to execute job. label=audit_20210520_155104_6fe859081c2a46918e5a74bf3d71332e, txn_id=11178426, query_id=d94952cdffe4ff08-131b39ecc22d6988

I0520 15:51:04.696190 2801 plan_fragment_executor.cpp:76] Prepare(): query_id=d94952cdffe4ff08-131b39ecc22d6988 fragment_instance_id=d94952cdffe4ff08-131b39ecc22d6989 backend_num=0 c002baa6e8f138e, version: 0

I0520 15:51:04.696252 2801 plan_fragment_executor.cpp:140] Using query memory limit: 2.00 GB

I0520 15:51:04.696797 2596 plan_fragment_executor.cpp:239] Open(): fragment_instance_id=d94952cdffe4ff08-131b39ecc22d6989 c002baa6e8f138e, version: 0

I0520 15:51:04.697333 2759 tablets_channel.cpp:59] open tablets channel: (id=d94952cdffe4ff08-131b39ecc22d6988,index_id=2527663), tablets num: 3, timeout(s): 36000

I0520 15:51:04.699055 2749 tablets_channel.cpp:141] close tablets channel: (id=d94952cdffe4ff08-131b39ecc22d6988,index_id=2527663), sender id: 0 c002baa6e8f138e, version: 0

I0520 15:51:04.699074 20369 tablet_sink.cpp:979] all node channels are stopped(maybe finished/offending/cancelled), consumer thread exit.

I0520 15:51:04.699285 2749 txn_manager.cpp:250] commit transaction to engine successfully. partition_id: 308886, transaction_id: 11178426, tablet: 2527680.39817555.a14fc33f88e02df4-29c9453fca0a2196, rowsetid: 0200000000dd21cd6a432b7fb01433bddc002baa6e8f138e, version: 0 ; , load_id=d94952cdffe4ff08-131b39ecc22d6988

I0520 15:51:04.699295 2749 delta_writer.cpp:343] close delta writer for tablet: 2527680, stats: (flush time(ms)=0, flush count=1, flush bytes: 4096, flush disk bytes: 0) t_id=2527704, txn_id=11178426, err=-215

I0520 15:51:04.699322 2749 txn_manager.cpp:250] commit transaction to engine successfully. partition_id: 308886, transaction_id: 11178426, tablet: 2527668.39817555.754bf6c720eb9f87-282e5c4b1f464694, rowsetid: 0200000000dd21ce6a432b7fb01433bddter write failed, tablet_id=2527704, txn_id=11178c002baa6e8f138e, version: 0

I0520 15:51:04.699327 2749 delta_writer.cpp:343] close delta writer for tablet: 2527668, stats: (flush time(ms)=0, flush count=1, flush bytes: 4096, flush disk bytes: 0)

I0520 15:51:04.699347 2749 txn_manager.cpp:250] commit transaction to engine successfully. partition_id: 308886, transaction_id: 11178426, tablet: 2527672.39817555.0f4841f222bb3bb5-8b4b4268a0aeecac, rowsetid: 0200000000dd21cf6a432b7fb01433bdd)(1)} {10003:(17)(1)} {10005:(0)(1)} {10008:(0)(1c002baa6e8f138e, version: 0

I0520 15:51:04.699350 2749 delta_writer.cpp:343] close delta writer for tablet: 2527672, stats: (flush time(ms)=0, flush count=1, flush bytes: 4096, flush disk bytes: 0) , txn_id=11178426, err=-215

I0520 15:51:04.699422 2749 load_channel_mgr.cpp:152] removing load channel d94952cdffe4ff08-131b39ecc22d6988 because it's finished

I0520 15:51:04.699430 2749 load_channel.cpp:38] load channel mem peak usage=4096, info=limit: 2147483648; consumption: 0; label: LoadChannel:d94952cdffe4ff08-131b39ecc22d6988; all tracker size: 3; limit trackers size: 2; parent is null: false; , load_id=d94952cdffe4ff08-131b39ecc22d6988

W0520 15:51:04.699900 2759 tablet_sink.cpp:168] NodeChannel[2527663-10008] add batch req success but status isn't ok, load_id=d94952cdffe4ff08-131b39ecc22d6988, txn_id=11178426, node=10.188.3.155:8060, errmsg=tablet writer write failed, tablet_id=2527704, txn_id=11178426, err=-215

W0520 15:51:04.700928 2596 tablet_sink.cpp:733] NodeChannel[2527663-10008]: close channel failed, load_id=d94952cdffe4ff08-131b39ecc22d6988, txn_id=11178426. error_msg=close wait failed coz rpc error. node=10.188.3.155:8060, errmsg=tablet writer write failed, tablet_id=2527704, txn_id=11178426, err=-215

I0520 15:51:04.718088 2596 tablet_sink.cpp:749] total mem_exceeded_block_ns=0, total queue_push_lock_ns=0, total actual_consume_ns=298548

I0520 15:51:04.718101 2596 tablet_sink.cpp:780] finished to close olap table sink. load_id=d94952cdffe4ff08-131b39ecc22d6988, txn_id=11178426, node add batch time(ms)/num: {10006:(17)(1)} {10009:(0)(1)} {10002:(0)(1)} {10007:(0)(1)} {10010:(0)(1)} {10003:(17)(1)} {10005:(0)(1)} {10008:(0)(1)} {10011:(0)(1)} {10004:(0)(1)}

W0520 15:51:04.718353 2596 fragment_mgr.cpp:230] Got error while opening fragment d94952cdffe4ff08-131b39ecc22d6989: Internal error: close wait failed coz rpc error. node=10.188.3.155:8060, errmsg=tablet writer write failed, tablet_id=2527704, txn_id=11178426, err=-215

I0520 15:51:04.718505 2596 plan_fragment_executor.cpp:583] Fragment d94952cdffe4ff08-131b39ecc22d6989:(Active: 20.468ms, non-child: 0.00%)

0
#2IamStrangers回复于2021-05

already stopped, skip waiting for close. cancelled/!eos: : 1/0"

这个错误可能是由多种愿意引起的,比如某次rpc失败,某个节点block住等等。目前版本这里的错误信息不太完善。定位问题比较麻烦。

可能需要先用 txnId 查找到对应的导入计划的 query id,再通过query id 查找对应的错误信息。

或者你可以尝试在be.INFO 中搜索 -215,-235 或者类似的负数错误码,可能是导入失败的原因。

我们在下一个版本优化了这部分的报错信息。

0
快速回复
TOP
切换版块