摘要:由于这个定时任务是 furion 写的,刚好这位学员是VIP客户,找了小僧大佬,大佬需要最小化的问题代码,由于不能本地复现,也就没下文了,毕竟也是触发了 13w 次之后才出现的问题,确实比较难搞,截图如下:
一:背景 1. 讲故事
前段时间有位朋友加我微信,来了就要进我的训练营,并且附带着纠结了他几个月的一个疑难杂症,让我帮忙看下怎么回事,问题描述截图如下:
由于这个定时任务是 furion 写的,刚好这位学员是VIP客户,找了小僧大佬,大佬需要最小化的问题代码,由于不能本地复现,也就没下文了,毕竟也是触发了 13w 次之后才出现的问题,确实比较难搞,截图如下:
像这种带着问题进训练营的朋友还是蛮多的,对这类需求我也是严格,谨慎,认真的对待,毕竟是骡子是马,得要拉出来溜溜。
二:为什么会任务延迟 1. 初步分析
经过和学员的沟通和截图确认,是一个叫M71EnterPortService的服务出现的延迟,这种问题相对来说比较简单,可能任务卡死在某个地方,通过~*e !clrstack观察下各个线程栈上是否有2. 走出迷雾
要想走出迷雾,需要回头看下类的调度方法,源码方法参考如下: [JobDetail("job_M71xxx", Description = "M7-1xxx作业", GroupName = "default", Concurrent = false)][Period(500L, TriggerId = "trigger_M71xxx", Description = "M7-1xxx作业")]
publicclassM71EnterPortService:IJob
{
publicstaticstringprocessCode ="M7-1A";
publicasyncTaskExecuteAsync(JobExecutingContext context, CancellationToken stoppingToken)
{
_ =string.Empty;
try
{
plcProcessPara =awaitxxxService.CheckProcess(processCode);
if(!plcProcessPara.IsProcess)
{
return;
}
...
if(string.IsOrWhiteSpace(rFIDReaderResultModel.RfidUid))
{
PLCxxxDriver.WritePlcxxxData($"PLC/{processCode}/xxxIPC","2");
awaitTask.Delay(2000);
}
awaitTask.CompletedTask;
}
catch(Exception ex)
{
xxxService.xxxWarn(produceRecord, plcProcessPara, ex);
}
}
}
从源码看,这是一个纯异步的写法,看到这个纯异步我就想到了新版的sos提供了一个!dumpasync命令,专门观察状态机链的,输出如下:
0:000> !dumpasync
...
STACK 9
0000025fc3a02f78 00007ff90f3f75e8 (-1) xxx.ComProcessService+d__1 @ 7ff90fdf02b0
0000025fc3a03008 00007ff90f3f82c0 (0) xxx.M71EnterPortService+d__3 @ 7ff90fe15a80
0000025fc3a03090 00007ff90f3f8a60 (0) Furion.Schedule.ScheduleHostedService+c__DisplayClass23_3+d @ 7ff90fdf0000
0000025fc3a030f8 00007ff90f3f8f58 (0) Furion.FriendlyException.Retry+d__1 @ 7ff90fdef840
0000025fc3a03198 00007ff90f3f93d0 (1) Furion.Schedule.ScheduleHostedService+c__DisplayClass23_2+d @ 7ff90fded340
...
我去,真的卡在下的CheckProcess中,接下来拿CheckProcess在所有线程栈上再次搜索,本以为有惊喜,同样毛都没有,我去。。。截图如下。
这就有点无语了,接下来我们观察下状态机地址0x0000025fc3a02f78 中的内部字段,从内部字段的赋值情况观察代码执行流,输出如下:
0:000> !dumpasync --address 0x0000025fc3a02f78 --fields
STACK 1
0000025fc3a02f78 00007ff90f3f75e8 (-1) xxx.ComProcessService+d__1 @ 7ff90fdf02b0
Address MT Type Value Name
0000025fc3a02fe0 00007ff90c3f7408 System.Int32 -1 1__state
0000025fc3a02fe8 00007ff90d872010 ...y.PLCPara.xxx> 0000025fc3a02ff0 t__builder
0000025f80683200 00007ff90c44bf40 System.String"M7-1A"processCode
0000025fc3a028e0 00007ff90d822c00 ...ty.PLCPara.xxx 0000025fc3a028e0 5__2
0000025fc3a02978 00007ff90c4f5b00 System.IDisposable 0000025fc3a02978 7__wrap2
0000025faf0ac600 00007ff90d8228d0 ...tity.PLCPara.xxx 0000025faf0ac600 5__4
0000025fc3a0aaf8 00007ff90d824358 ...ntity.PLCPara.xxx 0000025fc3a0aaf8 5__5
0000025fc3a02ff0 00007ff90d872118 ...tity.PLCPara.xxx> 0000025fc3a02ff8 u__1
0000025fc3a02ff8 00007ff90c98b920 ....CompilerServices.TaskAwaiter 0000025fc3a03000 u__2
0000025fc3a03008 00007ff90f3f82c0 (0) xxx.M71EnterPortService+d__3 @ 7ff90fe15a80
0000025fc3a03058 00007ff90c3f7408 System.Int32 0 1__state
0000025fc3a03060 00007ff90c98b540 ...rvices.AsyncTaskMethodBuilder 0000025fc3a03068 t__builder
000000000000000000007ff90d82ec98 ....DBModel.xxx 5__2
000000000000000000007ff90d822c00 ...ty.PLCPara.xxx 5__3
0000025fc3a03068 00007ff90d878de0 ...y.PLCPara.xxx> 0000025fc3a03070 u__1
0000025fc3a03070 00007ff90d879368 ...DBModel.xxx> 0000025fc3a03078 u__2
0000025fc3a03078 00007ff90d878eb8 ...Entity.xxx> 0000025fc3a03080 u__3
0000025fc3a03080 00007ff90c98b920 ....CompilerServices.TaskAwaiter 0000025fc3a03088 u__4
0000025fc3a03090 00007ff90f3f8a60 (0) Furion.Schedule.ScheduleHostedService+c__DisplayClass23_3+d @ 7ff90fdf0000
0000025fc3a030d8 00007ff90c3f7408 System.Int32 0 1__state
0000025fc3a030e0 00007ff90c98b540 ...rvices.AsyncTaskMethodBuilder 0000025fc3a030e8 t__builder
0000025fc3a02560 00007ff90cbbc378 ...Service+c__DisplayClass23_3 0000025fc3a02560 4__this
0000025fc3a030e8 00007ff90c98b920 ....CompilerServices.TaskAwaiter 0000025fc3a030f0 u__1
0000025fc3a030f8 00007ff90f3f8f58 (0) Furion.FriendlyException.Retry+d__1 @ 7ff90fdef840
0000025fc3a03168 00007ff90c3f7408 System.Int32 0 1__state
0000025fc3a03180 00007ff90c98b540 ...rvices.AsyncTaskMethodBuilder 0000025fc3a03188 t__builder
0000025fc3a02860 00007ff90dd040f8 ... 0000025fc3a02860 action
0000025fc3a0316c 00007ff90c3f7408 System.Int32 0 numRetries
0000025fc3a03178 00007ff90c3f2f78 System.BooleantruefinalThrow
000000000000000000007ff90f237fc8 ... System.Threading.Tasks.Task> fallbackPolicy
000000000000000000007ff90c528d80 System.Type exceptionTypes
000000000000000000007ff90cbd2f28 ...on.Retry+c__DisplayClass1_0 8__1
0000025fc3a028a0 00007ff90f233868 ...n 0000025fc3a028a0 retryAction
0000025fc3a03170 00007ff90c3f7408 System.Int32 1000 retryTimeout
0000025fc3a03174 00007ff90c3f7408 System.Int32 0 5__2
0000025fc3a03188 00007ff90c98b920 ....CompilerServices.TaskAwaiter 0000025fc3a03190 u__1
000000000000000000007ff90c334730 System.Object 7__wrap2
...
要想理解上面的字段,需要大家对状态机内部的有一些了解,比如:
5__xxx表示 await 的返回值。
1__state表示当一个方法中有多个await 时,这个字段会阶段性的记录当前是第几个await。
结合5__xxx赋值情况 和processMark的数据标记情况,推测出是卡死在SavePlcPara中,截图如下:最后就是让朋友重点观察下 TDengine.Driver.Impl.NativeMethods.NativeMethods.QueryWithReqid 方法,可以用排除法观察。
三:总结
这次任务延迟事故在分析过程中还是有相当大的迷惑性,如果你缺乏对状态机的理解以及不知!dumpasync命令的使用,我相信这个问题你很难搞定。来源:蓝胖子科技达人