记一次 .NET某旅行社酒店管理系统 卡死分析

360影视 动漫周边 2025-04-21 08:32 7

摘要:PrivateboolTryGetConnection(DbConnection owningObject,uintwaitForMultipleObjectsTimeout,boolallowCreate,boolonlyOneCheckConnection

1. 讲故事

年初有位朋友找到我,说他们的管理系统不响应了,让我帮忙看下到底咋回事? 手上也有dump,那就来分析吧。

1. 线程池队列有积压吗?朋友的系统是一个web系统,那web系统的无响应我们首先要关注的就是线程池,使用命令,参考输出如下:
0:000> !sos tpq
globalwork item queue
0x00000004010774C0 Microsoft.AspNetCore.Server.IIS.Core.IISHttpContextOfT
0x0000000401077808 Microsoft.AspNetCore.Server.IIS.Core.IISHttpContextOfT
....
0x000000030239DD78 Microsoft.AspNetCore.Server.IIS.Core.IISHttpContextOfT
0x000000030239E0C0 Microsoft.AspNetCore.Server.IIS.Core.IISHttpContextOfT

local per thread work items
0x0000000100A46410 System.Threading.Tasks.Task
...
0x000000010133F8C0 System.Threading.Tasks.Task

2 Work Microsoft.AspNetCore.Http.Connections.Internal.HttpConnectionContext+c.b__123_1
4 Work Microsoft.AspNetCore.Http.Connections.Internal.HttpConnectionContext+c.b__123_0
266 Work Microsoft.AspNetCore.SignalR.HubConnectionContext.AbortConnection
----
272

从卦中可以看到确实存在线程池积压的情况,那为什么会有积压呢?条件反射告诉我,是不是因为锁的原因,使用观察。
0:000> !syncblk
Index SyncBlock MonitorHeld Recursion Owning Thread Info SyncBlock Owner

Total 468
CCW 0
RCW 0
ComClassFactory 0
Free 120

从卦中看和锁没半毛钱关系,那就只能深入各个消费线程,看看这些线程为什么这么不给力。。。。

2. 线程都在干什么要想观察各个线程都在做什么,可以用~*e !clrstack观察各个线程调用栈,输出的调用栈太长,但仔细观察之后,发现很多线程都停留在TryGetConnnection上,截图如下:

从卦中可以看到大概有143个线程卡在 TryGetConnection 上,不知道 Connection 为啥取不到了,接下来观察问题代码,简化后如下:


PrivateboolTryGetConnection(DbConnection owningObject,uintwaitForMultipleObjectsTimeout,boolallowCreate,boolonlyOneCheckConnection, DbConnectionOptions userOptions,outDbConnectionInternal connection)
{
DbConnectionInternal dbConnectionInternal = ;
transaction transaction = ;

if(HasTransactionAffinity)
{
dbConnectionInternal = GetFromTransactedPool(outtransaction);
}
if(dbConnectionInternal == )
{
Interlocked.Increment(ref_waitCount);
do
{
num = WaitHandle.WaitAny(_waitHandles.GetHandles(allowCreate), (int)waitForMultipleObjectsTimeout);

}while(dbConnectionInternal == );
}
}

privateDbConnectionInternalGetFromTransactedPool(outTransaction transaction)
{
transaction = ADP.GetCurrentTransaction;

if( != transaction && _transactedConnectionPool != )
{
dbConnectionInternal = _transactedConnectionPool.GetTransactedObject(transaction);
//....
}
returndbConnectionInternal;
}

internalDbConnectionInternalGetTransactedObject(Transaction transaction)
{
lock(_transactedCxns)
{
flag = _transactedCxns.TryGetValue(transaction,outvalue);
}
if(flag)
{
lock(value)
{
intnum =value.Count - 1;
if(0 <= num)
{
dbConnectionInternal =value[num];
value.RemoveAt(num);
}
}
}
return
}

从卦中源码看是_transactedCxns返回 所致,看样子是_transactedCxns中的Conenction耗尽,接下来使用!dso从 DbConnectionPool 字段中去挖,输出如下:
0:298> !dso
OS Thread Id: 0x42c0 (298)
SP/REG Object Name
0005c087d020 0003ffc79528 Microsoft.Data.ProviderBase.DbConnectionPool

0:298> !do0003ffc79528
Name: Microsoft.Data.ProviderBase.DbConnectionPool
MethodTable: 000007fe8dd714c8
EEClass: 000007fe8dd65438
Tracked Type:false
Size: 176(0xb0) bytes
File: D:\APP\Energydotnet\runtimes\win\lib\net6.0\Microsoft.Data.SQLClient.dll
Fields:
MT Field Offset Type VT Attr Value Name
000007fe8dd71370 40001a7 8c System.Int32 1 instance 1 _state
...
000007fe8cd994b0 40001ad 94 System.Int32 1 instance 143 _waitCount
000007fe8e012c68 40001b4 78 ...tedConnectionPool 0 instance 00000003ffc7a3e8 _transactedConnectionPool
...
000007fe8cd994b0 40001b6 9c System.Int32 1 instance 100 _totalObjects
...

0:298> !DumpObj /d 00000003ffc7a520
Name: System.Collections.Generic.Dictionary`2[[System.Transactions.Transaction, System.Transactions.Local],[Microsoft.Data.ProviderBase.DbConnectionPool+TransactedConnectionList, Microsoft.Data.SqlClient]]
MethodTable: 000007fe8e0146a0
EEClass: 000007fe8cd8af30
Tracked Type:false
Size: 80(0x50) bytes
File: C:\Program Files\dotnet\shared\Microsoft.NETCore.App\6.0.22\System.Private.CoreLib.dll
Fields:

000007fe8cda8480 4002065 8 System.Int32 0 instance 0000000000000000 _buckets
000007fe8e3255d8 4002066 10 ...ivate.CoreLib]] 0 instance 0000000000000000 _entries
000007fe8cd9ca08 4002067 30 System.UInt64 1 instance 0 _fastModMultiplier
000007fe8cd994b0 4002068 38 System.Int32 1 instance 0 _count
000007fe8cd994b0 4002069 3c System.Int32 1 instance 0 _freeList
000007fe8cd994b0 400206a 40 System.Int32 1 instance 0 _freeCount
000007fe8cd994b0 400206b 44 System.Int32 1 instance 0 _version
000007fe8cdb7900 400206c 18 ...Private.CoreLib]] 0 instance 0000000000000000 _comparer
000007fe8da70c90 400206d 20 ...Private.CoreLib]] 0 instance 0000000000000000 _keys
000007fe8d2a1c60 400206e 28 ...Private.CoreLib]] 0 instance 0000000000000000 _values

从上面的卦中数据可知三点信息:

100 _totalObjects当前的线程池存着100个Connection。

0 _count当前100个Connection全部耗尽。

143 _waitCount表示当前有 143 个线程在获取 Connection 上进行等待。

3. 池中之物都去了哪里要想找到这个答案,继续观察线程栈,比如搜索TDS传输层方法Microsoft.Data.SqlClient.TdsParserStateObject.TryReadByte,可以看到刚好是 100 个,上层大多是xxxx.GetRoomNosInDltParams方法,截图如下:

挖掘各个线程栈,大概都是下面的sql,格式化如下:


SELECT[HotelId],
[RoomId],
[SubRoomId],
[SupplierId],
[RoomNo],
[RoomCategory],
[SellPrice],
[RoomNoState],
[CheckInDate]
FROM[xxx]
WHERE((((((([RoomId]IN(673,674))AND( [SellPrice] > @SellPrice1 ))
AND( [CheckInDate] >= @CheckInDate2 ))
AND( [CheckInDate] <= @CheckInDate3 ))
AND( [RoomCategory] <=@constant5))
AND( [RoomNoState] =@constant7))
ANDNOT([RoomNo]LIKE@MethodConst8+'%') )

反编译源码发现有一个 useCached 字段形同虚设,导致每次都是从 数据库 读取,截图如下:

到这里基本上就能推测出来的,程序的卡死主要是 Connection 耗尽,优化建议如下:

SQL 加上 nolock,避免锁问题。

GetRoomNosInDltParams 方法尽量使用缓存。

观察数据库层的锁和负载情况。

设置更大的数据库连接池。

这次卡死的生产事故,是大量数据库的慢请求导致SDK侧的数据库连接池(100)耗尽所致,如果有数据库侧的监控工具,我想一眼就能找到答案。

来源:opendotnet

相关推荐