2011-12-29 628 views
8

在应用程序服务器上间歇性出现x64 dotnet服务崩溃问题。该服务可以运行几个小时,几天或几周而没有问题,但是随后没有太多的信息。WinDbg:导致.net服务崩溃的捕获异常

该服务在两台服务器上运行在一个集群中(3台服务每台服务器) - 并且两台服务器上的任何服务都出现崩溃。复制的环境显示了相同的行为,所以我已经“疲惫”了一个配置问题的想法。

最初是从应用程序服务器的事件日志拉错误是:

Error message from event log on server XXXX 

Application: MySvc.exe 
Framework Version: v4.0.30319 
Description: The process was terminated due to an internal error in the .NET Runtime 
at IP 000007FEEFD8CD4C (000007FEEFC70000) with exit code 80131506 

这并不表现出很大的细节,我在网上找到了最好的指针是“交叉手指” ......

Application Crashes With "Internal Error In The .NET Runtime"

http://www.jamesewelch.com/2010/09/30/troubleshooting-internal-error-in-the-net-runtime/

与ADPlus的调试器中运行的经过一个月最后我们连接有一串失败和一些崩溃转储。现在我已经有了转储,我很难从中获得任何使用。

我以前曾经调查过一些有大量成功的'悬挂'转储,并且阅读了很多苔丝费兰德斯的博客,但是我已经证明是'死亡'倾销是死刑。大多数对象,异常等全部标记为垃圾回收&只剩下主线程 - 我可能错过了一些东西。

我将添加!analyze -v的详细信息以及转储日志 - 它们会显示异常。

所以 - 真正的问题在于:有人可以给我一些指向哪里从这里头。转储日志中的例外与我在实际转储中看到的不一致。

DUMP 1登录无济于事这里: http://pastebin.com/Eg5YCqww

DUMP 1分析:(我有一个符号问题,我解决不了。)

0:000> !analyze -v 
*** 
FAULTING_IP: 
+112c9440 
00000000`00000000 ??    ??? 

EXCEPTION_RECORD: ffffffffffffffff -- (.exr 0xffffffffffffffff) 
ExceptionAddress: 0000000000000000 
    ExceptionCode: 80000003 (Break instruction exception) 
    ExceptionFlags: 00000000 
NumberParameters: 0 

FAULTING_THREAD: 00000000000011f8 

PROCESS_NAME: MySvc.exe 

ERROR_CODE: (NTSTATUS) 0x80000003 - {EXCEPTION} Breakpoint A breakpoint has been reached. 

EXCEPTION_CODE: (HRESULT) 0x80000003 (2147483651) - One or more arguments are invalid 

MOD_LIST: <ANALYSIS/> 

NTGLOBALFLAG: 0 

APPLICATION_VERIFIER_FLAGS: 0 

MANAGED_STACK: 
(TransitionMU) 
000000000022EBB0 000007FEF40CB1AB System_ServiceProcess_ni!DomainBoundILStubClass.IL_STUB_PInvoke(IntPtr)+0x3b 
000000000022EC70 000007FEF40CD20D System_ServiceProcess_ni!System.ServiceProcess.ServiceBase.Run(System.ServiceProcess.ServiceBase[])+0x26d 
000000000022EDA0 000007FF00170227 MySvc!Ax.Remoting.MySvc.Main()+0x107 
(TransitionUM) 

MANAGED_STACK_COMMAND: _EFN_StackTrace 

BUGCHECK_STR: APPLICATION_FAULT_WRONG_SYMBOLS_FILL_PATTERN_ffffffff 

PRIMARY_PROBLEM_CLASS: WRONG_SYMBOLS_FILL_PATTERN_ffffffff 

DEFAULT_BUCKET_ID: WRONG_SYMBOLS_FILL_PATTERN_ffffffff 

LAST_CONTROL_TRANSFER: from 000007fefd8810ac to 000000007760f6fa 

STACK_TEXT: 
00000000`0022e818 000007fe`fd8810ac : 00000000`007541f0 000007fe`f40ce089 00000000`0022e9c0 00000000`00000000 : ntdll!ZwWaitForSingleObject+0xa 
00000000`0022e820 000007fe`fe7daffb : 00000000`ffffffff 000007fe`fe7d344c 00000000`00000000 00000000`0000032c : KERNELBASE!WaitForSingleObjectEx+0x79 
00000000`0022e8c0 000007fe`fe7d9d61 : 00000000`01d47ff0 00000000`0000032c 00000000`00000000 00000000`00000000 : sechost!ScSendResponseReceiveControls+0x13b 
00000000`0022e9b0 000007fe`fe7d9c16 : 00000000`0022eb18 00000000`00000000 00000000`00000000 000007fe`00000000 : sechost!ScDispatcherLoop+0x121 
00000000`0022eac0 000007fe`f19017c7 : 00000000`11213890 00000000`01d635c0 00000000`00000000 00000000`00000000 : sechost!StartServiceCtrlDispatcherW+0x14e 
00000000`0022eb10 000007fe`f40cb1ab : 00000000`01d63680 00000000`0022ebe8 000007fe`f40a5b50 0000bf6c`4589127e : clr!DoNDirectCall__PatchGetThreadCall+0x7b 
00000000`0022ebb0 000007fe`f40cd20d : 00000000`01d63680 00000000`00000000 00000000`01d63698 00000000`00000000 : System_ServiceProcess_ni+0x2b1ab 
00000000`0022ec70 000007ff`00170227 : 00000000`10ff1ac8 00000000`10ff1af0 00000000`10ff1af0 00000000`10ff1af0 : System_ServiceProcess_ni+0x2d20d 
00000000`0022eda0 000007fe`f196dc54 : 00000000`0022ee80 000007fe`f1904e65 ffffffff`fffffffe 00000000`0022f3a0 : 0x7ff`00170227 
00000000`0022ee30 000007fe`f196dd69 : 000007ff`000551f8 00000000`00000001 00000000`00000000 00000000`00000000 : clr!CallDescrWorker+0x84 
00000000`0022ee70 000007fe`f196dde5 : 00000000`0022ef88 00000000`00000000 00000000`0022ef90 00000000`0022f168 : clr!CallDescrWorkerWithHandler+0xa9 
00000000`0022eef0 000007fe`f1a214c5 : 00000000`00000000 00000000`0022f178 00000000`00000000 00000000`00000000 : clr!MethodDesc::CallDescr+0x2a1 
00000000`0022f120 000007fe`f1a215fc : 00000000`000ad7c0 00000000`000ad7c0 00000000`00000000 00000000`00000000 : clr!ClassLoader::RunMain+0x228 
00000000`0022f370 000007fe`f1a213b2 : 00000000`0022f970 00000000`00000200 00000000`000b7a80 00000000`00000200 : clr!Assembly::ExecuteMainMethod+0xac 
00000000`0022f620 000007fe`f1ac6d66 : 00000000`00000000 00000000`10fd0000 00000000`00000000 00000000`00000000 : clr!SystemDomain::ExecuteMainMethod+0x452 
00000000`0022fbd0 000007fe`f1ac6c83 : 00000000`10fd0000 00000000`00000000 00000000`00000000 00000000`00000000 : clr!ExecuteEXE+0x43 
00000000`0022fc30 000007fe`f1a2c515 : 00000000`000ad7c0 ffffffff`ffffffff 00000000`00000000 00000000`00000000 : clr!CorExeMainInternal+0xc4 
00000000`0022fca0 000007fe`f8973309 : 00000000`00000000 00000000`00000000 00000000`00000000 00000000`0022fc88 : clr!CorExeMain+0x15 
00000000`0022fce0 000007fe`f8a05b21 : 000007fe`f1a2c500 000007fe`f89732c0 00000000`00000000 00000000`00000000 : mscoreei!CorExeMain+0x41 
00000000`0022fd10 00000000`773bf56d : 000007fe`f8970000 00000000`00000000 00000000`00000000 00000000`00000000 : mscoree!CorExeMain_Exported+0x57 
00000000`0022fd40 00000000`775f2cc1 : 00000000`00000000 00000000`00000000 00000000`00000000 00000000`00000000 : kernel32!BaseThreadInitThunk+0xd 
00000000`0022fd70 00000000`00000000 : 00000000`00000000 00000000`00000000 00000000`00000000 00000000`00000000 : ntdll!RtlUserThreadStart+0x1d 


STACK_COMMAND: ~0s; .ecxr ; kb 

FOLLOWUP_IP: 
sechost!ScSendResponseReceiveControls+13b 
000007fe`fe7daffb 85c0   test eax,eax 

SYMBOL_STACK_INDEX: 2 

SYMBOL_NAME: sechost!ScSendResponseReceiveControls+13b 

FOLLOWUP_NAME: MachineOwner 

MODULE_NAME: sechost 

IMAGE_NAME: sechost.dll 

DEBUG_FLR_IMAGE_TIMESTAMP: 4a5be05e 

FAILURE_BUCKET_ID: WRONG_SYMBOLS_FILL_PATTERN_ffffffff_80000003_sechost.dll!ScSendResponseReceiveControls 

BUCKET_ID: X64_APPLICATION_FAULT_WRONG_SYMBOLS_FILL_PATTERN_ffffffff_sechost!ScSendResponseReceiveControls+13b 

更新1(12月29日):

重建转储日志中的一个CLR异常,接下来是调用堆栈。看起来像调用分贝时发生异常(通过ODAC)

clr!RaiseTheExceptionInternalOnly+0x363 
clr!IL_Throw+0x146 
gm.a(System.String, System.String, Int32, System.String, XXBase, Int32, XXDataParameter[]) 
gm.b(XXBase, XXBase, Boolean, Boolean, Boolean, Int32) 
gm.b(XXBase, XXBase) 
od.a(XXGridQueue, TaskStatus, ProcessResult, Int32, Int32, Int32) 
od.b(XXGridQueue) 
he.b(XXBaseCollection) 
he.a(Boolean ByRef) 
XX.MySvc.tmr_Elapsed(System.Object) 
System.Threading.ExecutionContext.Run(System.Threading.ExecutionContext, System.Threading.ContextCallback, System.Object, Boolean) 

重建访问冲突异常调用堆栈。调用ODAC库后调用垃圾回收器时抛出错误。

(1330.1074): Access violation - code c0000005 (first chance) 
FirstChance_av_AccessViolation 

clr!WKS::gc_heap::plan_phase+0x5ac 
clr!WKS::gc_heap::gc1+0xbb 
clr!WKS::gc_heap::garbage_collect+0x276 
clr!WKS::GCHeap::GarbageCollectGeneration+0x14e 
clr!WKS::gc_heap::try_allocate_more_space+0x25f 
clr!WKS::GCHeap::Alloc+0x7e 
clr!FastAllocatePrimitiveArray+0xc5 
clr!JIT_NewArr1+0x389 
System.Decimal.GetBits(System.Decimal) 
Oracle.DataAccess.Types.DecimalConv.GetDecimal(IntPtr) 
Oracle.DataAccess.Client.OracleDataReader.GetDecimal(Int32) 
Oracle.DataAccess.Client.OracleDataReader.GetValue(Int32) 
Oracle.DataAccess.Client.OracleDataReader.GetValues(System.Object[]) 
jr.a(System.Data.IDataReader, Boolean, ku, Boolean, DbTypeEnum, System.Type[]) 
ls.a(System.Data.IDataReader, Boolean, ku, Boolean, DbTypeEnum, System.Type[]) 
ba.a(System.String, System.Data.IDataReader, Boolean, ku, Boolean, System.Type[]) 
... 
XX.MySvc.tmr_Elapsed(System.Object) 

可能的类似问题(来自新的信息):http://markmail.org/message/yy3mvbngula4i3mu#query:+page:1+mid:l546gn5sfxtxxm5i+state:results http://social.msdn.microsoft.com/Forums/en/clr/thread/33920b39-690c-42c8-b04a-0f1f7176835a

更新2(2月23日):

的ODAC组件升级到正确的版本DOTNET 4.0(或在Oracle网站上被列为兼容),问题仍然重演。它每隔一两周仍以非常间歇的方式重新出现。发生的服务每天都会循环。

还有一些来自最近崩溃的转储,这些仍然指向堆损坏 - 尽管不是完整的转储(访问冲突)。实际上它似乎创建了完整的转储失败。

Creating d:\dumps\2xx_Crash_Mode\FULLDUMP_FirstChance_epr_Process_Shut_Down_MySvc.exe__0344.dmp - mini user dump 
WriteFullMemory.Memory.Read(0x262c000, 0x1000) failed 0x8007012b, ABORT. 
Dump creation failed, Win32 error 0n299 
    "Only part of a ReadProcessMemory or WriteProcessMemory request was completed." 

另外一个自定义管理(DOTNET)库被加载到应用程序,这也似乎是抛出一个异常,虽然它只是一个“第一次机会”,似乎并没有引起进程失败(我猜这可能是一个因素,但)。它实际上也是我们的库,所以我可以验证它不是调用托管代码。 错误是:

EXCEPTION_RECORD: ffffffffffffffff -- (.exr 0xffffffffffffffff) 
ExceptionAddress: 000007fefcffaa7d (KERNELBASE!RaiseException+0x0000000000000039) 
ExceptionCode: c0000006 (In-page I/O error) 
ExceptionFlags: 00000000 
NumberParameters: 3 
Parameter[0]: 0000000000000000 
Parameter[1]: 000000006d34aca0 
Parameter[2]: 00000000c00000c4 
Inpage operation failed at 000000006d34aca0, due to I/O error 00000000c00000c4 

PROCESS_NAME: MySvc.exe 

ERROR_CODE: (NTSTATUS) 0xc0000006 - The instruction at 0x%p referenced memory at 0x%p. The required data was not placed into memory because of an I/O error status of 0x%x. 

EXCEPTION_OBJECT: !pe 1a8106a8 
Exception object: 000000001a8106a8 
Exception type: System.Runtime.InteropServices.SEHException 
Message:   External component has thrown an exception. 
InnerException: <none> 
StackTrace (generated): 
SP    IP    Function 
000000002C77B980 0000000000000000 ... 
000000002C77BA50 000007FF01DCBA51 ... 

StackTraceString: <none> 
HResult: 80004005 

MANAGED_OBJECT: !dumpobj 148306f8 
Name:  System.String 
MethodTable: 000007feed9a6870 
EEClass:  000007feed52ed58 
Size:  112(0x70) bytes 
File:  C:\Windows\Microsoft.Net\assembly\GAC_64\mscorlib\v4.0_4.0.0.0__b77a5c561934e089\mscorlib.dll 
String:  External component has thrown an exception. 
Fields: 
       MT Field Offset     Type VT  Attr   Value Name 
0000000000000000 4000103  8   System.Int32 1 instance    43 m_stringLength 
0000000000000000 4000104  c   System.Char 1 instance    45 m_firstChar 
000007feed9a6870 4000105  10  System.String 0 shared   static Empty 
          >> Domain:Value 00000000002a69f0:NotInit 000000000dd738d0:NotInit << 

EXCEPTION_MESSAGE: External component has thrown an exception. 

MANAGED_OBJECT_NAME: System.Runtime.InteropServices.SEHException 

MANAGED_STACK_COMMAND: !pe 1a8106a8 

LAST_CONTROL_TRANSFER: from 000007fef47e8fc1 to 000007fefcffaa7d 

ADDITIONAL_DEBUG_TEXT: Followup set based on attribute [Is_ChosenCrashFollowupThread] from Frame:[0] on thread:[PSEUDO_THREAD] ; Followup set based on attribute [ip_is_call_value_Arch_si] from Frame:[23] on thread:[162c] 

FAULTING_THREAD: ffffffffffffffff 

BUGCHECK_STR: APPLICATION_FAULT__SYSTEM.RUNTIME.INTEROPSERVICES.SEHEXCEPTION_APPLICATION_FAULT_CALL 

PRIMARY_PROBLEM_CLASS: _SYSTEM.RUNTIME.INTEROPSERVICES.SEHEXCEPTION_CALL 

DEFAULT_BUCKET_ID: _SYSTEM.RUNTIME.INTEROPSERVICES.SEHEXCEPTION_CALL 

STACK_TEXT: 
00000000`2c77b980 00000000`00000000 ... 
00000000`2c77ba50 00000000`ffffffff ... 

任何一个带有如何这一进一步推进以有利的方式任何想法。我热衷于获得更多的完整转储 - 但当然需要比下一次失败更早找到答案!

+0

苔丝的博客是寻找信息的地方。你没有从中得到任何东西吗? – 2011-12-29 13:08:23

+0

你使用任何组件/ DLL与非托管代码? – Yahia 2011-12-29 13:10:35

+0

这几乎肯定会成为非托管代码搞砸你的内存空间。恐怕追查可能是一种正确的做法。你是否从第三方更新了DLL,或者最近是否更改过任何自己的非托管代码?它在32位系统上运行愉快吗? – 2011-12-29 13:26:46

回答

0

崩溃的原因(断点命中)表示过程中发生堆损坏。通过发出调试中断,堆管理功能报告堆损坏失败。

从记录的错误判断,.net运行时没有准备好处理这些(我可能是错的,并可能有更好的解释)。跟踪堆损坏的常见方法是启用(完整)页堆,通过使进程更接近腐败点来帮助定位违规组件。

狩猎堆腐败是一个真正的痛苦,至少可以说,但如果内存消耗允许它,我会去整个页面堆是最适合具有适度内存要求的应用程序。

希望它有帮助。

+0

谢谢。此外,这里有一个很好的博客:http://blogs.msdn.com/b/tess/archive/2006/02/09/net-crash-managed-heap-corruption-calling-unmanaged-code.aspx – glendon 2011-12-29 23:30:23

+0

有趣的 - 使用ODAC 11.1.0.7/x64/.NET 4 - 从文档看来应该是更新的11.2.0.1.2版本 - 但只能找到一个导致类似错误的例子(上面的链接)。 – glendon 2011-12-30 17:03:11

0

x64 .NET 4.0的GC有一个bug。这可能是你受到了这个影响。 MS建议禁用并发GC直到他们得到一个修补程序。或者,您可以使服务器GC为每个内核获取一个GC线程,如果您拥有多个内核,则可能会有这种情况。

否则服务器gc标志将不起作用。

这里是链接到KB article.

0

夫妇的事情 1.确保您正在运行最新版本的CLR的 2.对于本地堆损坏页堆是一个很好的选择和管理可能是你可以试试GCStress How to turn GCStress on in Windows 7? 3.要验证托管堆上的堆损坏,可以使用SOS的一部分的verifyheap https://msdn.microsoft.com/en-us/library/bb190764(v=vs.110).aspx “VerifyHeap检查垃圾收集器堆是否存在损坏迹象并显示发现的任何错误堆损坏可能是由平台调用调用构造不正确“。