在SQL Server 2008R2中,我有一个非常频繁运行的代理作业。这项工作只需一个步骤即可调用存储过程。存储过程非常长,并调用其他存储过程,其中一些过程也很长。由于MS DTC导致SQL Server代理作业失败
存储过程需要与不同服务器上的多个数据库一起工作。
问题是代理作业有时会失败。它会多次运行而不会失败,然后会失败一次,然后下一次运行它会成功运行。一切都在一个事务中完成,所以如果失败,数据将被恢复。这让我相信这不是一个语法或数据问题,尽管我无法确定。
当检查作业活动管理和查看历史记录失败作业,所有它说的是
The job failed. The Job was invoked by Schedule 11 (Sch0). The last step to run was step 1 (Step00).
我启用了日志记录作业的第1步。我从日志中得到的错误是
The Microsoft Distributed Transaction Coordinator (MS DTC) has cancelled the distributed transaction. [SQLSTATE 42000]
我看着为MS DTC跟踪日志的主服务器(服务器)上,当它失败,以下条目存在:
pid=3416;tid=3036;time=02/29/2016-12:13:11.493 ;seq=88;eventid=TRANSACTION_BEGUN ;tx_guid=<guid>;"TM Identifier='(null)'" ;"transaction has begun, description :'user_transaction'"
pid=3416;tid=3036;time=02/29/2016-12:13:11.493 ;seq=89;eventid=RM_ENLISTED_IN_TRANSACTION ;tx_guid=<guid>;"TM Identifier='(null)'" ;"resource manager #1001 enlisted as transaction enlistment #1. RM guid = '<guid>'"
pid=3416;tid=3036;time=02/29/2016-12:13:11.509 ;seq=90;eventid=TRANSACTION_PROPOGATED_TO_CHILD_NODE ;tx_guid=<guid>;"TM Identifier='(null)'" ;"transaction propagated to 'SERVER1' as transaction child node #1"
pid=3416;tid=3036;time=02/29/2016-12:13:27.947 ;seq=91;eventid=TRANSACTION_ABORTING ;tx_guid=<guid>;"TM Identifier='(null)'" ;"transaction is aborting"
pid=3416;tid=3036;time=02/29/2016-12:13:27.947 ;seq=92;eventid=RM_ISSUED_ABORT ;tx_guid=<guid>;"TM Identifier='(null)'" ;"abort request issued to resource manager #1001 for transaction enlistment #1"
pid=3416;tid=3036;time=02/29/2016-12:13:27.947 ;seq=93;eventid=CHILD_NODE_ISSUED_ABORT ;tx_guid=<guid>;"TM Identifier='(null)'" ;"abort request issued to transaction child node #1 'SERVER1'"
pid=3416;tid=3036;time=02/29/2016-12:13:27.947 ;seq=94;eventid=CHILD_NODE_ACKNOWLEDGED_ABORT ;tx_guid=<guid>;"TM Identifier='(null)'" ;"received acknowledgement of abort request from transaction child node #1 'SERVER1'"
pid=3416;tid=3036;time=02/29/2016-12:13:36.993 ;seq=95;eventid=RM_ACKNOWLEDGED_ABORT ;tx_guid=<guid>;"TM Identifier='(null)'" ;"received acknowledgement of abort request from the resource manager #1001 for transaction enlistment #1"
pid=3416;tid=3036;time=02/29/2016-12:13:36.993 ;seq=96;eventid=TRANSACTION_ABORTED ;tx_guid=<guid>;"TM Identifier='(null)'" ;"transaction has been aborted"
所以它从TRANSACTION_PROPOGATED_TO_CHILD_NODE到TRANSACTION_ABORTING,没有说明为什么(据我所知)。
我检查第二服务器(SERVER2)在MS DTC跟踪日志和看到以下时失败:
pid=4032;tid=3564;time=02/29/2016-13:26:46.117 ;seq=173977;eventid=TRANSACTION_PROPOGATED_FROM_PARENT ;tx_guid=<guid>;"TM Identifier='(null)'" ;"transaction propagated from parent node 'SERVER2', Description = 'a16ace8fa7f6'"
pid=4032;tid=3564;time=02/29/2016-13:26:46.117 ;seq=173978;eventid=RM_ENLISTED_IN_TRANSACTION ;tx_guid=<guid>;"TM Identifier='(null)'" ;"resource manager #1001 enlisted as transaction enlistment #1. RM guid = '<guid>'"
pid=4032;tid=3564;time=02/29/2016-13:27:02.758 ;seq=173979;eventid=RECEIVED_ABORT_REQUEST_FROM_NON_BEGINNER ;tx_guid=<guid>;"TM Identifier='(null)'" ;"received request to abort the transaction from non beginner"
pid=4032;tid=3564;time=02/29/2016-13:27:02.758 ;seq=173980;eventid=TRANSACTION_ABORTING ;tx_guid=<guid>;"TM Identifier='(null)'" ;"transaction is aborting"
pid=4032;tid=3564;time=02/29/2016-13:27:02.758 ;seq=173981;eventid=RM_ISSUED_ABORT ;tx_guid=<guid>;"TM Identifier='(null)'" ;"abort request issued to resource manager #1001 for transaction enlistment #1"
pid=4032;tid=3564;time=02/29/2016-13:27:02.758 ;seq=173982;eventid=RECEIVED_ABORT_FROM_PARENT ;tx_guid=<guid>;"TM Identifier='(null)'" ;"child node received abort request from parent node 'SERVER2'"
pid=4032;tid=3564;time=02/29/2016-13:27:02.758 ;seq=173983;eventid=ACKNOWLEDGING_ABORT_TO_PARENT ;tx_guid=<guid>;"TM Identifier='(null)'" ;"child node achnowledging the delivery of abort request from parent node 'SERVER2'"
pid=4032;tid=3564;time=02/29/2016-13:27:05.773 ;seq=173984;eventid=RM_ACKNOWLEDGED_ABORT ;tx_guid=<guid>;"TM Identifier='(null)'" ;"received acknowledgement of abort request from the resource manager #1001 for transaction enlistment #1"
pid=4032;tid=3564;time=02/29/2016-13:27:05.773 ;seq=173985;eventid=TRANSACTION_ABORTED ;tx_guid=<guid>;"TM Identifier='(null)'" ;"transaction has been aborted"
这一个显示RM_ENLISTED_IN_TRANSACTION后RECEIVED_ABORT_REQUEST_FROM_NON_BEGINNER。但仍然没有迹象表明它为什么会被中止。
是否RECEIVED_ABORT_REQUEST_FROM_NON_BEGINNER错误指示中止来自主服务器(SERVER1)?还是说中止来自SERVER1以外的其他东西,因为SERVER1是初学者?
我还检查了SQL Server ERRORLOG文件,它不包含任何此失败。
存储过程使用TRY/CATCH来处理错误,并设置代理发送失败时的电子邮件通知。在这种情况下,我收到电子邮件通知,但CATCH未处理该错误。我知道这可能是因为错误的严重程度很高。
还有什么我可以做,找出究竟是什么导致这种失败?
看起来工作有时已经开始,在上一份工作结束之前。 – PSVSupporter