numpy einsum/tensordot with shared non-contracted axis - numpy

Suppose I have two arrays:
import numpy as np
a = np.random.randn(32, 6, 6, 20, 64, 3, 3)
b = np.random.randn(20, 128, 64, 3, 3)
and want to sum over the last 3 axes, and keep the shared axis. The output dimension should be (32,6,6,20,128). Notice here the axis with 20 is shared in both a and b. Let's call this axis the "group" axis.
I have two methods for this task:
The first one is just a simple einsum:
def method1(a, b):
return np.einsum('NHWgihw, goihw -> NHWgo', a, b, optimize=True) # output shape:(32,6,6,20,128)
In the second method I loop through group dimension and use einsum/tensordot to compute the result for each group dimension, then stack the results:
def method2(a, b):
result = []
for g in range(b.shape[0]): # loop through each group dimension
# result.append(np.tensordot(a[..., g, :, :, :], b[g, ...], axes=((-3,-2,-1),(-3,-2,-1))))
result.append(np.einsum('NHWihw, oihw -> NHWo', a[..., g, :, :, :], b[g, ...], optimize=True)) # output shape:(32,6,6,128)
return np.stack(result, axis=-2) # output shape:(32,6,6,20,128)
here's the timing for both methods in my jupyter notebook:
we can see the second method with a loop is faster than the first method.
My question is:
How come method1 is that much slower? It doesn't compute more things.
Is there a more efficient way without using loops? (I'm a bit reluctant to use loops because they are slow in python)
Thanks for any help!

As pointed out by #Murali in the comments, method1 is not very efficient because it does not succeed to use a BLAS calls as opposed to method2 which does. In fact, np.einsum is quite good in method1 since it compute the result sequentially while method2 mostly runs in parallel thanks to OpenBLAS (used by Numpy on most machines). That being said, method2 is sub-optimal since it does not fully use the available cores (parts of the computation are done sequentially) and appear not to use the cache efficiently. On my 6-core machine, it barely use 50% of all the cores.
Faster implementation
One solution to speed up this computation is to write an highly-optimized Numba parallel code for this.
First of all, a semi-naive implementation is to use many for loops to compute the Einstein summation and reshape the input/output arrays so Numba can better optimize the code (eg. unrolling, use of SIMD instructions). Here is the result:
#nb.njit('float64[:,:,:,:,::1](float64[:,:,:,:,:,:,::1], float64[:,:,:,:,::1])')
def compute(a, b):
sN, sH, sW, sg, si, sh, sw = a.shape
so = b.shape[1]
assert b.shape == (sg, so, si, sh, sw)
ra = a.reshape(sN*sH*sW, sg, si*sh*sw)
rb = b.reshape(sg, so, si*sh*sw)
out = np.empty((sN*sH*sW, sg, so), dtype=np.float64)
for NHW in range(sN*sH*sW):
for g in range(sg):
for o in range(so):
s = 0.0
# Reduction
for ihw in range(si*sh*sw):
s += ra[NHW, g, ihw] * rb[g, o, ihw]
out[NHW, g, o] = s
return out.reshape((sN, sH, sW, sg, so))
Note that the input array are assumed to be contiguous. If this is not the case, please consider performing a copy (which is cheap compared to the computation).
While the above code works, it is far from being efficient. Here are some improvements that can be performed:
run the outermost NHW loop in parallel;
use the Numba flag fastmath=True. This flag is unsafe if the input data contains special values like NaN or +inf/-inf. However, this flag help compiler to generate a much faster code using SIMD instructions (this is not possible otherwise since IEEE-754 floating-point operations are not associative);
swap the NHW-based loop and g-based loop results in better performance since it improves cache-locality (rb is more likely to fit in the last-level cache of mainstream CPUs whereas it would likely in fetched from the RAM otherwise);
make use of register blocking so to saturate better SIMD computing units of the processor and reduce the pressure on the memory hierarchy;
make use of tiling by splitting the o-based loop so rb can almost fully be read from lower-level caches (eg. L1 or L2).
All these improvements except the last one are implemented in the following code:
#nb.njit('float64[:,:,:,:,::1](float64[:,:,:,:,:,:,::1], float64[:,:,:,:,::1])', parallel=True, fastmath=True)
def method3(a, b):
sN, sH, sW, sg, si, sh, sw = a.shape
so = b.shape[1]
assert b.shape == (sg, so, si, sh, sw)
ra = a.reshape(sN*sH*sW, sg, si*sh*sw)
rb = b.reshape(sg, so, si*sh*sw)
out = np.zeros((sN*sH*sW, sg, so), dtype=np.float64)
for g in range(sg):
for k in nb.prange((sN*sH*sW)//2):
NHW = k*2
so_vect_max = (so // 4) * 4
for o in range(0, so_vect_max, 4):
s00 = s01 = s02 = s03 = s10 = s11 = s12 = s13 = 0.0
# Useful since Numba does not optimize well the following loop otherwise
ra_row0 = ra[NHW+0, g, :]
ra_row1 = ra[NHW+1, g, :]
rb_row0 = rb[g, o+0, :]
rb_row1 = rb[g, o+1, :]
rb_row2 = rb[g, o+2, :]
rb_row3 = rb[g, o+3, :]
# Highly-optimized reduction using register blocking
for ihw in range(si*sh*sw):
ra_0 = ra_row0[ihw]
ra_1 = ra_row1[ihw]
rb_0 = rb_row0[ihw]
rb_1 = rb_row1[ihw]
rb_2 = rb_row2[ihw]
rb_3 = rb_row3[ihw]
s00 += ra_0 * rb_0; s01 += ra_0 * rb_1
s02 += ra_0 * rb_2; s03 += ra_0 * rb_3
s10 += ra_1 * rb_0; s11 += ra_1 * rb_1
s12 += ra_1 * rb_2; s13 += ra_1 * rb_3
out[NHW+0, g, o+0] = s00; out[NHW+0, g, o+1] = s01
out[NHW+0, g, o+2] = s02; out[NHW+0, g, o+3] = s03
out[NHW+1, g, o+0] = s10; out[NHW+1, g, o+1] = s11
out[NHW+1, g, o+2] = s12; out[NHW+1, g, o+3] = s13
# Remaining part for `o`
for o in range(so_vect_max, so):
for ihw in range(si*sh*sw):
out[NHW, g, o] += ra[NHW, g, ihw] * rb[g, o, ihw]
out[NHW+1, g, o] += ra[NHW+1, g, ihw] * rb[g, o, ihw]
# Remaining part for `k`
if (sN*sH*sW) % 2 == 1:
k = sN*sH*sW - 1
for o in range(so):
for ihw in range(si*sh*sw):
out[k, g, o] += ra[k, g, ihw] * rb[g, o, ihw]
return out.reshape((sN, sH, sW, sg, so))
This code is much more complex and uglier but also far more efficient. I did not implemented the tiling optimization since it would make the code even less readable. However, it should results in a significantly faster code on many-core processors (especially the ones with a small L2/L3 cache).
Performance results
Here are performance results on my i5-9600KF 6-core processor:
method1: 816 ms
method2: 104 ms
method3: 40 ms
Theoretical optimal: 9 ms (optimistic lower bound)
The code is about 2.7 faster than method2. There is a room for improvements since the optimal time is about 4 time better than method3.
The main reason why Numba does not generate a fast code comes from the underlying JIT which fail to efficiently vectorize the loop. Implementing the tiling strategy should slightly improves the execution time very close to the optimal one. The tiling strategy is critical for much bigger arrays. This is especially true if so is much bigger.
If you want a faster implementation you certainly need to write a C/C++ native code using directly SIMD instrinsics (which are unfortunately not portable) or a SIMD library (eg. XSIMD).
If you want an even faster implementation, then you need to use a faster hardware (with more cores) or a more dedicated one. Server-based GPUs (ie. not the one of personal computers) not should be able to speed up a lot such a computation since your input is small, clearly compute-bound and massively makes use of FMA floating-point operations. A first start is to try cupy.einsum.
Under the hood: low-level analysis
In order to understand why method1 is not faster, I checked the executed code. Here is the main loop:
1a0:┌─→; Part of the reduction (see below)
│ movapd xmm0,XMMWORD PTR [rdi-0x1000]
│
│ ; Decrement the number of loop cycle
│ sub r9,0x8
│
│ ; Prefetch items so to reduce the impact
│ ; of the latency of reading from the RAM.
│ prefetcht0 BYTE PTR [r8]
│ prefetcht0 BYTE PTR [rdi]
│
│ ; Part of the reduction (see below)
│ mulpd xmm0,XMMWORD PTR [r8-0x1000]
│
│ ; Increment iterator for the two arrays
│ add rdi,0x40
│ add r8,0x40
│
│ ; Main computational part:
│ ; reduction using add+mul SSE2 instructions
│ addpd xmm1,xmm0 <--- Slow
│ movapd xmm0,XMMWORD PTR [rdi-0x1030]
│ mulpd xmm0,XMMWORD PTR [r8-0x1030]
│ addpd xmm1,xmm0 <--- Slow
│ movapd xmm0,XMMWORD PTR [rdi-0x1020]
│ mulpd xmm0,XMMWORD PTR [r8-0x1020]
│ addpd xmm0,xmm1 <--- Slow
│ movapd xmm1,XMMWORD PTR [rdi-0x1010]
│ mulpd xmm1,XMMWORD PTR [r8-0x1010]
│ addpd xmm1,xmm0 <--- Slow
│
│ ; Is the loop over?
│ ; If not, jump to the beginning of the loop.
├──cmp r9,0x7
└──jg 1a0
It turns out that Numpy use the SSE2 instruction set (which is available on all x86-64 processors). However, my machine, like almost all relatively recent processor support the AVX instruction set which can compute twice more items at once per instruction. My machine also support fuse-multiply add instructions (FMA) that are twice faster in this case. Moreover, the loop is clearly bounded by the addpd which accumulate the result in mostly the same register. The processor cannot execute them efficiently since an addpd takes few cycle of latency and up to two can be executed at the same time on modern x86-64 processors (which is not possible here since only 1 intruction can perform the accumulation in xmm1 at a time).
Here is the executed code of the main computational part of method2 (dgemm call of OpenBLAS):
6a40:┌─→vbroadcastsd ymm0,QWORD PTR [rsi-0x60]
│ vbroadcastsd ymm1,QWORD PTR [rsi-0x58]
│ vbroadcastsd ymm2,QWORD PTR [rsi-0x50]
│ vbroadcastsd ymm3,QWORD PTR [rsi-0x48]
│ vfmadd231pd ymm4,ymm0,YMMWORD PTR [rdi-0x80]
│ vfmadd231pd ymm5,ymm1,YMMWORD PTR [rdi-0x60]
│ vbroadcastsd ymm0,QWORD PTR [rsi-0x40]
│ vbroadcastsd ymm1,QWORD PTR [rsi-0x38]
│ vfmadd231pd ymm6,ymm2,YMMWORD PTR [rdi-0x40]
│ vfmadd231pd ymm7,ymm3,YMMWORD PTR [rdi-0x20]
│ vbroadcastsd ymm2,QWORD PTR [rsi-0x30]
│ vbroadcastsd ymm3,QWORD PTR [rsi-0x28]
│ vfmadd231pd ymm4,ymm0,YMMWORD PTR [rdi]
│ vfmadd231pd ymm5,ymm1,YMMWORD PTR [rdi+0x20]
│ vfmadd231pd ymm6,ymm2,YMMWORD PTR [rdi+0x40]
│ vfmadd231pd ymm7,ymm3,YMMWORD PTR [rdi+0x60]
│ add rsi,0x40
│ add rdi,0x100
├──dec rax
└──jne 6a40
This loop is far more optimized: it makes use of the AVX instruction set as well as the FMA one (ie. vfmadd231pd instructions). Furthermore, the loop is better unrolled and there is not latency/dependency issue like in the Numpy code. However, while this loop is highly-efficient, the cores are not efficiently used due to some sequential checks done in Numpy and a sequential copy performed in OpenBLAS. Moreover, I am not sure the loop makes an efficient use of the cache in this case since a lot of read/writes are performed in RAM on my machine. Indeed, the RAM throughput about 15 GiB/s (over 35~40 GiB/s) due to many cache misses while the thoughput of method3 is 6 GiB/s (so more work is done in the cache) with a significantly faster execution.
Here is the executed code of the main computational part of method3:
.LBB0_5:
vorpd 2880(%rsp), %ymm8, %ymm0
vpcmpeqd %ymm1, %ymm1, %ymm1
vgatherqpd %ymm1, (%rsi,%ymm0,8), %ymm2
vmovupd %ymm2, 3040(%rsp)
vorpd 2848(%rsp), %ymm8, %ymm1
vpcmpeqd %ymm2, %ymm2, %ymm2
vgatherqpd %ymm2, (%rsi,%ymm1,8), %ymm3
vmovupd %ymm3, 3104(%rsp)
vorpd 2912(%rsp), %ymm8, %ymm2
vpcmpeqd %ymm3, %ymm3, %ymm3
vgatherqpd %ymm3, (%rsi,%ymm2,8), %ymm4
vmovupd %ymm4, 3136(%rsp)
vorpd 2816(%rsp), %ymm8, %ymm3
vpcmpeqd %ymm4, %ymm4, %ymm4
vgatherqpd %ymm4, (%rsi,%ymm3,8), %ymm5
vmovupd %ymm5, 3808(%rsp)
vorpd 2784(%rsp), %ymm8, %ymm9
vpcmpeqd %ymm4, %ymm4, %ymm4
vgatherqpd %ymm4, (%rsi,%ymm9,8), %ymm5
vmovupd %ymm5, 3840(%rsp)
vorpd 2752(%rsp), %ymm8, %ymm10
vpcmpeqd %ymm4, %ymm4, %ymm4
vgatherqpd %ymm4, (%rsi,%ymm10,8), %ymm5
vmovupd %ymm5, 3872(%rsp)
vpaddq 2944(%rsp), %ymm8, %ymm4
vorpd 2720(%rsp), %ymm8, %ymm11
vpcmpeqd %ymm13, %ymm13, %ymm13
vgatherqpd %ymm13, (%rsi,%ymm11,8), %ymm5
vmovupd %ymm5, 3904(%rsp)
vpcmpeqd %ymm13, %ymm13, %ymm13
vgatherqpd %ymm13, (%rdx,%ymm0,8), %ymm5
vmovupd %ymm5, 3552(%rsp)
vpcmpeqd %ymm0, %ymm0, %ymm0
vgatherqpd %ymm0, (%rdx,%ymm1,8), %ymm5
vmovupd %ymm5, 3616(%rsp)
vpcmpeqd %ymm0, %ymm0, %ymm0
vgatherqpd %ymm0, (%rdx,%ymm2,8), %ymm1
vmovupd %ymm1, 3648(%rsp)
vpcmpeqd %ymm0, %ymm0, %ymm0
vgatherqpd %ymm0, (%rdx,%ymm3,8), %ymm1
vmovupd %ymm1, 3680(%rsp)
vpcmpeqd %ymm0, %ymm0, %ymm0
vgatherqpd %ymm0, (%rdx,%ymm9,8), %ymm1
vmovupd %ymm1, 3712(%rsp)
vpcmpeqd %ymm0, %ymm0, %ymm0
vgatherqpd %ymm0, (%rdx,%ymm10,8), %ymm1
vmovupd %ymm1, 3744(%rsp)
vpcmpeqd %ymm0, %ymm0, %ymm0
vgatherqpd %ymm0, (%rdx,%ymm11,8), %ymm1
vmovupd %ymm1, 3776(%rsp)
vpcmpeqd %ymm0, %ymm0, %ymm0
vgatherqpd %ymm0, (%rsi,%ymm4,8), %ymm6
vpcmpeqd %ymm0, %ymm0, %ymm0
vgatherqpd %ymm0, (%rdx,%ymm4,8), %ymm3
vpaddq 2688(%rsp), %ymm8, %ymm0
vpcmpeqd %ymm1, %ymm1, %ymm1
vgatherqpd %ymm1, (%rsi,%ymm0,8), %ymm7
vpcmpeqd %ymm1, %ymm1, %ymm1
vgatherqpd %ymm1, (%rdx,%ymm0,8), %ymm4
vmovupd %ymm4, 3360(%rsp)
vpaddq 2656(%rsp), %ymm8, %ymm0
vpcmpeqd %ymm1, %ymm1, %ymm1
vgatherqpd %ymm1, (%rsi,%ymm0,8), %ymm13
vpcmpeqd %ymm1, %ymm1, %ymm1
vgatherqpd %ymm1, (%rdx,%ymm0,8), %ymm4
vmovupd %ymm4, 3392(%rsp)
vpaddq 2624(%rsp), %ymm8, %ymm0
vpcmpeqd %ymm1, %ymm1, %ymm1
vgatherqpd %ymm1, (%rsi,%ymm0,8), %ymm15
vpcmpeqd %ymm1, %ymm1, %ymm1
vgatherqpd %ymm1, (%rdx,%ymm0,8), %ymm4
vmovupd %ymm4, 3424(%rsp)
vpaddq 2592(%rsp), %ymm8, %ymm0
vpcmpeqd %ymm1, %ymm1, %ymm1
vgatherqpd %ymm1, (%rsi,%ymm0,8), %ymm9
vpcmpeqd %ymm1, %ymm1, %ymm1
vgatherqpd %ymm1, (%rdx,%ymm0,8), %ymm4
vmovupd %ymm4, 3456(%rsp)
vpaddq 2560(%rsp), %ymm8, %ymm0
vpcmpeqd %ymm1, %ymm1, %ymm1
vgatherqpd %ymm1, (%rsi,%ymm0,8), %ymm14
vpcmpeqd %ymm1, %ymm1, %ymm1
vgatherqpd %ymm1, (%rdx,%ymm0,8), %ymm4
vmovupd %ymm4, 3488(%rsp)
vpaddq 2528(%rsp), %ymm8, %ymm0
vpcmpeqd %ymm1, %ymm1, %ymm1
vgatherqpd %ymm1, (%rsi,%ymm0,8), %ymm11
vpcmpeqd %ymm1, %ymm1, %ymm1
vgatherqpd %ymm1, (%rdx,%ymm0,8), %ymm4
vmovupd %ymm4, 3520(%rsp)
vpaddq 2496(%rsp), %ymm8, %ymm0
vpcmpeqd %ymm1, %ymm1, %ymm1
vgatherqpd %ymm1, (%rsi,%ymm0,8), %ymm10
vpcmpeqd %ymm1, %ymm1, %ymm1
vgatherqpd %ymm1, (%rdx,%ymm0,8), %ymm4
vmovupd %ymm4, 3584(%rsp)
vpaddq 2464(%rsp), %ymm8, %ymm0
vpcmpeqd %ymm1, %ymm1, %ymm1
vgatherqpd %ymm1, (%rdx,%ymm0,8), %ymm2
vpaddq 2432(%rsp), %ymm8, %ymm0
vpcmpeqd %ymm1, %ymm1, %ymm1
vgatherqpd %ymm1, (%rdx,%ymm0,8), %ymm12
vpaddq 2400(%rsp), %ymm8, %ymm0
vpcmpeqd %ymm1, %ymm1, %ymm1
vgatherqpd %ymm1, (%rdx,%ymm0,8), %ymm4
vmovupd %ymm4, 3168(%rsp)
vpaddq 2368(%rsp), %ymm8, %ymm0
vpcmpeqd %ymm1, %ymm1, %ymm1
vgatherqpd %ymm1, (%rdx,%ymm0,8), %ymm4
vmovupd %ymm4, 3200(%rsp)
vpaddq 2336(%rsp), %ymm8, %ymm0
vpcmpeqd %ymm1, %ymm1, %ymm1
vgatherqpd %ymm1, (%rdx,%ymm0,8), %ymm4
vmovupd %ymm4, 3232(%rsp)
vpaddq 2304(%rsp), %ymm8, %ymm0
vpcmpeqd %ymm1, %ymm1, %ymm1
vgatherqpd %ymm1, (%rdx,%ymm0,8), %ymm4
vmovupd %ymm4, 3264(%rsp)
vpaddq 2272(%rsp), %ymm8, %ymm0
vpcmpeqd %ymm1, %ymm1, %ymm1
vgatherqpd %ymm1, (%rdx,%ymm0,8), %ymm4
vmovupd %ymm4, 3296(%rsp)
vpaddq 2240(%rsp), %ymm8, %ymm0
vpcmpeqd %ymm1, %ymm1, %ymm1
vgatherqpd %ymm1, (%rdx,%ymm0,8), %ymm4
vmovupd %ymm4, 3328(%rsp)
vpaddq 2208(%rsp), %ymm8, %ymm0
vpcmpeqd %ymm1, %ymm1, %ymm1
vgatherqpd %ymm1, (%rdx,%ymm0,8), %ymm4
vpaddq 2176(%rsp), %ymm8, %ymm0
vpcmpeqd %ymm1, %ymm1, %ymm1
vgatherqpd %ymm1, (%rdx,%ymm0,8), %ymm5
vmovupd %ymm5, 2976(%rsp)
vpaddq 2144(%rsp), %ymm8, %ymm0
vpcmpeqd %ymm1, %ymm1, %ymm1
vgatherqpd %ymm1, (%rdx,%ymm0,8), %ymm5
vmovupd %ymm5, 3008(%rsp)
vpaddq 2112(%rsp), %ymm8, %ymm0
vpcmpeqd %ymm1, %ymm1, %ymm1
vgatherqpd %ymm1, (%rdx,%ymm0,8), %ymm5
vmovupd %ymm5, 3072(%rsp)
vpcmpeqd %ymm1, %ymm1, %ymm1
vgatherqpd %ymm1, (%rsi,%ymm8,8), %ymm0
vpcmpeqd %ymm5, %ymm5, %ymm5
vgatherqpd %ymm5, (%rdx,%ymm8,8), %ymm1
vmovupd 768(%rsp), %ymm5
vfmadd231pd %ymm0, %ymm1, %ymm5
vmovupd %ymm5, 768(%rsp)
vmovupd 32(%rsp), %ymm5
vfmadd231pd %ymm0, %ymm3, %ymm5
vmovupd %ymm5, 32(%rsp)
vmovupd 1024(%rsp), %ymm5
vfmadd231pd %ymm0, %ymm2, %ymm5
vmovupd %ymm5, 1024(%rsp)
vmovupd 1280(%rsp), %ymm5
vfmadd231pd %ymm0, %ymm4, %ymm5
vmovupd %ymm5, 1280(%rsp)
vmovupd 1344(%rsp), %ymm0
vfmadd231pd %ymm1, %ymm6, %ymm0
vmovupd %ymm0, 1344(%rsp)
vmovupd 480(%rsp), %ymm0
vfmadd231pd %ymm3, %ymm6, %ymm0
vmovupd %ymm0, 480(%rsp)
vmovupd 1600(%rsp), %ymm0
vfmadd231pd %ymm2, %ymm6, %ymm0
vmovupd %ymm0, 1600(%rsp)
vmovupd 1856(%rsp), %ymm0
vfmadd231pd %ymm4, %ymm6, %ymm0
vmovupd %ymm0, 1856(%rsp)
vpaddq 2080(%rsp), %ymm8, %ymm0
vpcmpeqd %ymm1, %ymm1, %ymm1
vgatherqpd %ymm1, (%rdx,%ymm0,8), %ymm2
vpaddq 2048(%rsp), %ymm8, %ymm0
vpcmpeqd %ymm1, %ymm1, %ymm1
vgatherqpd %ymm1, (%rdx,%ymm0,8), %ymm4
vmovupd 800(%rsp), %ymm0
vmovupd 3552(%rsp), %ymm1
vmovupd 3040(%rsp), %ymm3
vfmadd231pd %ymm3, %ymm1, %ymm0
vmovupd %ymm0, 800(%rsp)
vmovupd 64(%rsp), %ymm0
vmovupd 3360(%rsp), %ymm5
vfmadd231pd %ymm3, %ymm5, %ymm0
vmovupd %ymm0, 64(%rsp)
vmovupd 1056(%rsp), %ymm0
vfmadd231pd %ymm3, %ymm12, %ymm0
vmovupd %ymm0, 1056(%rsp)
vmovupd 288(%rsp), %ymm0
vmovupd 2976(%rsp), %ymm6
vfmadd231pd %ymm3, %ymm6, %ymm0
vmovupd %ymm0, 288(%rsp)
vmovupd 1376(%rsp), %ymm0
vfmadd231pd %ymm1, %ymm7, %ymm0
vmovupd %ymm0, 1376(%rsp)
vmovupd 512(%rsp), %ymm0
vfmadd231pd %ymm5, %ymm7, %ymm0
vmovupd %ymm0, 512(%rsp)
vmovupd 1632(%rsp), %ymm0
vfmadd231pd %ymm12, %ymm7, %ymm0
vmovupd %ymm0, 1632(%rsp)
vmovupd 1888(%rsp), %ymm0
vfmadd231pd %ymm6, %ymm7, %ymm0
vmovupd %ymm0, 1888(%rsp)
vmovupd 832(%rsp), %ymm0
vmovupd 3616(%rsp), %ymm1
vmovupd 3104(%rsp), %ymm6
vfmadd231pd %ymm6, %ymm1, %ymm0
vmovupd %ymm0, 832(%rsp)
vmovupd 96(%rsp), %ymm0
vmovupd 3392(%rsp), %ymm3
vfmadd231pd %ymm6, %ymm3, %ymm0
vmovupd %ymm0, 96(%rsp)
vmovupd 1088(%rsp), %ymm0
vmovupd 3168(%rsp), %ymm5
vfmadd231pd %ymm6, %ymm5, %ymm0
vmovupd %ymm0, 1088(%rsp)
vmovupd 320(%rsp), %ymm0
vmovupd 3008(%rsp), %ymm7
vfmadd231pd %ymm6, %ymm7, %ymm0
vmovupd %ymm0, 320(%rsp)
vmovupd 1408(%rsp), %ymm0
vfmadd231pd %ymm1, %ymm13, %ymm0
vmovupd %ymm0, 1408(%rsp)
vmovupd 544(%rsp), %ymm0
vfmadd231pd %ymm3, %ymm13, %ymm0
vmovupd %ymm0, 544(%rsp)
vmovupd 1664(%rsp), %ymm0
vfmadd231pd %ymm5, %ymm13, %ymm0
vmovupd %ymm0, 1664(%rsp)
vmovupd 1920(%rsp), %ymm0
vfmadd231pd %ymm7, %ymm13, %ymm0
vmovupd %ymm0, 1920(%rsp)
vpaddq 2016(%rsp), %ymm8, %ymm0
vpcmpeqd %ymm1, %ymm1, %ymm1
vgatherqpd %ymm1, (%rdx,%ymm0,8), %ymm3
vmovupd 864(%rsp), %ymm0
vmovupd 3648(%rsp), %ymm1
vmovupd 3136(%rsp), %ymm6
vfmadd231pd %ymm6, %ymm1, %ymm0
vmovupd %ymm0, 864(%rsp)
vmovupd 128(%rsp), %ymm0
vmovupd 3424(%rsp), %ymm5
vfmadd231pd %ymm6, %ymm5, %ymm0
vmovupd %ymm0, 128(%rsp)
vmovupd 1120(%rsp), %ymm0
vmovupd 3200(%rsp), %ymm7
vfmadd231pd %ymm6, %ymm7, %ymm0
vmovupd %ymm0, 1120(%rsp)
vmovupd 352(%rsp), %ymm0
vmovupd 3072(%rsp), %ymm12
vfmadd231pd %ymm6, %ymm12, %ymm0
vmovupd %ymm0, 352(%rsp)
vmovupd 1440(%rsp), %ymm0
vfmadd231pd %ymm1, %ymm15, %ymm0
vmovupd %ymm0, 1440(%rsp)
vmovupd 576(%rsp), %ymm0
vfmadd231pd %ymm5, %ymm15, %ymm0
vmovupd %ymm0, 576(%rsp)
vmovupd 1696(%rsp), %ymm0
vfmadd231pd %ymm7, %ymm15, %ymm0
vmovupd %ymm0, 1696(%rsp)
vmovupd 736(%rsp), %ymm0
vfmadd231pd %ymm12, %ymm15, %ymm0
vmovupd %ymm0, 736(%rsp)
vmovupd 896(%rsp), %ymm0
vmovupd 3808(%rsp), %ymm1
vmovupd 3680(%rsp), %ymm5
vfmadd231pd %ymm1, %ymm5, %ymm0
vmovupd %ymm0, 896(%rsp)
vmovupd 160(%rsp), %ymm0
vmovupd 3456(%rsp), %ymm6
vfmadd231pd %ymm1, %ymm6, %ymm0
vmovupd %ymm0, 160(%rsp)
vmovupd 1152(%rsp), %ymm0
vmovupd 3232(%rsp), %ymm7
vfmadd231pd %ymm1, %ymm7, %ymm0
vmovupd %ymm0, 1152(%rsp)
vmovupd 384(%rsp), %ymm0
vfmadd231pd %ymm1, %ymm2, %ymm0
vmovupd %ymm0, 384(%rsp)
vmovupd 1472(%rsp), %ymm0
vfmadd231pd %ymm5, %ymm9, %ymm0
vmovupd %ymm0, 1472(%rsp)
vmovupd 608(%rsp), %ymm0
vfmadd231pd %ymm6, %ymm9, %ymm0
vmovupd %ymm0, 608(%rsp)
vmovupd 1728(%rsp), %ymm0
vfmadd231pd %ymm7, %ymm9, %ymm0
vmovupd %ymm0, 1728(%rsp)
vmovupd -128(%rsp), %ymm0
vfmadd231pd %ymm2, %ymm9, %ymm0
vmovupd %ymm0, -128(%rsp)
vmovupd 928(%rsp), %ymm0
vmovupd 3840(%rsp), %ymm1
vmovupd 3712(%rsp), %ymm2
vfmadd231pd %ymm1, %ymm2, %ymm0
vmovupd %ymm0, 928(%rsp)
vmovupd 192(%rsp), %ymm0
vmovupd 3488(%rsp), %ymm5
vfmadd231pd %ymm1, %ymm5, %ymm0
vmovupd %ymm0, 192(%rsp)
vmovupd 1184(%rsp), %ymm0
vmovupd 3264(%rsp), %ymm6
vfmadd231pd %ymm1, %ymm6, %ymm0
vmovupd %ymm0, 1184(%rsp)
vmovupd 416(%rsp), %ymm0
vfmadd231pd %ymm1, %ymm4, %ymm0
vmovupd %ymm0, 416(%rsp)
vmovupd 1504(%rsp), %ymm0
vfmadd231pd %ymm2, %ymm14, %ymm0
vmovupd %ymm0, 1504(%rsp)
vmovupd 640(%rsp), %ymm0
vfmadd231pd %ymm5, %ymm14, %ymm0
vmovupd %ymm0, 640(%rsp)
vmovupd 1760(%rsp), %ymm0
vfmadd231pd %ymm6, %ymm14, %ymm0
vmovupd %ymm0, 1760(%rsp)
vmovupd -96(%rsp), %ymm0
vfmadd231pd %ymm4, %ymm14, %ymm0
vmovupd %ymm0, -96(%rsp)
vpaddq 1984(%rsp), %ymm8, %ymm0
vpcmpeqd %ymm1, %ymm1, %ymm1
vgatherqpd %ymm1, (%rdx,%ymm0,8), %ymm2
vmovupd 960(%rsp), %ymm0
vmovupd 3872(%rsp), %ymm1
vmovupd 3744(%rsp), %ymm4
vfmadd231pd %ymm1, %ymm4, %ymm0
vmovupd %ymm0, 960(%rsp)
vmovupd 224(%rsp), %ymm0
vmovupd 3520(%rsp), %ymm5
vfmadd231pd %ymm1, %ymm5, %ymm0
vmovupd %ymm0, 224(%rsp)
vmovupd 1216(%rsp), %ymm0
vmovupd 3296(%rsp), %ymm6
vfmadd231pd %ymm1, %ymm6, %ymm0
vmovupd %ymm0, 1216(%rsp)
vmovupd 448(%rsp), %ymm0
vfmadd231pd %ymm1, %ymm3, %ymm0
vmovupd %ymm0, 448(%rsp)
vmovupd 1536(%rsp), %ymm0
vfmadd231pd %ymm4, %ymm11, %ymm0
vmovupd %ymm0, 1536(%rsp)
vmovupd 672(%rsp), %ymm0
vfmadd231pd %ymm5, %ymm11, %ymm0
vmovupd %ymm0, 672(%rsp)
vmovupd 1792(%rsp), %ymm0
vfmadd231pd %ymm6, %ymm11, %ymm0
vmovupd %ymm0, 1792(%rsp)
vmovupd -64(%rsp), %ymm0
vfmadd231pd %ymm3, %ymm11, %ymm0
vmovupd %ymm0, -64(%rsp)
vmovupd 992(%rsp), %ymm0
vmovupd 3904(%rsp), %ymm1
vmovupd 3776(%rsp), %ymm3
vfmadd231pd %ymm1, %ymm3, %ymm0
vmovupd %ymm0, 992(%rsp)
vmovupd 256(%rsp), %ymm0
vmovupd 3584(%rsp), %ymm4
vfmadd231pd %ymm1, %ymm4, %ymm0
vmovupd %ymm0, 256(%rsp)
vmovupd 1248(%rsp), %ymm0
vmovupd 3328(%rsp), %ymm5
vfmadd231pd %ymm1, %ymm5, %ymm0
vmovupd %ymm0, 1248(%rsp)
vmovupd 1312(%rsp), %ymm0
vfmadd231pd %ymm1, %ymm2, %ymm0
vmovupd %ymm0, 1312(%rsp)
vmovupd 1568(%rsp), %ymm0
vfmadd231pd %ymm3, %ymm10, %ymm0
vmovupd %ymm0, 1568(%rsp)
vmovupd 704(%rsp), %ymm0
vfmadd231pd %ymm4, %ymm10, %ymm0
vmovupd %ymm0, 704(%rsp)
vmovupd 1824(%rsp), %ymm0
vfmadd231pd %ymm5, %ymm10, %ymm0
vmovupd %ymm0, 1824(%rsp)
vmovupd -32(%rsp), %ymm0
vfmadd231pd %ymm2, %ymm10, %ymm0
vmovupd %ymm0, -32(%rsp)
vpaddq 1952(%rsp), %ymm8, %ymm8
addq $-4, %rcx
jne .LBB0_5
The loop is huge and is clearly not vectorized properly: there is a lot of completely useless instructions and loads from memory appear not to be contiguous (see vgatherqpd). Numba does not generate a good code since the underlying JIT (LLVM-Lite) fail to vectorize efficiently the code. In fact, I found out that a similar C++ code is badly vectorized by Clang 13.0 on a simplified example (GCC and ICC also fail on a more complex code) while an hand-written SIMD implementation works much better. It look like a bug of the optimizer or at least a missed optimization. This is why the Numba code is much slower than the optimal code. That being said, this implementation makes a quite efficient use of the cache and is properly multithreaded.
I also found out that the BLAS code is faster on Linux than Windows on my machine (with default packages coming from PIP and the same Numpy at version 1.20.3). Thus, the gap is closer between method2 and method3 but the later is still a significantly faster.

Related

Why SELECT shows Update lock in deadlock graph

I have two SQL queries (MSSQL server):
SELECT [Value]
FROM [dbo].[BigTable]
ORDER BY [Id] DESC
and
UPDATE [dbo].[BigTable]
SET [Value] = [Value]
Where [Id] - Primary clustered key.
When I run them infinitely in the loop I get deadlock, which is obvious. But what is not obvious (for me): why on deadlock graph I get "Owner mode: U" for select statement.
As far as I know select statement can only have shared locks. And here I'm not using any hints or additional transactions to make update lock. Any idea why I see it here?
XML for deadlock is attached
<deadlock-list>
<deadlock victim="process1c094ee5468">
<process-list>
<process id="process1c094ee5468" taskpriority="0" logused="0" waitresource="PAGE: 7:1:1502 " waittime="1289" ownerId="901143" transactionname="SELECT" lasttranstarted="2021-05-05T18:04:54.470" XDES="0x1c094329be8" lockMode="S" schedulerid="6" kpid="22644" status="suspended" spid="62" sbid="0" ecid="0" priority="0" trancount="0" lastbatchstarted="2021-05-05T18:04:54.470" lastbatchcompleted="2021-05-05T18:04:54.453" lastattention="1900-01-01T00:00:00.453" clientapp="Core Microsoft SqlClient Data Provider" hostname="ALEXEY-KLIPILIN" hostpid="3132" loginname="sa" isolationlevel="read committed (2)" xactid="901143" currentdb="7" currentdbname="SampleDb" lockTimeout="4294967295" clientoption1="671088672" clientoption2="128056">
<executionStack>
<frame procname="adhoc" line="1" stmtend="92" sqlhandle="0x02000000bf49f5138395d042205ae64888add734815151770000000000000000000000000000000000000000">
unknown </frame>
</executionStack>
<inputbuf>
SELECT * FROM [dbo].[BigTable] ORDER BY Id DESC </inputbuf>
</process>
<process id="process1c096e1d088" taskpriority="0" logused="100" waitresource="PAGE: 7:1:1503 " waittime="1289" ownerId="901139" transactionname="UPDATE" lasttranstarted="2021-05-05T18:04:54.470" XDES="0x1c08bc84428" lockMode="X" schedulerid="4" kpid="9160" status="suspended" spid="61" sbid="0" ecid="0" priority="0" trancount="2" lastbatchstarted="2021-05-05T18:04:54.470" lastbatchcompleted="2021-05-05T18:04:54.397" lastattention="1900-01-01T00:00:00.397" clientapp="Core Microsoft SqlClient Data Provider" hostname="ALEXEY-KLIPILIN" hostpid="3132" loginname="sa" isolationlevel="read committed (2)" xactid="901139" currentdb="7" currentdbname="SampleDb" lockTimeout="4294967295" clientoption1="671088672" clientoption2="128056">
<executionStack>
<frame procname="adhoc" line="1" stmtend="88" sqlhandle="0x0200000018eeb102d311fd032bb670822f260841060b64410000000000000000000000000000000000000000">
unknown </frame>
</executionStack>
<inputbuf>
UPDATE [dbo].[BigTable] SET [Value] = [Value] </inputbuf>
</process>
</process-list>
<resource-list>
<pagelock fileid="1" pageid="1502" dbid="7" subresource="FULL" objectname="SampleDb.dbo.BigTable" id="lock1c0884bdd00" mode="X" associatedObjectId="72057594043760640">
<owner-list>
<owner id="process1c096e1d088" mode="X"/>
</owner-list>
<waiter-list>
<waiter id="process1c094ee5468" mode="S" requestType="wait"/>
</waiter-list>
</pagelock>
<pagelock fileid="1" pageid="1503" dbid="7" subresource="FULL" objectname="SampleDb.dbo.BigTable" id="lock1c0a0a23380" mode="U" associatedObjectId="72057594043760640">
<owner-list>
<owner id="process1c094ee5468" mode="S"/>
</owner-list>
<waiter-list>
<waiter id="process1c096e1d088" mode="X" requestType="convert"/>
</waiter-list>
</pagelock>
</resource-list>
</deadlock>
</deadlock-list>
This just looks like some misrepresentation in the graphical representation.
process1c096e1d088 (the UPDATE) holds a page level X lock on page 1502 and a page level U lock on 1503 and is trying to convert that U lock to an X lock. (requestType="convert")
process1c094ee5468 (the SELECT) holds a page level S lock on 1503 (compatible with the U lock) and is waiting for a page level S lock on 1502.
Because the page lock 1503 is held in both S and U modes it has mode="U" in the deadlock XML and the UI assumes it is held by the blocker in that mode.
Of course if the SELECT transaction was to release its lock on 1503 before requesting the lock on 1502 this deadlock could not arise but I assume there is a good reason for it not doing this (maybe to stop 1502 getting deallocated mid scan and leaving it with no next page to visit).

Deadlock issue in update statement in SQL Server

I am facing a keylock deadlock issue in an update statement in SQL Server. I have a clustered index on the primary key and I am using them inside the where clause. I can say for sure that the 2 processes are not updating the same rows as the java code is written like that. They are trying to update different rows but still this deadlock is occurring. Both processes are owner mode of X and requesting for U lock as per the graph. I can share the other as well.
<deadlock-list>
<deadlock victim="process147029c28">
<process-list>
<process id="process147029c28" taskpriority="0" logused="1300"
waitresource="KEY: 7:72057600641925120 (ef40b100fec4)" waittime="6363"
ownerId="7750344" transactionname="implicit_transaction"
lasttranstarted="2018-10-25T21:52:35.553" XDES="0x10006ad90" lockMode="U"
schedulerid="3" kpid="17860" status="suspended" spid="67" sbid="0" ecid="0"
priority="0" trancount="2" lastbatchstarted="2018-10-25T21:52:36.567"
lastbatchcompleted="2018-10-25T21:52:36.540" lastattention="1900-01-
01T00:00:00.540" clientapp="Microsoft JDBC Driver for SQL Server"
hostname="USGURBHABISHT9" hostpid="0" loginname="bharat"
isolationlevel="read committed (2)" xactid="7750344" currentdb="7"
lockTimeout="4294967295" clientoption1="671088672" clientoption2="128058">
</executionStack>
<inputbuf>
(#P0 nvarchar(4000),#P1 datetime2,#P2 nvarchar(4000),#P3 varchar(8000),#P4
varchar(8000),#P5 nvarchar(4000),#P6 nvarchar(4000),#P7 datetime2,#P8
nvarchar(4000),#P9 varchar(8000),#P10 nvarchar(4000),#P11
varchar(8000),#P12 bigint,#P13 varchar(8000),#P14 varchar(8000),#P15
nvarchar(4000),#P16 varchar(8000),#P17 datetime2,#P18 nvarchar(4000),#P19
nvarchar(4000),#P20 nvarchar(4000),#P21 varchar(8000),#P22
nvarchar(4000),#P23 nvarchar(4000),#P24 nvarchar(4000),#P25
nvarchar(4000),#P26 nvarchar(4000),#P27 nvarchar(4000),#P28 bigint,#P29
nvarchar(4000),#P30 nvarchar(4000),#P31 nvarchar(4000),#P32
nvarchar(4000),#P33 nvarchar(4000),#P34 nvarchar(4000))Update
IN_R_AU_MEM_ELIG_DTL set APPROVAL_CODE = #P0, ARCHIVE_DT = #P1,
CASE_NUMBER = #P2, CATEGORY_CODE = #P3, CG_STATUS_CODE = #P4, CLOSE_DATE =
#P5, CLOSURE_CODE = #P6, CREATE_DT = #P7, CREATE_USER_ID = #P8,
DELETE_INDICATOR = #P9, ELIGIBILITY_SEQUENCE_NUMBER = #P10,
ELIG_INCAR_FLAG = #P11, HISTORY_SEQ = #P12, INCARCERATION_CODE = #P13,
INCARCERATION_DISCHARGE_DATE = </inputbuf>
</process>
<process id="process13e98b088" taskpriority="0" logused="1992"
waitresource="KEY: 7:72057600641925120 (f128991fbbbb)" waittime="6349"
ownerId="7751290" transactionname="implicit_transaction"
lasttranstarted="2018-10-25T21:52:35.803" XDES="0x12f960d90" lockMode="U"
schedulerid="3" kpid="6176" status="suspended" spid="66" sbid="0" ecid="0"
priority="0" trancount="2" lastbatchstarted="2018-10-25T21:52:36.610"
lastbatchcompleted="2018-10-25T21:52:36.603" lastattention="1900-01-
01T00:00:00.603" clientapp="Microsoft JDBC Driver for SQL Server"
hostname="USGURBHABISHT9" hostpid="0" loginname="bharat"
isolationlevel="read committed (2)" xactid="7751290" currentdb="7"
lockTimeout="4294967295" clientoption1="671088672" clientoption2="128058">
<inputbuf>
(#P0 nvarchar(4000),#P1 datetime2,#P2 nvarchar(4000),#P3 varchar(8000),#P4
varchar(8000),#P5 nvarchar(4000),#P6 nvarchar(4000),#P7 datetime2,#P8
nvarchar(4000),#P9 varchar(8000),#P10 nvarchar(4000),#P11
varchar(8000),#P12 bigint,#P13 varchar(8000),#P14 varchar(8000),#P15
nvarchar(4000),#P16 varchar(8000),#P17 datetime2,#P18 nvarchar(4000),#P19
nvarchar(4000),#P20 nvarchar(4000),#P21 varchar(8000),#P22
nvarchar(4000),#P23 nvarchar(4000),#P24 nvarchar(4000),#P25
nvarchar(4000),#P26 nvarchar(4000),#P27 nvarchar(4000),#P28 bigint,#P29
nvarchar(4000),#P30 nvarchar(4000),#P31 nvarchar(4000),#P32
nvarchar(4000),#P33 nvarchar(4000),#P34 nvarchar(4000))Update
IN_R_AU_MEM_ELIG_DTL set APPROVAL_CODE = #P0, ARCHIVE_DT = #P1,
CASE_NUMBER = #P2, CATEGORY_CODE = #P3, CG_STATUS_CODE = #P4, CLOSE_DATE =
#P5, CLOSURE_CODE = #P6, CREATE_DT = #P7, CREATE_USER_ID = #P8,
DELETE_INDICATOR = #P9, ELIGIBILITY_SEQUENCE_NUMBER = #P10,
ELIG_INCAR_FLAG = #P11, HISTORY_SEQ = #P12, INCARCERATION_CODE = #P13,
INCARCERATION_DISCHARGE_DATE = </inputbuf>
</process>
</process-list>
<resource-list>
<keylock hobtid="72057600641925120" dbid="7"
objectname="IEWP_EE.dbo.IN_R_AU_MEM_ELIG_DTL"
indexname="IN_R_AU_MEM_ELIG_DTL_PK" id="lock104f34d00" mode="X"
associatedObjectId="72057600641925120">
<owner-list>
<owner id="process13e98b088" mode="X"/>
</owner-list>
<waiter-list>
<waiter id="process147029c28" mode="U" requestType="wait"/>
</waiter-list>
</keylock>
<keylock hobtid="72057600641925120" dbid="7"
objectname="IEWP_EE.dbo.IN_R_AU_MEM_ELIG_DTL"
indexname="IN_R_AU_MEM_ELIG_DTL_PK" id="lock12ac42100" mode="X"
associatedObjectId="72057600641925120">
<owner-list>
<owner id="process147029c28" mode="X"/>
</owner-list>
<waiter-list>
<waiter id="process13e98b088" mode="U" requestType="wait"/>
</waiter-list>
</keylock>
</resource-list>
</deadlock>
</deadlock-list>
Update statement:
Update table_name updating all the columns even the primary key
where 1=1
AND PERSON_NUMBER = '7769999750768'
AND TYPE_CASE_CODE = '550'
AND START_DATE = '20180901'
AND INCARCERATION_ADMIT_DATE = '00000000'
AND OSS_AMOUNT = '000'
AND AID_CATEGORY = '50'
Columns used in the where clause are the composite primary key.
And two update queries deadlocking are having different primary key values.
So how it can have a lock on primary key?
table schema:
CREATE TABLE [dbo].[IN_R_AU_MEM_ELIG_DTL](
[CREATE_USER_ID] [varchar](20) NOT NULL,
[CREATE_DT] [datetime] NOT NULL,
[UNIQUE_TRANS_ID] [bigint] NOT NULL,
[HISTORY_SEQ] [bigint] NOT NULL,
[RECORD_TYPE] [varchar](1) NULL,
[SUB_RECORD_TYPE] [varchar](1) NULL,
[RECORD_SEQUENCE] [varchar](9) NULL,
[PERSON_NUMBER] [varchar](13) NOT NULL,
[PERSON_SEQUENCE] [varchar](8) NULL,
[ELIGIBILITY_SEQUENCE_NUMBER] [varchar](5) NULL,
[CASE_NUMBER] [varchar](13) NULL,
[CATEGORY_CODE] [varchar](2) NULL,
[TYPE_CASE_CODE] [varchar](3) NOT NULL,
[START_DATE] [varchar](8) NOT NULL,
[CLOSE_DATE] [varchar](8) NULL,
[APPROVAL_CODE] [varchar](3) NULL,
[CLOSURE_CODE] [varchar](3) NULL,
[DELETE_INDICATOR] [varchar](1) NULL,
[INCARCERATION_CODE] [varchar](1) NULL,
[INCARCERATION_ADMIT_DATE] [varchar](8) NOT NULL,
[INCARCERATION_DISCHARGE_DATE] [varchar](8) NULL,
[INCARCERATION_ELIG_FLAG] [varchar](1) NULL,
[RENEWAL_DATE] [varchar](8) NULL,
[RENEWAL_CODE] [varchar](2) NULL,
[PRE_RELEASE_DATE] [varchar](8) NULL,
[LOCATION_CODE] [varchar](4) NULL,
[MONEY_CODE] [varchar](1) NULL,
[OSS_AMOUNT] [varchar](5) NOT NULL,
[AID_CATEGORY] [varchar](10) NOT NULL,
[CG_STATUS_CODE] [varchar](5) NULL,
[MMIS_SND_PERSON_SEQ_NUM] [varchar](15) NULL,
[PROCESS_SW] [varchar](1) NULL,
[ELIG_INCAR_FLAG] [varchar](1) NULL,
[ARCHIVE_DT] [datetime] NULL,
[ROWID] [uniqueidentifier] NOT NULL DEFAULT (newid()),
[MMIS_SND_DT] [datetime] NULL,
CONSTRAINT [IN_R_AU_MEM_ELIG_DTL_PK] PRIMARY KEY CLUSTERED
(
[INCARCERATION_ADMIT_DATE] ASC,
[AID_CATEGORY] ASC,
[OSS_AMOUNT] ASC,
[PERSON_NUMBER] ASC,
[START_DATE] ASC,
[TYPE_CASE_CODE] ASC
)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF,
ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON, FILLFACTOR = 90) ON [PRIMARY]
) ON [PRIMARY]

Analysing sql deadlock xml

I need help in analysing the following deadlock xml
<deadlock>
<victim-list>
<victimProcess id="processa9f6f73c28" />
</victim-list>
<process-list>
<process id="processa9f6f73c28" taskpriority="0" logused="0" waitresource="KEY: 5:72057594060013568 (bd1a413b4dd8)" waittime="1759" ownerId="19463226" transactionname="user_transaction" lasttranstarted="2018-05-21T14:43:38.640" XDES="0xa9dec70458" lockMode="X" schedulerid="2" kpid="8068" status="suspended" spid="122" sbid="2" ecid="0" priority="0" trancount="2" lastbatchstarted="2018-05-21T14:43:38.640" lastbatchcompleted="2018-05-21T14:43:38.637" lastattention="1900-01-01T00:00:00.637" clientapp=".Net SqlClient Data Provider" hostname="RD0003FF430FC8" hostpid="12344" loginname="officearchitect" isolationlevel="read committed (2)" xactid="19463226" currentdb="5" currentdbname="OfficeArchitect_Performance_Test" lockTimeout="4294967295" clientoption1="673185824" clientoption2="128056">
<executionStack>
<frame procname="d2558974-73ab-4869-acd2-9cce4009286e.model.RelationshipPair_DeleteByRelationshipIds" queryhash="0x9a6597d902cb7ffa" queryplanhash="0x4f762f1ec930146f" line="7" stmtstart="302" stmtend="566" sqlhandle="0x03000500f540c416e4e82300e7a8000001000000000000000000000000000000000000000000000000000000">DELETE
RP
FROM
[model].RelationshipPair RP
INNER JOIN
#RelationshipIdTable RIT
ON
RP.RelationshipId = RIT.EntityI</frame>
<frame procname="d2558974-73ab-4869-acd2-9cce4009286e.model.ModelItem_Relationship_Delete" queryhash="0x0000000000000000" queryplanhash="0x0000000000000000" line="20" stmtstart="910" stmtend="1066" sqlhandle="0x030005000d989e702ae82300e7a8000001000000000000000000000000000000000000000000000000000000">EXEC [model].[RelationshipPair_DeleteByRelationshipIds]
#RelationshipIdTabl</frame>
</executionStack>
<inputbuf>Proc [Database Id = 5 Object Id = 1889441805]</inputbuf>
</process>
<process id="processa9f9857088" taskpriority="0" logused="624" waitresource="KEY: 5:72057594060013568 (3f1e49aa6519)" waittime="2779" ownerId="19414353" transactionname="user_transaction" lasttranstarted="2018-05-21T14:43:28.600" XDES="0xaa0a244458" lockMode="RangeS-S" schedulerid="2" kpid="51500" status="suspended" spid="164" sbid="2" ecid="0" priority="0" trancount="2" lastbatchstarted="2018-05-21T14:43:28.603" lastbatchcompleted="2018-05-21T14:43:28.593" lastattention="2018-05-21T14:38:44.820" clientapp=".Net SqlClient Data Provider" hostname="RD0003FF430FC8" hostpid="12344" loginname="officearchitect" isolationlevel="read committed (2)" xactid="19414353" currentdb="5" currentdbname="OfficeArchitect_Performance_Test" lockTimeout="4294967295" clientoption1="673185824" clientoption2="128056">
<executionStack>
<frame procname="d2558974-73ab-4869-acd2-9cce4009286e.model.ModelItem_Generic_Create" queryhash="0x21c1a974c29371a5" queryplanhash="0x60900e552e5614c5" line="17" stmtstart="898" stmtend="1402" sqlhandle="0x030005005dcdd742fde62300e7a8000001000000000000000000000000000000000000000000000000000000">INSERT INTO [model].[ModelItem]
(
[MetamodelItemId],
[ModelItemCategoryId]
)
OUTPUT [inserted].[ModelItemId], [inserted].[MetamodelItemId]
INTO #ModelItemIdsByMetamodelId
SELECT EntityId, #ModelItemCategoryId
FROM #MetamodelItemIdTabl</frame>
<frame procname="d2558974-73ab-4869-acd2-9cce4009286e.model.ModelItem_Relationship_Create" queryhash="0x41bf1ae3ccbfaccc" queryplanhash="0x76a3cb6aa572b737" line="134" stmtstart="9960" stmtend="10500" sqlhandle="0x030005009b4fb66e1be82300e7a8000001000000000000000000000000000000000000000000000000000000">INSERT INTO #tempStorage
EXECUTE [model].[ModelItem_Generic_Create]
#MetamodelItemIdTable = #metamodelIds,
#ModelId = #ModelId,
#ModelItemCategoryId = #ModelItemCategoryId,
#DateLastModified = #DateLastModified,
#LastModifiedBy = #LastModifiedB</frame>
</executionStack>
<inputbuf>Proc [Database Id = 5 Object Id = 1857441691]</inputbuf>
</process>
<process id="processa9fb862108" taskpriority="0" logused="43256" waitresource="KEY: 5:72057594060013568 (bd1a413b4dd8)" waittime="40" ownerId="19385479" transactionname="user_transaction" lasttranstarted="2018-05-21T14:43:27.370" XDES="0xa9da75c458" lockMode="RangeS-S" schedulerid="1" kpid="51692" status="suspended" spid="193" sbid="2" ecid="0" priority="0" trancount="2" lastbatchstarted="2018-05-21T14:43:40.320" lastbatchcompleted="2018-05-21T14:43:40.327" lastattention="1900-01-01T00:00:00.327" clientapp=".Net SqlClient Data Provider" hostname="RD0003FF430FC8" hostpid="12344" loginname="officearchitect" isolationlevel="read committed (2)" xactid="19385479" currentdb="5" currentdbname="OfficeArchitect_Performance_Test" lockTimeout="4294967295" clientoption1="673185824" clientoption2="128056">
<executionStack>
<frame procname="d2558974-73ab-4869-acd2-9cce4009286e.model.ModelItem_Generic_Delete" queryhash="0xd6e2f8f770b21179" queryplanhash="0x18df7aa720a890f6" line="80" stmtstart="4110" stmtend="4360" sqlhandle="0x0300050096f1cb4302e72300e7a8000001000000000000000000000000000000000000000000000000000000">DELETE
MI
FROM
[model].ModelItem MI
INNER JOIN
#ModelItemIdTable MIT
ON
MIT.EntityId = MI.ModelItemI</frame>
<frame procname="d2558974-73ab-4869-acd2-9cce4009286e.model.ModelItem_Object_Delete" queryhash="0x0000000000000000" queryplanhash="0x0000000000000000" line="25" stmtstart="1088" stmtend="1302" sqlhandle="0x0300050061e52c65bce72300e7a8000001000000000000000000000000000000000000000000000000000000">EXEC [model].[ModelItem_Generic_Delete]
#ObjectIdTable,
#MarkAsDeleted,
#DeletedBy,
#DeletedO</frame>
</executionStack>
<inputbuf>Proc [Database Id = 5 Object Id = 1697441121]</inputbuf>
</process>
<process id="processa9e0ddc108" taskpriority="0" logused="2657548" waitresource="KEY: 5:72057594060013568 (3f1e49aa6519)" waittime="2779" ownerId="19456397" transactionname="user_transaction" lasttranstarted="2018-05-21T14:43:30.350" XDES="0xa9dc49c458" lockMode="RangeS-S" schedulerid="2" kpid="55424" status="suspended" spid="85" sbid="2" ecid="0" priority="0" trancount="2" lastbatchstarted="2018-05-21T14:43:30.537" lastbatchcompleted="2018-05-21T14:43:30.530" lastattention="1900-01-01T00:00:00.530" clientapp=".Net SqlClient Data Provider" hostname="RD0003FF430FC8" hostpid="12344" loginname="officearchitect" isolationlevel="read committed (2)" xactid="19456397" currentdb="5" currentdbname="OfficeArchitect_Performance_Test" lockTimeout="4294967295" clientoption1="673185824" clientoption2="128056">
<executionStack>
<frame procname="d2558974-73ab-4869-acd2-9cce4009286e.model.ModelItem_Generic_Delete" queryhash="0xd6e2f8f770b21179" queryplanhash="0x18df7aa720a890f6" line="80" stmtstart="4110" stmtend="4360" sqlhandle="0x0300050096f1cb4302e72300e7a8000001000000000000000000000000000000000000000000000000000000">DELETE
MI
FROM
[model].ModelItem MI
INNER JOIN
#ModelItemIdTable MIT
ON
MIT.EntityId = MI.ModelItemI</frame>
<frame procname="d2558974-73ab-4869-acd2-9cce4009286e.model.ModelItem_Object_Delete" queryhash="0x0000000000000000" queryplanhash="0x0000000000000000" line="25" stmtstart="1088" stmtend="1302" sqlhandle="0x0300050061e52c65bce72300e7a8000001000000000000000000000000000000000000000000000000000000">EXEC [model].[ModelItem_Generic_Delete]
#ObjectIdTable,
#MarkAsDeleted,
#DeletedBy,
#DeletedO</frame>
</executionStack>
<inputbuf>Proc [Database Id = 5 Object Id = 1697441121]</inputbuf>
</process>
</process-list>
<resource-list>
<keylock hobtid="72057594060013568" dbid="5" objectname="d2558974-73ab-4869-acd2-9cce4009286e.model.RelationshipPair" indexname="PK_RelationshipPair_RelationshipPairId" id="lockaa19259180" mode="RangeS-U" associatedObjectId="72057594060013568">
<owner-list>
<owner id="processa9f9857088" mode="RangeS-S" />
</owner-list>
<waiter-list>
<waiter id="processa9f6f73c28" mode="X" requestType="convert" />
</waiter-list>
</keylock>
<keylock hobtid="72057594060013568" dbid="5" objectname="d2558974-73ab-4869-acd2-9cce4009286e.model.RelationshipPair" indexname="PK_RelationshipPair_RelationshipPairId" id="lockaa18438980" mode="RangeX-X" associatedObjectId="72057594060013568">
<owner-list>
<owner id="processa9e0ddc108" mode="RangeS-S" requestType="wait" />
</owner-list>
<waiter-list>
<waiter id="processa9f9857088" mode="RangeS-S" requestType="wait" />
</waiter-list>
</keylock>
<keylock hobtid="72057594060013568" dbid="5" objectname="d2558974-73ab-4869-acd2-9cce4009286e.model.RelationshipPair" indexname="PK_RelationshipPair_RelationshipPairId" id="lockaa19259180" mode="RangeS-U" associatedObjectId="72057594060013568">
<owner-list>
<owner id="processa9f6f73c28" mode="U" />
<owner id="processa9f6f73c28" mode="X" requestType="convert" />
</owner-list>
<waiter-list>
<waiter id="processa9fb862108" mode="RangeS-S" requestType="wait" />
</waiter-list>
</keylock>
<keylock hobtid="72057594060013568" dbid="5" objectname="d2558974-73ab-4869-acd2-9cce4009286e.model.RelationshipPair" indexname="PK_RelationshipPair_RelationshipPairId" id="lockaa18438980" mode="RangeX-X" associatedObjectId="72057594060013568">
<owner-list>
<owner id="processa9fb862108" mode="RangeX-X" />
</owner-list>
<waiter-list>
<waiter id="processa9e0ddc108" mode="RangeS-S" requestType="wait" />
</waiter-list>
</keylock>
</resource-list>
</deadlock>
Now from what I can understand (this is a bit new to me), the DELETE RP statement was the one that was "victimized", this was due to the INSERT INTO [model].[ModelItem] statement.
The issue occurred with locking on the index PK_RelationshipPair_RelationshipPairId.
What I don't fully understand are the RangeA-B locks. I understand that a range of values are locked on the index. But not quite sure why.
I understand that without the actual sql code it is difficult to see exactly what is going on, but I need some assistance in regards to how to go about diagnosing this.
I've tried to replicate the deadlock by running the DELETE and INSERT INTO statements in two transactions (and not completing or rolling back the insert), but no deadlock so far.
Edit
Transaction scope in C# layer is set as follows
var transactionOptions = return new TransactionOptions
{
IsolationLevel = IsolationLevel.ReadCommitted,
Timeout = TransactionManager.MaximumTimeout
};
using (var transaction = new TransactionScope(TransactionScopeOption.Required, transactionOptions, TransactionScopeAsyncFlowOption.Enabled))
{
await action(transaction);
transaction.Complete();
}
It seems you are using TransactionScope class that is forcing range locks common on serializable isolation level despite you using read commited isolation level. Those range locks lockMode="RangeS-S" are prompt to blocking and deadlocks.`
Despite you specifying Read Committed on the transaction options, TransactionScope is forcing range locks that exist only on serializable isolation as explained here.

Setting deadlock_priority does not make LOW priority session the victim

One of our jobs was deadlocking with user routines, so we put the following code in the job step, just before the procedure is called:
DECLARE #deadlock_var NCHAR(3);
SET #deadlock_var = N'LOW';
SET DEADLOCK_PRIORITY #deadlock_var;
*--Call procedure
exec Client_myDeliveries_I_S*
However, the procedure is still in the stack that is processed.
Does the deadlock priority not get inherited through the chain of sub procedures?
I can confirm that at no point does the deadlock victim session set its deadlock_priority.
The complete deadlock XML is below, with some deletions for privacy:
<deadlock>
<victim-list>
<victimProcess id="process28fc21868" />
</victim-list>
<process-list>
<process id="process28fc21868" taskpriority="0" logused="2504" waitresource="KEY: 5:72057594562412544 (40fd182c0dd9)" waittime="5008" ownerId="299034576" transactionname="user_transaction" lasttranstarted="2018-02-01T12:22:55.580" XDES="0x140b2cc70" lockMode="X" schedulerid="1" kpid="3600" status="suspended" spid="87" sbid="0" ecid="0" priority="0" trancount="2" lastbatchstarted="2018-02-01T12:22:55.580" lastbatchcompleted="2018-02-01T12:22:55.580" lastattention="2018-02-01T12:22:52.480" clientapp="EUROSTOP e-i Service" hostname="SRVAZBRWSQL01" hostpid="1328" loginname="sa" isolationlevel="read committed (2)" xactid="299034576" currentdb="5" lockTimeout="4294967295" clientoption1="673316896" clientoption2="128056">
<executionStack>
<frame procname="ALL.dbo.trgi_u_Constants" line="705" stmtstart="37030" stmtend="37270" sqlhandle="0x03000500651fce5aad8fd3002aa5000000000000000000000000000000000000000000000000000000000000">
update myconstants
set value = (select last_c_no from inserted)
where batch = 'last_c_no' </frame>
<frame procname="IF_TEST.myschema.SendTestData_Customers_SToALL_Customers" line="87" stmtstart="11420" stmtend="11650" sqlhandle="0x030007007f19ab663ac3e5005ca7000001000000000000000000000000000000000000000000000000000000">
update myschema.ALL_Constants
set last_c_no = #nLastCNumber + #NumberOfInsertedCs </frame>
<frame procname="IF_TEST.myschema.SendTestData_Customers" line="22" stmtstart="962" stmtend="1196" sqlhandle="0x030007009c954b7757b4d00054a7000001000000000000000000000000000000000000000000000000000000">
exec myschema.SendTestData_Customers_SToALL_Customers #MessageCode, #RejectAllOnValidationError </frame>
<frame procname="IF_TEST.myschema.SendSubmittedFData" line="13" stmtstart="876" stmtend="1114" sqlhandle="0x030007009b4fb66e2db4d00054a7000001000000000000000000000000000000000000000000000000000000">
exec [myschema].[SendTestData_Customers] #MessageCode, #RejectAllOnValidationError
-- Customer Orders </frame>
<frame procname="adhoc" line="1" stmtstart="104" sqlhandle="0x010007008547740e50dc4dd90700000000000000000000000000000000000000000000000000000000000000">
Exec myschema.SendSubmittedFData #0, #1, #2, #3 </frame>
<frame procname="unknown" line="1" sqlhandle="0x0000000000000000000000000000000000000000000000000000000000000000000000000000000000000000">
unknown </frame>
</executionStack>
<inputbuf>
(#0 nvarchar(4000),#1 int,#2 nvarchar(4000),#3 int);Exec myschema.SendSubmittedFData #0, #1, #2, #3 </inputbuf>
</process>
<process id="process2b6e94cf8" taskpriority="-6" logused="43652" waitresource="KEY: 5:72057594562412544 (d08358b1108f)" waittime="5009" ownerId="299033786" transactionname="user_transaction" lasttranstarted="2018-02-01T12:22:53.810" XDES="0x28262f130" lockMode="X" schedulerid="2" kpid="13408" status="suspended" spid="122" sbid="0" ecid="0" priority="-5" trancount="3" lastbatchstarted="2018-02-01T12:15:00.580" lastbatchcompleted="2018-02-01T12:15:00.580" lastattention="1900-01-01T00:00:00.580" clientapp="SQLAgent - TSQL JobStep (Job 0x24121E41ABD80643985B522FE6C248A7 : Step 1)" hostname="SRVAZBRWSQL01" hostpid="2252" loginname="SRVAZBRWSQL01\ALLSYSTEM" isolationlevel="read committed (2)" xactid="299033786" currentdb="5" lockTimeout="4294967295" clientoption1="673316896" clientoption2="128056">
<executionStack>
<frame procname="ALL.dbo.trgi_u_Constants" line="726" stmtstart="38142" stmtend="38386" sqlhandle="0x03000500651fce5aad8fd3002aa5000000000000000000000000000000000000000000000000000000000000">
update myconstants
set value = (select last_gs_rec_no from inserted)
where batch = 'last_gs_rec_no' </frame>
<frame procname="ALL.dbo.trgi_i_i_I_deliveries" line="1005" stmtstart="69046" stmtend="69220" sqlhandle="0x030005006cd104389a850d00a1a6000000000000000000000000000000000000000000000000000000000000">
update dbo.Constants set last_gs_rec_no = last_gs_rec_no + #nmyDeliveriesCreatedCount </frame>
<frame procname="C_HELP.dbo.Client_myDeliveries_I_Std" line="86" stmtstart="7306" stmtend="8324" sqlhandle="0x030006000d09a1438efcba0074a8000001000000000000000000000000000000000000000000000000000000">
insert into [ALL].dbo.i_I_deliveries
--[columns]
select --[columns]
from #Client_deliveries_PO_stg stg </frame>
<frame procname="C_HELP.dbo.Client_myDeliveries_I_S" line="446" stmtstart="32832" stmtend="33032" sqlhandle="0x0300060028323b37f7a70101eca7000001000000000000000000000000000000000000000000000000000000">
exec dbo.Client_myDeliveries_I_Std #week_selector, #username, #factory_bin_location, #Parameter </frame>
<frame procname="adhoc" line="6" stmtstart="216" sqlhandle="0x02000000e49e3111676b7e3aec714d06946692f70e3a8a880000000000000000000000000000000000000000">
exec Client_myDeliveries_I_S </frame>
</executionStack>
<inputbuf>
DECLARE #deadlock_var NCHAR(3);
SET #deadlock_var = N'LOW';
SET DEADLOCK_PRIORITY #deadlock_var;
exec Client_myDeliveries_I_S </inputbuf>
</process>
</process-list>
<resource-list>
<keylock hobtid="72057594562412544" dbid="5" objectname="ALL.dbo.myconstants" indexname="pk_myconstants" id="lockd5ad7b00" mode="RangeS-U" associatedObjectId="72057594562412544">
<owner-list>
<owner id="process2b6e94cf8" mode="RangeS-S" />
</owner-list>
<waiter-list>
<waiter id="process28fc21868" mode="X" requestType="convert" />
</waiter-list>
</keylock>
<keylock hobtid="72057594562412544" dbid="5" objectname="ALL.dbo.myconstants" indexname="pk_myconstants" id="lock9466df00" mode="RangeS-U" associatedObjectId="72057594562412544">
<owner-list>
<owner id="process28fc21868" mode="RangeS-S" />
</owner-list>
<waiter-list>
<waiter id="process2b6e94cf8" mode="X" requestType="convert" />
</waiter-list>
</keylock>
</resource-list>
</deadlock>
Additional
Version:
Microsoft SQL Server 2012 (SP3-CU2) (KB3137746) - 11.0.6523.0 (X64)
Mar 2 2016 21:29:16
Copyright (c) Microsoft Corporation
Standard Edition (64-bit) on Windows NT 6.3 (Build 9600: ) (Hypervisor)
From everyone's input, it seems that deadlock priority can be ignored by SQL Server under circumstances that have not been made clear by Microsoft (that we know of); the algorithms used to determine the victim are not understood or documented. If anyone can find solid information on this topic, please share.

Deadlock due to keylock involving 3 processes

I'm trying to determine how this deadlock occurred and what fix I need to do to prevent it from happening again.
I've attached deadlock graph image from SSMS, resolution provided by SSMS image is not good, I apologize.
What is going on here is 3 processes are locked in a cycle, all waiting for the lock on the primary key for the table SecurityObject to be released.
The primary key for this table is clustered and is a composite key containing four columns.
The statement that each process is running is shown below. It is a delete command, deleting all records from table that match on a single column. The column is a GUID identifier that is one of the four columns in the composite clustered primary key.
DELETE FROM SecurityObject WHERE col1 = #val1
Where col1 is one of the four columns in the primary key.
I'm struggling to understand is how this scenario could have happened? How can there be a deadlock scenario for a primary key lock?
Below is the deadlock xml graph:
<deadlock>
<victim-list>
<victimProcess id="processaeabf84108"/>
</victim-list>
<process-list>
<process id="processaeabf84108" taskpriority="0" logused="0" waitresource="KEY: 14:72057594041925632 (00f78314b62e)" waittime="1754" ownerId="6629325" transactionname="user_transaction" lasttranstarted="2017-08-04T15:16:55.747" XDES="0xaea526f498" lockMode="X" schedulerid="2" kpid="16620" status="suspended" spid="73" sbid="0" ecid="0" priority="0" trancount="2" lastbatchstarted="2017-08-04T15:16:55.747" lastbatchcompleted="2017-08-04T15:16:55.747" lastattention="1900-01-01T00:00:00.747" clientapp=".Net SqlClient Data Provider" hostname="RDXP0165C9JAWIE" hostpid="19084" loginname="REDMOND\RDXP0165C9JAWIE$" isolationlevel="read committed (2)" xactid="6629325" currentdb="14" lockTimeout="4294967295" clientoption1="671088672" clientoption2="128056">
<executionStack>
<frame procname="SecurityAuthorization.DB.dbo.spDeleteAllSecurityObjects" line="5" stmtstart="342" stmtend="474" sqlhandle="0x03000e00b56a9938f8fcba00c3a7000001000000000000000000000000000000000000000000000000000000"> DELETE FROM [SecurityObject] WHERE [EnvironmentId] = #EnvironmentI </frame>
</executionStack>
<inputbuf> Proc [Database Id = 14 Object Id = 949578421] </inputbuf>
</process>
<process id="processaea64a9468" taskpriority="0" logused="0" waitresource="KEY: 14:72057594041925632 (e0caa7da41f0)" waittime="3981" ownerId="6629329" transactionname="user_transaction" lasttranstarted="2017-08-04T15:16:55.750" XDES="0xaea9602408" lockMode="X" schedulerid="1" kpid="14152" status="suspended" spid="76" sbid="0" ecid="0" priority="0" trancount="2" lastbatchstarted="2017-08-04T15:16:55.750" lastbatchcompleted="2017-08-04T15:16:55.750" lastattention="1900-01-01T00:00:00.750" clientapp=".Net SqlClient Data Provider" hostname="RDXP0165C9JAWIE" hostpid="19084" loginname="REDMOND\RDXP0165C9JAWIE$" isolationlevel="read committed (2)" xactid="6629329" currentdb="14" lockTimeout="4294967295" clientoption1="671088672" clientoption2="128056">
<executionStack>
<frame procname="SecurityAuthorization.DB.dbo.spDeleteAllSecurityObjects" line="5" stmtstart="342" stmtend="474" sqlhandle="0x03000e00b56a9938f8fcba00c3a7000001000000000000000000000000000000000000000000000000000000"> DELETE FROM [SecurityObject] WHERE [EnvironmentId] = #EnvironmentI </frame>
</executionStack>
<inputbuf> Proc [Database Id = 14 Object Id = 949578421] </inputbuf>
</process>
<process id="processaea686fc28" taskpriority="0" logused="884" waitresource="KEY: 14:72057594041925632 (e0caa7da41f0)" waittime="2105" ownerId="6638253" transactionname="user_transaction" lasttranstarted="2017-08-04T15:16:57.627" XDES="0xaea9460e58" lockMode="X" schedulerid="2" kpid="6528" status="suspended" spid="79" sbid="0" ecid="0" priority="0" trancount="2" lastbatchstarted="2017-08-04T15:16:57.627" lastbatchcompleted="2017-08-04T15:16:57.627" lastattention="1900-01-01T00:00:00.627" clientapp=".Net SqlClient Data Provider" hostname="RDXP0165C9JAWIE" hostpid="19084" loginname="REDMOND\RDXP0165C9JAWIE$" isolationlevel="read committed (2)" xactid="6638253" currentdb="14" lockTimeout="4294967295" clientoption1="671088672" clientoption2="128056">
<executionStack>
<frame procname="SecurityAuthorization.DB.dbo.spDeleteAllSecurityObjects" line="5" stmtstart="342" stmtend="474" sqlhandle="0x03000e00b56a9938f8fcba00c3a7000001000000000000000000000000000000000000000000000000000000"> DELETE FROM [SecurityObject] WHERE [EnvironmentId] = #EnvironmentI </frame>
</executionStack>
<inputbuf> Proc [Database Id = 14 Object Id = 949578421] </inputbuf>
</process>
</process-list>
<resource-list>
<keylock hobtid="72057594041925632" dbid="14" objectname="SecurityAuthorization.DB.dbo.SecurityObject" indexname="PK__Security__185B78FE57F79F91" id="lockaead1a0680" mode="X" associatedObjectId="72057594041925632">
<owner-list>
<owner id="processaea686fc28" mode="X"/>
</owner-list>
<waiter-list>
<waiter id="processaeabf84108" mode="X" requestType="wait"/>
</waiter-list>
</keylock>
<keylock hobtid="72057594041925632" dbid="14" objectname="SecurityAuthorization.DB.dbo.SecurityObject" indexname="PK__Security__185B78FE57F79F91" id="lockae6d468f80" mode="X" associatedObjectId="72057594041925632">
<owner-list>
<owner id="processaeabf84108" mode="X"/>
</owner-list>
<waiter-list>
<waiter id="processaea64a9468" mode="X" requestType="wait"/>
</waiter-list>
</keylock>
<keylock hobtid="72057594041925632" dbid="14" objectname="SecurityAuthorization.DB.dbo.SecurityObject" indexname="PK__Security__185B78FE57F79F91" id="lockae6d468f80" mode="X" associatedObjectId="72057594041925632">
<owner-list>
<owner id="processaea64a9468" mode="X" requestType="wait"/>
</owner-list>
<waiter-list>
<waiter id="processaea686fc28" mode="X" requestType="wait"/>
</waiter-list>
</keylock>
</resource-list>
</deadlock>
Here is execution plan of stored procedure:
After further testing, I've managed to isolate the root cause of deadlock scenario to be concurrent calls to both DeleteAll (delete subset of records in table) and Insert (inserts a record that matches the criteria of the DeleteAll).
The exact sequence of events that leads to the deadlock scenario remains unclear, but the issue is solved by setting isolation level to serializable.
It is an acceptable side effect that it will hurt performance (for my scenario, we do not care about this operations performance because these operations are not being waited on, it is a fire and forget process).