\item LUTs replaced by smaller BG-specific parameters
\item Inefficient load/store replaced by circular memcpy
\end{itemize}
\item Bug fixes:
\begin{itemize}
\item Fixed bug in function \texttt{llr2CnProcBuf}
\item Corrected input LLR dynamic range in BLER simulations
\end{itemize}
\item Results:
\begin{itemize}
\item Size of LUTs reduced significantly (60MB to 200KB)
\item Siginifcantly enhances execution time (factor 3.5)
\item Improved BLER performance (all simulation results have been updated)
\end{itemize}
\end{itemize}
\newpage
\tableofcontents
\newpage
\section{Introduction}
\label{sec:introduction}
Low Density Parity Check (LDPC) codes have been developed by Gallager in 1963 \cite{gallager1962low}. They are linear error correcting codes that are capacity-achieving for large block length and are completely described by their Parity Check Matrix (PCM) $\H^{M\times N}$. The PCM $\H$ defines $M$ constraints on the codeword $\c$ of length $N$ such that
\begin{equation}
\label{eq:29}
\H\c = \0.
\end{equation}
The number of information bits $B$ that can be encoded with $\H$ is given by $B=N-M$. Hence the code rate $R$ of $\H$ reads
\begin{equation}
\label{eq:37}
R = \frac{B}{N} = 1-\frac{M}{N}.
\end{equation}
\subsection{LDPC in NR}
\label{sec:ldpc-nr}
NR uses quasi-cyclic (QC) Protograph LDPC codes, i.e. a smaller graph, called Base Graph (BG), is defined and utilized to construct the larger PCM. This has the advantage that the large PCM does not have to be stored in memory and allows for a more efficient implementation while maintaining good decoding properties.
Two BGs $\Hbg\in\Nbb^{\Mb\times\Nb}$ are defined in NR:
\begin{enumerate}
\item$\Hbgo\in\Nbb^{46\times68}$
\item$\Hbgt\in\Nbb^{42\times52}$
\end{enumerate}
where $\Nbb$ is the set of integers. For instance the first 3 rows and 13 columns of BG2 are given by
To obtain the PCM $\H$ from the BG $\Hbg$, each element $\Hbg(i,j)$ in the BG is replaced by a lifting matrix of size $\Zc\times\Zc$ according to
\begin{equation}
\label{eq:35}
\Hbg(i,j) =
\begin{cases}
\0&\textrm{if}~ \Hbg(i,j)=\emptyset\\
\I_{P_{ij}}&\textrm{otherwise}
\end{cases}
\end{equation}
where $\I_{P_{ij}}$ is the identity matrix circularly shifted to the right by $P_{ij}=\Hbg(i,j)\mod\Zc$. Hence, the resulting PCM $\H$ will be of size $\Mb\Zc\times\Nb\Zc$.
The lifting size $\Zc$ depends on the number of bits to encode. To limit the complexity, a discrete set $\mathcal{Z}$ of possible values of $\Zc$ has been defined in \cite{3gpp2017_38212} and the optimal value $\Zc$ is calculated according to
The base rate of the two BGs is $1/3$ and $1/5$ for BG1 and BG2, respectively. That is, BG1 encodes $K=22\Zc$ bits and BG2 encodes $K=10\Zc$ bits. Note that the first 2 columns of BG 1 and 2 are always punctured, that is after encoding, the first $2\Zc$ bits are discarded and not transmitted.
For instance, consider $B=500$ information bits to encode using BG2, \eqref{eq:36} yields $\Zc=64$ hence $K=640$. Since $K>B$, $K-B=140$ filler bits are appended to the information bits. The PCM $\Hbgt$ is of size $2688\times3328$ and the $640$ bits $\b$ are encoded according to \eqref{eq:29} at a rate $R \approx0.192$. To achieve the higher base rate of $0.2$, the first $128$ are punctured, i.e. instead of transmitting all $3328$ bits, only $3200$ are transmitted resulting in the desired rate $R=640/3200=0.2$.
\subsection{LDPC Decoding}
\label{sec:ldpc-decoding}
The decoding of codeword $\c$ can be achieved via the classical message passing algorithm. This algorithm can be illustrated best using the Tanner graph of the PCM. The rows of the PCM are called check nodes (CN) since they represent the parity check equations. The parity check equation of each of these check nodes involves various bits in the codeword. Similarly, every column of the PCM corresponds to a bit and each bit is involved in several parity check equations. In the Tanner graph representation, the bits are called bit nodes (BN). Let's go back to the previous example of BG2 and assume $\Zc=2$, hence the first 3 rows and 13 columns of BG2 $\Hbgt$ read
\node[vnode, label=below:$c_2$, left of=c3, node distance=1.5cm] (c2) {};
\node[vnode, label=below:$c_1$, left of=c2, node distance=1.5cm] (c1) {};
\node[vnode, label=below:$c_0$, left of=c1, node distance=1.5cm] (c0) {};
\node[vnode, label=below:$c_4$, right of=c3, node distance=1.5cm] (c4) {};
\node[vnode, label=below:$c_5$, right of=c4, node distance=1.5cm] (c5) {};
\node[vnode, label=below:$c_6$, right of=c5, node distance=1.5cm] (c6) {};
% Draw edges
\draw (c0) edge[connector] (v1);
\draw (c1) edge[connector] (v0);
\draw (c1) edge[connector] (v2);
\draw (c2) edge[connector] (v1);
\draw (c3) edge[connector] (v0);
\draw (c4) edge[connector] (v0);
\draw (c4) edge[connector] (v2);
\draw (c5) edge[connector] (v1);
\draw (c6) edge[connector] (v0);
\draw (c6) edge[connector] (v2);
\end{tikzpicture}
\caption{Tanner graph for first 7 bits nodes and 3 check nodes from \eqref{eq:39}.}
\end{figure}
The message passing algorithm is an iterative algorithm where probabilities of the bits (being either 0 or 1) are exchanged between the BNs and CNs. After sufficient iterations, the probabilities will have either converged to either 0 or 1 and the parity check equations will be satisfied, at this point, the codeword has been decoded correctly.
\newpage
\section{LDPC Decoder Implementation}
\label{sec:ldpc-implementation}
The implementation on a general purpose processor (GPP) has to take advantage of potential instruction extension of the processor architecture. We focus on the Intel x86 instruction set architecture (ISA) and its advanced vector extension (AVX). In particular, we utilize AVX2 with its 256-bit single instruction multiple data (SIMD) format. In order to utilize AVX2 to speed up the processing at the CNs and BNs, the corresponding data has to be ordered/aligned in a specific way. The processing flow of the LDPC decoder is depicted in \ref{fig:ldpc_decoder_flow}.
The functions involved are described in more detail in Table \ref{tab:sum_func}.
\begin{table}[ht]
\centering
\begin{tabular}{ll}
\toprule
\textbf{Function}&\textbf{Description}\\
\midrule
\texttt{llr2llrProcBuf}& Copies input LLRs to LLR processing buffer \\
\texttt{llr2CnProcBuf}& Copies input LLRs to CN processing buffer \\
\texttt{cnProc}& Performs CN signal processing \\
\texttt{cnProcPc}& Performs parity check \\
\texttt{cn2bnProcBuf}& Copies the CN results to the BN processing buffer \\
\texttt{bnProcPc}& Performs BN processing for parity check and/or hard-decision \\
\texttt{bnProc}& Utilizes the results of \texttt{bnProcPc} to compute LLRs for CN processing \\
\texttt{bn2cnProcBuf}& Copies the BN results to the CN processing buffer \\
\texttt{llrRes2llrOut}& Copies the results of \texttt{bnProcPc} to output LLRs \\
\texttt{llr2bit}& Performs hard-decision on the output LLRs \\
\bottomrule
\end{tabular}
\caption{Summary of the LDPC decoder functions.}
\label{tab:sum_func}
\end{table}
The input LLRs are assumed to be 8-bit and aligned on 32 bytes. CN processing is carried out in 8-bit whereas BN processing is done in 16 bit. Subsequently, the processing tasks at the CNs and BNs are explained in more detail.
\newpage
\subsection{Check Node Processing}
\label{sec:check-node-proc}
Denote $q_{ij}$ the value from BN $j$ to CN $i$ and let $\Bcal_i$ be the set of connected BNs to the $i$th CN. Then, using the min-sum approximation, CN $i$ has to carry out the following operation for each connected BN.
where $r_{ji}$ is the value returned to BN $j$ from CN $i$. There are $\Mb=\{46,42\}$ CNs in BG 1 and BG 2, respectively. Each of these CNs is connected to only a small number of BNs. The number of connected BNs to CN $i$ is $|\Bcal_i|$. In BG1 and BG2, $|\Bcal_i|=\{3,4,5,6,7,8,9,10,19\}$ and $|\Bcal_i|=\{3,4,5,6,8,10\}$, respectively. The following tables show the number of CNs $M_{|\Bcal_i|}$ that are connected to the same number of BNs.
It can be observed that each CN is at least connected to 3 BNs and there are 9 groups and 5 groups in BG1 and BG2, respectively. Denote the set of CN groups as $\Gcal$ and $M_k$ the number of CNs in group $k\in\Gcal$, e.g. for BG2 $M_4=20\Zc$. Each CN group will be processed separately. The CN processing buffer $p_C^k$ of group $k$ is defined as
min = _mm256_min_epu8(min, _mm256_abs_epi8(ymm0));
sgn = _mm256_sign_epi8(sgn, ymm0);
// Store result
min = _mm256_min_epu8(min, *p_maxLLR); // 128 in epi8 is -127
*p_cnProcBufResBit = _mm256_sign_epi8(min, sgn);
p_cnProcBufResBit++;
}
}
}
\end{lstlisting}
Once all results of the check node processing $r_{ji}$ have been calculated, they are copied to the bit node processing buffer.
\subsection{Bit Node Processing}
\label{sec:bit-node-processing}
Denote $r_{ji}$ the value from CN $i$ to BN $j$ and let $\Ccal_j$ be the set of connected CNs to the $j$th BN. Each BN $j$ has to carry out the following operation for every connected CN $i\in\Ccal_j$.
There are $\Nb=\{68,52\}$ BNs in BG 1 and BG 2, respectively. Each of these BNs is connected to only a small number of CNs. The number of connected CNs to BN $j$ is $|\Ccal_j|$. In BG1 and BG2, $|\Ccal_j|=\{1,4,7,8,9,10,11,12,28,30\}$ and $|\Ccal_j|=\{1,5,6,7,8,9,10,12,13,14,16,22,23\}$, respectively. The following tables show the number of BNs $K_{|\Ccal_j|}$ that are connected to the same number of CNs.
\caption{Bit node groups for BG1 and BG2 for base rates 1/3 and 1/5, respectively.}
\label{tab:bitNodeGroups}
\end{table}
The BNs that are connected to a single CN do not need to be considered in the BN processing since \eqref{eq:46} yields $q_{ij}=\Lambda_j$. It can be observed that the grouping is less compact, i.e. there are many groups with only a small number of elements.
Denote the set of BN groups as $\Bcal$ and $K_k$ the number of BNs in group $k\in\Bcal$, e.g. for BG2 $K_5=2\Zc$. Each BN group will be processed separately. The BN processing buffer $p_B^k$ of group $k$ is defined as
Depending on the code rate, some parity bits are not being transmitted. For instance, for BG2 with code rate $R =1/3$ the last $20\Zc$ bits are discarded. Therefore, the last 20 columns or the last $20\Zc$ parity check equation are not required for decoding. This means that the BN groups shown in table \ref{tab:bitNodeGroups} are depending on the rate.
\begin{lstlisting}[frame=single,caption={Example of BN processing for group 3 from \texttt{bnProcPc}.},label=code_bnproc] % Start your code-block
// If elements in group move to next address
idxBnGroup++;
// Number of groups of 32 BNs for parallel processing
M = (lut_numBnInBnGroups[2]*Z)>>5;
// Set the offset to each CN within a group in terms of 16 Byte
The sum of the LLRs is carried out in 16 bit for accuracy and is then saturated to 8 bit for CN processing. Saturation after each addition results in significant loss of sensitivity for low code rates.
\subsection{Mapping to the Processing Buffers}
\label{sec:mapp-cn-proc}
For efficient processing with the AVX instructions, the data is required to be aligned in a certain manner. That is the reason why processing buffers have been introduced. The drawback is that the results of the processing need to copied every time to the processing buffer of the next task. However, the speed up in computation with AVX more than makes up for the time wasted in copying data. The copying is implemented as a circular memcpy because every edge in the BG is a circular shift of a $Z\times Z$ identity matrix. Hence, a circular mempcy consists of two regular memcpys each copying a part of the $Z$ values depending on the circular shift in the BG definition. The circular shifts are stored in \texttt{nrLDPC\_lut.h} in arrays \texttt{circShift\_BGX\_ZX\_CNGX}. In the specification there are only 8 sets of cirular shifts defined. However, the applied circular shift depends on $Z$, i.e. modulo $Z$. To avoid inefficient modulo operations in loops, we store the the circular shift values for every $Z$. Moreover, for convinience the arrays are already arranged depending on the CN group (CNG).
\newpage
\section{Performance Results}
\label{sec:performance-results}
In this section, the performance in terms of BLER and decoding latency of the current LDPC decoder implementation is verified.
\subsection{BLER Performance}
\label{sec:bler-performance}
In all simulations, we assume AWGN, QPSK modulation and 8-bit input LLRs, i.e. $-127$ until $+127$. The DLSCH coding procedure in 38.212 is used to encode/decode the TB and an error is declared if the TB CRC check failed. Results are averaged over at least $10\,000$ channel realizations.
The first set of simulations in Figure \ref{fig:bler-bg2-15} compares the current LDPC decoder implementation to the reference implementation developed by Kien. This reference implementation is called \textit{LDPC Ref} and uses the min-sum algorithm with 2 layers and 16 bit for processing. Our current optimized decoder implementation is referred to as \textit{LDPC OAI}. Moreover, reference results provided by Huawei are also shown.
\caption{BLER vs. SNR, BG2, Rate=1/5, \{5,10,20\} Iterations, B=1280.}
\label{fig:bler-bg2-15}
\end{figure}
From Figure \ref{fig:bler-bg2-15} it can be observed that the reference decoder outperforms the current implementation significantly for low to medium number of iterations. The reason is the implementation of 2 layers in the reference decoder, which results in faster convergence for punctured codes and hence requires less iterations to achieve a given BLER target. Note that there is a large performance loss of about 4 dB at BLER $10^{-2}$ between the Huawei reference and the current optimized decoder implementation with 5 iterations.
Moreover, there is a gap of about 1.5 dB between the results provided by Huawei and the current decoder with 20 iterations. The reason is the min-sum approximation algorithm used in both the reference decoder and the current implementation. The gap can be closed by using a tighter approximation like the min-sum with normalization or the lambda-min approach. Moreover, the gap closes for higher code rates which can be observed from Figure \ref{fig:bler-bg2-r23}. The gap is only about 0.6 dB for 50 iterations.
The Matlab results denoted \texttt{MATLAB NMS} are obtained with the function \texttt{nrLDPCDecode} provided by the MATLAB 5G Toolbox R2019b. The following options are provided to the function: \texttt{'Termination','max','Algorithm','Normalized min-sum','ScalingFactor',1}. Furthermore, the 8-bit input LLRs are adapted to fit the dynamic range of \texttt{nrLDPCDecode} which is shown in Listing \ref{ldpc_matlab}.
\begin{lstlisting}[frame=single,caption={Input adaptation for MATLAB LDPC Decoder},label=ldpc_matlab]
maxLLR = max(abs(softbits));
rxLLRs = round((softbits/maxLLR)*127);
// adjust range to fit tanh use in decoder code
softbits = rxLLRs/3.4;
\end{lstlisting}
A scaling factor (SF) of 1 has been chosen to compare the results more easily with the \textit{LDPC OAI} since the resulting check node processing is the same. However, the Matlab normelized min-sum algorithm uses layered processing and floating point operations. Thus, for the same number of iterations, the performance is significantly better than \textit{LDPC OAI}, especially for small a number of iterations.
\caption{BLER vs. SNR, BG2, Rate=2/3, \{5,50\} Iterations, B=1280.}
\label{fig:bler-bg2-r23}
\end{figure}
In Figure \ref{fig:bler-bg2-15-2} we compare the performance of different algorithms using at most 50 iterations with early stopping if the parity check passes. The Matlab layered believe propagation (LBP) is used with unquantized input LLRs and performs the best since no approximation is done in the processing. Both NMS and offset min-sum (OMS) use a scaling factor and offset, respectively, that has been empirically found to perform best in this simulation setting. Theirs performance is very close to the BLP and OMS is slightly better than NMS. The performance of \textit{LDPC OAI} is more than 1 dB worse mainly because of the looser approximation. Moreover, the NMS algorithm with SF=1 performs worst probably because the SF is not optimized for the input LLRs. From the results in Figure \ref{fig:bler-bg2-15-2} we can conclude that the performance of the \textit{LDPC OAI} can be significantly improved by adopting an offset min-sum approximation improving the performance to within 0.3dB of the Huawei reference curve.
\caption{BLER vs. SNR, BG2, Rate=1/5, max iterations = 50, B=1280.}
\label{fig:bler-bg2-15-2}
\end{figure}
Figure \ref{fig:bler-bg1-r89} shows the performance of BG1 with largest block size of $B=8448$ and highest code rate $R=8/9$. From Figure \ref{fig:bler-bg1-r89} it can be observed that the performance gap is only about 0.3 dB if 50 iterations are used. However, for 5 iterations there is still a significant performance loss of about 2.3 dB at BLER $10^{-2}$.
\caption{BLER vs. SNR, BG1, Rate=8/9 \{5,50\} Iterations, B=8448.}
\label{fig:bler-bg1-r89}
\end{figure}
\newpage
\subsection{Decoding Latency}
\label{sec:decoding-time}
This section provides results in terms of decoding latency. That is, the time it takes the decoder to to finish decoding for a given number of iterations. To measure the run time of the decoder we use the OAI tool \texttt{time\_meas.h}. The clock frequency is about 2.9 GHZ, decoder is run on a single core and the results are averaged over $10\,000$ blocks.
The results in Table \ref{tab:lat-bg2-r15} show the impact of the number of iterations on the decoding latency. It can be observed that the latency roughly doubles if the number of iterations are doubled.
Table \ref{tab:lat-bg2-i5} shows the impact of the code rate on the latency for a given block size and 5 iterations. It can be observed that the performance gain from code rate 1/3 to 2/3 is about a factor 2.
Table \ref{tab:lat-bg1-i5} shows the results for BG1, larges block size and different code rates. The latency difference betwee code rate 1/3 and code rate 2/3 is less than half because upper left corner of the PCM is more dense than the rest of the PCM.
From the above results it can be observed that the data transfer between CNs and BNs takes up a significant amount of the run time. However, the performance gain due to AVX instructions in both CN and BN processing is significantly larger than the penalty incurred by the data transfers.
\section{Parity Check and Early Stopping Criteria}
It is often unnecessary to carry out the maximum number of iterations. After each iteration a parity check (PC) \eqref{eq:29} can be computed and if a valid code word is found the decoder can stop. This functionality has been implemented and the additional overhead is reasonable. The PC is carried out in the CN processing buffer and the calculation complexity itself is negligible. However, for the processing it is necessary to move the BN results to the CN buffer which takes time, the overall overhead is at most $10\%$ compared to an algorithm without early stopping criteria with the same number of iterations. The PC has to be activated via the define \texttt{NR\_LDPC\_ENABLE\_PARITY\_CHECK}.
\section{Conclusion}
\label{sec:conclusion}
The results in the previous sections show that the current optimized LDPC implementation full-fills the requirements in terms of decoding latency for low to medium number of iterations at the expense of a loss in BLER performance. To improve BLER performance, it is recommended to implement a layered algorithm and a min-sum algorithm with normalization. Further improvements upon the current implementation are detailed in the next section.
\newpage
\section{Future Work}
\label{sec:future-work}
The improvements upon the current LDPC decoder implementation can be divided into two categories:
\begin{enumerate}
\item Improved BLER performance
\item Reduced decoding latency
\end{enumerate}
\subsection{Improved BLER Performance}
\label{sec:impr-bler-perf}
The BLER performance can be improved by using a tighter approximation than the min-sum approximation. For instance, the min-sum algorithm can be improved by adding a correction factor in the CN processing . The min-sum approximation in \eqref{eq:40} is modified as
where the constant $c$ is of order $0.5$ typically.
\subsection{Reduced Decoding Latency}
\label{sec:reduc-decod-latency}
The following improvements will reduce the decoding latency:
\begin{itemize}
\item Adapt to AVX512
\item Optimization of CN processing
\item Implement 2/3-layers for faster convergence
\end{itemize}
\paragraph{AVX512:}
The computations in the CN and BN processing can be further accelerated by using AVX512 instructions. This improvement will speed-up the CN and BN processing by a approximately a factor of 2.
\paragraph{Optimization of CN Processing:}
It can be investigated if CN processing can be improved by computing two minima regardless of the number of BNs. Susequently, the (absolute) value fed back to the BN is one of those minima.
\paragraph{Layered processing:}
The LDPC code in NR always punctures the first 2 columns of the base graph. Hence, the decoder inserts LLRs with value 0 at their place and needs to retrieve those bits during the decoding process. Instead of computing all the parity equations and then passing the results to the BN processing, it is beneficial to first compute parity equations where at most one punctured BN is connected to that CN. If two punctured BNs are connected than according to \eqref{eq:40}, the result will be again 0. Thus in a first sub-iteration those parity equation are computed and the results are send to BN processing which calculates the results using only those rows in the PCM. In the second sub-iteration the remaining check equation are used.
The convergence of this layered approach is much fast since the bit can be retrieved more quickly while the decoding complexity remains the same. Therefore, for a fixed number of iterations the layered algorithm will have a significantly better performance.