\item LUTs replaced by smaller BG-specific parameters
\item Inefficient load/store replaced by circular memcpy
\end{itemize}
\item Bug fixes:
\begin{itemize}
\item Fixed bug in function \texttt{llr2CnProcBuf}
\item Introduced saturation to $-127$ in \texttt{bnProc}
\item Corrected input LLR dynamic range in simulation
\end{itemize}
\item Results:
\begin{itemize}
\item Size of LUTs reduced significantly (60MB to 200KB)
\item Siginifcantly enhances execution time (factor 3.5)
\item Improved BLER performance
\end{itemize}
\end{itemize}
\newpage
\tableofcontents
\tableofcontents
\newpage
\newpage
...
@@ -327,6 +355,7 @@ The functions involved are described in more detail in table \ref{tab:sum_func}.
...
@@ -327,6 +355,7 @@ The functions involved are described in more detail in table \ref{tab:sum_func}.
\texttt{llr2llrProcBuf}& Copies input LLRs to LLR processing buffer \\
\texttt{llr2llrProcBuf}& Copies input LLRs to LLR processing buffer \\
\texttt{llr2CnProcBuf}& Copies input LLRs to CN processing buffer \\
\texttt{llr2CnProcBuf}& Copies input LLRs to CN processing buffer \\
\texttt{cnProc}& Performs CN signal processing \\
\texttt{cnProc}& Performs CN signal processing \\
\texttt{cnProcPc}& Performs parity check \\
\texttt{cn2bnProcBuf}& Copies the CN results to the BN processing buffer \\
\texttt{cn2bnProcBuf}& Copies the CN results to the BN processing buffer \\
\texttt{bnProcPc}& Performs BN processing for parity check and/or hard-decision \\
\texttt{bnProcPc}& Performs BN processing for parity check and/or hard-decision \\
\texttt{bnProc}& Utilizes the results of \texttt{bnProcPc} to compute LLRs for CN processing \\
\texttt{bnProc}& Utilizes the results of \texttt{bnProcPc} to compute LLRs for CN processing \\
...
@@ -514,24 +543,7 @@ The sum of the LLRs is carried out in 16 bit for accuracy and is then saturated
...
@@ -514,24 +543,7 @@ The sum of the LLRs is carried out in 16 bit for accuracy and is then saturated
\subsection{Mapping to the Processing Buffers}
\subsection{Mapping to the Processing Buffers}
\label{sec:mapp-cn-proc}
\label{sec:mapp-cn-proc}
For efficient processing with the AVX instructions, the data is required to be aligned in a certain manner. That is the reason why processing buffers have been introduced. The drawback is that the results of the processing need to copied every time to the processing buffer of the next task. However, the speed up in computation with AVX more than makes up for the time wasted in copying data. The copying is implemented using look-up tables (LUTs) which are described in table \ref{tab:sum_lut}.
For efficient processing with the AVX instructions, the data is required to be aligned in a certain manner. That is the reason why processing buffers have been introduced. The drawback is that the results of the processing need to copied every time to the processing buffer of the next task. However, the speed up in computation with AVX more than makes up for the time wasted in copying data. The copying is implemented as a circular memcpy because every edge in the BG is a circular shift of a $Z\times Z$ identity matrix. Hence, a circular mempcy consists of two regular memcpys each copying a part of the $Z$ values depending on the circular shift in the BG definition. The circular shifts are stored in \texttt{nrLDPC\_lut.h} in arrays \texttt{circShift\_BGX\_ZX\_CNGX}. In the specification there are only 8 sets of cirular shifts defined. However, the applied circular shift depends on $Z$, i.e. modulo $Z$. To avoid inefficient modulo operations in loops, we store the the circular shift values for every $Z$. Moreover, for convinience the arrays are already arranged depending on the CN group (CNG).
\begin{table}[ht]
\centering
\begin{tabular}{ll}
\toprule
\textbf{LUT}&\textbf{Description}\\
\midrule
\texttt{lut\_llr2llrProcBuf\_BGX\_ZX\_RX}& Indices for function \texttt{llr2llrProcBuf}\\
\texttt{lut\_llr2CnProcBuf\_BGX\_ZX\_RX}& Indices for function \texttt{llr2CnProcBuf}\\
\texttt{lut\_cn2bnProcBuf\_BGX\_ZX\_RX}& Indices for functions \texttt{cn2bnProcBuf} and \texttt{bn2cnProcBuf}\\
\bottomrule
\end{tabular}
\caption{Summary of the LUTs.}
\label{tab:sum_lut}
\end{table}
These LUTs are depending on the BG, the lifting size and the code rate. Assuming 5 rates for BG2 and 7 rates for BG1, the total number of LUTs is 617.
\newpage
\newpage
\section{Performance Results}
\section{Performance Results}
...
@@ -595,7 +607,10 @@ The first set of simulations in Figure \ref{fig:bler-bg2-15} compares the curren
...
@@ -595,7 +607,10 @@ The first set of simulations in Figure \ref{fig:bler-bg2-15} compares the curren
@@ -709,7 +728,7 @@ Figure \ref{fig:bler-bg1-r89} shows the performance of BG1 with largest block si
...
@@ -709,7 +728,7 @@ Figure \ref{fig:bler-bg1-r89} shows the performance of BG1 with largest block si
\label{fig:bler-bg1-r89}
\label{fig:bler-bg1-r89}
\end{figure}
\end{figure}
From \ref{fig:bler-bg1-r89} it can be observed that the performance gap is only about 0.2 dB if 50 iterations are used. However, for 5 iterations there is still a significant performance loss of about 3.4 dB at BLER $10^{-2}$.
From \ref{fig:bler-bg1-r89} it can be observed that the performance gap is only about 0.3 dB if 50 iterations are used. However, for 5 iterations there is still a significant performance loss of about 2.3 dB at BLER $10^{-2}$.
\newpage
\newpage
\subsection{Decoding Latency}
\subsection{Decoding Latency}
...
@@ -824,8 +843,8 @@ Table \ref{tab:lat-bg1-i5} shows the results for BG1, larges block size and diff
...
@@ -824,8 +843,8 @@ Table \ref{tab:lat-bg1-i5} shows the results for BG1, larges block size and diff
From the above results it can be observed that the data transfer between CNs and BNs takes up a significant amount of the run time. However, the performance gain due to AVX instructions in both CN and BN processing is significantly larger than the penalty incurred by the data transfers.
From the above results it can be observed that the data transfer between CNs and BNs takes up a significant amount of the run time. However, the performance gain due to AVX instructions in both CN and BN processing is significantly larger than the penalty incurred by the data transfers.
\section{Parity Check and early stopping Criteria}
\section{Parity Check and Early Stopping Criteria}
It is often unnecessary to carry out the maximum number of iterations. After each iteration a parity check \eqref{eq:29} can be computed and if a valid code word is found the decoder can stop. This functionality has been implemented and the additional overhead is reasonable. The PC is carried out in the CN processing buffer and the calculation complexity itself is negligible. However, for the processing it is necessary to move the BN results to the CN buffer which takes time, the overall overhead is at most $10\%$ compared to an algorithm without early stopping criteria with the same number of iterations. The PC has to be activated via the define \texttt{NR\_LDPC\_ENABLE\_PARITY\_CHECK}.
It is often unnecessary to carry out the maximum number of iterations. After each iteration a parity check (PC) \eqref{eq:29} can be computed and if a valid code word is found the decoder can stop. This functionality has been implemented and the additional overhead is reasonable. The PC is carried out in the CN processing buffer and the calculation complexity itself is negligible. However, for the processing it is necessary to move the BN results to the CN buffer which takes time, the overall overhead is at most $10\%$ compared to an algorithm without early stopping criteria with the same number of iterations. The PC has to be activated via the define \texttt{NR\_LDPC\_ENABLE\_PARITY\_CHECK}.