Mobile QR Code QR CODE

  1. (Department of Electrical and Computer Engineering, Seoul National University, Seoul, Korea {kimms213, jhyi, hyuk_jae_lee}@capp.snu.ac.kr )
  2. (Department of Electrical and Information Engineering and Research Center for Electrical and Information Technology, Seoul National University of Science and Technology, Seoul, Korea hyunkim@seoultech.ac.kr )



Non-volatile memory, Phase-change memory, Read disturbance errors, On-demand scrubbing

1. Introduction

A modern computer system requires large amounts of main memory owing to its multi-core structure and complex applications. In particular, data-intensive applications such as big data and deep learning require a large amount of main memory to support large amounts of data [1-3]. As a result, the need for large-capacity main memory with low power consumption and high reliability has become important [4,5], and studies on the use of phase-change memory (PCM) as main memory have been actively conducted [6-8]. The cell size in PCM is smaller than DRAM, so the module can be denser, enabling a large memory capacity [9]. Furthermore, owing to the non-volatile characteristics of PCM, it is more advantageous than DRAM in terms of power efficiency and data retention time.

Despite these advantages, PCM suffers from low reliability, which needs to be addressed in order to use PCM as main memory. One of the main causes of reliability issues in PCM is read disturbance errors (RDEs) [10,11]. An RDE is a phenomenon whereby cells that are repeatedly read are damaged by thermal energy. RDEs occur when the number of reads exceeds a certain threshold. A conventional solution for RDEs is to scrub the cells in a word before the number of reads reaches the threshold. Memory scrubbing first reads a word, corrects bit errors with error-correcting code (ECC), and writes the corrected word back to the same location. Periodically scrubbing a word therefore prevents RDEs in that word.

However, periodical scrubbing requires read counters, which results in significant resource overhead because the number of reads must be counted in order to trigger scrubbing. In this paper, on-demand memory scrubbing that does not require read counters is proposed. Under the given RDE model with ECC, the probability distribution of the number of errors that occur with an additional read is derived. By using the derived probability distribution, the proposed solution suggests whether to scrub or not based on the current number of errors. Because the proposed solution only requires the number of errors, it does not need read counters, thereby eliminating nearly 1GB of resource overhead. The contributions from this paper are summarized as follows.

· A probabilistic model for RDEs is mathematically derived, and the optimal on-demand scrubbing policy is derived from the proposed model.

· Monte-Carlo (MC) simulation is conducted to verify the probabilistic model.

· The proposed on-demand scrubbing eliminates more than 1GB needed for read counters, while fixing more than 99.99% of RDEs.

The remainder of this paper is organized as follows. Section 2 introduces the background, and Section 3 presents the proposed on-demand scrubbing method. In Section 4, experimental results are given. Finally, Section 5 concludes the paper.

2. Background

In this section, the background for RDE error models and RDE mitigating schemes is presented.

2.1 Error Models for RDEs

Typically, a counter-based error model is used for RDEs [9]. Under this model, each cell has an RDE threshold, and when the number of reads reaches the threshold, an RDE occurs. The RDE threshold values follow a Gaussian distribution, $N\left(m,\sigma ^{2}\right)$. For later discussions, $\textit{m}$ at 3,000 and various $\sigma $ values are assumed.

The word size for PCM typically ranges from 64B to 256B [7]. In this paper, we assume 128B words. For ECC, 176-21 Reed-Solomon code is assumed so up to 21 symbols can be corrected out of 176 symbols in total. These 1,408 cells in a word are assumed to have independent RDE threshold values modeled as a Gaussian distribution.

2.2 RDE Mitigating Schemes

To mitigate RDE occurrences, a memory scrubbing method is used [12]. In conventional methods, each word utilizes a read counter. If the counter value reaches a certain threshold, the method reads a whole word and checks for errors via ECC. If errors are found, they are corrected and the word is rewritten. This read-and-fix process is called memory scrubbing. Conventional memory scrubbing, which uses a read counter, can remove RDEs effectively as long as the scrubbing threshold value is well chosen. However, as shown in Fig. 1, it requires a read counter per word, which means an extra 2B of storage must be allocated per word. Taking into account that the word size in PCM is typically between 64B and 256B, the extra storage takes about 1/32 to 1/128 of the total PCM capacity. Moreover, these read counters are updated frequently, and thus, DRAM should be used for them. Assuming 512GB of PCM capacity and a 128B word size, nearly 8GB of DRAM is used only for read counters, which is a significantly large overhead.

Fig. 1. Diagram of PCM and its read counters.
../../Resources/ieie/IEIESPC.2021.10.1.055/fig1.png

3. On-demand Memory Scrubbing

In this section, an on-demand memory scrubbing method that effectively eliminates read counters is described. First, the probability distribution under the Gaussian counter-based error model is derived, and then, an efficient on-demand scrubbing policy under the given probability distribution is suggested.

3.1 Probability Distribution for the Number of Errors

Denote as $\textit{L}$ the number of errors, with $\textit{K}$ as the number of reads, and $\textit{T}$ as the RDE threshold. The first probability to derive is $e_{k}$, which is the probability that an error occurs when the number of reads is $\textit{k}$ $\left(K=k\right)$. $e_{k}$ is derived as follows:

(1)
$e_{k}=P\left(k\geq T\right)=P\left(k\geq N\left(m,\sigma ^{2}\right)\right)$.

Because errors occur when the number of reads exceeds the RDE threshold, the first equality in (1) stands. The second equality is from the Gaussian modeling of $\textit{T}$. It should be noted that $P\left(k\geq N\left(m,\sigma ^{2}\right)\right)$ can be easily calculated from the normal distribution table.

When $K=k$, the probability of $L$ being $l$ is a binomial distribution $B\left(l,176,e_{k}\right)$. More specifically, for all 176 symbols, the probability of an error is $e_{k}$, so they have a binomial distribution as follows:

(2)
$P(l|k)=B\left(l;176,e_{k}\right)=\left(\begin{array}{l} 176\\ l \end{array}\right)\cdot {e_{k}}^{l}\cdot \left(1-e_{k}\right)^{176-l}$.

However, we are interested in probability $P(k|l)$, not $P(l|k)$, because the value that can be observed is $\textit{l}$, not $\textit{k}$. By using Bayes’s rule, $P(k|l)$ can be derived as follows:

(3)
$P(k|l)=\frac{P(l|k)\cdot P\left(k\right)}{P\left(l\right)}=\frac{P(l|k)}{\sum _{k}P(l|k)}$.

This means that when the observed number of errors is $\textit{l}$, the hidden number of reads, $\textit{k,}$ has the probability distribution derived from (3).

Lastly, probability $P(L'=l'|L=l)$ is derived. $L'$ represents the number of errors when an additional read operation is performed. Therefore, probability $P(l'|l)$ means the number of errors \textit{l$^{\prime}$} from $\textit{l}$ due to an additional read. It is derived as follows:

(4)
$P\left(l'|l\right)=\sum _{k=0}^{\infty }B\left(\left(l'-l\right);176-l,e_{k+1}\right)\cdot P(k|l)$.

$B\left(\left(l'-l\right);176-l,e_{k+1}\right)$ calculates the probability that the number of additional errors that occur from one more reads is $\left(l'-l\right)$. Because the number of reads is now \textit{k+}1, $e_{k+1}$ is used as the error probability. By summing the term over $\textit{k,}$ the desired $P\left(l'|l\right)$ is calculated.

3.2 On-demand Scrubbing Policy

So far, probability distribution $P\left(l'|l\right)$ is calculated with (4). This distribution is used to derive the on-demand scrubbing policy. Assuming that ECC can correct up to 21 symbols, scrubbing must be done before the number of errors exceeds 21. This means that probability $P\left(L'>21|L=l\right)$ must be small enough when the scrubbing is done. It can be calculated as follows:

(5)
$P\left(L'>21|l\right)=1-P\left(L'\leq 21|l\right)=1-\sum _{l'=l}^{21}P\left(l'|l\right)$.

Table 1 shows the value of $P\left(L'>21|L=l\right)$ for various values of $\sigma $. When $\sigma =10$ and $L=6$, the probability is 6.5E-7%. This means that when the current number of errors is 6, the probability that the number of errors becomes larger than 21 due to one more read operation is 6.5E-7%. It is obvious that the probability must be very small to prevent an uncorrectable error (UE). Therefore, performing scrubbing at a small value for $\textit{L}$ is always better in terms of removing UEs. However, a small L value causes too-frequent scrubbing. Table 2 shows the expected value for the number of reads ($\textit{K}$) with respect to $\textit{L}$. We can see that $\textit{K}$ increases as $\textit{L}$ increases. Moreover, the difference in the value is relatively large when $\textit{L}$ is small (from 0 to 6). On the other hand, when $\textit{L}$ is relatively large, the difference becomes smaller. To sum up, it is best to delay scrubbing while the UE probability is small enough, but when the $\textit{L}$ value becomes relatively large, to delay scrubbing does not give much benefit, because the difference in the $\textit{K}$ value is minor. For example, when the probability goal is 99.999%, which means the UE probability should be less than 0.001%, the selected $\textit{L}$ values that trigger scrubbing are 7, 10, and 13 when $\sigma $ is 10, 20, and 50, respectively. The average reads at the selected points are 2984, 2969, and 2920, respectively.

The proposed on-demand scrubbing provides a significant hardware overhead reduction because it does not require read counters. As mentioned above, a read counter for a single word takes up 2B. Assuming a word size of 128B, 1/64 of the total PCM capacity must be allocated to read counters. For example, if the PCM capacity is 64GB, total of 1GB of storage must be allocated to read counters. This means that using on-demand scrubbing significantly reduces storage overhead by about 1/64 of the total capacity.

Table 1. The probability of violation with respect to L.

L

σ=10

σ=20

σ=50

0-6

<6.5E-7%

<4.8E-7%

<2.4E-8%

7

2.9E-4%

6.1E-6%

1.6E-7%

8

5.0E-3%

2.0E-5%

1.1E-6%

9

0.03%

8.4E-5%

8.7E-6%

10

0.08%

2.8E-4%

2.2E-5%

11

0.31%

3.4E-3%

9.1E-5%

12

0.82%

0.02%

3.3E-4%

Table 2. The expected value of K with respect to L.

L

σ=10

σ=20

σ=50

0-6

2956-2982

2913-2963

2783-2908

7

2983

2965

2912

8

2983

2967

2915

9

2984

2968

2917

10

2984

2969

2918

11

2984

2970

2919

12

2985

2970

2920

4. Simulation Results

In this section, MC simulation results regarding the proposed on-demand memory scrubbing are presented. For MC simulations, RDE threshold value $\textit{T}$ for each bit in a word was randomly sampled from the Gaussian distribution, $N\left(3000,~ \sigma ^{2}\right)$. Under the generated $\textit{T}$ values in a word, the number of errors ($\textit{L}$) was followed as the read count ($\textit{K}$) increased.

Table 3 shows the MC simulation results with respect to various standard deviation values. The number of MC trials was 1,000,000. The first column of Table 3 shows the number of errors in the current state. From the current state, if one more read causes a violation ($\textit{i.e.}$, the number of errors exceeds 21), the case is counted as a $\textit{violated read}$. The second, fifth, and eighth columns in Table 3 show the number of violated reads for standard deviations of 10, 20, and 50, respectively. For example, when the standard deviation was 10, the violated read value was 4 for an $\textit{L}$ of 7. This means that in these four cases, the error number changed directly from 7 to a value larger than 21. The columns labeled Probability of VR show the ratio of violated reads with respect to the total number of trials. The column Average Read Threshold represents the average read counts among the violations. For example, when the standard deviation was 10, the average read threshold was 2,973 for an $\textit{L}$ of 7. This means that the average read count over four violations was 2,973.

From the MC simulation results in Table 3, two observations can be drawn. First, the distribution of violated reads changed with respect to the standard deviation. As the standard deviation of the underlying Gaussian model increased, the violations occurred late. In other words, when the standard deviation was relatively large, more errors could be endured until scrubbing. The other observation is that the average read threshold value remained almost the same while $\textit{L}$ increased. This means that even if we accumulate more errors before scrubbing, in order to reduce scrubbing frequency, the actual reduction in the scrubbing frequency is negligible. From this observation, the optimal on-demand scrubbing point for $\textit{L}$ is the point where a violation first occurs. Therefore, the optimal on-demand scrubbing points for each standard deviation value for $\textit{L}$ are 7, 10, and 13, respectively.

It should be noted that the MC simulation results presented in Table 3 verify the probability distribution described in Section 3.1. The probabilities of violations derived from the probability distributions shown in Table 1 give values similar to those in Table 3. For the average read threshold value drawn from the probability distribution in Table 2, the results verify that the value does not increase significantly when $\textit{L}$ is greater than 6.

In addition to RDEs, there are other reliability issues regarding PRAM, such as write disturbance error (WDE). That is, for example, if the current number of errors is 8, then it is possible that four RDEs and four WDEs both occur. Under this circumstance, consider a case where the proposed on-demand scrubbing policy initiates scrubbing. The policy thinks that all eight errors were caused by RDEs, but in this case, only four errors were caused by RDEs. Therefore, the probability that an additional read will cause an ECC violation is even less when other errors are considered. In summary, the proposed approach, which only considers RDEs, is a conservative approach, and other errors do not compromise the reliability of the proposed on-demand scrubbing.

Table 3. MC simulation results showing violated read and average read threshold values.

L

N3000,$10^{2}$

N3000,$20^{2}$

N3000,$50^{2}$

Violated Read

Probability of VR (%)

Average Read Threshold

Violated Read

Probability of VR (%)

Average Read Threshold

Violated Read

Probability of VR (%)

Average Read Threshold

0-6

0

0

0

0

0

0

0

0

0

7

4

0.0004

2973

0

0

0

0

0

0

8

37

0.0037

2973

0

0

0

0

0

0

9

197

0.0197

2973

0

0

0

0

0

0

10

706

0.0706

2973

3

0.0003

2952

0

0

0

11

2462

0.2462

2973

25

0.0025

2949

0

0

0

12

7604

0.7604

2973

160

0.016

2948

0

0

0

13

18692

1.8692

2973

866

0.0866

2948

3

0.0003

2881

14

40413

4.0413

2973

3669

0.3669

2949

47

0.0047

2876

15

73756

7.3756

2973

12916

1.2916

2949

432

0.0432

2875

16

115931

11.5931

2973

38812

3.8812

2949

2880

0.288

2876

17

157267

15.7267

2973

95337

9.5337

2949

17173

1.7173

2876

18

186598

18.6598

2974

187397

18.7397

2949

78581

7.8581

2876

19

198617

19.8617

2974

293767

29.3767

2949

267358

26.7358

2877

20

197716

19.7716

2974

367048

36.7048

2950

633526

63.3526

2877

5. Conclusion

To efficiently address RDEs, this paper proposes an on-demand scrubbing policy that does not require read counters. Assuming the memory capacity of PCM is 64GB, a read counter per word takes nearly 1GB of storage, which creates a significant resource overhead. In the proposed method, without this resource overhead, more than 99.99% of read disturbance errors can be fixed. From a mathematically derived probability distribution model of RDE occurrence, the optimal on-demand scrubbing policy is drawn up with respect to a certain standard deviation in RDE threshold values. MC simulation results also verified the derived probability distribution model. It should be noted that the probability of preventing violations can be increased by scrubbing more often. This means that there is a trade-off between reliability and performance, and the user can adaptively select the configuration according to the application.

ACKNOWLEDGMENTS

This paper was supported in part by the Technology Innovation Program (10080613, DRAM/PRAM hetero-geneous memory architecture and controller IC design technology research and development) funded by the Ministry of Trade, Industry & Energy (MOTIE, Korea) and in part by the Basic Science Research Program through the National Research Foundation of Korea (NRF) funded by the Ministry of Education under Grant NRF-2019R1A6A1A03032119.

REFERENCES

1 
Kim B., et al. , Apr. 2020, PCM: Precision-Controlled Memory System for Energy Efficient Deep Neural Network Training, 2020 Design, Automation & Test in Europe Conference & Exhibition (DATE)DOI
2 
Nguyen D. T., Hung N. H., Kim H., Lee H.-J., May 2020, An Approximate Memory Architecture for Energy Saving in Deep Learning Applications, in IEEE Transactions on Circuits and Systems for Video Technology, Vol. 67, No. 5, pp. 1588-1601DOI
3 
Lee C., Lee H., Feb. 2019, Effective Parallelization of a High-Order Graph Matching Algorithm for GPU Execution, in IEEE Transactions on Circuits and Systems for Video Technology, Vol. 29, No. 2, pp. 560-571DOI
4 
Kim M., Choi J., Kim H., Lee H., 1 Oct. 2019, An Effective DRAM Address Remapping for Mitigating Rowhammer Errors, in IEEE Transactions on Computers, Vol. 68, No. 10, pp. 1428-1441DOI
5 
Kim M., Chang I., Lee H., 2019, Segmented Tag Cache: A Novel Cache Organization for Reducing Dynamic Read Energy, in IEEE Transactions on Computers, Vol. 68, No. 10, pp. 1546-1552DOI
6 
Lee H., Kim M., Kim H., Kim H., Lee H., 2019, Integration and boost of a read-modify-write module in phase change memory system, IEEE Transactions on Computers, Vol. 68, No. 12, pp. 1772-1784DOI
7 
Lee B. C., Ipek E., Mutlu O., Burger D., Architecting phase change memory as a scalable dram alternative, in Proceedings of the 36th Annual International Symposium on Computer Architecture, ser. ISCA ’09.DOI
8 
Qureshi M. K., Srinivasan V., Rivers J. A., 2012, Scalable high performance main memory system using phase-change memory technology, in Proceedings of the 36th Annual International Symposium on High-Performance Comp Architecture (HPCA)DOI
9 
Wong H.-S. P., Raoux S., Kim S., Liang J., Reifenberg J. P., Rajendran B., Asheghi M., Goodson K. E., 2010, Phase change memory, Proceedings of the IEEE, Vol. 98, No. 12, pp. 2201-2227DOI
10 
Nair P. J., Chou C., Rajendran B., Qureshi M. K., 2015, Reducing read latency of phase change memory via early read and Turbo Read, 2015 IEEE 21st International Symposium on High Performance Computer Architecture (HPCA), Burlingame, CADOI
11 
Rashidi S., Jalili M., Sarbazi-Azad. H., Improving MLC PCM Performance through Relaxed Write and Read for Intermediate Resistance Levels, ACM Trans. Archit. Code Optim. 15, 1, Article 12 (April 2018), 31 pages.DOI
12 
Awasthi M., Shevgoor M., Sudan K., Rajendran B., Balasubramonian R., Srinivasan V., 2012, Efficient scrub mechanisms for error-prone emerging memories, IEEE International Symposium on High-Performance Comp Architecture (HPCA), New Orleans, LADOI

Author

Moonsoo Kim
../../Resources/ieie/IEIESPC.2021.10.1.055/au1.png

Moonsoo Kim received a B.S. and Ph.D. degrees in electrical and computer engineering from Seoul Na- tional University, Seoul, Korea, in 2014 and 2020, respectively. In 2020, he joined Inter-University Semicon-ductor Research Center from Seoul National University, Seoul, Korea as a post-doctoral researcher. His research interests include SoC design of video/image applications, and low-power, reliable design of memory hierarchy.

Joohan Yi
../../Resources/ieie/IEIESPC.2021.10.1.055/au2.png

Joohan Yi received the B.S. degree in electric and electrical engineering from Korea University, Seoul, Korea, in 2018. He is currently working toward the integrated M.S and Ph.D degree in electrical and computer engineering at Seoul National University, Seoul, Korea. His research interests include memory hierarchy, deep neural network processor, and image processing for Robot Systems.

Hyun Kim
../../Resources/ieie/IEIESPC.2021.10.1.055/au3.png

Hyun Kim received the B.S., M.S. and Ph.D. degrees in Electrical Engineering and Computer Science from Seoul National University, Seoul, Korea, in 2009, 2011 and 2015, respectively. From 2015 to 2018, he was with the BK21 Creative Research Engineer Development for IT, Seoul National University, Seoul, Korea, as a Research Professor. In 2018, he joined the Department of Electrical and Information Engineering, Seoul National University of Science and Technology, Seoul, Korea, where he is currently working as an Assistant Professor. His research interests are the areas of algorithm, computer architecture, and SoC design for low-complexity multimedia applications.

Hyuk-Jae Lee
../../Resources/ieie/IEIESPC.2021.10.1.055/au4.png

Hyuk-Jae Lee received B.S. and M.S. degrees in Electronics Engineering from Seoul National University, Korea, in 1987 and 1989, respectively, and the Ph.D. degree in Electrical and Computer Engineering from Purdue University at West Lafayette, Indiana, in 1996. From 1998 to 2001, he worked at the Server and Workstation Chipset Division of Intel Corporation in Hillsboro, Oregon as a senior component design engineer. From 1996 to 1998, he was on the faculty of the Department of Computer Science of Louisiana Tech University at Ruston, Louisiana. In 2001, he joined the School of Electrical Engineering and Computer Science at Seoul National University, Korea, where he is currently working as a Professor. He is a founder of Mamurian Design, Inc., a fabless SoC design house for multimedia applications. His research interests are in the areas of computer architecture and SoC design for multimedia applications.