Mobile QR Code QR CODE

2024

Acceptance Ratio

21%

Main Menu

※ The user interface design of www.ieiespc.org has been recently revised and updated. Please contact inter@theieie.org for any inquiries regarding paper submission.

Journal Search

IEIESPC(IEIE Transactions on Smart Processing and Computing)

IEIESPC Vol. 10, No. 1, p.055-060

ISSN (print) :

2287-5255

Received : 16 May 2020Revised : 23 June 2020Accepted : 20 July 2020

DOI :

https://doi.org/10.5573/IEIESPC.2021.10.1.055

Regular Paper

Extended from a Conference: Preliminary results of this paper were presented at the IEEE International Conference on Electronics, Information, and Communication (ICEIC) 2020. This paper has been accepted by the editorial board through the regular reviewing process that confirms the original contribution.

An Optimal On-demand Scrubbing Solution for Read Disturbance Errors in Phase-change Memory

KimMoonsoo¹ LeeJuhan¹ KimHyun^2* LeeHyuk-Jae¹

(Department of Electrical and Computer Engineering, Seoul National University, Seoul, Korea {kimms213, jhyi, hyuk_jae_lee}@capp.snu.ac.kr )
(Department of Electrical and Information Engineering and Research Center for Electrical and Information Technology, Seoul National University of Science and Technology, Seoul, Korea hyunkim@seoultech.ac.kr )

^* Corresponding Author: Hyun Kim

License :

This is an Open-Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/3.0/) which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.(www.theieie.org).

Abstract

Phase-change memory is a promising technology due to its attractive properties. However, phase-change memory is difficult to commercialize because of its reliability issues. A read disturbance error, which is the main cause of reliability issues, occurs when a cell is repeatedly read. A conventional solution for read disturbance errors is periodically scrubbing the cells. However, this method requires read counters to count the number of reads per word. This paper proposes an on-demand scrubbing solution that does not require read counters, which significantly reduces resource overhead. The proposed method observes the number of errors in a word using error-correcting code. If the number of errors is larger than a pre-defined threshold, scrubbing is performed to fix the errors. The proposed method removes nearly 1GB of hardware overhead required by read counters, and fixes more than 99.99% of read disturbance errors.

Keywords

Non-volatile memory, Phase-change memory, Read disturbance errors, On-demand scrubbing

1. Introduction

A modern computer system requires large amounts of main memory owing to its multi-core structure and complex applications. In particular, data-intensive applications such as big data and deep learning require a large amount of main memory to support large amounts of data [1-3]. As a result, the need for large-capacity main memory with low power consumption and high reliability has become important [4,5], and studies on the use of phase-change memory (PCM) as main memory have been actively conducted [6-8]. The cell size in PCM is smaller than DRAM, so the module can be denser, enabling a large memory capacity [9]. Furthermore, owing to the non-volatile characteristics of PCM, it is more advantageous than DRAM in terms of power efficiency and data retention time.

Despite these advantages, PCM suffers from low reliability, which needs to be addressed in order to use PCM as main memory. One of the main causes of reliability issues in PCM is read disturbance errors (RDEs) [10,11]. An RDE is a phenomenon whereby cells that are repeatedly read are damaged by thermal energy. RDEs occur when the number of reads exceeds a certain threshold. A conventional solution for RDEs is to scrub the cells in a word before the number of reads reaches the threshold. Memory scrubbing first reads a word, corrects bit errors with error-correcting code (ECC), and writes the corrected word back to the same location. Periodically scrubbing a word therefore prevents RDEs in that word.

However, periodical scrubbing requires read counters, which results in significant resource overhead because the number of reads must be counted in order to trigger scrubbing. In this paper, on-demand memory scrubbing that does not require read counters is proposed. Under the given RDE model with ECC, the probability distribution of the number of errors that occur with an additional read is derived. By using the derived probability distribution, the proposed solution suggests whether to scrub or not based on the current number of errors. Because the proposed solution only requires the number of errors, it does not need read counters, thereby eliminating nearly 1GB of resource overhead. The contributions from this paper are summarized as follows.

· A probabilistic model for RDEs is mathematically derived, and the optimal on-demand scrubbing policy is derived from the proposed model.

· Monte-Carlo (MC) simulation is conducted to verify the probabilistic model.

· The proposed on-demand scrubbing eliminates more than 1GB needed for read counters, while fixing more than 99.99% of RDEs.

The remainder of this paper is organized as follows. Section 2 introduces the background, and Section 3 presents the proposed on-demand scrubbing method. In Section 4, experimental results are given. Finally, Section 5 concludes the paper.

2. Background

In this section, the background for RDE error models and RDE mitigating schemes is presented.

2.1 Error Models for RDEs

Typically, a counter-based error model is used for RDEs [9]. Under this model, each cell has an RDE threshold, and when the number of reads reaches the threshold, an RDE occurs. The RDE threshold values follow a Gaussian distribution, $N\left(m,\sigma ^{2}\right)$. For later discussions, $\textit{m}$ at 3,000 and various $\sigma $ values are assumed.

The word size for PCM typically ranges from 64B to 256B [7]. In this paper, we assume 128B words. For ECC, 176-21 Reed-Solomon code is assumed so up to 21 symbols can be corrected out of 176 symbols in total. These 1,408 cells in a word are assumed to have independent RDE threshold values modeled as a Gaussian distribution.

2.2 RDE Mitigating Schemes

To mitigate RDE occurrences, a memory scrubbing method is used [12]. In conventional methods, each word utilizes a read counter. If the counter value reaches a certain threshold, the method reads a whole word and checks for errors via ECC. If errors are found, they are corrected and the word is rewritten. This read-and-fix process is called memory scrubbing. Conventional memory scrubbing, which uses a read counter, can remove RDEs effectively as long as the scrubbing threshold value is well chosen. However, as shown in Fig. 1, it requires a read counter per word, which means an extra 2B of storage must be allocated per word. Taking into account that the word size in PCM is typically between 64B and 256B, the extra storage takes about 1/32 to 1/128 of the total PCM capacity. Moreover, these read counters are updated frequently, and thus, DRAM should be used for them. Assuming 512GB of PCM capacity and a 128B word size, nearly 8GB of DRAM is used only for read counters, which is a significantly large overhead.

Fig. 1. Diagram of PCM and its read counters.

3. On-demand Memory Scrubbing

In this section, an on-demand memory scrubbing method that effectively eliminates read counters is described. First, the probability distribution under the Gaussian counter-based error model is derived, and then, an efficient on-demand scrubbing policy under the given probability distribution is suggested.

3.1 Probability Distribution for the Number of Errors

Denote as $\textit{L}$ the number of errors, with $\textit{K}$ as the number of reads, and $\textit{T}$ as the RDE threshold. The first probability to derive is $e_{k}$, which is the probability that an error occurs when the number of reads is $\textit{k}$ $\left(K=k\right)$. $e_{k}$ is derived as follows:

(1)

$e_{k}=P\left(k\geq T\right)=P\left(k\geq N\left(m,\sigma ^{2}\right)\right)$.

Because errors occur when the number of reads exceeds the RDE threshold, the first equality in (1) stands. The second equality is from the Gaussian modeling of $\textit{T}$. It should be noted that $P\left(k\geq N\left(m,\sigma ^{2}\right)\right)$ can be easily calculated from the normal distribution table.

When $K=k$, the probability of $L$ being $l$ is a binomial distribution $B\left(l,176,e_{k}\right)$. More specifically, for all 176 symbols, the probability of an error is $e_{k}$, so they have a binomial distribution as follows:

(2)

$P(l|k)=B\left(l;176,e_{k}\right)=\left(\begin{array}{l} 176\\ l \end{array}\right)\cdot {e_{k}}^{l}\cdot \left(1-e_{k}\right)^{176-l}$.

However, we are interested in probability $P(k|l)$, not $P(l|k)$, because the value that can be observed is $\textit{l}$, not $\textit{k}$. By using Bayes’s rule, $P(k|l)$ can be derived as follows:

(3)

$P(k|l)=\frac{P(l|k)\cdot P\left(k\right)}{P\left(l\right)}=\frac{P(l|k)}{\sum _{k}P(l|k)}$.

This means that when the observed number of errors is $\textit{l}$, the hidden number of reads, $\textit{k,}$ has the probability distribution derived from (3).

Lastly, probability $P(L'=l'|L=l)$ is derived. $L'$ represents the number of errors when an additional read operation is performed. Therefore, probability $P(l'|l)$ means the number of errors \textit{l$^{\prime}$} from $\textit{l}$ due to an additional read. It is derived as follows:

(4)

$P\left(l'|l\right)=\sum _{k=0}^{\infty }B\left(\left(l'-l\right);176-l,e_{k+1}\right)\cdot P(k|l)$.

$B\left(\left(l'-l\right);176-l,e_{k+1}\right)$ calculates the probability that the number of additional errors that occur from one more reads is $\left(l'-l\right)$. Because the number of reads is now \textit{k+}1, $e_{k+1}$ is used as the error probability. By summing the term over $\textit{k,}$ the desired $P\left(l'|l\right)$ is calculated.

3.2 On-demand Scrubbing Policy

So far, probability distribution $P\left(l'|l\right)$ is calculated with (4). This distribution is used to derive the on-demand scrubbing policy. Assuming that ECC can correct up to 21 symbols, scrubbing must be done before the number of errors exceeds 21. This means that probability $P\left(L'>21|L=l\right)$ must be small enough when the scrubbing is done. It can be calculated as follows:

(5)

$P\left(L'>21|l\right)=1-P\left(L'\leq 21|l\right)=1-\sum _{l'=l}^{21}P\left(l'|l\right)$.

Table 1 shows the value of $P\left(L'>21|L=l\right)$ for various values of $\sigma $. When $\sigma =10$ and $L=6$, the probability is 6.5E-7%. This means that when the current number of errors is 6, the probability that the number of errors becomes larger than 21 due to one more read operation is 6.5E-7%. It is obvious that the probability must be very small to prevent an uncorrectable error (UE). Therefore, performing scrubbing at a small value for $\textit{L}$ is always better in terms of removing UEs. However, a small L value causes too-frequent scrubbing. Table 2 shows the expected value for the number of reads ($\textit{K}$) with respect to $\textit{L}$. We can see that $\textit{K}$ increases as $\textit{L}$ increases. Moreover, the difference in the value is relatively large when $\textit{L}$ is small (from 0 to 6). On the other hand, when $\textit{L}$ is relatively large, the difference becomes smaller. To sum up, it is best to delay scrubbing while the UE probability is small enough, but when the $\textit{L}$ value becomes relatively large, to delay scrubbing does not give much benefit, because the difference in the $\textit{K}$ value is minor. For example, when the probability goal is 99.999%, which means the UE probability should be less than 0.001%, the selected $\textit{L}$ values that trigger scrubbing are 7, 10, and 13 when $\sigma $ is 10, 20, and 50, respectively. The average reads at the selected points are 2984, 2969, and 2920, respectively.

The proposed on-demand scrubbing provides a significant hardware overhead reduction because it does not require read counters. As mentioned above, a read counter for a single word takes up 2B. Assuming a word size of 128B, 1/64 of the total PCM capacity must be allocated to read counters. For example, if the PCM capacity is 64GB, total of 1GB of storage must be allocated to read counters. This means that using on-demand scrubbing significantly reduces storage overhead by about 1/64 of the total capacity.

Table 1. The probability of violation with respect to L.

L	σ=10	σ=20	σ=50
0-6	<6.5E-7%	<4.8E-7%	<2.4E-8%
7	2.9E-4%	6.1E-6%	1.6E-7%
8	5.0E-3%	2.0E-5%	1.1E-6%
9	0.03%	8.4E-5%	8.7E-6%
10	0.08%	2.8E-4%	2.2E-5%
11	0.31%	3.4E-3%	9.1E-5%
12	0.82%	0.02%	3.3E-4%

Table 2. The expected value of K with respect to L.

L	σ=10	σ=20	σ=50
0-6	2956-2982	2913-2963	2783-2908
7	2983	2965	2912
8	2983	2967	2915
9	2984	2968	2917
10	2984	2969	2918
11	2984	2970	2919
12	2985	2970	2920

4. Simulation Results

In this section, MC simulation results regarding the proposed on-demand memory scrubbing are presented. For MC simulations, RDE threshold value $\textit{T}$ for each bit in a word was randomly sampled from the Gaussian distribution, $N\left(3000,~ \sigma ^{2}\right)$. Under the generated $\textit{T}$ values in a word, the number of errors ($\textit{L}$) was followed as the read count ($\textit{K}$) increased.

Table 3 shows the MC simulation results with respect to various standard deviation values. The number of MC trials was 1,000,000. The first column of Table 3 shows the number of errors in the current state. From the current state, if one more read causes a violation ($\textit{i.e.}$, the number of errors exceeds 21), the case is counted as a $\textit{violated read}$. The second, fifth, and eighth columns in Table 3 show the number of violated reads for standard deviations of 10, 20, and 50, respectively. For example, when the standard deviation was 10, the violated read value was 4 for an $\textit{L}$ of 7. This means that in these four cases, the error number changed directly from 7 to a value larger than 21. The columns labeled Probability of VR show the ratio of violated reads with respect to the total number of trials. The column Average Read Threshold represents the average read counts among the violations. For example, when the standard deviation was 10, the average read threshold was 2,973 for an $\textit{L}$ of 7. This means that the average read count over four violations was 2,973.

From the MC simulation results in Table 3, two observations can be drawn. First, the distribution of violated reads changed with respect to the standard deviation. As the standard deviation of the underlying Gaussian model increased, the violations occurred late. In other words, when the standard deviation was relatively large, more errors could be endured until scrubbing. The other observation is that the average read threshold value remained almost the same while $\textit{L}$ increased. This means that even if we accumulate more errors before scrubbing, in order to reduce scrubbing frequency, the actual reduction in the scrubbing frequency is negligible. From this observation, the optimal on-demand scrubbing point for $\textit{L}$ is the point where a violation first occurs. Therefore, the optimal on-demand scrubbing points for each standard deviation value for $\textit{L}$ are 7, 10, and 13, respectively.

It should be noted that the MC simulation results presented in Table 3 verify the probability distribution described in Section 3.1. The probabilities of violations derived from the probability distributions shown in Table 1 give values similar to those in Table 3. For the average read threshold value drawn from the probability distribution in Table 2, the results verify that the value does not increase significantly when $\textit{L}$ is greater than 6.

In addition to RDEs, there are other reliability issues regarding PRAM, such as write disturbance error (WDE). That is, for example, if the current number of errors is 8, then it is possible that four RDEs and four WDEs both occur. Under this circumstance, consider a case where the proposed on-demand scrubbing policy initiates scrubbing. The policy thinks that all eight errors were caused by RDEs, but in this case, only four errors were caused by RDEs. Therefore, the probability that an additional read will cause an ECC violation is even less when other errors are considered. In summary, the proposed approach, which only considers RDEs, is a conservative approach, and other errors do not compromise the reliability of the proposed on-demand scrubbing.

Table 3. MC simulation results showing violated read and average read threshold values.

L	N3000,$10^{2}$			N3000,$20^{2}$			N3000,$50^{2}$
L	Violated Read	Probability of VR (%)	Average Read Threshold	Violated Read	Probability of VR (%)	Average Read Threshold	Violated Read	Probability of VR (%)	Average Read Threshold
0-6	0	0	0	0	0	0	0	0	0
7	4	0.0004	2973	0	0	0	0	0	0
8	37	0.0037	2973	0	0	0	0	0	0
9	197	0.0197	2973	0	0	0	0	0	0
10	706	0.0706	2973	3	0.0003	2952	0	0	0
11	2462	0.2462	2973	25	0.0025	2949	0	0	0
12	7604	0.7604	2973	160	0.016	2948	0	0	0
13	18692	1.8692	2973	866	0.0866	2948	3	0.0003	2881
14	40413	4.0413	2973	3669	0.3669	2949	47	0.0047	2876
15	73756	7.3756	2973	12916	1.2916	2949	432	0.0432	2875
16	115931	11.5931	2973	38812	3.8812	2949	2880	0.288	2876
17	157267	15.7267	2973	95337	9.5337	2949	17173	1.7173	2876
18	186598	18.6598	2974	187397	18.7397	2949	78581	7.8581	2876
19	198617	19.8617	2974	293767	29.3767	2949	267358	26.7358	2877
20	197716	19.7716	2974	367048	36.7048	2950	633526	63.3526	2877

5. Conclusion

To efficiently address RDEs, this paper proposes an on-demand scrubbing policy that does not require read counters. Assuming the memory capacity of PCM is 64GB, a read counter per word takes nearly 1GB of storage, which creates a significant resource overhead. In the proposed method, without this resource overhead, more than 99.99% of read disturbance errors can be fixed. From a mathematically derived probability distribution model of RDE occurrence, the optimal on-demand scrubbing policy is drawn up with respect to a certain standard deviation in RDE threshold values. MC simulation results also verified the derived probability distribution model. It should be noted that the probability of preventing violations can be increased by scrubbing more often. This means that there is a trade-off between reliability and performance, and the user can adaptively select the configuration according to the application.

ACKNOWLEDGMENTS

This paper was supported in part by the Technology Innovation Program (10080613, DRAM/PRAM hetero-geneous memory architecture and controller IC design technology research and development) funded by the Ministry of Trade, Industry & Energy (MOTIE, Korea) and in part by the Basic Science Research Program through the National Research Foundation of Korea (NRF) funded by the Ministry of Education under Grant NRF-2019R1A6A1A03032119.

REFERENCES

Kim B., et al. , Apr. 2020, PCM: Precision-Controlled Memory System for Energy Efficient Deep Neural Network Training, 2020 Design, Automation & Test in Europe Conference & Exhibition (DATE)

Nguyen D. T., Hung N. H., Kim H., Lee H.-J., May 2020, An Approximate Memory Architecture for Energy Saving in Deep Learning Applications, in IEEE Transactions on Circuits and Systems for Video Technology, Vol. 67, No. 5, pp. 1588-1601

Lee C., Lee H., Feb. 2019, Effective Parallelization of a High-Order Graph Matching Algorithm for GPU Execution, in IEEE Transactions on Circuits and Systems for Video Technology, Vol. 29, No. 2, pp. 560-571

Kim M., Choi J., Kim H., Lee H., 1 Oct. 2019, An Effective DRAM Address Remapping for Mitigating Rowhammer Errors, in IEEE Transactions on Computers, Vol. 68, No. 10, pp. 1428-1441

Kim M., Chang I., Lee H., 2019, Segmented Tag Cache: A Novel Cache Organization for Reducing Dynamic Read Energy, in IEEE Transactions on Computers, Vol. 68, No. 10, pp. 1546-1552

Lee H., Kim M., Kim H., Kim H., Lee H., 2019, Integration and boost of a read-modify-write module in phase change memory system, IEEE Transactions on Computers, Vol. 68, No. 12, pp. 1772-1784

Lee B. C., Ipek E., Mutlu O., Burger D., Architecting phase change memory as a scalable dram alternative, in Proceedings of the 36th Annual International Symposium on Computer Architecture, ser. ISCA ’09.

Qureshi M. K., Srinivasan V., Rivers J. A., 2012, Scalable high performance main memory system using phase-change memory technology, in Proceedings of the 36th Annual International Symposium on High-Performance Comp Architecture (HPCA)

Wong H.-S. P., Raoux S., Kim S., Liang J., Reifenberg J. P., Rajendran B., Asheghi M., Goodson K. E., 2010, Phase change memory, Proceedings of the IEEE, Vol. 98, No. 12, pp. 2201-2227

Nair P. J., Chou C., Rajendran B., Qureshi M. K., 2015, Reducing read latency of phase change memory via early read and Turbo Read, 2015 IEEE 21st International Symposium on High Performance Computer Architecture (HPCA), Burlingame, CA

Rashidi S., Jalili M., Sarbazi-Azad. H., Improving MLC PCM Performance through Relaxed Write and Read for Intermediate Resistance Levels, ACM Trans. Archit. Code Optim. 15, 1, Article 12 (April 2018), 31 pages.

Awasthi M., Shevgoor M., Sudan K., Rajendran B., Balasubramonian R., Srinivasan V., 2012, Efficient scrub mechanisms for error-prone emerging memories, IEEE International Symposium on High-Performance Comp Architecture (HPCA), New Orleans, LA

Author

Moonsoo Kim

Moonsoo Kim received a B.S. and Ph.D. degrees in electrical and computer engineering from Seoul Na- tional University, Seoul, Korea, in 2014 and 2020, respectively. In 2020, he joined Inter-University Semicon-ductor Research Center from Seoul National University, Seoul, Korea as a post-doctoral researcher. His research interests include SoC design of video/image applications, and low-power, reliable design of memory hierarchy.

Joohan Yi

Joohan Yi received the B.S. degree in electric and electrical engineering from Korea University, Seoul, Korea, in 2018. He is currently working toward the integrated M.S and Ph.D degree in electrical and computer engineering at Seoul National University, Seoul, Korea. His research interests include memory hierarchy, deep neural network processor, and image processing for Robot Systems.

Hyun Kim

Hyun Kim received the B.S., M.S. and Ph.D. degrees in Electrical Engineering and Computer Science from Seoul National University, Seoul, Korea, in 2009, 2011 and 2015, respectively. From 2015 to 2018, he was with the BK21 Creative Research Engineer Development for IT, Seoul National University, Seoul, Korea, as a Research Professor. In 2018, he joined the Department of Electrical and Information Engineering, Seoul National University of Science and Technology, Seoul, Korea, where he is currently working as an Assistant Professor. His research interests are the areas of algorithm, computer architecture, and SoC design for low-complexity multimedia applications.

Hyuk-Jae Lee

Hyuk-Jae Lee received B.S. and M.S. degrees in Electronics Engineering from Seoul National University, Korea, in 1987 and 1989, respectively, and the Ph.D. degree in Electrical and Computer Engineering from Purdue University at West Lafayette, Indiana, in 1996. From 1998 to 2001, he worked at the Server and Workstation Chipset Division of Intel Corporation in Hillsboro, Oregon as a senior component design engineer. From 1996 to 1998, he was on the faculty of the Department of Computer Science of Louisiana Tech University at Ruston, Louisiana. In 2001, he joined the School of Electrical Engineering and Computer Science at Seoul National University, Korea, where he is currently working as a Professor. He is a founder of Mamurian Design, Inc., a fabless SoC design house for multimedia applications. His research interests are in the areas of computer architecture and SoC design for multimedia applications.