LeeJongho1
                     KimHyun1
               
                  - 
                           
                        (Department of Electrical and Information Engineering, Research Center for Electrical
                        and Information Technology, Seoul National University of Science and Technology /
                        Seoul, Korea   {jhlees, hyunkim}@seoultech.ac.kr )
                        
 
            
            
            Copyright © The Institute of Electronics and Information Engineers(IEIE)
            
            
            
            
            
               
                  
Keywords
               
                Computer vision,  Image classification,  Deep learning,  Discrete cosine transform (DCT),  Vision transformer
             
            
          
         
            
                  1. Introduction
               Recently, owing to developments in deep learning (DL), there have been remarkable
                  performance improvements in the field of computer vision [1-4, 26-29]. Until now, most DL-based computer vision studies have been developed based mainly
                  on model architectures [5,6] and computational methods, such as convolution and self-attention [7,8]. In the 2020s, self-attention-based vision transformers have tended to replace convolutional
                  neural networks (CNNs) [1,3,5,9]. The transformer model, which has been actively studied in the field of natural language
                  processing (NLP), allows one image patch to act as a word in a sentence through patch
                  embedding. This enables self-attention operations in the field of computer vision.
                  
               
               However, unlike a single word of great importance in a language, an image is only
                  a signal of light. Thus, it has redundant information relative to the importance of
                  words in sentences [4]. Therefore, if important information is extracted in advance from the image, it may
                  help improve the accuracy of the DL model in computer vision. Frequency domain transform
                  methods, such as the Discrete Cosine Transform (DCT), Discrete Wavelet Transform (DWT),
                  and Fast Fourier Transform (FFT), have been steadily used for extracting meaningful
                  information from images. In modern DL models, it is also possible to use these frequency
                  domain transform methods for this purpose [10-12] because computational methods for image recreation can achieve good performance with
                  the DCT, DWT, and FFT, as well as with convolution and self-attention [13]. Several attempts to use DCT in DL models have been reported [14-16]. On the other hand, these studies were only performed to reduce the communication
                  bandwidth and computational costs in CNN or NLP models.
               
               This paper proposes a method to improve the accuracy of the vision transformer model
                  using the DCT. In detail, an input image enters the vision transformer after applying
                  a 2D DCT [10] in units with an N${\times}$N block size, allowing the DL models to utilize inputs
                  with both spatial and frequency information. The proposed method improves the top-1
                  accuracy of the vision transformer by approximately 3-5% on the Cifar10 [17], Cifar-100 [17], and Tiny-ImageNet [18] datasets, while performing the 2D-DCT only once, immediately before patch embedding.
                  In addition, as the proposed method can improve the performance of various vision
                  transformer models [1], including tiny and small sizes, it has high compatibility and scalability for model
                  sizes and datasets.
               
             
            
                  2. Background
               
                     2.1 Patch Embedding
                  Before the emergence of vision transformers [1] by Dosovitsky et al., CNN models [6,19] were used widely in computer vision. Subsequently, the emergence of vision transformer
                     DL models based solely on self-attention operations has become dominant, excluding
                     the convolution structure [1,3,5,9]. The concept of self-attention in computer vision is similar to self-attention in
                     the field of NLP; however, it is possible to change the concept of image-patch in
                     vision transformer models to that of sentence-word in NLP through the patch embedding
                     process [1], as shown in Fig. 1. This concept is simple. It cuts the image into a non-overlapping 16${\times}$16
                     patch for linear projection and adds a class token. Through these ideas, vision transformers
                     can achieve state-of-the-art performance not only in NLP but also in computer vision.
                  
                  On the other hand, despite the contributions of patch embedding, most studies on computer
                     vision tasks have focused on improving model architectures and computational methods,
                     such as convolution, multi-layer perception, and self-attention [5-8, 30]. Accordingly, these studies cannot overcome the limitations of using only the spatial
                     information of the image. As shown in Fig. 1, when patch embedding is performed, the patches cut into 16${\times}$16 pixels are
                     converted to a 196${\times}$C-dimensional matrix through a 2D-convolution operation.
                     Owing to the nature of the CNN, the weights of the filters applied to each patch are
                     shared. Therefore, it may be helpful to match the uniform format for each patch rather
                     than to use the original image of the pixel. 
                  
                  
                        Fig. 1. Diagram showing the patch embedding process of a sample image of size 224${\times}$224 extracted from imageNet-1k dataset. The RGB image is cut into 196(=14${\times}$14) 16${\times}$16 size patches, and is released in the form of a matrix of 196${\times}$C size through a non-overlapping 2D convolution operation of the 16${\times}$16 size filter.}
 
                
               
                     2.2 2D-Discrete Cosine Transform
                  The 2D-DCT is used widely in signal processing, and various fields, including image
                     compression, such as JPEGs [20]. One of the advantages of 2D-DCT is that the image can be viewed from a frequency
                     perspective. When the 2D-DCT (in units of N${\times}$N block size) is performed on
                     the image, the upper-left side of the block has low-frequency information, whereas
                     the lower-right side has high-frequency information. Fig. 2 presents the result from obtaining an image with frequency information for each block
                     by performing the N${\times}$N block 2D-DCT on the RGB image with a resolution of
                     224${\times}$224 in Fig. 1.
                  
                  An image is a signal in which spatial redundant information is captured. Therefore,
                     if the 2D-DCT is performed on a block basis, the frequency and spatial information
                     can be expressed in the local and global parts, respectively. The motivation of this
                     study is that by exploiting these advantages, the vision transformer models without
                     inductive bias can be better trained by inputting images with the 2D-DCT applied to
                     vision transformer models. The 2D-DCT with the N${\times}$N size can be calculated
                     as follows:
                  
                  
                  where D represents one N${\times}$N-sized block. For example, the original image of
                     a 224${\times}$224 resolution may be converted to 3136 4${\times}$4 size blocks, 784
                     8${\times}$8 size blocks, 196 16${\times}$16 size blocks, or one 224${\times}$224
                     block. In (1), i and j are the pixel indices of the blocks converted by the 2D-DCT, and x and y
                     are the pixel indices of the original block I. $\alpha $ is a scale factor for ensuring
                     that the transform is orthonormal and is given as follows:
                  
                  
                  In this way, the original image can be converted to a frequency domain on a block
                     basis, and within the block, the upper-left part can concentrate low-frequency energy.
                     The lower right can have the relatively less important high-frequency energy.
                  
                  
                        Fig. 2. Results of a 2D-discrete cosine transform (DCT) operation for the 224${\times}$224 size input image of Fig. 1 in block units of (a) 4${\times}$4; (b) 8${\times}$8; (c) 16${\times}$16; (d) 224${\times}$224.}
 
                
               
                     2.3 Vision Transformer
                  Fig. 3(a) shows the overall structure of the vision transformer used for image classification.
                     First, the input image resized to a 224${\times}$224 resolution was cut into 196 16${\times}$16
                     size patches through a patch embedding process. Subsequently, each patch was supplemented
                     with position information through position embedding and class information by adding
                     a class token. Because the multi-head attention [21]-based transformer encoder has the same input and output dimensions, it can go through
                     several blocks depending on the size of the models (e.g., tiny and small). Multi-head
                     attention is a method of self-attention in parallel by dividing the query, key, and
                     value input by the number of heads. Through this method, the vision transformer identifies
                     the relationships between patches and extracts the image features. Finally, an image
                     classification task is performed through a multi-layer perceptron head after the previous
                     processes.
                  
                
             
            
                  3. Proposed Method 
               While the vision transformer model receives an original image in RGB format, this
                  study proposes adding a new 2D-DCT patch embedding method at the input stage of the
                  vision transformer model. In the proposed method, the original image was cut into
                  N${\times}$N blocks, and the 2D-DCT is performed in units of blocks before patch embedding,
                  making the local spatial information available within each block, as shown in Fig. 3. The location information is more critical in the 2D-DCT block than in the original
                  image because the upper-left part of the 2D-DCT block has DC information representing
                  the average pixel value. In contrast, the bottom-right part has high-frequency AC
                  information. The proposed method improves performance by utilizing this additional
                  information and has the advantage that it can be implemented with a minimal computational
                  increase to the baseline (i.e., vision transformer) without changing the number of
                  parameters. In addition, it can be applied to various vision transformer models in
                  various structures (i.e., high compatibility and scalability).
               
               This paper describes applying the proposed method through an example of a specific
                  block size. When the 2D-DCT block size is 4${\times}$4, the original image of size
                  224${\times}$224 is divided by 3136 (= (224${\times}$224) / (4${\times}$4)) blocks
                  and a 4${\times}$4 2D-DCT is then performed in parallel for each block. The 3136 blocks
                  on which the 4${\times}$4 2D-DCT is performed are again combined into images of size
                  224${\times}$224. In this case, an additional 1.5M floating-point operations per second
                  (FLOPs ) is required for the 2D-DCT operation, compared to the case where the vision
                  transformer’s inference operation is performed on the RGB image with a resolution
                  of 224${\times}$224. On the other hand, this is an insignificant increase considering
                  that the computational amount of DeiT-Tiny and DeiT-Small is 1.3G FLOPs and 4.6G FLOPs
                  [22], respectively. Even if the 2D-DCT block size increases to 8${\times}$8, only approximately
                  2.7M FLOPs are required (i.e., when performing the same process with the 784 8${\times}$8
                  blocks). In addition, because there is no dependency between each block, a fast parallel
                  operation is possible through ``CUDA'' [23]. The subsequent processing method follows the operation process of the existing vision
                  transformer described in Section 2.3. In other words, the proposed method is a straightforward
                  but effective method that performs N${\times}$N block DCT on the input image without
                  changing the structure of the existing vision transformer model and then inputs the
                  DCT blocks to patch embedding.
               
               
                     Fig. 3. Diagram showing the patch embedding process of a sample image of size 224${\times}$224 extracted from imageNet-1k dataset. The RGB image was cut into 196(=14${\times}$14) 16${\times}$16 size patches, and was released in the form of a matrix of 196${\times}$C size through a non-overlapping 2D convolution operation of the 16${\times}$16 size filter.}
 
             
            
                  4. Experimental Results
               
                     4.1 Experiment Settings
                  A vision transformer was trained and validated in four Tesla-V100 GPUs using the Cifar-10
                     [17], Cifar-100 [17], and Tiny-ImageNet [18] datasets. The Cifar-10 and Cifar-100 datasets have 50,000 training images and 10,000
                     validation images with 10 and 100 classes, respectively. The Tiny-ImageNet dataset
                     contains 100,000 training images and approximately 10,000 validation images for 200
                     classes. The most training strategy of DeiT [3] and the detailed environmental settings are as follows. Adam [31] was used as an optimizer, and set the momentum to 0.9 and weight decay to 0.05. All
                     models were trained for 300 epochs using a batch size of 1,024 and a learning rate
                     of 0.0005. All source codes are referred to the pytorch-based pytorch image models
                     (Timm) library [24], and for the 2D-DCT operations, the torchJpeg library [25] is used. It should be noted that the DCT block size in all result tables is marked
                     as ``-'' for vanilla vision transformers that do not use the DCT patch embedding.
                  
                
               
                     4.2 Accuracy Evaluation
                  Table 1 lists the results of applying the proposed method to the vision transformer with
                     the tiny and small models on the Cifar-10 dataset. As suggested, the model was trained
                     by adding discrete cosine transformed images in units of N${\times}$N blocks before
                     performing patch embedding for the vision transformer. When a 4${\times}$4 block size
                     2D-DCT was adopted on the Cifar-10 dataset, the top-1 accuracy increased by 4.09%
                     and 0.36% for the tiny and small models, respectively. When a 16${\times}$16 block
                     size 2D-DCT was adopted, the top-1 accuracy improved by 4.57% and 3.7% for the tiny
                     and small models, respectively. When the block size was more than 16${\times}$16,
                     the performance was inferior to the others because the detailed information was lost
                     spatially. Therefore, experiments larger than 32${\times}$32 were not performed. In
                     particular, when 2D-DCT was performed with an image size of 224${\times}$224, all
                     spatial features of the image were lost because of the characteristics of 2D-DCT,
                     as shown in Fig. 2(d).
                  
                  Table 2 shows the same experiment for the Cifar-100 dataset with a tiny model and small model.
                     When the 4${\times}$4 block size 2D-DCT was adopted, the top-1 accuracy increased
                     by 3.86% and 3.59% for the tiny and small models, respectively. When the 16${\times}$16
                     block size 2D-DCT was adopted, the top-1 accuracy increased by 5.36% and 8.92% for
                     the tiny and small models, respectively.
                  
                  Experiments were conducted on the Tiny-ImageNet dataset, and a relatively large dataset
                     was used, as shown in Table 3. When the 4${\times}$4 block size 2D-DCT was adopted, the top-1 accuracy increased
                     by 2.92% and 5.37% for the tiny and small models, respectively. When the 16${\times}$16
                     block size 2D-DCT was adopted, it increased by 3.66% and 5.49% for the tiny and small
                     models, respectively.
                  
                  As a result, when the 16${\times}$16 block size 2D-DCT patch embedding was applied,
                     performance was increased most in all cases through the proposed method. This result
                     was attributed to the patch being cut into 16x16 during the patch embedding process
                     in DeiT. The increase in computational cost and decrease in speed was negligible.
                     In addition, as the proposed model can be applied directly to most vision transformer
                     models using patch embedding, its compatibility was excellent, making it an easy and
                     general way to improve performance.
                  
                  
                        Table 1. Accuracy of the Proposed Method on the CIFAR-10.
                     
                           
                              
                                 | Model | DCT block size | Top1-Acc.(%) | Top5-Acc.(%) | 
                           
                                 | DeiT -Tiny | - | 79.92 | 98.8 | 
                           
                                 | 2 | 84.42 | 99.18 | 
                           
                                 | 4 | 84.01 | 99.2 | 
                           
                                 | 8 | 84.27 | 99.21 | 
                           
                                 | 16 | 84.49 | 99.26 | 
                           
                                 | 32 | 82.7 | 99.21 | 
                           
                                 | DeiT -Small | - | 79.73 | 98.76 | 
                           
                                 | 2 | 75.96 | 98.42 | 
                           
                                 | 4 | 80.09 | 98.73 | 
                           
                                 | 8 | 80.67 | 99.0 | 
                           
                                 | 16 | 83.43 | 99.07 | 
                           
                                 | 32 | 53.82 | 93.68 | 
                        
                     
                   
                  
                        Table 2. Accuracy of the Proposed Method on the CIFAR-100.
                     
                           
                              
                                 | Model | DCT block size | Top1-Acc.(%) | Top5-Acc.(%) | 
                           
                                 | DeiT -Tiny | - | 66.17 | 89.79 | 
                           
                                 | 2 | 68.95 | 91.36 | 
                           
                                 | 4 | 70.03 | 91.55 | 
                           
                                 | 8 | 69.64 | 91.34 | 
                           
                                 | 16 | 71.53 | 92.07 | 
                           
                                 | 32 | 67.15 | 90.35 | 
                           
                                 | DeiT -small | - | 61.13 | 86.2 | 
                           
                                 | 2 | 61.19 | 85.34 | 
                           
                                 | 4 | 64.72 | 88.12 | 
                           
                                 | 8 | 66.31 | 88.59 | 
                           
                                 | 16 | 70.05 | 90.75 | 
                           
                                 | 32 | 38.47 | 38.76 | 
                        
                     
                   
                  
                        Table 3. Accuracy of the Proposed Method on the Tiny-ImageNet.
                     
                           
                              
                                 | Model | DCT block size | Top1-Acc.(%) | Top5-Acc.(%) | 
                           
                                 | DeiT -Tiny | - | 54.22 | 77.84 | 
                           
                                 | 2 | 57.24 | 79.88 | 
                           
                                 | 4 | 57.14 | 79.94 | 
                           
                                 | 8 | 56.81 | 80.17 | 
                           
                                 | 16 | 57.88 | 80.62 | 
                           
                                 | 32 | 52.45 | 76.75 | 
                           
                                 | DeiT -Small | - | 48.78 | 73.7 | 
                           
                                 | 2 | 53.26 | 76.28 | 
                           
                                 | 4 | 54.15 | 76.94 | 
                           
                                 | 8 | 52.75 | 76.88 | 
                           
                                 | 16 | 54.27 | 77.66 | 
                           
                                 | 32 | 28.42 | 53.66 | 
                        
                     
                   
                
             
            
                  5. Conclusion
               Thus far, modern DL models perform well because they can independently extract and
                  process the information needed in the image. On the other hand, in this study, because
                  an image is simply a light signal, the DL model can help better process an image by
                  utilizing a modulation method for the traditionally studied frequency band. The proposed
                  method shows remarkable performance improvements in all experimental cases, even though
                  the computational cost and latency remain relatively unchanged. The proposed method
                  can be applied directly to other transformer-affiliated models and can be extended
                  to tasks such as object detection, instance segmentation, semantic segmentation, and
                  depth estimation 
               
             
          
         
            
                  ACKNOWLEDGMENTS
               
                  				This research was supported by the MSIT(Ministry of Science and ICT), Korea, under
                  the ITRC(Information Technology Research Center) support program (IITP-2023-RS-2022-00156295)
                  supervised by the IITP(Institute for Information & Communications Technology Planning
                  & Evaluation).
                  			
               
             
            
                  
                     REFERENCES
                  
                     
                        
                        A. Dosovitskiy et al., “An image is worth 16x16 words: Transformers for image recognition
                           at scale,” arXiv preprint arXiv:2010.11929, 2020.

 
                     
                        
                        Z. Dai, H. Liu, Q. V. Le, and M. Tan, “Coatnet: Marrying convolution and attention
                           for all data sizes,” in Proc. Advances in Neural Information Processing Systems, 2021,
                           vol. 34, pp. 3965-3977.

 
                     
                        
                        H. Touvron, M. Cord, M. Douze, F. Massa, A. Sablayrolles, and H. Jégou, “Training
                           data-efficient image transformers & distillation through attention,” in Proceedings
                           of the International Conference on Machine Learning, 2021, pp. 10347-10357.

 
                     
                        
                        K. He, X. Chen, S. Xie, Y. Li, P. Dollár, and R. Girshick, “Masked autoencoders are
                           scalable vision learners,” arXiv preprint arXiv:2111.06377, 2021.

 
                     
                        
                        Z. Liu et al., “Swin transformer: Hierarchical vision transformer using shifted windows,”
                           in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021,
                           pp. 10012-10022.

 
                     
                        
                        Z. Xia, X. Pan, S. Song, L. E. Li, and G. Huang, “Vision Transformer with Deformable
                           Attention,” arXiv preprint arXiv:2201.00520, 2022.

 
                     
                        
                        K. He, G. Gkioxari, P. Dollár, and R. Girshick, “Mask r-cnn,” in Proceedings of the
                           IEEE international conference on computer vision, 2017, pp. 2961-2969.

 
                     
                        
                        A. G. Howard et al., “Mobilenets: Efficient convolutional neural networks for mobile
                           vision applications,” arXiv preprint arXiv:1704.04861, 2017.

 
                     
                        
                        Z. Liu et al., “Swin Transformer V2: Scaling Up Capacity and Resolution,” arXiv preprint
                           arXiv:2111.09883, 2021.

 
                     
                        
                        N. Ahmed and K. R. Natarajan T_ and Rao, “Discrete cosine transform,” IEEE Transactions
                           on Computers, vol. 100, no. 1, pp. 90-93, 1974.

 
                     
                        
                        J. Shin and H. Kim, “RL-SPIHT: Reinforcement Learning based Adaptive Selection of
                           Compression Ratio for 1-D SPIHT Algorithm,” IEEE Access, vol. 9, pp. 82485-82496,
                           2021.

 
                     
                        
                        H. Kim, A. No, and H.-J. Lee, “SPIHT Algorithm with Adaptive Selection of Compression
                           Ratio Depending on DWT Coefficients,” IEEE Transactions on Multimedia, vol. 20, no.
                           12, pp. 3200-3211, Dec. 2018.

 
                     
                        
                        Y. Rao, W. Zhao, Z. Zhu, J. Lu, and J. Zhou, “Global filter networks for image classification,”
                           in Proceedings of the Advances in Neural Information Processing Systems, 2021, vol.
                           34.

 
                     
                        
                        K. Xu, M. Qin, F. Sun, Y. Wang, Y.-K. Chen, and F. Ren, “Learning in the Frequency
                           Domain,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern
                           Recognition (CVPR), Jun. 2020, pp. 1740-1749.

 
                     
                        
                        X. Shen et al., “DCT-Mask: Discrete Cosine Transform Mask Representation for Instance
                           Segmentation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern
                           Recognition (CVPR), Jun. 2021, pp. 8720-8729.

 
                     
                        
                        C. Scribano, G. Franchini, M. Prato, and M. Bertogna, “DCT-Former: Efficient Self-Attention
                           with Discrete Cosine Transform,” arXiv preprint arXiv:2203.01178, 2022.

 
                     
                        
                        A. Krizhevsky, G. Hinton, and others, “Learning multiple layers of features from tiny
                           images,” 2009.

 
                     
                        
                        Y. Le and X. S. Yang, “Tiny ImageNet Visual Recognition Challenge,” 2015.

 
                     
                        
                        J. Choi, D. Chun, H. Kim, and H.-J. Lee, “Gaussian yolov3: An accurate and fast object
                           detector using localization uncertainty for autonomous driving,” in Proc. IEEE/CVF
                           Int. Conf. Computer Vision, 2019, pp. 502-511.

 
                     
                        
                        G. K. Wallace, “The JPEG still picture compression standard,” IEEE Transactions on
                           Consumer Electronics, vol. 38, no. 1, pp. xviii-xxxiv, 1992.

 
                     
                        
                        A. Vaswani et al., “Attention is all you need,” in Proc. Advances in Neural Information
                           Processing Systems, 2017, vol. 30.

 
                     
                        
                        S. Khan, M. Naseer, M. Hayat, S. W. Zamir, F. S. Khan, and M. Shah, “Transformers
                           in vision: A survey,” ACM Computing Surveys (CSUR), 2021.

 
                     
                        
                        NVIDIA, P. Vingelmann, and F. H. P. Fitzek, CUDA, release: 10.2.89. 2020. [Online].
                           Available:

 
                     
                        
                        R. Wightman, PyTorch Image Models. GitHub, 2019. doi: 10.5281/zenodo.4414861.

 
                     
                        
                        M. Ehrlich, L. Davis, S.-N. Lim, and A. Shrivastava, “Quantization Guided JPEG Artifact
                           Correction,” 2020.

 
                     
                        
                        K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,”
                           in Proceedings of the IEEE conference on computer vision and pattern recognition,
                           2016, pp. 770-778.

 
                     
                        
                        M. Tan and Q. Le, “Efficientnet: Rethinking model scaling for convolutional neural
                           networks,” in Proceedings of the International conference on machine learning. PMLR,
                           2019, pp. 6105-6114.

 
                     
                        
                        M. Tan and Q. Le, “Efficientnetv2: Smaller models and faster training,” in Proceedings
                           of the International Conference on Machine Learning. PMLR, 2021, pp. 10 096-10 106.

 
                     
                        
                        S. Ren, K. He, R. Girshick, and J. Sun, “Faster r-cnn: Towards real-time object detection
                           with region proposal networks,” vol. 28, 2015.

 
                     
                        
                        J. Dai, H. Qi, Y. Xiong, Y. Li, G. Zhang, H. Hu, and Y. Wei, “Deformable convolutional
                           networks,” in Proc. IEEE/CVF Int. Conf. Computer Vision, 2017, pp. 764-773

 
                     
                        
                        I. Loshchilov and F. Hutter, “Decoupled weight decay regularization,” arXiv preprint
                           arXiv:1711.05101, 2017.

 
                   
                
             
            Author
            
            
               			Jongho Lee received his B.S. degree in Electrical and Information Engineering from
               Seoul National University of Science and Technology, Seoul, Korea, in 2020. Currently,
               he is a graduate student of Seoul National University of Science and Technology, Seoul,
               Korea. In 2020, he was a research student at the Korea Institute of Science and Technology
               (KIST), Seoul, Korea. In 2022, he was a visit student at the University of Wisconsin-Madison,
               Wisconsin, USA. His research interests are the deep learning and machine learning
               algorithms for computer vision tasks.
               
               
               		
            
            
            
               			Hyun Kim received his B.S., M.S. and Ph.D. degrees in Electrical Engi-neering and
               Computer Science from Seoul National University, Seoul, Korea, in 2009, 2011 and 2015,
               respectively. From 2015 to 2018, he was with the BK21 Creative Research Engineer Development
               for IT, Seoul National University, Seoul, Korea, as a BK Assistant Professor. In 2018,
               he joined the Department of Electrical and Information Engineering, Seoul National
               University of Science and Technology, Seoul, Korea, where he is currently working
               as an Associate Professor. His research interests are the areas of algorithms, computer
               architecture, memory, and SoC design for low-complexity multimedia applications and
               deep neural networks.