Mobile QR Code QR CODE

2024

Acceptance Ratio

21%

Main Menu

※ The user interface design of www.ieiespc.org has been recently revised and updated. Please contact inter@theieie.org for any inquiries regarding paper submission.

Journal Search

IEIE Transactions on Smart Processing and Computing

IEIESPC Vol. 14, No. 06, p.764-775

ISSN (online) :

2287-5255

Received : 30 July 2024Revised : 30 September 2024Accepted : 19 November 2024

DOI :

https://doi.org/10.5573/IEIESPC.2025.14.6.764

Regular Paper

Generated Image Classification Model for Deep Learning-based Inpainting Model

(Han-gyul Baek) ¹ (Dong-shin Lim) ¹ (Hojun Song) ¹ (Vani Priyanka Gali) ¹ (Sang-hyo Park) ^1,^*

(School of Computer Science and Engineering, Kyungpook National University, Daegu, Korea)

^* Corresponding Author: Sang-hyo Park, s.park@knu.ac.kr

License :

This is an Open-Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/3.0/) which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.(www.theieie.org).

Abstract

Thanks to image generation models (e.g., DALL-E 2) that have shown high generation performance, the generated image data has been widely used in computer vision research, which looks natural from the human perspective. In this paper, we start from the assumption that the generated images may be unstable from the perspective of deep learning models. In particular, for the inpainting task of seamlessly restoring objects or areas in an image the inpainting model may not show the excellence on generated images. Through the experiments, we demonstrate the vulnerability of the inpainting model to the generated images, and present real and generated image classification framework for future seamless inpainting research.

Keywords

Deep learning, Generated images, Image classification, Inpainting

1. Introduction

The development of computers and networks has led to the proliferation of vast media content, including images and videos. As a result, data, such as images and videos, are increasingly important. In the field of deep learning-based computer vision that utilizes such data, various studies have been conducted and continue to be developed. However, in computer vision, most problems are studied based on real data, such as images and videos captured by humans. In contrast, recent developments include DALL-E 2 ^[1], Imagen ^[2], and Stable Diffusion ^[3], which are very large-scale artificial intelligence models with myriad parameters. These models are trained on large-scale data using computing infrastructures capable of large-scale computations, generating data with attributes different from real data. These models ^[1-^3] can generate unnatural images that differ from real data based on creative text inputs. Although the images are artificially generated, they possess a quality that makes them appear as if they are real. Due to advancements in generating high-quality images comparable to real-life images, generative images have been used in various fields of computer vision. Inpainting, one of the many techniques in computer vision, can restore damaged areas of images and videos or mask and remove areas that the user deems unnecessary. However, existing inpainting models have been trained on real data, and their technical achievements and implementations have focused on real data. Nevertheless, inpainting has yet to be studied for synthetic images generated by DALL-E 2 ^[1], Imagen ^[2], and Stable Diffusion ^[3]. In this paper, we first demonstrate the performance of our inpainting model on generated images. Following this, we introduce a synthetic image classification framework that proactively aims to identify vulnerabilities in current inpainting models. Furthermore, this framework evaluates their suitability for tasks involving inpainting capabilities in a novel context.

2. Related Work

In this section, we describe our work on inpainting models, generated image models, and image classification studies of generated and real images that are relevant to the purpose of this research.

2.1. Image Inpainting Models

In Computer vision, inpainting has been studied by many researchers from the past to the present. The field of inpainting can be broadly divided into two parts: image inpainting and video inpainting. However, since this article is about image classification, we will focus on image inpainting rather than video inpainting. Image inpainting based on deep neural networks learns semantic priors and meaningful hidden representations in an end-to-end manner, and the network uses a convolutional fitter to fill the missing content with a fixed value. This approach relies on initial hole values, which can lead to problems such as a lack of texture in the holes, severe color contrast, artificial edges, and artificial edges. Another limitation is that they focus on rectangular-shaped holes in the center of the image, which leads to overfitting of the holes.

To address these limitations, Liu et al. ^[4] propose a partial convolution with an automatic mask update step rather than a traditional convolution, which allows the convolution to be applied only to valid pixels, and proposes a method to automatically generate an updated mask when moving from the current layer to the next layer as part of a partial convolution forward pass. The goal of the paper is thus to propose a predictive model that works on irregular hole patterns and integrates with the rest of the image without additional post-processing. However, Liu et al. ^[4] note that there are limitations: first, the classification is done empirically through validity and invalidity for all spatial locations, which means that the mask is set regardless of how many pixels are covered by the filter. Second, as you progress through the layers, there are fewer and fewer invalid pixels, and all mask values are converted to 1. This process has the limitation that it increases the probability of assigning a valid mask to an unfilled pixel.

In a limitation of Liu et al. ^[4], Yu et al. ^[5] show that partial convolution is hard-gated, which can limit flexibility because all channels in each layer share the same mask. Yu et al. ^[5] also named partial convolution ^[4] as unlearnable single-channel feature hard gating. Instead of the hard gating mask of partial convolution ^[4], gated convolution ^[5] automatically learns a soft mask from the data, paying more attention to features that are more likely to be valid features. This is called learning a channel-by-channel, spatial dynamic feature selection mechanism, and this process proposes a new convolution, the gated convolution. In addition, a spectral-normalized markovian discriminator (SN-PatchGAN) is devised to naturally fill a large number of holes with arbitrary locations and arbitrary shapes. Here, SN-PatchGAN is devised by focusing on the deep Markovian model of Li et al ^[6], which only uses the statistics of local patches, thus enabling image inpainting and user-guided inpainting. In the overall structure, the coarse network and refinement network use an encoder-decoder type network instead of the U-Net ^[7] used in partial convolution ^[4] because the skip connection of the U-Net ^[7] cannot propagate detailed color or texture information.

Recent research has attempted to enhance image inpainting. These methods ^[27-^29] focus on improving the quality of image inpainting and, as a result, producing more natural-looking images.

Chen, Yuantao et al. ^[27] proposes an image inpainting method, multi-scale feature module with attention module (MFMAM), which combines a multi-scale feature module with an attention module. Existing deep learning-based image inpainting algorithms deal with the problem of information loss during the process of extracting deep-level features. The proposed network consists of a two-level structure, using a deep learning encoding-decoding network and a generative adversarial neural network (GAN) structure. The multi-scale fusion module using dilated convolution (MFDC) module reduces the information loss during the feature extraction process, and the joint attention module enhances the restoration of the semantic structure of the network. The approach of Chen, Yuantao et al. ^[27] uses a conditional semantic attention (CSA) layer and a pixel-by-pixel class activation mapping (CAM) layer to restore deep semantic features of an image and performs pixel-level restoration on shallow features. This effectively restores information in corrupted areas. Using the model trained on the Outdoor Scene ^[30] and CelebA-HQ ^[31] datasets, they performed an object removal task in real-world scenes and showed that the proposed method can well predict the structure of the corrupted image and produce more realistic and clearer restoration results. The proposed MFMAM network improves the performance of image inpainting, demonstrating the effectiveness of the attention module in different architectures. Chen, Yuantao et al. ^[27] showed that the method with the addition of style loss and perceptual loss is particularly effective.

Corneanu, C et al. ^[28] proposes a new information propagation mechanism called LatentPaint, which conditions existing diffusion models to perform image inpainting tasks. Existing inpainting methods primarily propagate information based on patch similarity or use generative adversarial networks (GAN) to generate images, often suffering from high computational cost, complex training, and inconsistent quality of results. LatentPaint, the approach of Corneanu, C et al. ^[28] works by training a neural network to predict the original image from a noisy image. In doing so, they introduce an Explicit Propagation (EP) module that propagates information from the conditional pixels to the pixels to be inferred, which they combine with a traditional diffusion model (DM) to perform inpainting. Corneanu, C et al. ^[28] evaluated LatentPaint’s performance on several visual domains, including the CelebA-HQ ^[31] dataset, and found that it outperforms existing state-of-the-art techniques, especially in producing images with fast runtimes and high quality. LatentPaint opens up new possibilities in the field of image inpainting, demonstrating an innovative way to deliver high-quality results with an efficient information propagation mechanism.

Xie, S et al. ^[29] proposes SmartBrush, a diffusion-based model that fills damaged areas with objects using text and shape guides. They address the problem that the existing DALLE-2, Stable Duffusion, only supports text-based inpainting but not shape guides, and as a result tends to modify the background texture around the object. The approach proposed in this paper is to perform multi-modal object inpainting through a new diffusion-based model called SmartBrush, which focuses on improving the quality and control of inpainted objects by simultaneously utilizing text describing the properties of the object that the user wants and a mask defining the shape of the area to be inpainted. SmartBrush is based on a diffusion model, which produces high-quality images by gradually removing noise in the image. During this process, the model performs inpainting by considering the input text and shape mask, and it also predicts the foreground object mask to help preserve background pixels around the inpainted object in order to preserve the background of the original image. In addition, SmartBrush adopts a multi-task training strategy that simultaneously trains on object inpainting and text-to-image generation tasks to improve its ability to handle different text descriptions and image content. Training data is taken from large datasets such as LAION-Aesthetics v2 ^[32], and the model’s performance has been evaluated through user studies on Amazon Mechanical Turk. This approach enables SmartBrush to produce higher quality results than traditional inpainting models, and to more accurately reflect the shapes and attributes desired by users.

2.2. Text-to-image Models

Hyper-scale artificial intelligence technologies are gaining worldwide attention, with text-to-image models like DALL-E 2 ^[1], Imagen ^[2], and Stable Diffusion ^[3], which generate images from input text, are under continuous improvement and research. For example, the contrastive models such as contrastive language-image pretraining (CLIP) ^[8] are known to learn various representations of images well. Ramesh et al. ^[1] propose a two-stage model to leverage representations for image generation. The two-stage model consists of a prior and a decoder. The prior generates a CLIP image embedding from the text, and the decoder takes the image embedding from the prior as a condition and generates an image with a diffusion model ^[9].

In addition, the diffusion model ^[9] can be used as a decoder to learn to restore the CLIP image encoder, and so on, to output semantically similar images, such as generative adversarial networks inversion (GAN inversion) ^[10,^11], and to enable image interpolation through image embedding. Therefore, Ramesh et al. propose the unCLIP model ^[1] which finds a text embedding, given a text, that matches the image embedding present in CLIP ^[8] and converts the text embedding into a CLIP ^[8] image using a prior model. Given an image embedding, the next step is to decode it with a diffusion model ^[9] to generate a range of images that match the image embedding.

Another text-to-image model, Imagen ^[2], which focuses on improving the match between text and images, uses a large transformer language model. This model, T5-XXL ^[12], is a pre-trained language model with frozen weights that is known for its ability to understand and generate human language. To map text to a sequence of embeddings, Imagen combined this large transformer language model with an encoder and a diffusion model ^[9,^13-^15]. The diffusion model generates 64x64 images, 256x256, and 1024x1024 super-resolution images. Saharia et al. ^[2] emphasize that, unlike previous works that use only image-text data for model training, the use of this large transformer language model ^[12] pre-trained on text only is very effective for image-text matching, and the problem of classifier-free guidance, a new diffusion sampling technique, is solved by dynamic thresholding to improve the quality of images.

Stable Diffusion ^[3] models represent a significant advancement in text-to-image generation. Unlike its predecessors, Stable Diffusion ^[3] utilizes diffusion methodologies and latent space representation to generate lifelike images from textual and visual prompts while substantially reducing computational requirements. The training process involves encoding text inputs into latent vectors using pre-trained language models like CLIP. This encoding facilitates the generation of compressed representations of images, thus alleviating the challenges associated with high-resolution image generation. The model comprises three main components: a variational autoencoder (VAE) for transforming images into latent representations, a U-Net ^[7] for denoising noisy latents, and a text-encoder for converting input prompts into embeddings that guide the denoising process. Stable Diffusion ^[3] enables various creative applications, including text-to-image generation, image upscaling, and inpainting, while also democratizing high-resolution image synthesis by reducing the cost of training and inference.

2.3. Detecting Generated Images

Much research is being done on the classification of generated and real images, so it is becoming essential to study the classification of generated and real images. In a study of image classification of real and generated images, Bird, Jordan J. et al. ^[22] used CIFAR-10 ^[23] as the real dataset and Stable Diffusion 1.4 ^[3] as the generated dataset to generate images based on 10 CIFAR-10 ^[23] classes. Then, they trained an image classification model using this image data. The trained image classification model was trained with a simple CNN model. The trained image classification model experimented with different hyperparameter tuning, such as the number of layers, filters, neurons, etc., and found the best model experimentally by measuring precision, recall, F1-score, etc. The best model achieved an accuracy of 92.98%.

Other research in generated and real image classification combines transfer learning with specific algorithms to improve classification accuracy despite the small number of generated datasets. Mittal, Himanshu et al. ^[24] use the real and fake face detection by Yonsei University ^[25] to train a model for the classification of real and generated face images. Due to the small dataset, they trained the model based on pre-trained AlexNet ^[26]. Finally, they propose a model called improved quantum-inspired evolutionary-based feature selection (IQIEA-FS). IQIEA-FS is a feature selection method developed based on a quantum-inspired evolutionary algorithm’s basic concepts and ideas (QIEA). QIEA leverages concepts from quantum computing to solve optimization problems. IQIEA-FS is based on this QIEA algorithm but presents an improved method for more efficient and accurate feature selection to improve its performance. IQIEA-FS is an evolution of QIEA, which provides better feature selection and classification performance while maintaining its basic principles. Furthermore, it is used with a k-nearest neighbor (KNN) classifier to classify images into real and fake faces. Thus, the IQIEA-FS method consists of feature extraction based on AlexNet, feature selection using the IQIEA algorithm, and image classification using the KNN classifier, and achieves a mean normalized accuracy (MNA) of 58.3%.

3. Limitations of Existing Inpainting Models

Since existing inpainting models are primarily trained on real data, we conducted a series of experiments to explore and establish the limitations of inpainting models. As part of our experiments, we applied two distinct inpainting models to both real and generated data. The real data to which the inpainting was applied is the MS-COCO dataset ^[16], and the generated data is from DALL-E 2 ^[1], Stable Diffusion ^[3]. The first applied inpainting model is the partial convolution model by Liu et al. ^[4] which can be seen in Figs. 1 and 2. The second applied inpainting model is the gated convolution model by Yu et al. ^[5] which can be seen in Figs. 3 and 4. Figs. 1-4 show the original image, masked regions, binary mask, and inpainting results for the masked regions of the real and generated data.

Fig. 1. Comparison after applying Liu et al. [4] partial convolution to real data.

Original image (top-left), masked regions (top-right), binary mask (bottom-left), and inpainting results (bottom-right).

Fig. 2. Comparison after applying Liu et al. [4] partial convolution to generated data.

Original image (top-left), masked regions (top-right), binary mask (bottom-left), and inpainting results (bottom-right).

Fig. 3. Comparison after applying Yu et al. [2] gated convolution to real data.

Original image (top-left), masked regions (top-right), binary mask (bottom-left), and inpainting results (bottom-right).

Fig. 4. Comparison after applying Yu et al. [2] gated convolution to generated data.

Original image (top-left), masked regions (top-right), binary mask (bottom-left), and inpainting results (bottom-right).

From the results in Figs. 1-4, we can see that the results of applying the inpainting model to the generated data are noticeably worse than the results from the real data. There could be several reasons for this, but it’s most likely due to the different nature of the real and generated data. So, since the inpainting model was trained on real data, we can assume that the generated data is unstable from the perspective of the deep learning model. Since the inpainting model proves to be fragile on the generated data, we propose a binary classification model to classify the generated data.

We summarized the main features after applying each inpainting ^[4,^5] to each real and generated image in Table 1. We can see that each of the inpainting models ^[4,^5] has common features of borderline, obvious artifacts, and unnatural results when applied to the generated images.

Table 1. Apply each inpainting model [4,5] for the main features according to the dataset.

Models	Real	Generated
Models	MS-COCO ^[16]	DALLE-2 ^[1]	Stabel diffusion ^[3]
Partial convolution ^[4]	Borderless, natural, little artifact	Borderline, unnatural, obvious artifacts	Borderline, unnatural, obvious artifacts
Gated convolution ^[5]	Borderless, natural	Borderline, unnatural, obvious artifacts	Borderline, unnatural, obvious artifacts

4. Proposed Framework

Our trained model framework primarily consists of two processes: data augmentation and model training, as depicted in Fig. 5. Data augmentation, as referenced in ^[18], is a technique that effectively reduces training and validation errors. For effective feature extraction from generated images, we applied random rotation, Gaussian noise, brightness and contrast adjustments, and color jitter augmentation techniques.

Fig. 5. Overall framework of the proposed binary classification model.

The model training process involves transfer learning and fine-tuning the ConvNeXt-XL model ^[17]. Transfer learning is a method that uses an existing pre-trained model to learn the target data, which can perform well on data from similar domains with a small amount of data. Training with a small dataset can also avoid relative overfitting, and training only the last layer can avoid overfitting by reducing the number of training weights. We adopted ConvNeXt ^[17], a model designed to evaluate improvements by modernizing a CNN model like the high-performing Vision Transformer in computer vision. Liu et al. ^[17] applied the ResNet model to the Swin-transformer configuration and saw how much it improved performance. We thought it was a suitable model for transfer learning because it showed high performance by dividing it into five major change points to configure it like a Swin-transformer. If it improves performance when they give it that point, they continue to apply it after that and add the next change point to improve performance.

The ConvNeXt model takes as input 384 × 384 size images with RGB channel values. We changed from 768 × 768 to 384 × 384 via data preprocessing to avoid performance degradation or characteristic distortion when training the model at a size other than the predefined size. We had to train the whole model to learn the features of the generated data well, so we trained the part of the fully connected (FC) layer that can extract features from the data and the part of the classification by fine-tuning some hyperparameters. We used the global average pooling (GAP) in the FC layer part as it is. This is because GAP has the advantage of averaging the feature values from the feature map and inputting them directly to the nodes. This can significantly reduce the weights and avoid overfitting, as no optimized parameters exist. As for the details of the FC layer, since the purpose of the FC layer is a binary classification of generated data and real data, the activation function is sigmoid, the loss function is binary cross-entropy, and the optimizer function is Adam. In addition, only one dense layer was added at the end, and the number of nodes was set to two. To effectively learn the generated images, we fine-tuned both the ConvNeXt model ^[17] and the additional FC layer.

5. Experiment and Result

The experiment for classifying natural and generated images was conducted using two NVIDIA RTX 2080ti GPUs in parallel. We applied transfer learning to the ConNeXt-XL model, which has shown excellent performance based on its transformer architecture. A dense layer for binary classification was added to the end of the ConNeXt-XL network, and the weights of all layers were trained. The training dataset consisted of an equal 1:1 ratio of natural and generated images, totaling 20,000 images, while the validation dataset was also composed of the same ratio with a total of 3,800 images. The image size was set to 384 to match the input size of ConvNeXt-XL ^[17]. The training time was 60 minutes per epoch with a batch size of 4. In conclusion, a 99.87% classification accuracy was shown by the trained ConvNeXt-XL ^[17] model on the test dataset.

5.1. Dataset and Augmentation

We used MS-COCO ^[16] 2017 dataset to collect real images. For training, 10,000 images were sampled for the training dataset and 1,900 images for the validation dataset, both drawn from MS-COCO ^[16] train. Also, 1,500 test images for evaluation were collected from MS-COCO ^[16] valid. Stable Diffusion is recently used to generate high quality images with text descriptions ^[3]. We generated 768 * 768 images via the Stable Diffusion ^[3] 2.0 model, matching the number and ratio of the natural dataset, excluding its training dataset. We used the captions from MS-COCO ^[16] annotations as text descriptions to synthesize images.

We implemented the algorithm for data augmentation in our generated dataset. The pseudo-code in Algorithm 1 shows the specific generated image augmentation process. Specifically, on the 7,000 images generated dataset for training, we randomly applied augmentation techniques to produce a total of 3,000 images.

Algorithm 1: Pseudo-code for data augmentation.

5.2. Real and Generated Images: Comparative and Quantitative Analysis

We additionally applied inpainting to each real and generated dataset This algorithm was chosen for its everyday use in the field and its ability to provide a baseline for comparison. We then and conducted a quantitative evaluation to increase the credibility of the poor inpainting performance in the generated image among the real and generated images. We applied 10 images for each dataset, for a total of 80 images inpainted. In Fig. 6, the real image dataset, MS-COCO [16], shows that the masked objects are well erased, while the generated image datasets, DALLE-2 [1], Imagen [2], and Stabe Diffusion [3], show unnatural results where the masked objects are not well erased.

Fig. 6. Examples after applying inpainting to real and generated image datasets.

We performed a quantitative evaluation to provide an objective assessment of the results. The metrics we used for the quantitative evaluation are the frechet inception distance (FID) score and naturalness image quality evaluator (NIQE). The FID score is used to measure the quality difference between the original image and the generated image, while NIQE is a quantitative measure of the naturalness of the image. In other words, we used FID to measure the quality difference between the original image and the image with objects removed by inpainting and NIQE to measure the naturalness of the image with objects removed by inpainting. In Table 2, which shows the quantitative figures, we can objectively see that the generated images have higher values on average than the real images, so we can say that the generated images are not inpainting better than the real images. In particular, the average FID score shows a significant difference. Therefore, we have shown the difference between the real and generated images through experiments and proved that the generated images are not inpainting better through quantitative evaluation.

Table 2. Quantitative assessment of each real and generated image (FID Score, NIQE).

Dataset	Real		Generated
Dataset	MS-COCO ^[16]		DALLE-2 ^[1]		Imagen ^[2]		Stable diffusion ^[3]
	FID ↓	NIQE ↓	FID ↓	NIQE ↓	FID ↓	NIQE ↓	FID ↓	NIQE ↓
Image1	27.79	4.5903	291.04	4.3293	31.18	4.1303	27.78	2.9703
Image2	12.51	3.0011	338.26	4.5903	140.43	4.3048	184.35	3.6435
Image3	42.61	3.7806	94.97	4.3194	79.79	3.0726	316.2	3.746
Image4	3.5	4.4704	52	4.6827	67.58	4.1156	123.94	3.7288
Image5	9.15	3.8071	93.61	5.4865	25.44	4.3323	24.2	5.194
Image6	33.06	3.1545	27.9	6.0514	82.76	4.1148	337.87	3.2744
Image7	22.92	3.523	248.64	3.4186	138.49	4.5378	376.25	2.8453
Image8	42.77	4.0398	97.51	3.4495	333.69	3.9339	114.63	5.1683
Image9	8.5	5.1887	135.06	3.3526	117.56	3.7073	103.96	3.7698
Image10	33.6	2.956	113.17	4.0303	34.31	3.668	14.03	3.4401
Average	23.64	3.8512	149.22	4.0261	105.12	3.9917	159.54	3.9781

5.3. Result

The image size was set to 384 to match the input size of ConvNeXt-XL ^[17]. We used a batch size of 4 and 200 epochs, with a training time of 60 minutes per epoch. To minimize unnecessary training time and prevent overfitting, we applied an early stopping technique to identify the best performance networks quickly. In conclusion, a 99.87% classification accuracy was shown by the trained proposed model on the test dataset. Fig. 6 shows a visualization of the predictions made by the trained model on the test dataset. It can be verified that the model correctly classifies the generated and real images for the test dataset.

6. Ablation Study

We trained three additional models ^[18-^20] and ran a performance comparison by measuring accuracy, f1-score, and area Under the ROC Curve (AUC) to argue that our trained model is the best performing. The F1-score combines a model’s precision and recall to measure its performance, while the AUC represents the area under the ROC curve, evaluating the model’s classification ability. We trained the three models in the same environment and conditions to ensure a fair measurement. Table 3 presents a performance comparison of different architectures. The ConvNeXt-XL ^[17] model, with a high accuracy of 99.87%, shows only a slight advantage over the 97.83% accuracy of less complex VGG-16 ^[18], suggesting a minor performance differential between complex and more straightforward models.

Table 3. Results (Accuracy-1, F1 Score, Auc).

Models	Accuracy	F1-Score	AUC
VGG-16 ^[19]	0.9793	0.9794	0.98
ResNet-50 ^[20]	0.9813	0.9814	0.98
EfficientNet-B0 ^[21]	0.9793	0.9793	0.98
ConvNeXt-XL ^[17]	0.9987	0.9987	1

We validated our results by applying the trained model to images generated by DALLE-2 ^[1] and Imagen ^[2] to ensure that our model performs well since it was trained with Stable Diffusion ^[3]. The accuracy of the trained model on the DALLE-2 ^[1] and Imagen ^[2] datasets was 4.74% and 91.19%, respectively. We found that the trained model could not classify most the images in DALLE-2 ^[1]. The predicted visualization results can be seen in Figs. 6 and 7. As shown in Figs. 7 and 8, the most poorly predicted images in DALLE-2 ^[1] and Imagen ^[2] are artwork, illustrations, and pictures. The accuracy for DALLE-2 ^[1] is so low because most of the DALLE-2 ^[1] dataset is composed of images such as artwork, illustrations, and drawings, which were not included in the training dataset. The dataset we trained on does not contain images such as artwork, illustrations, and drawings because we used the caption of MS-COCO ^[16] as the dataset for generating images using Stable Diffusion ^[3].

Fig. 7. Visualization FP predictions by trained models on the DALLE-2 dataset.

Predicted: Natural Image / True: Generated Image

Fig. 8. Visualization FP predictions by trained models on the Imagen dataset.

Predicted: Natural Image / True: Generated Image

7. Conclusion

In this paper, we identified the weaknesses of inpainting for real and generated data. Thus, we proposed a classification framework to determine which images that are more suitable for inpainting between real and generated image data. In addition, if the model is trained for the purpose using the transfer learning method, it can achieve an accuracy of about 100% even with a relatively small amount of data, and we confirmed that trained ConvNeXt-XL ^[17] has the best performance among image classification models through an ablation study.

7.1. Limitation and Future Research

The model trained on the Stable Diffusion ^[3] dataset demonstrated good performance; however, it struggled to accurately classify artworks, illustrations, photographs, and other content generated by DALLE-2 ^[1], indicating the need for more diverse generated images to be included in the training dataset to improve classification of generated images. In addition, although we obtained objective performance results through quantitative evaluation, we also needed to conduct subjective evaluation to understand the subtle differences and context that are difficult to describe in numbers. However, due to the time constraints of the research schedule and experimental period, we were unable to recruit enough subjects, which limited the subjective evaluation. In addition, the model we learned was limited by high computational cost and processing time, which makes it difficult to apply to real-world applications.

In our future work, we will focus on lightweighting and optimizing the model to improve computational cost and inference speed to compensate for these limitations. We also plan to obtain more generated data for training and recruit enough subjects to make the subjective evaluation more effective. Finally, we will continue our research by validating the performance of the model in different environments to increase its real-world applicability.

Acknowledgment

This work was supported by Institute of Information & communications Technology Planning & Evaluation (IITP) grant funded by the Korea government(MSIT) (No.RS-2023-00227431, Development of 3D space digital media standard technology).

References

Ramesh A., Dhariwal P., Nichol A., Chu C., Chen M., 2022, Hierarchical text-conditional image generation with CLIP latents, arXiv preprint arXiv.2204.06125

Saharia C., Chan W., Saxena S., Li L., Whang J., Denton E., 2022, Photorealistic text-to-image diffusion models with deep language understanding, arXiv preprint arXiv.2205.11487

Rombach R., Blattmann A., Lorenz D., Esser P., Ommer B., 2022, High-resolution image synthesis with latent diffusion models

Liu G., Reda F. A., Shih K. J., Wang T.-C., Tao A., Catanzaro B., 2018, Image inpainting for irregular holes using partial convolutions, Lecture Notes in Computer Science, pp. 89-105

Yu J., Lin Z., Yang J., Shen X., Lu X., Huang T., 2019, Free-form image inpainting with gated convolution

Li C., Wand M., 2016, Precomputed real-time texture synthesis with Markovian generative adversarial networks, Lecture Notes in Computer Science, pp. 702-716

Ronneberger O., 2017, Invited talk: U-Net convolutional networks for biomedical image segmentation, Bildverarbeitung für die Medizin 2017, pp. 3-3

Wang S., Duan H., Ding H., Tan Y.-P., Yap K.-H., Yuan J., 2022, Learning transferable human-object interaction detector with natural language supervision

Nichol A., Dhariwal P., Ramesh A., Shyam P., Mishkin P., McGrew B., 2021, GLIDE: Towards photorealistic image generation and editing with text-guided diffusion models, arXiv preprint arXiv.2112.10741

Zhu J.-Y., Krähenbühl P., Shechtman E., Efros A. A., 2016, Generative visual manipulation on the natural image manifold, Lecture Notes in Computer Science, pp. 597-613

Xia W., Zhang Y., Yang Y., Xue J.-H., Zhou B., Yang M., 2022, GAN inversion: A survey, IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 45, No. 3, pp. 3121-3138

Raffel C., Shazeer N., Roberts A., Lee K., Sarang M., 2020, Exploring the limits of transfer learning with a unified text-to-text transformer, Journal of Machine Learning Research, Vol. 21, No. 140, pp. 1-67

Ho J., Jain A. N., Abbeel P., 2020, Denoising diffusion probabilistic models, arXiv preprint arXiv.2006.11239

Ho J., Saharia C., Chan W., Fleet D. J., Norouzi M., Salimans T., 2021, Cascaded diffusion models for high fidelity image generation, arXiv preprint arXiv.2106.15282

Dhariwal P., Nichol A., 2021, Diffusion models beat GANs on image synthesis, arXiv preprint arXiv.2105.05233

Lin T.-Y., Maire M., Belongie S., Bourdev L., Girshick R., 2014, Microsoft COCO: Common objects in context, Lecture Notes in Computer Science, pp. 740-755

Liu Z., Mao H., Wu C.-Y., Feichtenhofer C., Darrell T., Xie S., 2022, A ConvNet for the 2020s

Shorten C., Khoshgoftaar T. M., 2019, A survey on image data augmentation for deep learning, Journal of Big Data, Vol. 6, No. 1

Simonyan K., Zisserman A., 2015, Very deep convolutional networks for large-scale image recognition, arXiv preprint arXiv:1409.1556

He K., Zhang X., Ren S., Sun J., 2015, Deep residual learning for image recognition, arXiv preprint arXiv.1512.03385

Tan M., Le Q. V., 2019, EfficientNet: Rethinking model scaling for convolutional neural networks, arXiv preprint arXiv.1905.11946

Bird J. J., Lotfi A., 2024, CIFAKE: Image classification and explainable identification of AI-generated synthetic images, IEEE Access, Vol. 12, pp. 15642-15650

Krizhevsky A., Hinton G., 2009, Learning multiple layers of features from tiny images

Mittal H., Saraswat M., Bansal J. C., Nagar A., 2020, Fake-face image classification using improved quantum-inspired evolutionary-based feature selection method, pp. 989-995

Real and Fake Face Detection — kaggle https:// www.kaggle.com/ciplab/real-and-fake-facedetecti on, (Accessed on 01/15/2020)., Kaggle

Krizhevsky A., Sutskever I., Hinton G. E., 2012, Imagenet classification with deep convolutional neural networks, pp. 1097-1105

Chen Y., Xia R., Yang K., Zou K., 2024, MFMAM: Image inpainting via multi-scale feature module with attention module, Computer Vision and Image Understanding, Vol. 238, pp. 103883

Corneanu C., Gadde R., Martinez A. M., 2024, Latentpaint: Image inpainting in latent space with diffusion models

Xie S., Zhang Z., Lin Z., Hinz T., Zhang K., 2023, Smartbrush: Text and shape guided object inpainting with diffusion model

Wang X., Yu K., Dong C., Loy C. C., 2018, Recovering realistic texture in image super-resolution by deep spatial feature transform

Karras T., 2017, Progressive growing of GANs for improved quality, stability, and variation

Schuhmann C., Vencu R., Beaumont R., Kaczmarczyk R., Mullis C., Katta A., Coombes T., Jitsev J., Komatsuzaki A., 2021, LAION-400M : Open dataset of CLIP-filtered 400 million image-text pairs, arXiv preprint arXiv.2111.02114

Author

Han-gyul Baek

Han-gyul Baek received his M.S. degree in computer science and engineering from Kyungpook National University in 2024. His research interests include video compression, image and video processing, deep learning and inpainting in computer vision.

Dong-shin Lim

Dong-shin Lim is currently a Ph.D. candidate in computer science and engineering at Kyungpook National University. He is a researcher in the AI-Big Data Section at the Korea Education and Research Information Service (KERIS), Daegu, South Korea. His research interests include video compression, video quality enhancement, and artificial intelligence applications in multimedia processing.

Hojun Song

Hojun Song is currently an M.S. candidate in computer science and engineering at Kyungpook National University. His research interests include model compression, 3D scene understanding, and AI applications in multimedia processing.

Vani Priyanka Gali

Vani Priyanka Gali received her M.S. degree in computer science and engineering from Kyungpook National University in 2024. Her research interests include image translation, text-to-image generated images, super-resolution, and deep learning in computer vision.

Sang-hyo Park

Sang-hyo Park received his Ph.D. degrees in computer science from Hanyang University, Seoul, South Korea, in 2017. From 2017 to 2018, he held a Postdoctoral position with the Intelligent Image Processing Center, Korea Electronics Technology Institute, and a Research Fellow with the Barun ICT Research Center, Yonsei University in 2018. From 2019 to 2020, he held a Postdoctoral position with the Department of Electronic and Electrical Engineering, Ewha Womans University. In 2020, he joined the Kyungpook National University at Daegu, where he is now an Associate Professor of Computer Science and Engineering. His research interests include VVC, encoding complexity, scene description, and model compression. He had served as a Co-Editor of Internet Video Coding (IVC, ISO/IEC 14496-33) for six years.