Generated Image Classification Model for Deep Learning-based Inpainting Model
(Han-gyul Baek)
1
(Dong-shin Lim)
1
(Hojun Song)
1
(Vani Priyanka Gali)
1
(Sang-hyo Park)
1,*
-
(School of Computer Science and Engineering, Kyungpook National University, Daegu,
Korea)
Copyright © 2025 The Institute of Electronics and Information Engineers(IEIE)
Keywords
Deep learning, Generated images, Image classification, Inpainting
1. Introduction
The development of computers and networks has led to the proliferation of vast media
content, including images and videos. As a result, data, such as images and videos,
are increasingly important. In the field of deep learning-based computer vision that
utilizes such data, various studies have been conducted and continue to be developed.
However, in computer vision, most problems are studied based on real data, such as
images and videos captured by humans. In contrast, recent developments include DALL-E
2 [1], Imagen [2], and Stable Diffusion [3], which are very large-scale artificial intelligence models with myriad parameters.
These models are trained on large-scale data using computing infrastructures capable
of large-scale computations, generating data with attributes different from real data.
These models [1-3] can generate unnatural images that differ from real data based on creative text inputs.
Although the images are artificially generated, they possess a quality that makes
them appear as if they are real. Due to advancements in generating high-quality images
comparable to real-life images, generative images have been used in various fields
of computer vision. Inpainting, one of the many techniques in computer vision, can
restore damaged areas of images and videos or mask and remove areas that the user
deems unnecessary. However, existing inpainting models have been trained on real data,
and their technical achievements and implementations have focused on real data. Nevertheless,
inpainting has yet to be studied for synthetic images generated by DALL-E 2 [1], Imagen [2], and Stable Diffusion [3]. In this paper, we first demonstrate the performance of our inpainting model on generated
images. Following this, we introduce a synthetic image classification framework that
proactively aims to identify vulnerabilities in current inpainting models. Furthermore,
this framework evaluates their suitability for tasks involving inpainting capabilities
in a novel context.
2. Related Work
In this section, we describe our work on inpainting models, generated image models,
and image classification studies of generated and real images that are relevant to
the purpose of this research.
2.1. Image Inpainting Models
In Computer vision, inpainting has been studied by many researchers from the past
to the present. The field of inpainting can be broadly divided into two parts: image
inpainting and video inpainting. However, since this article is about image classification,
we will focus on image inpainting rather than video inpainting. Image inpainting based
on deep neural networks learns semantic priors and meaningful hidden representations
in an end-to-end manner, and the network uses a convolutional fitter to fill the missing
content with a fixed value. This approach relies on initial hole values, which can
lead to problems such as a lack of texture in the holes, severe color contrast, artificial
edges, and artificial edges. Another limitation is that they focus on rectangular-shaped
holes in the center of the image, which leads to overfitting of the holes.
To address these limitations, Liu et al. [4] propose a partial convolution with an automatic mask update step rather than a traditional
convolution, which allows the convolution to be applied only to valid pixels, and
proposes a method to automatically generate an updated mask when moving from the current
layer to the next layer as part of a partial convolution forward pass. The goal of
the paper is thus to propose a predictive model that works on irregular hole patterns
and integrates with the rest of the image without additional post-processing. However,
Liu et al. [4] note that there are limitations: first, the classification is done empirically through
validity and invalidity for all spatial locations, which means that the mask is set
regardless of how many pixels are covered by the filter. Second, as you progress through
the layers, there are fewer and fewer invalid pixels, and all mask values are converted
to 1. This process has the limitation that it increases the probability of assigning
a valid mask to an unfilled pixel.
In a limitation of Liu et al. [4], Yu et al. [5] show that partial convolution is hard-gated, which can limit flexibility because
all channels in each layer share the same mask. Yu et al. [5] also named partial convolution [4] as unlearnable single-channel feature hard gating. Instead of the hard gating mask
of partial convolution [4], gated convolution [5] automatically learns a soft mask from the data, paying more attention to features
that are more likely to be valid features. This is called learning a channel-by-channel,
spatial dynamic feature selection mechanism, and this process proposes a new convolution,
the gated convolution. In addition, a spectral-normalized markovian discriminator
(SN-PatchGAN) is devised to naturally fill a large number of holes with arbitrary
locations and arbitrary shapes. Here, SN-PatchGAN is devised by focusing on the deep
Markovian model of Li et al [6], which only uses the statistics of local patches, thus enabling image inpainting
and user-guided inpainting. In the overall structure, the coarse network and refinement
network use an encoder-decoder type network instead of the U-Net [7] used in partial convolution [4] because the skip connection of the U-Net [7] cannot propagate detailed color or texture information.
Recent research has attempted to enhance image inpainting. These methods [27-29] focus on improving the quality of image inpainting and, as a result, producing more
natural-looking images.
Chen, Yuantao et al. [27] proposes an image inpainting method, multi-scale feature module with attention module
(MFMAM), which combines a multi-scale feature module with an attention module. Existing
deep learning-based image inpainting algorithms deal with the problem of information
loss during the process of extracting deep-level features. The proposed network consists
of a two-level structure, using a deep learning encoding-decoding network and a generative
adversarial neural network (GAN) structure. The multi-scale fusion module using dilated
convolution (MFDC) module reduces the information loss during the feature extraction
process, and the joint attention module enhances the restoration of the semantic structure
of the network. The approach of Chen, Yuantao et al. [27] uses a conditional semantic attention (CSA) layer and a pixel-by-pixel class activation
mapping (CAM) layer to restore deep semantic features of an image and performs pixel-level
restoration on shallow features. This effectively restores information in corrupted
areas. Using the model trained on the Outdoor Scene [30] and CelebA-HQ [31] datasets, they performed an object removal task in real-world scenes and showed that
the proposed method can well predict the structure of the corrupted image and produce
more realistic and clearer restoration results. The proposed MFMAM network improves
the performance of image inpainting, demonstrating the effectiveness of the attention
module in different architectures. Chen, Yuantao et al. [27] showed that the method with the addition of style loss and perceptual loss is particularly
effective.
Corneanu, C et al. [28] proposes a new information propagation mechanism called LatentPaint, which conditions
existing diffusion models to perform image inpainting tasks. Existing inpainting methods
primarily propagate information based on patch similarity or use generative adversarial
networks (GAN) to generate images, often suffering from high computational cost, complex
training, and inconsistent quality of results. LatentPaint, the approach of Corneanu,
C et al. [28] works by training a neural network to predict the original image from a noisy image.
In doing so, they introduce an Explicit Propagation (EP) module that propagates information
from the conditional pixels to the pixels to be inferred, which they combine with
a traditional diffusion model (DM) to perform inpainting. Corneanu, C et al. [28] evaluated LatentPaint’s performance on several visual domains, including the CelebA-HQ
[31] dataset, and found that it outperforms existing state-of-the-art techniques, especially
in producing images with fast runtimes and high quality. LatentPaint opens up new
possibilities in the field of image inpainting, demonstrating an innovative way to
deliver high-quality results with an efficient information propagation mechanism.
Xie, S et al. [29] proposes SmartBrush, a diffusion-based model that fills damaged areas with objects
using text and shape guides. They address the problem that the existing DALLE-2, Stable
Duffusion, only supports text-based inpainting but not shape guides, and as a result
tends to modify the background texture around the object. The approach proposed in
this paper is to perform multi-modal object inpainting through a new diffusion-based
model called SmartBrush, which focuses on improving the quality and control of inpainted
objects by simultaneously utilizing text describing the properties of the object that
the user wants and a mask defining the shape of the area to be inpainted. SmartBrush
is based on a diffusion model, which produces high-quality images by gradually removing
noise in the image. During this process, the model performs inpainting by considering
the input text and shape mask, and it also predicts the foreground object mask to
help preserve background pixels around the inpainted object in order to preserve the
background of the original image. In addition, SmartBrush adopts a multi-task training
strategy that simultaneously trains on object inpainting and text-to-image generation
tasks to improve its ability to handle different text descriptions and image content.
Training data is taken from large datasets such as LAION-Aesthetics v2 [32], and the model’s performance has been evaluated through user studies on Amazon Mechanical
Turk. This approach enables SmartBrush to produce higher quality results than traditional
inpainting models, and to more accurately reflect the shapes and attributes desired
by users.
2.2. Text-to-image Models
Hyper-scale artificial intelligence technologies are gaining worldwide attention,
with text-to-image models like DALL-E 2 [1], Imagen [2], and Stable Diffusion [3], which generate images from input text, are under continuous improvement and research.
For example, the contrastive models such as contrastive language-image pretraining
(CLIP) [8] are known to learn various representations of images well. Ramesh et al. [1] propose a two-stage model to leverage representations for image generation. The two-stage
model consists of a prior and a decoder. The prior generates a CLIP image embedding
from the text, and the decoder takes the image embedding from the prior as a condition
and generates an image with a diffusion model [9].
In addition, the diffusion model [9] can be used as a decoder to learn to restore the CLIP image encoder, and so on, to
output semantically similar images, such as generative adversarial networks inversion
(GAN inversion) [10,11], and to enable image interpolation through image embedding. Therefore, Ramesh et
al. propose the unCLIP model [1] which finds a text embedding, given a text, that matches the image embedding present
in CLIP [8] and converts the text embedding into a CLIP [8] image using a prior model. Given an image embedding, the next step is to decode it
with a diffusion model [9] to generate a range of images that match the image embedding.
Another text-to-image model, Imagen [2], which focuses on improving the match between text and images, uses a large transformer
language model. This model, T5-XXL [12], is a pre-trained language model with frozen weights that is known for its ability
to understand and generate human language. To map text to a sequence of embeddings,
Imagen combined this large transformer language model with an encoder and a diffusion
model [9,13-15]. The diffusion model generates 64x64 images, 256x256, and 1024x1024 super-resolution
images. Saharia et al. [2] emphasize that, unlike previous works that use only image-text data for model training,
the use of this large transformer language model [12] pre-trained on text only is very effective for image-text matching, and the problem
of classifier-free guidance, a new diffusion sampling technique, is solved by dynamic
thresholding to improve the quality of images.
Stable Diffusion [3] models represent a significant advancement in text-to-image generation. Unlike its
predecessors, Stable Diffusion [3] utilizes diffusion methodologies and latent space representation to generate lifelike
images from textual and visual prompts while substantially reducing computational
requirements. The training process involves encoding text inputs into latent vectors
using pre-trained language models like CLIP. This encoding facilitates the generation
of compressed representations of images, thus alleviating the challenges associated
with high-resolution image generation. The model comprises three main components:
a variational autoencoder (VAE) for transforming images into latent representations,
a U-Net [7] for denoising noisy latents, and a text-encoder for converting input prompts into
embeddings that guide the denoising process. Stable Diffusion [3] enables various creative applications, including text-to-image generation, image
upscaling, and inpainting, while also democratizing high-resolution image synthesis
by reducing the cost of training and inference.
2.3. Detecting Generated Images
Much research is being done on the classification of generated and real images, so
it is becoming essential to study the classification of generated and real images.
In a study of image classification of real and generated images, Bird, Jordan J. et
al. [22] used CIFAR-10 [23] as the real dataset and Stable Diffusion 1.4 [3] as the generated dataset to generate images based on 10 CIFAR-10 [23] classes. Then, they trained an image classification model using this image data.
The trained image classification model was trained with a simple CNN model. The trained
image classification model experimented with different hyperparameter tuning, such
as the number of layers, filters, neurons, etc., and found the best model experimentally
by measuring precision, recall, F1-score, etc. The best model achieved an accuracy
of 92.98%.
Other research in generated and real image classification combines transfer learning
with specific algorithms to improve classification accuracy despite the small number
of generated datasets. Mittal, Himanshu et al. [24] use the real and fake face detection by Yonsei University [25] to train a model for the classification of real and generated face images. Due to
the small dataset, they trained the model based on pre-trained AlexNet [26]. Finally, they propose a model called improved quantum-inspired evolutionary-based
feature selection (IQIEA-FS). IQIEA-FS is a feature selection method developed based
on a quantum-inspired evolutionary algorithm’s basic concepts and ideas (QIEA). QIEA
leverages concepts from quantum computing to solve optimization problems. IQIEA-FS
is based on this QIEA algorithm but presents an improved method for more efficient
and accurate feature selection to improve its performance. IQIEA-FS is an evolution
of QIEA, which provides better feature selection and classification performance while
maintaining its basic principles. Furthermore, it is used with a k-nearest neighbor
(KNN) classifier to classify images into real and fake faces. Thus, the IQIEA-FS method
consists of feature extraction based on AlexNet, feature selection using the IQIEA
algorithm, and image classification using the KNN classifier, and achieves a mean
normalized accuracy (MNA) of 58.3%.
3. Limitations of Existing Inpainting Models
Since existing inpainting models are primarily trained on real data, we conducted
a series of experiments to explore and establish the limitations of inpainting models.
As part of our experiments, we applied two distinct inpainting models to both real
and generated data. The real data to which the inpainting was applied is the MS-COCO
dataset [16], and the generated data is from DALL-E 2 [1], Stable Diffusion [3]. The first applied inpainting model is the partial convolution model by Liu et al.
[4] which can be seen in Figs. 1 and 2. The second applied inpainting model is the gated convolution model by Yu et al.
[5] which can be seen in Figs. 3 and 4. Figs. 1-4 show the original image, masked regions, binary mask, and inpainting results for
the masked regions of the real and generated data.
Fig. 1. Comparison after applying Liu et al. [4] partial convolution to real data.
Original image (top-left), masked regions (top-right), binary mask (bottom-left),
and inpainting results (bottom-right).
Fig. 2. Comparison after applying Liu et al. [4] partial convolution to generated
data.
Original image (top-left), masked regions (top-right), binary mask (bottom-left),
and inpainting results (bottom-right).
Fig. 3. Comparison after applying Yu et al. [2] gated convolution to real data.
Original image (top-left), masked regions (top-right), binary mask (bottom-left),
and inpainting results (bottom-right).
Fig. 4. Comparison after applying Yu et al. [2] gated convolution to generated data.
Original image (top-left), masked regions (top-right), binary mask (bottom-left),
and inpainting results (bottom-right).
From the results in Figs. 1-4, we can see that the results of applying the inpainting model to the generated data
are noticeably worse than the results from the real data. There could be several reasons
for this, but it’s most likely due to the different nature of the real and generated
data. So, since the inpainting model was trained on real data, we can assume that
the generated data is unstable from the perspective of the deep learning model. Since
the inpainting model proves to be fragile on the generated data, we propose a binary
classification model to classify the generated data.
We summarized the main features after applying each inpainting [4,5] to each real and generated image in Table 1. We can see that each of the inpainting models [4,5] has common features of borderline, obvious artifacts, and unnatural results when
applied to the generated images.
Table 1. Apply each inpainting model [4,5] for the main features according to the
dataset.
|
Models
|
Real
|
Generated
|
|
MS-COCO [16] |
DALLE-2 [1] |
Stabel diffusion [3] |
|
Partial convolution [4] |
Borderless, natural, little artifact
|
Borderline, unnatural, obvious artifacts
|
Borderline, unnatural, obvious artifacts
|
|
Gated convolution [5] |
Borderless, natural
|
Borderline, unnatural, obvious artifacts
|
Borderline, unnatural, obvious artifacts
|
4. Proposed Framework
Our trained model framework primarily consists of two processes: data augmentation
and model training, as depicted in Fig. 5. Data augmentation, as referenced in [18], is a technique that effectively reduces training and validation errors. For effective
feature extraction from generated images, we applied random rotation, Gaussian noise,
brightness and contrast adjustments, and color jitter augmentation techniques.
Fig. 5. Overall framework of the proposed binary classification model.
The model training process involves transfer learning and fine-tuning the ConvNeXt-XL
model [17]. Transfer learning is a method that uses an existing pre-trained model to learn the
target data, which can perform well on data from similar domains with a small amount
of data. Training with a small dataset can also avoid relative overfitting, and training
only the last layer can avoid overfitting by reducing the number of training weights.
We adopted ConvNeXt [17], a model designed to evaluate improvements by modernizing a CNN model like the high-performing
Vision Transformer in computer vision. Liu et al. [17] applied the ResNet model to the Swin-transformer configuration and saw how much it
improved performance. We thought it was a suitable model for transfer learning because
it showed high performance by dividing it into five major change points to configure
it like a Swin-transformer. If it improves performance when they give it that point,
they continue to apply it after that and add the next change point to improve performance.
The ConvNeXt model takes as input 384 × 384 size images with RGB channel values. We
changed from 768 × 768 to 384 × 384 via data preprocessing to avoid performance degradation
or characteristic distortion when training the model at a size other than the predefined
size. We had to train the whole model to learn the features of the generated data
well, so we trained the part of the fully connected (FC) layer that can extract features
from the data and the part of the classification by fine-tuning some hyperparameters.
We used the global average pooling (GAP) in the FC layer part as it is. This is because
GAP has the advantage of averaging the feature values from the feature map and inputting
them directly to the nodes. This can significantly reduce the weights and avoid overfitting,
as no optimized parameters exist. As for the details of the FC layer, since the purpose
of the FC layer is a binary classification of generated data and real data, the activation
function is sigmoid, the loss function is binary cross-entropy, and the optimizer
function is Adam. In addition, only one dense layer was added at the end, and the
number of nodes was set to two. To effectively learn the generated images, we fine-tuned
both the ConvNeXt model [17] and the additional FC layer.
5. Experiment and Result
The experiment for classifying natural and generated images was conducted using two
NVIDIA RTX 2080ti GPUs in parallel. We applied transfer learning to the ConNeXt-XL
model, which has shown excellent performance based on its transformer architecture.
A dense layer for binary classification was added to the end of the ConNeXt-XL network,
and the weights of all layers were trained. The training dataset consisted of an equal
1:1 ratio of natural and generated images, totaling 20,000 images, while the validation
dataset was also composed of the same ratio with a total of 3,800 images. The image
size was set to 384 to match the input size of ConvNeXt-XL [17]. The training time was 60 minutes per epoch with a batch size of 4. In conclusion,
a 99.87% classification accuracy was shown by the trained ConvNeXt-XL [17] model on the test dataset.
5.1. Dataset and Augmentation
We used MS-COCO [16] 2017 dataset to collect real images. For training, 10,000 images were sampled for
the training dataset and 1,900 images for the validation dataset, both drawn from
MS-COCO [16] train. Also, 1,500 test images for evaluation were collected from MS-COCO [16] valid. Stable Diffusion is recently used to generate high quality images with text
descriptions [3]. We generated 768 * 768 images via the Stable Diffusion [3] 2.0 model, matching the number and ratio of the natural dataset, excluding its training
dataset. We used the captions from MS-COCO [16] annotations as text descriptions to synthesize images.
We implemented the algorithm for data augmentation in our generated dataset. The pseudo-code
in Algorithm 1 shows the specific generated image augmentation process. Specifically,
on the 7,000 images generated dataset for training, we randomly applied augmentation
techniques to produce a total of 3,000 images.
Algorithm 1: Pseudo-code for data augmentation.
5.2. Real and Generated Images: Comparative and Quantitative Analysis
We additionally applied inpainting to each real and generated dataset This algorithm
was chosen for its everyday use in the field and its ability to provide a baseline
for comparison. We then and conducted a quantitative evaluation to increase the credibility
of the poor inpainting performance in the generated image among the real and generated
images. We applied 10 images for each dataset, for a total of 80 images inpainted.
In Fig. 6, the real image dataset, MS-COCO [16], shows that the masked objects are well erased, while the generated image datasets,
DALLE-2 [1], Imagen [2], and Stabe Diffusion [3], show unnatural results where the masked objects are not well erased.
Fig. 6. Examples after applying inpainting to real and generated image datasets.
We performed a quantitative evaluation to provide an objective assessment of the results.
The metrics we used for the quantitative evaluation are the frechet inception distance
(FID) score and naturalness image quality evaluator (NIQE). The FID score is used
to measure the quality difference between the original image and the generated image,
while NIQE is a quantitative measure of the naturalness of the image. In other words,
we used FID to measure the quality difference between the original image and the image
with objects removed by inpainting and NIQE to measure the naturalness of the image
with objects removed by inpainting. In Table 2, which shows the quantitative figures, we can objectively see that the generated
images have higher values on average than the real images, so we can say that the
generated images are not inpainting better than the real images. In particular, the
average FID score shows a significant difference. Therefore, we have shown the difference
between the real and generated images through experiments and proved that the generated
images are not inpainting better through quantitative evaluation.
Table 2. Quantitative assessment of each real and generated image (FID Score, NIQE).
|
Dataset
|
Real
|
Generated
|
|
MS-COCO [16] |
DALLE-2 [1] |
Imagen [2] |
Stable diffusion [3] |
|
|
FID ↓
|
NIQE ↓
|
FID ↓
|
NIQE ↓
|
FID ↓
|
NIQE ↓
|
FID ↓
|
NIQE ↓
|
|
Image1
|
27.79
|
4.5903
|
291.04
|
4.3293
|
31.18
|
4.1303
|
27.78
|
2.9703
|
|
Image2
|
12.51
|
3.0011
|
338.26
|
4.5903
|
140.43
|
4.3048
|
184.35
|
3.6435
|
|
Image3
|
42.61
|
3.7806
|
94.97
|
4.3194
|
79.79
|
3.0726
|
316.2
|
3.746
|
|
Image4
|
3.5
|
4.4704
|
52
|
4.6827
|
67.58
|
4.1156
|
123.94
|
3.7288
|
|
Image5
|
9.15
|
3.8071
|
93.61
|
5.4865
|
25.44
|
4.3323
|
24.2
|
5.194
|
|
Image6
|
33.06
|
3.1545
|
27.9
|
6.0514
|
82.76
|
4.1148
|
337.87
|
3.2744
|
|
Image7
|
22.92
|
3.523
|
248.64
|
3.4186
|
138.49
|
4.5378
|
376.25
|
2.8453
|
|
Image8
|
42.77
|
4.0398
|
97.51
|
3.4495
|
333.69
|
3.9339
|
114.63
|
5.1683
|
|
Image9
|
8.5
|
5.1887
|
135.06
|
3.3526
|
117.56
|
3.7073
|
103.96
|
3.7698
|
|
Image10
|
33.6
|
2.956
|
113.17
|
4.0303
|
34.31
|
3.668
|
14.03
|
3.4401
|
|
Average |
23.64 |
3.8512 |
149.22 |
4.0261 |
105.12 |
3.9917 |
159.54 |
3.9781 |
5.3. Result
The image size was set to 384 to match the input size of ConvNeXt-XL [17]. We used a batch size of 4 and 200 epochs, with a training time of 60 minutes per
epoch. To minimize unnecessary training time and prevent overfitting, we applied an
early stopping technique to identify the best performance networks quickly. In conclusion,
a 99.87% classification accuracy was shown by the trained proposed model on the test
dataset. Fig. 6 shows a visualization of the predictions made by the trained model on the test dataset.
It can be verified that the model correctly classifies the generated and real images
for the test dataset.
6. Ablation Study
We trained three additional models [18-20] and ran a performance comparison by measuring accuracy, f1-score, and area Under
the ROC Curve (AUC) to argue that our trained model is the best performing. The F1-score
combines a model’s precision and recall to measure its performance, while the AUC
represents the area under the ROC curve, evaluating the model’s classification ability.
We trained the three models in the same environment and conditions to ensure a fair
measurement. Table 3 presents a performance comparison of different architectures. The ConvNeXt-XL [17] model, with a high accuracy of 99.87%, shows only a slight advantage over the 97.83%
accuracy of less complex VGG-16 [18], suggesting a minor performance differential between complex and more straightforward
models.
Table 3. Results (Accuracy-1, F1 Score, Auc).
|
Models
|
Accuracy
|
F1-Score
|
AUC
|
|
VGG-16 [19] |
0.9793
|
0.9794
|
0.98
|
|
ResNet-50 [20] |
0.9813
|
0.9814
|
0.98
|
|
EfficientNet-B0 [21] |
0.9793
|
0.9793
|
0.98
|
|
ConvNeXt-XL [17] |
0.9987 |
0.9987 |
1 |
We validated our results by applying the trained model to images generated by DALLE-2
[1] and Imagen [2] to ensure that our model performs well since it was trained with Stable Diffusion
[3]. The accuracy of the trained model on the DALLE-2 [1] and Imagen [2] datasets was 4.74% and 91.19%, respectively. We found that the trained model could
not classify most the images in DALLE-2 [1]. The predicted visualization results can be seen in Figs. 6 and 7. As shown in Figs. 7 and 8, the most poorly predicted images in DALLE-2 [1] and Imagen [2] are artwork, illustrations, and pictures. The accuracy for DALLE-2 [1] is so low because most of the DALLE-2 [1] dataset is composed of images such as artwork, illustrations, and drawings, which
were not included in the training dataset. The dataset we trained on does not contain
images such as artwork, illustrations, and drawings because we used the caption of
MS-COCO [16] as the dataset for generating images using Stable Diffusion [3].
Fig. 7. Visualization FP predictions by trained models on the DALLE-2 dataset.
Predicted: Natural Image / True: Generated Image
Fig. 8. Visualization FP predictions by trained models on the Imagen dataset.
Predicted: Natural Image / True: Generated Image
7. Conclusion
In this paper, we identified the weaknesses of inpainting for real and generated data.
Thus, we proposed a classification framework to determine which images that are more
suitable for inpainting between real and generated image data. In addition, if the
model is trained for the purpose using the transfer learning method, it can achieve
an accuracy of about 100% even with a relatively small amount of data, and we confirmed
that trained ConvNeXt-XL [17] has the best performance among image classification models through an ablation study.
7.1. Limitation and Future Research
The model trained on the Stable Diffusion [3] dataset demonstrated good performance; however, it struggled to accurately classify
artworks, illustrations, photographs, and other content generated by DALLE-2 [1], indicating the need for more diverse generated images to be included in the training
dataset to improve classification of generated images. In addition, although we obtained
objective performance results through quantitative evaluation, we also needed to conduct
subjective evaluation to understand the subtle differences and context that are difficult
to describe in numbers. However, due to the time constraints of the research schedule
and experimental period, we were unable to recruit enough subjects, which limited
the subjective evaluation. In addition, the model we learned was limited by high computational
cost and processing time, which makes it difficult to apply to real-world applications.
In our future work, we will focus on lightweighting and optimizing the model to improve
computational cost and inference speed to compensate for these limitations. We also
plan to obtain more generated data for training and recruit enough subjects to make
the subjective evaluation more effective. Finally, we will continue our research by
validating the performance of the model in different environments to increase its
real-world applicability.
Acknowledgment
This work was supported by Institute of Information & communications Technology Planning
& Evaluation (IITP) grant funded by the Korea government(MSIT) (No.RS-2023-00227431,
Development of 3D space digital media standard technology).
References
Ramesh A., Dhariwal P., Nichol A., Chu C., Chen M., 2022, Hierarchical text-conditional
image generation with CLIP latents, arXiv preprint arXiv.2204.06125

Saharia C., Chan W., Saxena S., Li L., Whang J., Denton E., 2022, Photorealistic text-to-image
diffusion models with deep language understanding, arXiv preprint arXiv.2205.11487

Rombach R., Blattmann A., Lorenz D., Esser P., Ommer B., 2022, High-resolution image
synthesis with latent diffusion models

Liu G., Reda F. A., Shih K. J., Wang T.-C., Tao A., Catanzaro B., 2018, Image inpainting
for irregular holes using partial convolutions, Lecture Notes in Computer Science,
pp. 89-105

Yu J., Lin Z., Yang J., Shen X., Lu X., Huang T., 2019, Free-form image inpainting
with gated convolution

Li C., Wand M., 2016, Precomputed real-time texture synthesis with Markovian generative
adversarial networks, Lecture Notes in Computer Science, pp. 702-716

Ronneberger O., 2017, Invited talk: U-Net convolutional networks for biomedical image
segmentation, Bildverarbeitung für die Medizin 2017, pp. 3-3

Wang S., Duan H., Ding H., Tan Y.-P., Yap K.-H., Yuan J., 2022, Learning transferable
human-object interaction detector with natural language supervision

Nichol A., Dhariwal P., Ramesh A., Shyam P., Mishkin P., McGrew B., 2021, GLIDE: Towards
photorealistic image generation and editing with text-guided diffusion models, arXiv
preprint arXiv.2112.10741

Zhu J.-Y., Krähenbühl P., Shechtman E., Efros A. A., 2016, Generative visual manipulation
on the natural image manifold, Lecture Notes in Computer Science, pp. 597-613

Xia W., Zhang Y., Yang Y., Xue J.-H., Zhou B., Yang M., 2022, GAN inversion: A survey,
IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 45, No. 3, pp.
3121-3138

Raffel C., Shazeer N., Roberts A., Lee K., Sarang M., 2020, Exploring the limits of
transfer learning with a unified text-to-text transformer, Journal of Machine Learning
Research, Vol. 21, No. 140, pp. 1-67

Ho J., Jain A. N., Abbeel P., 2020, Denoising diffusion probabilistic models, arXiv
preprint arXiv.2006.11239

Ho J., Saharia C., Chan W., Fleet D. J., Norouzi M., Salimans T., 2021, Cascaded diffusion
models for high fidelity image generation, arXiv preprint arXiv.2106.15282

Dhariwal P., Nichol A., 2021, Diffusion models beat GANs on image synthesis, arXiv
preprint arXiv.2105.05233

Lin T.-Y., Maire M., Belongie S., Bourdev L., Girshick R., 2014, Microsoft COCO: Common
objects in context, Lecture Notes in Computer Science, pp. 740-755

Liu Z., Mao H., Wu C.-Y., Feichtenhofer C., Darrell T., Xie S., 2022, A ConvNet for
the 2020s

Shorten C., Khoshgoftaar T. M., 2019, A survey on image data augmentation for deep
learning, Journal of Big Data, Vol. 6, No. 1

Simonyan K., Zisserman A., 2015, Very deep convolutional networks for large-scale
image recognition, arXiv preprint arXiv:1409.1556

He K., Zhang X., Ren S., Sun J., 2015, Deep residual learning for image recognition,
arXiv preprint arXiv.1512.03385

Tan M., Le Q. V., 2019, EfficientNet: Rethinking model scaling for convolutional neural
networks, arXiv preprint arXiv.1905.11946

Bird J. J., Lotfi A., 2024, CIFAKE: Image classification and explainable identification
of AI-generated synthetic images, IEEE Access, Vol. 12, pp. 15642-15650

Krizhevsky A., Hinton G., 2009, Learning multiple layers of features from tiny images

Mittal H., Saraswat M., Bansal J. C., Nagar A., 2020, Fake-face image classification
using improved quantum-inspired evolutionary-based feature selection method, pp. 989-995

Real and Fake Face Detection — kaggle https:// www.kaggle.com/ciplab/real-and-fake-facedetecti
on, (Accessed on 01/15/2020)., Kaggle

Krizhevsky A., Sutskever I., Hinton G. E., 2012, Imagenet classification with deep
convolutional neural networks, pp. 1097-1105

Chen Y., Xia R., Yang K., Zou K., 2024, MFMAM: Image inpainting via multi-scale feature
module with attention module, Computer Vision and Image Understanding, Vol. 238, pp.
103883

Corneanu C., Gadde R., Martinez A. M., 2024, Latentpaint: Image inpainting in latent
space with diffusion models

Xie S., Zhang Z., Lin Z., Hinz T., Zhang K., 2023, Smartbrush: Text and shape guided
object inpainting with diffusion model

Wang X., Yu K., Dong C., Loy C. C., 2018, Recovering realistic texture in image super-resolution
by deep spatial feature transform

Karras T., 2017, Progressive growing of GANs for improved quality, stability, and
variation

Schuhmann C., Vencu R., Beaumont R., Kaczmarczyk R., Mullis C., Katta A., Coombes
T., Jitsev J., Komatsuzaki A., 2021, LAION-400M : Open dataset of CLIP-filtered 400
million image-text pairs, arXiv preprint arXiv.2111.02114

Author
Han-gyul Baek received his M.S. degree in computer science and engineering from Kyungpook National
University in 2024. His research interests include video compression, image and video
processing, deep learning and inpainting in computer vision.
Dong-shin Lim is currently a Ph.D. candidate in computer science and engineering at Kyungpook National
University. He is a researcher in the AI-Big Data Section at the Korea Education and
Research Information Service (KERIS), Daegu, South Korea. His research interests include
video compression, video quality enhancement, and artificial intelligence applications
in multimedia processing.
Hojun Song is currently an M.S. candidate in computer science and engineering at Kyungpook National
University. His research interests include model compression, 3D scene understanding,
and AI applications in multimedia processing.
Vani Priyanka Gali received her M.S. degree in computer science and engineering from Kyungpook National
University in 2024. Her research interests include image translation, text-to-image
generated images, super-resolution, and deep learning in computer vision.
Sang-hyo Park received his Ph.D. degrees in computer science from Hanyang University, Seoul, South
Korea, in 2017. From 2017 to 2018, he held a Postdoctoral position with the Intelligent
Image Processing Center, Korea Electronics Technology Institute, and a Research Fellow
with the Barun ICT Research Center, Yonsei University in 2018. From 2019 to 2020,
he held a Postdoctoral position with the Department of Electronic and Electrical Engineering,
Ewha Womans University. In 2020, he joined the Kyungpook National University at Daegu,
where he is now an Associate Professor of Computer Science and Engineering. His research
interests include VVC, encoding complexity, scene description, and model compression.
He had served as a Co-Editor of Internet Video Coding (IVC, ISO/IEC 14496-33) for
six years.