Mobile QR Code QR CODE

  1. (School of Electrical Engineering, Korea University / Seoul 02841, Korea {jy364, franky_, jokim}@korea.ac.kr)



Leaf instance segmentation, Spatial embedding, Knowledge distillation

1. Introduction

Plants grow sensitive to the surrounding environment. In order for plants to grow healthy, various environmental conditions must be kept constant. Otherwise, plants cannot grow normally due to lack of nutrients, or plants cannot grow due to pests. Therefore, continuous monitoring is essential for plants sensitive to environmental changes to grow normally.

In the agricultural field, various sensors and monitoring systems for detecting environmental changes are introduced to increase plant productivity. However, it costs a lot of money to purchase equipment, and it also requires a lot of time and effort to manage equipment. To solve these problems, introducing deep learning technology in the agricultural field can grow plants more productively. Instead of using expensive agricultural equipment, plants can be managed in a contactless manner using relatively inexpensive vision sensors. Using a remote device can reduce costs and automatically control the environment without the need for human intervention. Therefore, deep learning based research is being actively conducted in the agricultural field [1-5].

In this paper, we propose a leaf instance segmentation network using spatial embedding method and knowledge distillation. Each plant leaf is very small in size and has many overlapping areas, so segmentation is performed using spatial embedding instead of detection-based segmentation. In the existing knowledge distillation method, the ``teacher'' network is composed of a large and heavy network, and the ``student'' network is composed of a small and light network, but the proposed network architecture is the same as the teacher network and the student network. Moreover, unlike knowledge distillation to lighten the model, the proposed network aims to achieve good performance even with a small amount of dataset using knowledge distillation. We conduct experiments using various types of distillation, and as a result, the proposed network produces good instance segmentation performance. In summary, the contributions of the paper are as follows.

· In consideration of the characteristics of leaves that are small in size and have many overlapping parts, a leaf instance segmentation is performed with a spatial embedding-based segmentation instead of a detection-based segmentation.

· The proposed network adopts knowledge distillation to achieve good performance even when learning with a small amount of dataset.

· We conduct experiments in various combinations by placing different types of distillation at different locations in the network.

Fig. 1. Plant leaf instance segmentation using knowledge distillation.
../../Resources/ieie/IEIESPC.2023.12.2.162/fig1.png
Fig. 2. Growing plants in the chamber.
../../Resources/ieie/IEIESPC.2023.12.2.162/fig2.png

2. Related Work

2.1 Instance Segmentation

Recently, networks that produce good instance segmentation performance are mainly object detection-based segmentation [8, 9, 20]. Many of these networks have developed a segmentation technique based on object detection of Faster R-CNN [6], which is a method that adds Region Proposal Network (RPN) to R-CNN [7] that performs object detection using a bounding box. In particular, Mask R-CNN [8] is a model that has been widely used as a comparison method and shows high performance among segmentation networks developed based on Faster R-CNN. Faster R-CNN has a class label branch and a bounding box offset branch, where Mask R-CNN is a network that adds an object mask branch. Therefore, segmentation networks that reduce the size of the model or speed up the computation using the Mask R-CNN backbone have been actively studied until now.

As another method of segmentation, there are networks of performing segmentation based on an embedding loss function [10,27]. These networks allow feature vectors of pixels belonging to the same instance to be similar to each other, and feature vectors of pixels belonging to different instances to be different. The fully convolution network is not suitable for the embedding method, so it shows worse performance than the detection-based network [11-15]. To solve the problem, methods have been proposed to increase segmentation performance by allocating pixels to point to the center of an object [16-19]. In addition, a method for adding clustering to the loss function and optimizing intersection over union has been proposed [27].

The proposed method is the same concept as [27] which is an embedding-based segmentation, and we have introduced additional knowledge distillation. Plants have a small instance size and many overlapping parts, so it does not fit to the detection-based segmentation, which necessarily has bounding box information, so we adopted embedding-based segmentation.

2.2 Knowledge Distillation

Knowledge distillation is a method of transferring knowledge from a large and heavy network, teacher network, to a light and small network, student network. In general, a teacher network extracts useful information such as feature information from input images and transfers it to a student network. Knowledge distillation has been widely used to improve performance in classification, segmentation and object detection [21-23].

Our network uses knowledge distillation to achieve good performance with even a small set of data, instead of lightening the model. The structure of the teacher network and the student network are the same, but the input image of each network is different. For training a teacher network, a large amount of plant dataset is used, and for training a student network, a small amount of plant dataset is used. In addition, two types of distillation methods are used: attention distillation [24] and region affinity distillation [21].

3. Dataset for Instance Segmentation

3.1 Dataset Acquisition

We grew butterheads in a plant grower chamber to directly acquire a plant dataset. 11 butterheads were grown for 18 days, and an IP camera was installed in the chamber to record butterhead growth. After cultivation was completed, we cropped the acquired images to generate a total of 198 RGB images. 175 images were used for train, and 23 images were used for test.

3.2 Dataset Processing

Instance labels are required for plant leaf instance segmentation. We labelled to distinguish between leaf instances. Using the annotation tool [25], we generated the labeled ground truth of the Butterhead dataset we processed. Each leaf instance was labeled in a different color, and rules were set so that the leaves that grew later would be on top if overlapping parts occurred.

Fig. 3. Butterheads taken with an IP camera.
../../Resources/ieie/IEIESPC.2023.12.2.162/fig3.png
Fig. 4. RGB Original Image and Labeling Results.
../../Resources/ieie/IEIESPC.2023.12.2.162/fig4.png

4. The Proposed Method

4.1 Leaf Instance Segmentation

We performed a spatial embedding-based instance segmentation, considering the properties of small and overlapping plant leaves. The backbone of our network is ERFNet [26], which is often used for segmentation. ERFNet consists of encoder and decoder structures, and factorized convolutions and residual connections are applied to help perform efficient segmentation. The decoder is divided into two branches: seed branch and instance branch. Seed branch generates a seed map, which has a value close to 1 as it is closer to the center of the instance to which each pixel belongs, and a value close to 0 as it moves away from the center. It is calculated using Gaussian functions.

The instance branch generates sigma map and pixel offset vector (x, y). Sigma map is related to the margin of each instance, and offset vector (x, y) represents the distance away from the center in the x and y directions. After that, with seed map and sigma map, each pixel is subjected to instance segmentation through clustering. Spatial embedding which points to the centroid of the instance is performed with each pixel and the corresponding offset vector. Using Gaussian functions, we compute the distance between spatial pixel embedding and instance center to obtain the probability of belonging to that instance. The spatial embedding loss is used as a segmentation loss.

4.2 Knowledge Distillation

A large amount of plant dataset was used to train the teacher network, and a small amount of plant dataset was used to train the student network. Unlike other general datasets, it is difficult to obtain plant datasets because they can only be obtained by growing plants in facilities with the same environment for a certain period of time. To solve this dataset shortage problem, we used knowledge distillation to achieve good segmentation performance even with a small amount of dataset. Since the purpose is not to make the model lighter, the structures of the teacher network and the student network are the same. Two types of distillation are used in the encoder and decoder of our backbone network ERFNet: attention distillation [24] and region affinity distillation [21]. In the encoder, attention distillation is used to transfer attention containing important information from the teacher network to the student network. We obtained the attention using the squeeze-and-excitation block of [24]. In the decoder, region affinity distillation [21] is used to transfer region contrast information from the teacher network to the student network.

The region contrast is obtained by calculating the cosine similarity between two classes. The two classes are multiplied by foreground and background of the plant binary mask, respectively. One class is multiplied by the feature map and the background of the plant binary mask. The other class is multiplied by the feature map and the foreground of the plant binary mask. The region contrast information of the teacher network is transferred to the student network. Two methods, attention distillation and region affinity distillation, are used for the distillation loss.

4.3 The Proposed Network Architecture

Fig. 5 is the overall structure of the proposed network. The top is the student network and the bottom is the teacher network. The encoder consists of 16 layers, the decoder consists of 8 layers each of the seed branch and the instance branch, and the decoder consists of 16 layers in total. The teacher network is trained with a large plant dataset with 750 images.

The sequence of network operations is as follows. First, we generate seed map, sigma map, and offset vector via ERFNet with a large dataset from the teacher network. Then, with the generated map, the leaf instance segmentation is performed using the spatial embedding. The teacher network is frozen after learning 1000 epochs. After the training of the teacher network is completed, the student network is trained. The student network works identically to the teacher network, but it can perform well with a small amount of plant dataset through distillation transferred from the teacher network. Unlike the teacher network, the student network learns with a total of three losses, two distillation losses as well as segmentation loss. An attention distillation is performed at the encoder portion, and a region affinity distillation is done at the decoder portion.

Fig. 5. The Proposed Network Architecture.
../../Resources/ieie/IEIESPC.2023.12.2.162/fig5.png

5. Experimental Results

5.1 Experimental Setting

The proposed network is trained using several kinds of plant datasets. The image size of all datasets with different types is unified to 128 ${\times}$ 128. The network is implemented using the PyTorch framework on a PC with NVIDIA RTX 3090 GPU [28]. We adopted the Adam optimizer for loss optimization [29]. The initial learning rate is 0.00001.

We used four kinds of plant datasets in the experiment. Komatsuna [30] dataset is used to experiment with the teacher network. There are a total of 900 images in the Komatsuna dataset, 720 for train and 180 for test. For the student network experiment, three types of datasets are used: CVPPP 2017 A1, A4 [31], and the Butterhead dataset that we made. The CVPPP 2017 A1 dataset has a total of 128 images, 100 for train and 28 for test. The CVPPP 2017 A4 dataset has a total of 624 images: 459 for train and 165 for test. Finally, the Butterhead dataset has a total of 198 images, 175 for train and 23 for test. The dataset experimenting with the student network has fewer images than the dataset experimenting with the teacher network. We conducted many experiments by placing distillations at different locations in the network four types of distillations: feature distillation [32], attention distillation [24], prediction map distillation [21] and region affinity distillation [21].

5.2 Result on the CVPPP 2017 A1 Dataset

We evaluated the knowledge distillation and instance segmentation performance of the proposed student network using the A1 dataset of CVPPP 2017 [31], a leaf counting challenge dataset. Before training the student network, the teacher network was pre-trained after 1500 epochs learning with the Komatsuna dataset. In order to find out the effect of distillation, we conducted four experiments using four types of distillation. The distillation is performed on the encoder layer and decoder layer of the ERFNet. Table 1 shows the results of evaluation with mIoU, a quantitative evaluation metric. Fig. 10 shows the results of the experiment. Fig. 10(a) is a ground truth image. Fig. 10(b) is a result image when knowledge distillation is not used. Fig. 10(c) is a result image when using attention distillation in the encoder and region affinity distillation in the decoder and this is the result of the proposed method. Fig. 10(d) is a result image when using feature distillation in the encoder and region affinity distillation in the decoder. Fig. 10(e) is a result image that adds a prediction map distillation to Fig. 10(d). Fig. 10(f) is a result image that uses region affinity distillation in the encoder and attention distillation in the decoder, as opposed to the proposed method.

Table 1. Performance comparison on CVPPP 2017 A1 dataset[31].

Method

mIoU

No knowledge distillation

0.8521

Feature distillation - Region Affinity distillation

0.8584

Prediction map distillation +

Feature distillation - Region Affinity distillation

0.8599

Region Affinity distillation - Attention distillation

0.8636

Attention distillation - Region Affinity distillation (proposed)

0.8683

First, in the case of feature distillation in the encoder and region affinity distillation in the decoder, the mIoU increases compared to no knowledge distillation. But looking at Fig. 10(d), a small leaf instance cannot be segmented in the middle part. When precision map distillation is added in the previous case, the mIoU increases, and the leaf stems become a little clearer, as Fig. 10(e) shows. In the case of region affinity distillation in the encoder and attention distillation in the decoder, it can be seen that it has been segmented to the small leaf instance located in the center due to the influence of attention distillation as shown in Fig. 10(f). The proposed method is to change the position of distillation in the previous case. All the leaves are well segmented, and stems and borders are more distinct as shown in Fig. 10(c).

Fig. 6. Komatsuna Dataset[30].
../../Resources/ieie/IEIESPC.2023.12.2.162/fig6.png
Fig. 7. CVPPP 2017 A1 Dataset[31].
../../Resources/ieie/IEIESPC.2023.12.2.162/fig7.png
Fig. 8. CVPPP 2017 A4 Dataset[31].
../../Resources/ieie/IEIESPC.2023.12.2.162/fig8.png
Fig. 9. Butterhead Dataset.
../../Resources/ieie/IEIESPC.2023.12.2.162/fig9.png

5.3 Result on the CVPPP 2017 A4 Dataset

Also, we evaluated the knowledge distillation and instance segmentation performance of the proposed student network using the A4 dataset of CVPPP 2017 [31], a leaf counting challenge dataset. The experimental evaluation method is the same as in 5.2. Table 2 shows the results of evaluation with mIoU, a quantitative evaluation metric. Fig. 11 shows the results of the experiment. The resulting image sequence is the same as in section 5.2. The CVPPP 2017 A4 dataset is very similar except for the background of the CVPPP 2017 A1 dataset, showing similar results.

Fig. 10. Segmentation result comparison on CVPPP 2017 A1 dataset[31]: (a) is the ground truth; (b) is without knowledge distillation; (c) is attention distillation in the encoder and region affinity distillation in the decoder (proposed); (d) is feature distillation in the encoder and region affinity distillation in the decoder; (e) is prediction map distillation and feature distillation in the encoder and region affinity distillation in the decoder; (f) is region affinity distillation in the encoder and attention distillation in the decoder.
../../Resources/ieie/IEIESPC.2023.12.2.162/fig10.png
Fig. 11. Segmentation result comparison on CVPPP 2017 A4 dataset[31]: (a)-(f) are the same as the detailed description of Fig. 10.
../../Resources/ieie/IEIESPC.2023.12.2.162/fig11.png
Fig. 12. Segmentation result comparison on Butterhead dataset: (a)-(f) are the same as the detailed description of Fig. 10.
../../Resources/ieie/IEIESPC.2023.12.2.162/fig12.png
Table 2. Performance comparison on CVPPP 2017 A4 dataset[31].

Method

mIoU

No knowledge distillation

0.7812

Feature distillation - Region Affinity distillation

0.7885

Prediction map distillation +

Feature distillation - Region Affinity distillation

0.8162

Region Affinity distillation - Attention distillation

0.8155

Attention distillation - Region Affinity distillation (proposed)

0.8208

5.4 Result on the Proposed Butterhead Dataset

Finally, we evaluated the knowledge distillation and instance segmentation performance of the proposed student network using the Butterhead dataset. The experimental evaluation method is the same as in 5.2. Table 3 shows the results of evaluation with mIoU, a quantitative evaluation metric. Fig. 12 shows the results of the experiment. The resulting image sequence is the same as in 5.2.

Unlike the CVPPP 2017 dataset, the Butterhead dataset is one produced by labeling ourselves, so the experimental results are slightly different. Butterhead dataset has no stems, and there are many overlapping leaves, so their performance is not good in these overlapping areas. But still, the performance in the proposed method with attention distillation in the encoder and region affinity distillation in the decoder is the best. Our next research goal is to design a network that can perform better for overlapping leaves.

Table 3. Performance comparison on Butterhead dataset.

Method

mIoU

No knowledge distillation

0.7468

Feature distillation - Region Affinity distillation

0.7570

Prediction map distillation +

Feature distillation - Region Affinity distillation

0.7760

Region Affinity distillation - Attention distillation

0.7898

Attention distillation - Region Affinity distillation (proposed)

0.8287

6. Conclusion

We proposed a network that performs instance segmentation using knowledge distillation. For instance segmentation, each pixel is instance-divided through clustering. And, spatial-embedding is performed pointing to the center of the instance. In addition, Butterhead dataset is produced by directly performing butterhead cultivation to labeling. Experiments are conducted with various datasets for network performance evaluation, and experiments are conducted with various combinations of knowledge distillation. As a result, the experimental results prove the effect of improving the instance performance with knowledge distillation.

ACKNOWLEDGMENTS

This work is supported by the National Research Foundation of Korea (NRF) grant funded by the Korea government (MSIT) (No. 2020R1A4A4079705).

REFERENCES

1 
Bell, Jonathan, and Hannah M. Dee. "Leaf segmentation through the classification of edges." arXiv preprint arXiv:1904.03124. 2019.DOI
2 
Dobrescu, Andrei, Mario Valerio Giuffrida, and Sotirios A. Tsaftaris. "Leveraging multiple datasets for deep leaf counting." Proceedings of the IEEE international conference on computer vision workshops. 2017.URL
3 
Drees, Lukas, et al. "Temporal prediction and evaluation of Brassica growth in the field using conditional generative adversarial networks." Computers and Electronics in Agriculture 190 (2021): 106415.DOI
4 
Aich, Shubhra, and Ian Stavness. "Leaf counting with deep convolutional and deconvolutional networks." Proceedings of the IEEE international conference on computer vision workshops. 2017.URL
5 
Kuznichov, Dmitry, et al. "Data augmentation for leaf segmentation and counting tasks in rosette plants." Proceedings of the IEEE/CVF conference on computer vision and pattern recognition workshops. 2019.URL
6 
Ren, Shaoqing, et al. "Faster r-cnn: Towards real-time object detection with region proposal networks." Advances in neural information processing systems 28. 2015.URL
7 
Girshick, Ross, et al. "Rich feature hierarchies for accurate object detection and semantic segmentation." Proceedings of the IEEE conference on computer vision and pattern recognition. 2014.URL
8 
He, Kaiming, et al. "Mask r-cnn." Proceedings of the IEEE international conference on computer vision. 2017.URL
9 
Bolya, Daniel, et al. "Yolact: Real-time instance segmentation." Proceedings of the IEEE/CVF international conference on computer vision. 2019.URL
10 
Athar, Ali, et al. "Stem-seg: Spatio-temporal embeddings for instance segmentation in videos." European Conference on Computer Vision. Springer, Cham, 2020.DOI
11 
Liu, Rosanne, et al. "An intriguing failing of convolutional neural networks and the coordconv solution." Advances in neural information processing systems 31. 2018.URL
12 
Stanley, Kenneth O., et al. "Designing neural networks through neuroevolution." Nature Machine Intelligence 1.1 (2019): 24-35.DOI
13 
Tian, Zhi, Chunhua Shen, and Hao Chen. "Conditional convolutions for instance segmentation." European conference on computer vision. Springer, Cham. 2020.DOI
14 
Wang, Xinlong, et al. "Solo: Segmenting objects by locations." European Conference on Computer Vision. Springer, Cham. 2020.DOI
15 
Novotny, David, et al. "Semi-convolutional operators for instance segmentation." Proceedings of the European Conference on Computer Vision (ECCV). 2018.URL
16 
Abdar, Moloud, et al. "A review of uncertainty quantification in deep learning: Techniques, applications and challenges." Information Fusion 76 (2021): 243-297.DOI
17 
Alom, Md Zahangir, et al. "A state-of-the-art survey on deep learning theory and architectures." Electronics 8.3 (2019): 292.DOI
18 
Zheng, Zhedong, and Yi Yang. "Rectifying pseudo label learning via uncertainty estimation for domain adaptive semantic segmentation." International Journal of Computer Vision 129.4 (2021): 1106-1120.DOI
19 
Ahn, Jiwoon, Sunghyun Cho, and Suha Kwak. "Weakly supervised learning of instance segmentation with inter-pixel relations." Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2019.URL
20 
Bolya, Daniel, et al. "Yolact++: Better real-time instance segmentation." IEEE transactions on pattern analysis and machine intelligence. 2020.URL
21 
Qin, Dian, et al. "Efficient medical image segmentation based on knowledge distillation." IEEE Transactions on Medical Imaging 40.12 (2021): 3820-3831.DOI
22 
Liu, Yifan, et al. "Structured knowledge distillation for semantic segmentation." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2019.URL
23 
Hong, Yu, Hang Dai, and Yong Ding. "Cross-Modality Knowledge Distillation Network for Monocular 3D Object Detection." European Conference on Computer Vision. Springer, Cham. 2022.DOI
24 
Hu, Jie, Li Shen, and Gang Sun. "Squeeze-and-excitation networks." Proceedings of the IEEE conference on computer vision and pattern recognition. 2018.URL
25 
Hideaki Uchiyama, Shunsuke Sakurai, Masashi Mishima, Daisaku Arita, Takashi Okayasu, Atsushi Shimada and Rin-ichiro Taniguchi, "An easy-to-setup 3D phenotyping platform for KOMATSUNA dataset," ICCV Workshop on Computer Vision Problems in Plant Phenotyping, pp.2038-2045. 2017.URL
26 
Romera, Eduardo, et al. "Erfnet: Efficient residual factorized convnet for real-time semantic segmentation." IEEE Transactions on Intelligent Transportation Systems 19.1 (2017): 263-272.DOI
27 
Neven, Davy, et al. "Instance segmentation by jointly optimizing spatial embeddings and clustering bandwidth." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2019.URL
28 
PyTorch. Accessed: Oct. 2016. [Online]. Available: http://pytorch.orgURL
29 
Kingma, Diederik P., and Jimmy Ba. "Adam: a method for stochastic optimization (2014)." arXiv preprint arXiv:1412.6980 22. 2014.DOI
30 
Uchiyama, Hideaki, et al. "An easy-to-setup 3D phenotyping platform for KOMATSUNA dataset." Proceedings of the IEEE International Conference on Computer Vision Workshops. 2017.URL
31 
Scharr, Hanno, Tony P. Pridmore, and Sotirios A. Tsaftaris. "Computer Vision Problems in Plant Phenotyping, CVPPP 2017--Introduction to the CVPPP 2017 Workshop Papers." Proceedings of the IEEE International Conference on Computer Vision Workshops. 2017.URL
32 
Suin, Maitreya, Kuldeep Purohit, and A. N. Rajagopalan. "Distillation-guided image inpainting." Proceedings of the IEEE/CVF International Conference on Computer Vision. 2021.URL

Author

Joo-Yeon Jung
../../Resources/ieie/IEIESPC.2023.12.2.162/au1.png

Joo-Yeon Jung received the B.S. degree from the Department of Computer Science, Sookmyung Women’s University, Seoul, South Korea, in 2021. She is currently pursuing the M.S. degree in electrical engineering from Korea University, Seoul, South Korea. Her current research interests include image processing, image prediction, and deep learning.

Sang-Ho Lee
../../Resources/ieie/IEIESPC.2023.12.2.162/au2.png

Sang-Ho Lee received the B.S. degree from the School of Electrical Engineering, Korea University, Seoul, South Korea, in 2015, where he is currently pursuing the integrated M.S. and Ph.D. degree in electrical engineering. His current research interests include image generation, color constancy and visible light communication

Jong-Ok Kim
../../Resources/ieie/IEIESPC.2023.12.2.162/au2.png

Jong-Ok Kim received the B.S. and M.S. degrees in electronic engineering from Korea University, Seoul, South Korea, in 1994 and 2000, respectively, and the Ph.D. degree in information networking from Osaka University, Osaka, Japan, in 2006. From 1995 to 1998, he served as an officer in the Korea Air Force. From 2000 to 2003, he was with SK Telecom Research and Development center and Mcubeworks INC., South Korea, where he was involved in research and development on mobile multimedia systems. From 2006 to 2009, he was a researcher in Advanced Telecommunication Research Institute International (ATR), Kyoto, Japan. He joined Korea University, Seoul, Korea in 2009, where he is currently a professor. His current research interests include image processing, computer vision and intelligent media systems. He was a recipient of Japanese Government Scholarship, from 2003-2006.