Methods of Image Recognition and Processing Using The Vision Transformer

Keywords: vision transformer, multi-label images, long-range dependency modeling, convolutional neural networks, spatial relations, context-aware mechanisms

Abstract

In this study, the primary focus is on leveraging the inherent capabilities of the pure Vision Transformer (ViT) as a foundational framework for research in image recognition and processing. The utilization of Transformer architecture is motivated by its proficiency in modeling long-range dependencies, thereby overcoming the limitations associated with Convolutional Neural Networks (CNNs), which are constrained by local receptive fields. Despite the efficacy of ViT in capturing global information, its exclusive reliance on such data proves suboptimal for scenarios involving multi-label images. These images inherently comprise diverse objects spanning various categories, scales, and spatial relations. In light of this, the study acknowledges the inadequacy of relying solely on global information for effective processing of such complex visual data. The research aims to address this limitation by investigating strategies that augment the ViT model with additional mechanisms capable of incorporating contextual information pertinent to multi-label images. The objective is to enhance the model's capacity to discern and recognize objects characterized by diverse attributes, dimensions, and spatial arrangements. By elucidating the need for a nuanced approach to address the challenges posed by multi-label images, this study endeavors to contribute to the ongoing discourse on advancing image recognition and processing methodologies. The exploration of strategies to complement ViT with context-aware mechanisms underscores a commitment to refining the capabilities of vision-based models for more robust and versatile applications in the realm of computer vision.

References

1. Ju R. Lin T., Chiang J., Jian J., Lin Y., Huang L. Aggregated Pyramid Vision Transformer: Split-transform-merge Strategy for Image Recognition without Convolutions. 2022.
2. Hu Y., Jin X., Zhang Y., Hong H., Zhang J., Yan F., He Y., Xue H.. Diverse Instance Discovery: Vision-Transformer for Instance-Aware Multi-Label Image Recognition. 2022.
3. Zhang Z., Lei Z., Omura M., Hasegawa H., Gao S.. Dendritic Learning-Incorporated Vision Transformer for Image Recognition. IEEE/CAA Journal of Automatica Sinica. 2024. №11. P. 539-541.
4. Yamabe T., Saitoh T. Vision Transformer-Based Bark Image Recognition for Tree Identification. 2023.
5. Meng L., Li H., Chen B., Lan S., Wu Z., Jiang Y., Lim S. AdaViT: Adaptive Vision Transformers for Efficient Image Recognition. 2021.

Abstract views: 62
PDF Downloads: 52
Published
2024-03-28
How to Cite
Nedashkivskyi , B. (2024). Methods of Image Recognition and Processing Using The Vision Transformer. COMPUTER-INTEGRATED TECHNOLOGIES: EDUCATION, SCIENCE, PRODUCTION, (54), 146-152. https://doi.org/10.36910/6775-2524-0560-2024-54-17
Section
Computer science and computer engineering