Acta Scientiarum Polonorum Technologia Alimentaria

ISSN:1644-0730, e-ISSN:1898-9594

Acta Scientiarum Polonorum Logo
Issues
Submit manuscript
Journal metrics
Indexed in:
Creative Commons licence CC BY
original articleIssue 24 (4) 2025 pp. 489-511

Maolan Lin1, Zhenchang Gao2, Wenliang Liao1, Honghao Cai1

1Department of Physics, School of Science, Jimei University, China
2
School of Information Science and Technology, ShanghaiTech University, Shanghai, China

A systematic review of transformer-based vision models for object detection in food and agriculture

Abstract

Computer vision has become a cornerstone technology in the food and agriculture industries, driving innovation and enabling automation across a wide range of processes. Within this field, object detection plays a critical role, supporting efficiency, accuracy, and scalability in real-world applications. The Transformer, first introduced in natural language processing, demonstrated outstanding performance thanks to its powerful self-attention mechanism and parallel processing capabilities. More recently, it has been rapidly adopted in object detection and is emerging as a strong alternative to traditional convolutional neural networks. However, much of the related research remains scattered and interdisciplinary. This paper systematically reviews the development of transformer-based models for computer vision, analysing research trends, key topics, and distinctions from other algorithms. It introduces the basic architecture of the Vision Transformer (ViT) and other transformer-based vision models, explains core principles such as self-attention and multi-stage processing, and examines applications in food and agriculture, including food quality analysis, crop monitoring, pest and disease detection, and weed identification. Challenges and future directions of transformer-based models are also discussed, alongside a review of the latest research for reference. By consolidating a large body of literature, this study provides a comprehensive overview of the structure, development,  dvantages, and limitations of transformer-based vision models, while highlighting their potential to deliver more intelligent, sustainable, and efficient decision-support systems for precision food and farming practices.

Keywords: deep learning, convolutional neural network, computer vision, intelligent agriculture, self-attention mechanism, food quality
pub/.pdf Full text available in english in Adobe Acrobat format:
https://www.food.actapol.net/volume24/issue4/4_4_2025.pdf

https://doi.org/10.17306/J.AFS.001425

For citation:

MLA Lin, Maolan, et al. "A systematic review of transformer-based vision models for object detection in food and agriculture." Acta Sci.Pol. Technol. Aliment. 24.4 (2025): 489-511. https://doi.org/10.17306/J.AFS.001425
APA Lin M., Gao Z., Liao W., Cai H. (2025). A systematic review of transformer-based vision models for object detection in food and agriculture. Acta Sci.Pol. Technol. Aliment. 24 (4), 489-511 https://doi.org/10.17306/J.AFS.001425
ISO 690 LIN, Maolan, et al. A systematic review of transformer-based vision models for object detection in food and agriculture. Acta Sci.Pol. Technol. Aliment., 2025, 24.4: 489-511. https://doi.org/10.17306/J.AFS.001425