A Quantitative and Computational Efficiency Comparison of CNN and Vision  Transformer Architectures for Pneumonia Detection from Chest X-rays

Akinyemi Omololu Akinrotimi; Israel Oluwabusayo  Omotosho; Olugbenga Olayinka Owolabi; Paul Onome   Omude

Authors

Akinyemi Omololu Akinrotimi Department of Information Systems and Technology, Kings University, Ode-Omu, Osun State, Nigeria.
Israel Oluwabusayo Omotosho Department of Management Information Systems, Bowie State University, Bowie, Maryland, USA.
Olugbenga Olayinka Owolabi Department of Electrical and Electronics Engineering, Adeleke University, Ede, Osun State, Nigeria
Paul Onome Omude Department of Computer Science, Tai Solarin University of Education, Ijagun, Ogun State, Nigeria.

Keywords:

Accuracy, Convolutional Neural Network, Interpretability, Pneumonia Detection, Vision Transformers, X-ray Imaging

Abstract

Accurate and efficient detection of pneumonia from chest X-ray images remains a critical challenge in medical imaging, especially in resource-constrained healthcare settings. This study presents a systematic comparison between a lightweight convolutional neural network (ResNet18) and a compact Vision Transformer (ViT-tiny/16) for binary classification of pneumonia and normal cases using the publicly available Kaggle Chest X-Ray dataset. The dataset was preprocessed through resizing, normalization, augmentation, and stratified splitting into training (70%), validation (15%), and test (15%) subsets. Both models were fine-tuned from ImageNet pretrained weights and evaluated using accuracy, precision, recall, F1-score, AUROC, training time per epoch, and parameter counts. The results demonstrated that ResNet18 achieved superior recall (94.7%), F1-score (94.2%), and AUROC (0.973)
while also training faster (22.5 s/epoch) with fewer parameters (11.7M). ViT-tiny achieved marginally higher precision (94.1%) but exhibited lower recall (89.2%) and increased computational demand (35.2 s/epoch, 21.7M parameters). Interpretability analyses revealed that CNN heatmaps localized pulmonary opacities consistent with radiological patterns, while ViT attention maps distributed focus more broadly, sometimes highlighting non diagnostic regions. These findings suggest that while Vision Transformers hold promise, CNNs currently offer a more balanced trade-off between accuracy, efficiency, and interpretability in small-to-medium-scale medical imaging tasks. Future research should investigate hybrid CNN–ViT approaches, self-supervised pretraining, and multi-institutional validation to further enhance generalizability and clinical applicability.

A Quantitative and Computational Efficiency Comparison of CNN and Vision Transformer Architectures for Pneumonia Detection from Chest X-rays

Authors

Keywords:

Abstract

Downloads

Published

How to Cite

Issue

Section

EDITORIAL BOARD

Current Issue

Information