A Quantitative and Computational Efficiency Comparison of CNN and Vision Transformer Architectures for Pneumonia Detection from Chest X-rays
Keywords:
Accuracy, Convolutional Neural Network, Interpretability, Pneumonia Detection, Vision Transformers, X-ray ImagingAbstract
Accurate and efficient detection of pneumonia from chest X-ray images remains a critical challenge in medical imaging, especially in resource-constrained healthcare settings. This study presents a systematic comparison between a lightweight convolutional neural network (ResNet18) and a compact Vision Transformer (ViT-tiny/16) for binary classification of pneumonia and normal cases using the publicly available Kaggle Chest X-Ray dataset. The dataset was preprocessed through resizing, normalization, augmentation, and stratified splitting into training (70%), validation (15%), and test (15%) subsets. Both models were fine-tuned from ImageNet pretrained weights and evaluated using accuracy, precision, recall, F1-score, AUROC, training time per epoch, and parameter counts. The results demonstrated that ResNet18 achieved superior recall (94.7%), F1-score (94.2%), and AUROC (0.973)
while also training faster (22.5 s/epoch) with fewer parameters (11.7M). ViT-tiny achieved marginally higher precision (94.1%) but exhibited lower recall (89.2%) and increased computational demand (35.2 s/epoch, 21.7M parameters). Interpretability analyses revealed that CNN heatmaps localized pulmonary opacities consistent with radiological patterns, while ViT attention maps distributed focus more broadly, sometimes highlighting non diagnostic regions. These findings suggest that while Vision Transformers hold promise, CNNs currently offer a more balanced trade-off between accuracy, efficiency, and interpretability in small-to-medium-scale medical imaging tasks. Future research should investigate hybrid CNN–ViT approaches, self-supervised pretraining, and multi-institutional validation to further enhance generalizability and clinical applicability.
