The increasing global prevalence of monkeypox (mpox) has necessitated the development of accurate, efficient, and interpretable diagnostic models for timely disease identification. Although deep learning has advanced medical image analysis, diagnosing diseases with complex lesions, such as monkeypox, remains challenging in resource-limited settings. Existing methods often rely on large architectures that demand high computational resources and lack interpretability. To address these issues, we propose a lightweight, interpretable architecture that truncates MobileNetV2 for efficient feature extraction, incorporates Ghost Modules to reduce redundancy, and integrates a SA-MobileViT block to capture both local and global dependencies in lesion images. Grad-CAM visualizations enhance interpretability, and Optuna is used for hyperparameter optimization. Our model achieved an accuracy of 98.72%, precision of 98.76%, recall of 98.72%, and F1-score of 98.72%, outperforming existing approaches, especially in handling irregular lesion patterns. The model’s design is ideal for deployment in resource-constrained settings, offering a robust solution for real-world healthcare applications. Grad-CAM, saliency maps, and other visualization techniques help improve clinical trust by highlighting relevant regions in input images, emphasizing key features of the lesions. The proposed model is lightweight with only 630,204 parameters, making it 92.3x smaller than ResNet152V2 (58,152,004 parameters) and 28.7x smaller than DenseNet201 (18,100,612 parameters). It requires 321.61M FLOPs, 35.8x fewer than ResNet152V2 (11.51G FLOPs) and 13.4x fewer than DenseNet201 (4.29G FLOPs). With a size of just 72.92 MB, it is 8.14x smaller than ResNet152V2 (594.08 MB) and 4.57x smaller than DenseNet201 (333.47 MB), making it an efficient solution for mobile and edge deployment in resource-constrained environments. © 2025 Elsevier B.V., All rights reserved.