Vision Transformer with Fractal Dimension Transformation: Effects of Resolution and Patch Size
Loading...
Date
Journal Title
Journal ISSN
Volume Title
Publisher
Abstract
Vision Transformer (ViT) achieves strong performance in computer vision but requires substantial computational resources, particularly with high-resolution data. A key challenge lies in the quadratic complexity of self-attention with respect to the number of image patches, which is jointly determined by input size and patch size. Conventional resizing is a common strategy to reduce resolution and thus the number of patches, but it risks discarding structural details that may be important for prediction. To address this issue, this study investigates how input size, patch size, and dimensionality reduction influence ViT training time and prediction accuracy. Using the NIH Chest X-ray dataset, we compared two preprocessing methods: conventional resizing and a Fractal Dimension (FD)-based transformation. Results show that the FD-based method consistently reduced training time across all settings, demonstrating its effectiveness in lowering computational costs. In terms of accuracy, conventional resizing generally performed slightly better overall; however, the differences were not uniform, as smaller patches improved AUROC mainly at higher resolutions but not consistently at lower ones. These findings highlight a tradeoff between efficiency and accuracy, positioning FD-based representations as a practical complement to conventional resizing when computational resources are limited.