On the Role of ViT and CNN in Semantic Communications: Analysis and Prototype Validation

GitHub Bib Paper

@ARTICLE{yoo2023role,
  author={Yoo, Hanju and Dai, Linglong and Kim, Songkuk and Chae, Chan-Byoung},
  journal={IEEE Access},
  title={On the Role of ViT and CNN in Semantic Communications: Analysis and Prototype Validation},
  year={2023},
  volume={11},
  pages={71528--71541},
  month={July},
  doi={10.1109/ACCESS.2023.3291405}
}

Hanju Yoo^†, Linglong Dai^‡, Songkuk Kim^†, and Chan-Byoung Chae^†

^† School of Integrated Technology, Yonsei University, Seoul 03722, Korea
^‡ Department of Electronic Engineering, Tsinghua University, Beijing 100084, China

System architecture for ViT and CNN semantic communications — SemViT combines convolutional and Vision Transformer blocks inside an end-to-end image semantic communication autoencoder.

Abstract

Semantic communications have shown promising advancements by optimizing source and channel coding jointly. However, the dynamics of these systems remain understudied, limiting research and performance gains. Inspired by the robustness of Vision Transformers (ViTs) in handling image nuisances, we propose a ViT-based model for semantic communications. Our approach achieves a peak signal-to-noise ratio (PSNR) gain of +0.5 dB over convolutional neural network variants. We introduce novel measures, average cosine similarity and Fourier analysis, to analyze the inner workings of semantic communications and optimize the system’s performance. We also validate our approach through a real wireless channel prototype using software-defined radio (SDR). To the best of our knowledge, this is the first investigation of the fundamental workings of a semantic communications system, accompanied by the pioneering hardware implementation. To facilitate reproducibility and encourage further research, we provide open-source code, including neural network implementations and LabVIEW codes for SDR-based wireless transmission systems.

Why This Matters

At the time, many semantic communication papers showed performance gains without explaining the internal behavior of the learned encoder and decoder. This paper tries to open that black box. It asks which parts of the network behave like source coding, which parts behave like denoising, and why transformers may help image semantic communication.

The prototype matters for the same reason. A model that works only in AWGN simulation is not enough evidence for a wireless communication system. The SDR testbed checks whether the observed architecture advantage survives real channel effects such as gain mismatch, reflections, DAC quantization, and I/Q imbalance.

What This Paper Does

The paper starts from a CNN-based DeepJSCC baseline and selectively replaces middle layers with ViT blocks. The best model uses ViT blocks at the semantic bottleneck and early decoder side, matching the intuition that transformers help diversify compressed features and act as strong low-pass filters during reconstruction.

The analysis has two main tools. Average cosine similarity measures how diverse or redundant learned features are, while Fourier analysis shows whether each layer behaves more like a high-pass or low-pass transformation. Together they support a simple interpretation: encoders tend to extract and diversify high-frequency information, while decoders suppress channel noise and reconstruct image structure.

+0.5 dBPSNR gain over CNN variants reported in the paper.

31.8 dBRepresentative PSNR for the selected C-C-V-V-C-C architecture.

6.24 GFLOPsComputation of the selected architecture in the design table.

13.8MTrainable parameters for SemViT.

Key Results

SemViT outperforms CNN-based DeepJSCC across AWGN, Rayleigh, and measured wireless conditions.
The PSNR gap grows in higher-SNR and higher-bandwidth-ratio regimes, where better source coding becomes more important.
ViT blocks diversify latent representations and show strong low-pass behavior in the decoder.
Real wireless measurements show better quality than CNN baselines, while also exposing an approximately 3 dB simulation-to-prototype gap caused by non-Gaussian hardware/channel errors.
The paper provides open-source neural and SDR code for reproducibility.

PSNR comparison for ViT and CNN semantic communication models — SemViT improves reconstructed image quality over CNN-based DeepJSCC across channel SNR and bandwidth ratio settings.

Cosine similarity analysis for ViT and CNN semantic communication models — Cosine similarity reveals how the encoder learns less redundant features as source-coding pressure increases.

Fourier analysis for ViT and CNN semantic communication models — Fourier analysis supports the interpretation that encoders and decoders play different filtering roles.