Reconstruction of Partial Facades

From FDHwiki
Jump to navigation Jump to search

Introduction

Project Timeline & Milestones

Timeframe Task Completion
Week 4
  • Understanding the DiffPMAE paper
Week 5
  • Customising MAE model for 3D input
Week 6
  • Understanding DiffPMAE model from GitHub repo
Week 7
  • Setting up environment for DiffPMAE model
Week 8
Week 9
Week 10
Week 11
Week 12
Week 13
Week 14

Introduction

Motivation

Venice's facades represent a remarkable heritage of artistic and architectural ingenuity, reflecting centuries of cultural evolution. However, despite advancements in digital documentation, many scanned images of these facades are incomplete or improperly captured, leading to gaps in their visual representation. This limits the potential for accurate digital analysis, visualization, and preservation of these iconic structures.

To address this challenge, this project explores the application of different models for reconstruction of incomplete facade images. Firstly, we tried to implement a Masked Autoencoder (MAE). MAEs are powerful tools for self-supervised learning, aiming at reconstructing missing portions of data by leveraging patterns learned from complete examples. By training the model on a dataset of complete Venetian facade images, we aim to develop a system capable of accurately filling in the missing regions of improperly scanned images. The second model we tried to implement was an NMF,...


Methodology

This project is inspired by the paper "Masked Autoencoders Are Scalable Vision Learners" by He et al., from Facebook AI Research (FAIR). The core methodology revolves around leveraging MAEs for scalable and efficient learning in visual domains. Below, we detail the methodology used in this project.


Overview of the Approach

The MAE architecture is designed to reconstruct missing parts of an image, enabling effective self-supervised pretraining of Vision Transformers (ViTs). The central idea is to mask a substantial portion of input image patches and train the model to reconstruct the original image using the remaining visible patches.

For this project, two types of MAEs were implemented:

=====Custom MAE=====: Trained from scratch, allowing flexibility in input size, masking strategies, and hyperparameters. =====Pretrained MAE=====: Leveraged a pretrained MAE, which was finetuned for our specific task.

Below, we detail the methodology used for the custom MAE trained from scratch.

Custom MAE Methodology

Data Preprocessing

Images were resized to a fixed resolution (e.g., 256×256) and normalized to have pixel values in the range [-1, 1] and the input image was divided into patches of size 4×4, resulting in a grid of 64×64 patches for each image.

Model Architecture

Encoder: The encoder takes visible (unmasked) patches as input and processes them using a Vision Transformer (ViT)-based architecture. Positional embeddings are added to the patch embeddings to retain spatial information. The encoder produces a latent representation for the visible patches.

Decoder: The decoder takes both the encoded representations of visible patches and learnable masked tokens as input. It reconstructs the image by predicting pixel-level details for the masked patches.

Masking Strategy

Inspired by the FAIR paper, the majority of patches (e.g., 75%) are masked during training. Two masking strategies were explored.

Random Masking: Randomly selecting patches to mask. Block Masking: Masking contiguous blocks of patches to simulate occlusion which better represents our incomplete facades. The masking ratio and strategy were crucial hyperparameters, and experiments were conducted to analyze their impact on model performance.

Loss Function

The reconstruction task is framed as a pixel-wise regression problem. The primary loss function used is Mean Squared Error (MSE) between the reconstructed and original images.Then Structural Similarity Index Measure (SSIM) loss was also incorporated to improve perceptual quality by focusing on structural similarities in reconstructions.

Training and Optimization

The model was trained using the AdamW optimizer, with a learning rate scaled based on batch size and a cosine decay scheduler for gradual reduction of learning rates. A warm-up phase was incorporated to stabilize training during the initial epochs. Gradient clipping was applied to ensure numerical stability.

Evaluation Metrics

Performance was evaluated based on reconstruction quality (MSE loss) and visual fidelity of reconstructed images. Qualitative evaluation was conducted through visualization of reconstructed images compared to the originals.

Results

Conclusion

Appendix

References