Reconstruction of Partial Facades
Introduction
Project Timeline & Milestones
Timeframe | Task | Completion |
---|---|---|
Week 4 |
|
|
Week 5 |
|
|
Week 6 |
|
|
Week 7 |
|
|
Week 8 |
|
|
Week 9 |
|
|
Week 10 |
|
|
Week 11 |
|
|
Week 12 |
|
|
Week 13 |
|
|
Week 14 |
|
Introduction
Motivation
Venice's facades represent a remarkable heritage of artistic and architectural ingenuity, reflecting centuries of cultural evolution. However, despite advancements in digital documentation, many scanned images of these facades are incomplete or improperly captured, leading to gaps in their visual representation. This limits the potential for accurate digital analysis, visualization, and preservation of these iconic structures.
To address this challenge, this project explores the application of different models for reconstruction of incomplete facade images. Firstly, we tried to implement a Masked Autoencoder (MAE). MAEs are powerful tools for self-supervised learning, aiming at reconstructing missing portions of data by leveraging patterns learned from complete examples. By training the model on a dataset of complete Venetian facade images, we aim to develop a system capable of accurately filling in the missing regions of improperly scanned images. The second model we tried to implement was an NMF,...
Methodology
This project is inspired by the paper "Masked Autoencoders Are Scalable Vision Learners" by He et al., from Facebook AI Research (FAIR). The MAE architecture is designed to reconstruct missing parts of an image, enabling effective self-supervised pretraining of Vision Transformers (ViTs). The central idea is to mask a substantial portion of input image patches and train the model to reconstruct the original image using the remaining visible patches.
For this project, two types of MAEs were implemented:
1) Custom MAE: Trained from scratch, allowing flexibility in input size, masking strategies, and hyperparameters.
2) Pretrained MAE: Leveraged a pretrained MAE, which was finetuned for our specific task.
Custom MAE
Data Preprocessing
Images were resized to a fixed resolution (e.g., 256×256) and normalized to have pixel values in the range [-1, 1] and the input image was divided into patches of size 4×4, resulting in a grid of 64×64 patches for each image.
Model Architecture
Encoder: The encoder takes visible (unmasked) patches as input and processes them using a Vision Transformer (ViT)-based architecture. Positional embeddings are added to the patch embeddings to retain spatial information. The encoder produces a latent representation for the visible patches.
Decoder: The decoder takes both the encoded representations of visible patches and learnable masked tokens as input. It reconstructs the image by predicting pixel-level details for the masked patches.
Masking Strategy
Inspired by the FAIR paper, the majority of patches (e.g., 75%) are masked during training. Two masking strategies were explored.
Random Masking: Randomly selecting patches to mask. Block Masking: Masking contiguous blocks of patches to simulate occlusion which better represents our incomplete facades. The masking ratio and strategy were crucial hyperparameters, and experiments were conducted to analyze their impact on model performance.
Loss Function
The reconstruction task is framed as a pixel-wise regression problem. The primary loss function used is Mean Squared Error (MSE) between the reconstructed and original images.Then Structural Similarity Index Measure (SSIM) loss was also incorporated to improve perceptual quality by focusing on structural similarities in reconstructions.
Training and Optimization
The model was trained using the AdamW optimizer, with a learning rate scaled based on batch size and a cosine decay scheduler for gradual reduction of learning rates. A warm-up phase was incorporated to stabilize training during the initial epochs. Gradient clipping was applied to ensure numerical stability.
Evaluation Metrics
Performance was evaluated based on reconstruction quality (MSE loss) and visual fidelity of reconstructed images. Qualitative evaluation was conducted through visualization of reconstructed images compared to the originals.