Pattern Networks in Art History: Difference between revisions
No edit summary |
No edit summary |
||
| Line 47: | Line 47: | ||
<h1> | <h1>Methods</h1> | ||
<h2>Segmentation</h2> | |||
The first step of our pipeline to identify morphological links is to segment the images of our paintings into masks corresponding to all the human figures present on it. To do so, we decided to use the SAM3 model. The SAM3 model is an image segmentation model by Meta based on a (dual) encoder-decoder transformer architecture. SAM3 builds upon its previous models by introducing Promptable Concept Segmentation (PCS) and Promptable Visual Segmentation (PVS). Essentially, SAM3 enables image segmentation using text and images as prompts. While automatically generating masks would be ideal, SAM3 is limited in its ability to handle domain-specific areas. However, the authors of the paper indicate that the model can quickly adapt to domain-gaps through fine-tuning with human-annotated data and/or synthetically generated data. | |||
Examples of auto-masking tests are shown below. The results are very inconsistent and do not fully capture the general morphologies needed for our testing. Fortunately, we discovered that using the text-based concept prompts yielded promising results, specifically by using the word “humans”, which allowed us to receive binary masks of the human figures without background elements. While only focusing on these figures may appear short-sighted for discovering hidden morphologies, most positive connections in our datasets manifest as figures of people. Additionally, this prompting opens the door for further fine-tuning and testing across different vocabulary. | |||
<h1> | |||
<h2>Shape Analysis</h2> | |||
The most critical element to our project was the analysis and comparison of the segmented masks’ shape. The first step of this process was calculating the shape complexity of each mask in order to filter out overly simple shapes (in our case mostly background characters. Filtering out simple shapes is necessary to both reduce the computational cost of our pipeline and to prevent the false-positive links between two morphologically unrelated figures (i.e. we do not want two near-perfect circles to be flagged as a positive morphological link when they are likely totally unrelated). While there is no consensual definition of shape complexity, we chose to base our conception of shape complexity on the definition given in [1]. While the method given in the paper involves the use of a Variational Autoencoder (VAE) with three different measures of complexity, we chose to only use a combination of two metrics. | |||
The first metric is compression, where we calculate the ratio between the byte length of our mask compressed using the DEFLATE algorithm and the byte length of the uncompressed mask. Since the image compression will group large, homogenous areas together but cannot do so with smaller and more scattered areas, a “simpler” shape will have a smaller byte size when compressed than a more complex one. | |||
The second metric is based on a Fast Fourier Transform (FFT) of the mask. Since more complex shapes will require more higher frequency areas to represent with a FFT, we can then just measure the mean frequency in each dimension (height/width) of the image and then add them together to get an estimation of the shape’s complexity. | |||
We chose to combine these two measures with handpicked ratios that fit with the masks we obtained, but it would be more rigorous to use some method to find optimal ratios (like the authors of [1] did with VAE). | |||
With these metrics, we were able to filter out masks that have a shape complexity below a certain threshold. The following step was to calculate the contour of each mask, which we did using the function “findContour” of the OpenCV python library. This method allowed us to find the elliptical Fourier transform (EFD, i.e. an extension of the Discrete Cosine Transformation, used to calculate the Fourier transform of a closed contour) for each mask. After that, we normalized each EFD to make them size, rotation and position invariant and then filtered out the smaller frequencies, which tended to represent more minute details of the mask’s shape. We thus calculated the “distance” (dissimilarity) between two masks by taking the Euclidean distance between the normalized and filtered EFD frequencies of two masks. | |||
<h2>Identification of Morphological Links</h2> | |||
The original idea for the validation of the candidate links found with the contour was to use DinoV3 [CITE]. In their repository, they showcase an application of the model called “Dense Sparse Matching” that allows to match different patches of masked region on image pairs. (see image [?]) | |||
Those links could have been used to define a threshold to assess if the content of the masked area was similar enough and visually demonstrate their similitude. However, when testing the model, we found that the model would find links even on humans that are semantically quite different (see image [?]) and we could not figure out a way to quantify and tune those matching points to output a single similarity score that generalizes over different pairs. | |||
Therefore, to identify whether there is a morphological link between two images, we use the process described in the shape analysis section to compute the distance between every mask identified in each image. We then say that there is a positive link if one or more pair(s) of masks is similar enough, i.e. the distance is below a threshold. | |||
<h1>Assessment & Limitations</h1> | |||
Revision as of 16:32, 17 December 2025
Project Plan
The foundation for this project begins with EPFL’s Replica Project (2015-2019) spearheaded primarily by Isabella di Lenardo, Benoit Seguin, and Frederic Kaplan. Essentially, the Replica Project aimed to create a searchable digital collection of artworks leveraging a large dataset of artworks (CINI dataset). Leveraging both the CINI dataset and a CNN-based architecture, a system of links was developed between artworks based on their morphology. Pre-trained with historian corrections, these morphological links aimed to identify whether there existed unquestionable influence from one artwork to another. For example, Image 1 shows a series artworks with a “positive” morphological link; in other words, there is no question the artworks was referencing or was influenced by the other.
However, compared to the capabilities of CNN-based image pipelines, transformer-based architectures have emerged with vastly greater capabilities for image-segmentation and analysis. Therefore, our group’s overarching goal has been to develop and understand the power of implementing a modern image processing pipeline for identifying morphological links between artworks. This has been explored in three steps:
- Using a segmentation model to extract the different elements of paintings.
- Matching the shape of segmented regions to find plausible correspondences.
- Confirming the correspondences based on the morphology of the masked elements.
Over the course of the semester, our group has explored a range of visual-image-transformer (ViT) models to perform segmentation on the dataset. Primarily we aimed to exploit the dataset of positive morphological links to test if the models will accurately mask the shape of the necessary features. The main model our group pivoted towards was the Segment-Anything Model (SAM) by Meta. Specifically, we tested the SAM2 model: a transformer-powered vision model for promptable and automatic segmentation. Ironically however, while testing the model, Meta released a new version of the SAM model (SAM3). Our methodological processes and assessment in the sections below go into greater detail into our observations. At a high-level glance, we outline a table below highlighting the milestones and weekly evolution of our project.
Milestones & Planning
| Week | Milestones |
|---|---|
| 1 | Explored Cini dataset, relevant repositories, and research papers to better understand the scope of the project. |
| 2 | Explored image-segmentation models and research papers on segmentation and related work, including SAM2, panoptic segmentation, and transformer-based image processing. |
| 3 | Used segmentation models to test mask generation on pairs of positively linked images to assess whether masks accurately captured morphology. |
| 4 | Identified limitations with automatically generated segmentation masks and explored alternative or more fine-tuned segmentation approaches. |
| 5 | Investigated the release of the SAM 3 model and explored ways to leverage promptable segmentation within the experimental pipeline. |
| 6 | Continued developing the pipeline and identified solutions addressing prompting and contouring limitations in segmentation. |
| 7 | Finalized documentation and deliverables within the codebase and presented findings. |
Motivation & Deliverables
Motivation
As mentioned previously, our group’s goal has been to develop and understand the power of implementing a modern image processing pipeline for identifying morphological links between artworks. By testing this pipeline, we hope to open the door for future academics to update the Replica Projects’s CNN-based architecture with a modern transformer-based version. With these updates, there is a potential to discover hidden morphologies previously untraceable.
Deliverables
Methods
Segmentation
The first step of our pipeline to identify morphological links is to segment the images of our paintings into masks corresponding to all the human figures present on it. To do so, we decided to use the SAM3 model. The SAM3 model is an image segmentation model by Meta based on a (dual) encoder-decoder transformer architecture. SAM3 builds upon its previous models by introducing Promptable Concept Segmentation (PCS) and Promptable Visual Segmentation (PVS). Essentially, SAM3 enables image segmentation using text and images as prompts. While automatically generating masks would be ideal, SAM3 is limited in its ability to handle domain-specific areas. However, the authors of the paper indicate that the model can quickly adapt to domain-gaps through fine-tuning with human-annotated data and/or synthetically generated data.
Examples of auto-masking tests are shown below. The results are very inconsistent and do not fully capture the general morphologies needed for our testing. Fortunately, we discovered that using the text-based concept prompts yielded promising results, specifically by using the word “humans”, which allowed us to receive binary masks of the human figures without background elements. While only focusing on these figures may appear short-sighted for discovering hidden morphologies, most positive connections in our datasets manifest as figures of people. Additionally, this prompting opens the door for further fine-tuning and testing across different vocabulary.
Shape Analysis
The most critical element to our project was the analysis and comparison of the segmented masks’ shape. The first step of this process was calculating the shape complexity of each mask in order to filter out overly simple shapes (in our case mostly background characters. Filtering out simple shapes is necessary to both reduce the computational cost of our pipeline and to prevent the false-positive links between two morphologically unrelated figures (i.e. we do not want two near-perfect circles to be flagged as a positive morphological link when they are likely totally unrelated). While there is no consensual definition of shape complexity, we chose to base our conception of shape complexity on the definition given in [1]. While the method given in the paper involves the use of a Variational Autoencoder (VAE) with three different measures of complexity, we chose to only use a combination of two metrics.
The first metric is compression, where we calculate the ratio between the byte length of our mask compressed using the DEFLATE algorithm and the byte length of the uncompressed mask. Since the image compression will group large, homogenous areas together but cannot do so with smaller and more scattered areas, a “simpler” shape will have a smaller byte size when compressed than a more complex one.
The second metric is based on a Fast Fourier Transform (FFT) of the mask. Since more complex shapes will require more higher frequency areas to represent with a FFT, we can then just measure the mean frequency in each dimension (height/width) of the image and then add them together to get an estimation of the shape’s complexity.
We chose to combine these two measures with handpicked ratios that fit with the masks we obtained, but it would be more rigorous to use some method to find optimal ratios (like the authors of [1] did with VAE).
With these metrics, we were able to filter out masks that have a shape complexity below a certain threshold. The following step was to calculate the contour of each mask, which we did using the function “findContour” of the OpenCV python library. This method allowed us to find the elliptical Fourier transform (EFD, i.e. an extension of the Discrete Cosine Transformation, used to calculate the Fourier transform of a closed contour) for each mask. After that, we normalized each EFD to make them size, rotation and position invariant and then filtered out the smaller frequencies, which tended to represent more minute details of the mask’s shape. We thus calculated the “distance” (dissimilarity) between two masks by taking the Euclidean distance between the normalized and filtered EFD frequencies of two masks.
Identification of Morphological Links
The original idea for the validation of the candidate links found with the contour was to use DinoV3 [CITE]. In their repository, they showcase an application of the model called “Dense Sparse Matching” that allows to match different patches of masked region on image pairs. (see image [?])
Those links could have been used to define a threshold to assess if the content of the masked area was similar enough and visually demonstrate their similitude. However, when testing the model, we found that the model would find links even on humans that are semantically quite different (see image [?]) and we could not figure out a way to quantify and tune those matching points to output a single similarity score that generalizes over different pairs.
Therefore, to identify whether there is a morphological link between two images, we use the process described in the shape analysis section to compute the distance between every mask identified in each image. We then say that there is a positive link if one or more pair(s) of masks is similar enough, i.e. the distance is below a threshold.
