Pattern Networks in Art History

From FDHwiki
Jump to navigation Jump to search

Introduction

The foundation for this project begins with EPFL’s Replica Project (2015–2019) spearheaded primarily by Isabella di Lenardo, Benoit Seguin, and Frederic Kaplan [1]. Essentially, the Replica Project aimed to create a searchable digital collection of artworks leveraging a large dataset of artworks (Cini dataset). Leveraging both the Cini dataset and a CNN-based architecture, a system of links was developed between artworks based on their morphology. Pre-trained with historian corrections, these morphological links aimed to identify whether there existed unquestionable influence from one artwork to another. For example, Figure 1 shows a series artworks with a “positive” morphological link; in other words, there is no question the artworks was referencing or was influenced by the other.

Figure 1: From left to right: Sleeping Venus by Giorgione (c.1510); Venus of Urbino by Titian (1538); Olympia by Édouard Manet (1863). The figures, while not perfectly the same, are practically identical in morphology (the posture, the shape, etc.)

However, compared to the capabilities of CNN-based image pipelines, transformer-based architectures have emerged with vastly greater capabilities for image-segmentation and analysis. Therefore, our group’s overarching goal has been to develop and understand the power of implementing a modern image processing pipeline for identifying morphological links between artworks. This has been explored in three steps:

  1. Using a segmentation model to extract the different elements of paintings.
  2. Matching the shape of segmented regions to find plausible correspondences.
  3. Confirming the correspondences based on the morphology of the masked elements.

Over the course of the semester, our group has explored a range of visual-image-transformer (ViT) models to perform segmentation on the dataset. Primarily we aimed to exploit the dataset of positive morphological links to test if the models will accurately mask the shape of the necessary features. The main model our group pivoted towards was the Segment Anything Model (SAM) by Meta. Specifically, we tested the SAM2 model: a transformer-powered vision model for promptable and automatic segmentation [2]. Ironically however, while testing the model, Meta released a new version of the SAM model (SAM3), which we implemented into our workflow [3]. Our methodological processes and assessment in the sections below go into greater detail into our observations.

Motivation

As mentioned previously, our group’s goal has been to develop and understand the power of implementing a modern image processing pipeline for identifying morphological links between artworks. By testing this pipeline, we hope to open the door for future academics to update the Replica Projects’s CNN-based architecture with a modern transformer-based version. With these updates, there is a potential to discover hidden morphologies previously untraceable. These morphological connections are an important discovery since they enable historians to better understand the influence of certain artists and artworks during their respective period. For example, discovering that Painting A actually portrays the same exact figure shown in Painting B; if Painting B is older, there is a reasonable assumption to make that Painter A was undoubtedly influenced by Painting B. However, since there are countless artworks stored in archival collections, it is unfeasible for humans to compare every single painting with another. With the continuous advancement of machine learning models, there is a possibility for users to explore different avenues for discovering morphological links. With our project, we tested a transformer-based image-segmentation model in hopes of exploring the potential of evolving machine-learning models.

Driving Questions

  • Will an updated visual-model pipeline be able to accurately detect morphological links between artworks?
  • What limitations exist if we were to scale our proof-of-concept pipeline to thousands or millions of images?

Milestones & Planning

  1. Week 9: Explored Cini dataset, relevant repositories, and research papers to better understand the scope of the project.
  2. Week 10: Explored image-segmentation models, research papers on segmentation, and various related work: SAM2, panoptic segmentation, transformer-based image-processing.
  3. Week 11: Used segmentation models to test mask generation of pairs of positively linked images to understand whether masks would accurately capture morphology.
  4. Week 12: Discovered limitation with automatically generated segmentation masks, exploring if there are alternatives or more fine-tuned ways of implementing segmentation.
  5. Week 12: SAM 3 model releases, explored the different ways we could leverage promptable segmentation within our experimental pipeline.
  6. Week 13: Continued developing pipeline and discovered solutions with prompting and contouring limitations of the segmentation.
  7. Week 14: Finalized documents and deliverables withing our code and presented findings.

Datasets Used

Cini

The Cini dataset is a collection of 330,000 documents from the photo-collection of the Cini Foundation in Venice. The original data is sourced from standardized pieces of cardboard with scanned photos of artworks and their respective metadata information. To fit into a tabular format, the images of the artwork appear as hyperlinks hosted through an IIIF format or through the Web Gallery of Art (WAG). Each artwork has a special ‘UID’ for identification. More details of the dataset are explain the Seguin's thesis paper [4].

Morphograph Data

Based on the Cini dataset from the Replica Project, a dataset for every morphological connection was compiled between two separate artworks (120,000). However, there is a second dataset with only positive morphological links (1,900). We, specifically, test our pipeline against the positive morphological links since these datapoints reflect the true nature of morphological connections. The dataset contains, without a doubt, a collection of images with an undeniable visual link.

Deliverables

  • A public GitHub Repository containing the implementation of the pipeline[5]
  • Python scripts implementing segmentation, shape analysis, and matching
  • Jupyter notebooks documenting experimentation with SAM 3 and DINOv3

General Pipeline

Our GitHub project is characterized by this implementation pipeline.

  1. A list of image files is provided into the program for comparison.
  2. SAM 3 is used to compute the masks that segment human instances on the image (panoptic → one mask per human). We use the text prompting feature of the model to return segmentations.
  3. Each segmented mask is then compared by our contour analysis pipeline that compares the boundaries of the masked region.
  4. The code returns the pairs of associated image where a link has been found.
  5. (Bonus) We also included a notebook that uses DinoV3 to showcase visually links inside of masked regions of images. We could not find a way to properly include it in the pipeline and extract a coherent score out of it, but could be a good visual validation tool.

All the information needed to run the code is provided in the README, where we explain how to set up and use our notebook in Google Colab. The code can also be run locally, but this is not recommended due to the difficulty of managing the required libraries. The provided Colab setup should work out of the box.

Methods

Segmentation

The first step of our pipeline to identify morphological links was to segment the images of our paintings into masks corresponding to all the human figures present on it. To do so, we decided to use the SAM3 model. The SAM3 model is an image segmentation model by Meta based on a (dual) encoder-decoder transformer architecture. SAM 3 builds upon its previous models by introducing Promptable Concept Segmentation (PCS) and Promptable Visual Segmentation (PVS) [3]. Essentially, SAM 3 enables image segmentation using text and images as prompts. While automatically generating masks would be ideal, SAM 3 is limited in its ability to handle domain-specific areas. However, the authors of the paper indicate that the model can quickly adapt to domain-gaps through fine-tuning with human-annotated data and/or synthetically generated data.

An example of auto-masking is shown in Figure 2. The results are very inconsistent and do not fully capture the general morphologies needed for our testing. Fortunately, we discovered that using the text-based concept prompts yielded promising results, specifically by using the word “humans”, which allowed us to receive binary masks of the human figures without background elements. While only focusing on these figures may appear short-sighted for discovering hidden morphologies, most positive connections in our datasets manifest as figures of people. Additionally, this prompting opens the door for further fine-tuning and testing across different vocabulary.

Figure 2: Baseline attempt of automatically generated masks of an artwork.

Shape Analysis

The most critical element to our project was the analysis and comparison of the segmented masks’ shape. The first step of this process was calculating the shape complexity of each mask in order to filter out overly simple shapes (in our case mostly background characters). Filtering out simple shapes is necessary to both reduce the computational cost of our pipeline and to prevent the false-positive links between two morphologically unrelated figures (i.e. we do not want two near-perfect circles to be flagged as a positive morphological link when they are likely totally unrelated). While there is no formally agreed definition of shape complexity, we chose to base our conception of shape complexity on the definition given in Rothgänger's paper [6]. While the method given in the paper involves the use of a Variational Autoencoder (VAE) with three different measures of complexity, we chose to only use a combination of two metrics.

The first metric is compression, where we calculate the ratio between the byte length of our mask compressed using the DEFLATE algorithm and the byte length of the uncompressed mask. Since the image compression will group large, homogenous areas together but cannot do so with smaller and more scattered areas, a “simpler” shape will have a smaller byte size when compressed than a more complex one. The second metric is based on a Fast Fourier Transform (FFT) of the mask [7]. Since more complex shapes will require more higher frequency areas to represent with a FFT, we can then just measure the mean frequency in each dimension (height/width) of the image and then add them together to get an estimation of the shape’s complexity. We chose to combine these two measures with handpicked ratios that fit with the masks we obtained, but it would be more rigorous to use some method to find optimal ratios (like the authors of the shape complexity paper [6] did with VAE).

With these metrics, we were able to filter out masks that have a shape complexity below a certain threshold. The following step was to calculate the contour of each mask, which we did using the function “findContour” of the OpenCV python library [8]. This method allowed us to find the elliptical Fourier transform (EFD, i.e. an extension of the Discrete Cosine Transformation, used to calculate the Fourier transform of a closed contour) for each mask. After that, we normalized each EFD to make them size, rotation and position invariant and then filtered out the smaller frequencies, which tended to represent more minute details of the mask’s shape. We thus calculated the “distance” (dissimilarity) between two masks by taking the Euclidean distance between the normalized and filtered EFD frequencies of two masks.

Figure 3: Example of a shape analysis done between two image pairs.

Identification of Morphological Links

The original idea for the validation of the candidate links found with the contour was to use DINOv3 [9]. In their repository, they showcase an application of the model called “Dense Sparse Matching” that allows to match different patches of masked region on image pairs. (see Figure 4)

Figure 4: Dense Sparse Matching Test.

Those links could have been used to define a threshold to assess if the content of the masked area was similar enough and visually demonstrate their similitude. However, when testing the model, we found that the model would find links even on humans that are semantically quite different (see image Figure 5) and we could not figure out a way to quantify and tune those matching points to output a single similarity score that generalizes over different pairs.

Figure 5: Dense Sparse Matching between an artwork and a photograph.

Therefore, to identify whether there is a morphological link between two images, we use the process described in the shape analysis section to compute the distance between every mask identified in each image. We then say that there is a positive link if one or more pair(s) of masks is similar enough, i.e. the distance is below a threshold.

Results

Figure 6 represent the 8 image pairs taken from the positively-linked morphograph dataset. Of these 8 pair of images, 4 pairs are 'truly' positive. In other words, their morphological contour is, without a doubt, exactly similar. After processing the images through our pipeline, (shown in Figure 7) 3 of the 4 'truly' positive image pairs were correctly matched. No false positive links were found in our test. For our assessment of the model, taking this exact similarity is necessary for providing the most solid foundation for future implementations. In our case, while other positively linked artworks were found to be morphologically connected, we prioritized matching images with the most precisely similar contours. Otherwise, dissimilar contours that may be 'close' in nature, may result in a false positive flag.

Figure 6 8 pairs of artworks with varying degrees of morphological linkage.
Figure 7 3 pairs of artworks found to have highly matching morphological contours using our pipeline.

Assessment & Limitations

Limitations are the most important consideration regarding the scope of our pipeline. When hoping to extract the morphology of a figure using image segmentation, there is the glaring issue of non-continuous masks. In many artworks, there are commonly two figures with obvious positive visual links. However, when secondary features (like objects or people) cover parts of a figure, the resulting mask will be ‘broken.’ If the morphologies are not roughly exact, then there will be no resulting positive link, which is problematic if we have various masks that are cut into pieces by overlapping objects. To remedy this issue, we explored different methods of ‘filling’ the segmentation gaps. While the results appear promising, we question whether there are edge-cases where connected masks result in a greatly un-faithful morphology of the original figure, and lead to false-positives.

Another glaring limitation is the sheer volume and computational cost of segmentation of each artwork. Firstly, we must determine the prompt for extracting the necessary and important morphological figures. Furthermore, we must compare this mask across the 300,000 other segmented artworks in the CINI dataset. Segmenting each work, and cross-comparing them with each other across the vast collection will take a large amount of time and computational power to complete. This limitation is the reason why we could not explicitly compare all the links we discovered with the links discovered in the Replica Project. Additionally, 300,000 may, in fact, be a small size compared to other archival collections of artworks. If this pipeline were to be scaled to databases totaling over the millions, would it be computationally feasible and efficient? For example, a limitation was choosing all of the threshold parameters (to filter out masks using complexity, to filter out frequencies in the EFD and to output whether there is a link between two images based on our distance measure) by hand. While this leads to promising results on a small dataset, it might not generalize well to a larger one. Understanding these issues of scaling is necessary when considering collections of immense volume.

Future Considerations

To improve our results, we would like to explore how other image-segmentation models perform on artworks. Ideally, we would, also, like to explore ways to fine tune the SAM 3 model for our domain-specific cause. It was already mentioned that synthetic-data and annotated data could potentially reduce this gap with the SAM 3 model. It would be interesting to test how synthetic data generation/augmentation can be used to improve a morphological detection model. Additionally, we also wonder whether auto-generating masks could yield better results with better parameter fine tuning. Auto-generating masks would be ideal in the case of finding more potential objects within an artwork, but it still remains paired with its own set of limitations.

Another consideration would be able to extract the semantic content of our masks rather than just taking the mask contours. By only taking the mask contours, we are limited in understanding finer morphological similarities. If there are more fine grained details within the figures, they would not be represented within our mask. With these details, semantic representations could be collected that may potentially allow us to data mine, generalize, and feed back promptable segmentation masks into the SAM 3 model (but this is only theoretical).

References

  1. di Lenardo, I., Seguin, B., & Kaplan, F. (2019). Replica Project. EPFL Digital Humanities Laboratory. https://www.epfl.ch/labs/dhlab/projects/replica/
  2. Kirillov, A., et al. (2024). Segment Anything Model 2. arXiv:2408.00714. https://arxiv.org/abs/2408.00714
  3. 3.0 3.1 Kirillov, A., et al. (2024). Segment Anything Model 3. arXiv:2511.16719. https://arxiv.org/abs/2511.16719
  4. Seguin, B. L. A. (2018). Making large art historical photo archives searchable (PhD thesis, École polytechnique fédérale de Lausanne). EPFL Infoscience repository. https://infoscience.epfl.ch/entities/publication/4d0b98f4-c5e9-4cdd-8df2-2218d8012801
  5. Hugentobler, J., et al. PatNet: Morphological Link Detection Pipeline. https://github.com/JeremyHugentobler/PatNet
  6. 6.0 6.1 Rothgänger, Markus, Andrew Melnik, and Helge Ritter. Shape complexity estimation using VAE. Intelligent Systems Conference. Cham: Springer Nature Switzerland, 2023.
  7. Kuhl, Frank P., and Charles R. Giardina. Elliptic Fourier features of a closed contour. Computer Graphics and Image Processing, Volume 18, Issue 3, 1982, Pages 236–258. https://doi.org/10.1016/0146-664X(82)90034-X
  8. Bradski, G. The OpenCV Library. Dr. Dobb’s Journal of Software Tools, 2000. https://opencv.org/
  9. Siméoni, O. et al. DINOv3. https://arxiv.org/abs/2508.10104


Credits

Course: Foundation of Digital Humanities (DH-405), EPFL

Professor: Frédéric Kaplan

Supervisor: Alexander Rusnak

Authors: Jeremy Hugentobler, Néhémie Frei, Niccholas Reiz