Universality of Aesthetics (Cross-Cultural Dataset Focus): Difference between revisions

From FDHwiki
Jump to navigation Jump to search
Line 155: Line 155:
A first priority would be to further expand and balance the underlying datasets. While the current work highlights structural cultural imbalance, many cultural groups remain represented by only a small number of images. Future work could focus on systematically increasing coverage for underrepresented regions and traditions, with the goal of constructing culturally balanced subsets at a much larger scale. This would allow more robust statistical comparisons and reduce sensitivity to sampling effects.
A first priority would be to further expand and balance the underlying datasets. While the current work highlights structural cultural imbalance, many cultural groups remain represented by only a small number of images. Future work could focus on systematically increasing coverage for underrepresented regions and traditions, with the goal of constructing culturally balanced subsets at a much larger scale. This would allow more robust statistical comparisons and reduce sensitivity to sampling effects.


A second direction concerns model comparison. The present analysis focuses on a single self-supervised vision model (DINOv2). Extending the study to additional representation learning frameworks—such as VICReg, CLIP, or aesthetic-specific scoring models—would make it possible to test whether observed patterns are model-dependent or reflect more general properties of learned visual representations.
A second direction concerns model comparison. The present analysis focuses on a single self-supervised vision model (DINOv2). Extending the study to additional representation learning frameworks—such as VICReg, CLIP, or aesthetic-specific scoring models—would make it possible to test whether observed patterns are model-dependent or reflect more general properties of learned visual representations. Such comparisons would also allow the project to more directly engage with recent claims of representational convergence, including the Platonic Representation Hypothesis (Huh et al., 2024).


Another promising extension would be to examine intentionally non-canonical or anti-aesthetic visual material. Beyond children’s drawings, future work could incorporate artistic movements that deliberately challenge conventional notions of beauty (e.g. certain strands of modernism, conceptual art, or kitsch). While such categories are inherently subjective and difficult to operationalize, they may offer valuable contrastive cases for understanding how models encode deviations from dominant aesthetic norms. In addition, a large-scale collection of visually neutral, everyday imagery scraped from the internet could be introduced as a third reference category, positioned neither explicitly aesthetic nor unaesthetic. Including such neutral material would help disentangle whether observed structure arises from aesthetic judgment itself or from more general visual regularities.
Another promising extension would be to examine intentionally non-canonical or anti-aesthetic visual material. Beyond children’s drawings, future work could incorporate artistic movements that deliberately challenge conventional notions of beauty (e.g. certain strands of modernism, conceptual art, or kitsch). While such categories are inherently subjective and difficult to operationalize, they may offer valuable contrastive cases for understanding how models encode deviations from dominant aesthetic norms. In addition, a large-scale collection of visually neutral, everyday imagery scraped from the internet could be introduced as a third reference category, positioned neither explicitly aesthetic nor unaesthetic. Including such neutral material would help disentangle whether observed structure arises from aesthetic judgment itself or from more general visual regularities.


Finally, this project could be extended into a fully multimodal setting. By incorporating text descriptions, curatorial metadata, or captions, future work could explore how aesthetic structure emerges jointly across image–text representations and whether aesthetic convergence persists across modalities. This direction connects directly to recent work on representational convergence and the Platonic Representation Hypothesis, and would allow a deeper dialogue between computational analysis and philosophical debates about aesthetic universality and subjectivity.
Finally, this project could be extended into a fully multimodal setting. By incorporating text descriptions, curatorial metadata, or captions, future work could explore how aesthetic structure emerges jointly across image–text representations and whether aesthetic convergence persists across modalities. This direction would enable a deeper dialogue between computational analysis and philosophical debates about aesthetic universality and subjectivity.


Together, these extensions would transform the present exploratory study into a broader investigation of how aesthetic categories, cultural histories, and representational geometry interact in contemporary machine learning systems.
Together, these extensions would transform the present exploratory study into a broader investigation of how aesthetic categories, cultural histories, and representational geometry interact in contemporary machine learning systems.

Revision as of 19:27, 17 December 2025

Introduction

This project investigates how aesthetic representations learned by modern vision and multimodal models are shaped by cultural imbalance in large-scale art datasets. While recent advances in computer vision and representation learning have enabled models to capture increasingly abstract visual concepts such as style, beauty, and artistic quality, the datasets used to train and evaluate these systems remain heavily skewed toward Western art traditions. As a result, what models learn as “aesthetic” may implicitly encode a narrow, culturally specific notion of beauty rather than a genuinely universal one.

Within the field of Digital Humanities, this raises important methodological and epistemological questions. Art historical datasets are not neutral collections of images, but the outcome of curatorial practices, institutional histories, and geopolitical power structures. When such datasets are repurposed for machine learning, their underlying biases risk being amplified rather than examined. This project positions itself at the intersection of digital art history and machine learning, using computational methods not only to analyze visual data, but also to reflect critically on the cultural assumptions embedded in it.

Concretely, the project focuses on large, widely used art and aesthetics datasets, including museum collections and web-curated corpora, and asks how aesthetic signals differ across cultural groupings. Instead of treating aesthetic prediction as a purely technical task, we frame it as a representational problem: what kinds of images cluster together in representation space, which cultural traditions are foregrounded, and which are marginalized or rendered anomalous.

Motivation

Aesthetic prediction and representation learning play a central role in contemporary AI systems, from content recommendation and image retrieval to creative generation. Models such as CLIP, DINO, and aesthetic scoring networks are increasingly used as generic feature extractors, often without close scrutiny of what their learned representations encode.[1] However, these models are trained on datasets that overwhelmingly privilege Western art histories, photographic conventions, and museum canons.

This project aims to test whether visual representations learned by self-supervised vision models encode a shared distinction between aesthetic and non-aesthetic images across cultures, or whether aesthetic structure diverges once cultural balance is enforced at the dataset level. Rather than assuming that aesthetic structure is universal, we explicitly test whether representations of “beautiful” and “non-beautiful” images align across cultures once dataset imbalance is controlled, in line with philosophical accounts that treat aesthetic judgment as historically and culturally situated.[2][3]

By constructing culturally balanced subsets of aesthetic and non-aesthetic images, this project analyzes how vision models represent aesthetic distinctions across different cultural contexts. This approach combines computational analysis of representation spaces with critical attention to how aesthetic categories are defined, labeled, and balanced during dataset construction. More broadly, it follows Walter Benjamin’s insight that cultural artifacts which survive and are canonized are shaped by historical processes of power and preservation, rather than constituting a neutral record of global artistic production.[4]

Research aims and expected results

The project is guided by the following research aims:

To construct a culturally diverse art dataset by aggregating and normalizing artworks from multiple sources.

To analyze how aesthetic representations cluster across cultural labels in learned embedding spaces.

To examine whether Western-centric aesthetic norms dominate similarity structure and scoring behavior.

To reflect on the implications of these findings for the use of aesthetic models in art-historical and creative contexts.

We expect to observe measurable differences in how artworks from different cultural traditions are embedded and scored by aesthetic models. In particular, we hypothesize that Western artworks will occupy denser and more internally coherent regions of representation space, while non-Western artworks may appear more dispersed or peripheral, reflecting both dataset imbalance and learned aesthetic conventions.

Original planning and milestones

The project was initially planned as an exploratory study combining dataset construction, representation analysis, and qualitative interpretation, with a particular focus on cultural balance and metadata normalization as central methodological concerns.

Planned milestones:

Week 1–2: Dataset collection and initial inspection using museum APIs and existing art datasets. Preliminary assessment of metadata quality, cultural labels, and overall dataset imbalance.

Week 3: Data cleaning, normalization, and alignment of cultural and national labels. Resolution of ambiguous or inconsistent metadata categories.

Week 4: Construction of culturally balanced subsets of aesthetic and non-aesthetic images, ensuring comparable representation across cultural groups while preserving visual diversity.

Week 5: Representation extraction using pretrained self-supervised vision models, followed by exploratory analysis and visualization of embedding spaces to compare aesthetic and non-aesthetic representations across cultures.

Week 6: Interpretation of results, writing, and documentation. Compilation of methodological choices, findings, and limitations into the final Wiki deliverable, along with documentation of code and data-processing steps.

During the course of the project, additional time was allocated to data cleaning and cultural label normalization, as this step proved more complex and consequential than initially anticipated. This adjustment reflected the central role of dataset construction in shaping representational outcomes and strengthened the methodological grounding of the project.

Data and assumptions

Data sources

The project draws on several publicly available datasets commonly used in computational art and visual culture analysis.

The WikiArt dataset was used as a primary source of digitized artworks. While WikiArt as a platform displays artist nationality and cultural information, commonly used WikiArt dataset releases do not include standardized national labels in a machine-readable form. To address this, artworks were scraped and re-labeled based on available artist metadata. As is typical for WikiArt-based studies, the dataset exhibits a strong emphasis on Western artists and art movements.

The Metropolitan Museum of Art Open Access Dataset was used to supplement WikiArt with artworks from a broader range of cultural and historical contexts. The dataset provides high-quality images and structured metadata, including cultural attributions, enabling more explicit cross-cultural comparisons.

Additional artworks and metadata were drawn from the British Museum’s open collections data, further expanding the representation of non-Western artistic traditions and historical periods.

In addition to museum and art-historical datasets, a dataset of children’s drawings from diverse cultural contexts was included. This dataset provides non-institutional, non-canonical visual material and was used as a contrastive reference for images produced outside professional or museum-curated settings.

Due to GitHub file size constraints and practical storage limitations, large image files are not stored directly in the project repository. Instead, the combined CSV file includes metadata and direct URLs to the original images. Preprocessing scripts are provided to download the images from their original sources if full dataset reproduction is required.

Cultural labels and assumptions

A key challenge in this project concerns the definition and normalization of cultural labels. Museum metadata frequently mixes nationality, ethnicity, geographic region, and art-historical classification, resulting in ambiguous or inconsistent categories. Labels such as “Bengali,” “Bangladeshi,” or “Indian,” for example, may refer to overlapping but distinct cultural, linguistic, or political identities.

Across datasets, the level of label consistency varies substantially. WikiArt provides relatively clean and standardized artist nationality information, making it the most straightforward dataset to organize by cultural labels. In contrast, the Metropolitan Museum of Art dataset contains several thousand distinct cultural or national descriptors, many of which required consolidation and renaming in order to enable cross-dataset comparison. The British Museum dataset similarly required extensive renaming and manual alignment of labels, as well as individual downloading of associated image data.

Rather than attempting to impose a single authoritative cultural taxonomy, this project adopts a pragmatic normalization strategy. Cultural labels are consolidated to maximize internal consistency while preserving meaningful distinctions relevant to the analysis. All normalization decisions are explicitly documented, and ambiguous cases are handled conservatively.

Importantly, cultural labels are treated as analytical tools rather than fixed or essential categories. The project does not aim to represent cultures exhaustively or definitively, but to examine how existing datasets operationalize cultural difference and how vision models respond to these operationalizations.

Methods

Dataset construction pipeline

The dataset was constructed through an iterative process combining automated scripts with manual inspection and normalization. Artworks and metadata were collected from publicly available sources, including WikiArt, the Metropolitan Museum of Art, and the British Museum.

Entries with missing images, broken links, or insufficient cultural metadata were excluded. Cultural labels were then normalized to reduce extreme fragmentation across datasets, particularly for museum metadata that mixed nationality, geography, and stylistic descriptors. Normalization focused on enabling consistent cross-dataset comparison rather than producing a fully balanced corpus.

In addition to museum and art-historical datasets, a dataset of children’s drawings from diverse cultural contexts was incorporated as a source of non-canonical imagery. This dataset was used as a contrastive reference for the “unaesthetic” category, under the assumption that children’s drawings reflect different constraints on technical execution and stylistic convention than professional or museum-curated artworks. This distinction is used analytically rather than normatively and does not imply a value judgment about children’s artistic expression.

The resulting combined dataset remains culturally imbalanced, reflecting the biases of the source collections. For exploratory analysis in this mini-project, a small, approximately balanced sample was created by selecting a fixed number of images per nationality and aesthetic category. This sampling was used solely for visualization and representation analysis, not as a claim of full dataset balance.

All preprocessing and sampling steps are implemented in reproducible Python scripts, and the final datasets are stored as lightweight CSV files containing metadata and image URLs.

Representation extraction

BLAH BLAHC LBALH NEED FIX

To analyze aesthetic similarity, we extract fixed-length embeddings from pretrained vision models. These models are not fine-tuned on the dataset, in order to probe their existing representational structure rather than adapting them to the data.

Embeddings are computed for each artwork image and used as the basis for similarity analysis, clustering, and visualization. This approach allows artworks from different cultural traditions to be compared within a shared representational space.

Analysis methodology

BLAH BLAHC LBALH NEED FIX

The analysis combines quantitative and qualitative methods:

Distance and similarity analysis to examine intra- and inter-cultural cohesion.

Unsupervised clustering to identify dominant visual groupings.

Dimensionality reduction techniques (e.g., PCA, UMAP) to inspect global structure.

Qualitative case studies to interpret clusters and outliers.

Rather than aiming for statistical generalization, the analysis emphasizes interpretability and exploratory insight.

Results

Quantitative observations

BLAH BLAHC LBALH NEED FIX

The embedding space reveals clear asymmetries in cultural representation. Western artworks tend to form dense, well-defined clusters, while artworks from underrepresented cultures are more sparsely distributed. This pattern is consistent across multiple models and embedding configurations.

Similarity scores also show higher average cohesion within Western subsets, suggesting that models encode more consistent aesthetic features for these artworks.

Qualitative patterns

BLAH BLAHC LBALH NEED FIX

Qualitative inspection indicates that visual conventions common in Western art—such as perspective, lighting, and figurative composition—align closely with the features emphasized by pretrained models. Artworks that deviate from these conventions are more likely to appear as outliers, regardless of their aesthetic richness within their own traditions.

Quality assessment and limitations

Quality assessment

The quality of this project lies in its transparency, reproducibility, and methodological reflexivity. All data processing steps are documented, assumptions are stated explicitly, and results are interpreted cautiously. The project does not aim to produce definitive claims, but rather to provide a structured exploratory analysis grounded in both computational methods and critical reflection.

Limitations

BLAH BLAHC LBALH NEED FIX


Several limitations must be acknowledged:

Cultural labels are coarse and imperfect proxies for complex identities.

Dataset imbalance cannot be fully eliminated.

Pretrained models encode biases from their training data.

The analysis focuses on representation structure rather than downstream task performance.

These limitations are intrinsic to the available data and tools and are treated as part of the research problem rather than as external noise.

Future improvements

This mini-project was conceived as an exploratory study rather than a definitive evaluation of cross-cultural aesthetic representation. Several extensions would substantially strengthen and expand the analysis.

A first priority would be to further expand and balance the underlying datasets. While the current work highlights structural cultural imbalance, many cultural groups remain represented by only a small number of images. Future work could focus on systematically increasing coverage for underrepresented regions and traditions, with the goal of constructing culturally balanced subsets at a much larger scale. This would allow more robust statistical comparisons and reduce sensitivity to sampling effects.

A second direction concerns model comparison. The present analysis focuses on a single self-supervised vision model (DINOv2). Extending the study to additional representation learning frameworks—such as VICReg, CLIP, or aesthetic-specific scoring models—would make it possible to test whether observed patterns are model-dependent or reflect more general properties of learned visual representations. Such comparisons would also allow the project to more directly engage with recent claims of representational convergence, including the Platonic Representation Hypothesis (Huh et al., 2024).

Another promising extension would be to examine intentionally non-canonical or anti-aesthetic visual material. Beyond children’s drawings, future work could incorporate artistic movements that deliberately challenge conventional notions of beauty (e.g. certain strands of modernism, conceptual art, or kitsch). While such categories are inherently subjective and difficult to operationalize, they may offer valuable contrastive cases for understanding how models encode deviations from dominant aesthetic norms. In addition, a large-scale collection of visually neutral, everyday imagery scraped from the internet could be introduced as a third reference category, positioned neither explicitly aesthetic nor unaesthetic. Including such neutral material would help disentangle whether observed structure arises from aesthetic judgment itself or from more general visual regularities.

Finally, this project could be extended into a fully multimodal setting. By incorporating text descriptions, curatorial metadata, or captions, future work could explore how aesthetic structure emerges jointly across image–text representations and whether aesthetic convergence persists across modalities. This direction would enable a deeper dialogue between computational analysis and philosophical debates about aesthetic universality and subjectivity.

Together, these extensions would transform the present exploratory study into a broader investigation of how aesthetic categories, cultural histories, and representational geometry interact in contemporary machine learning systems.

GitHub repository

The full codebase, preprocessing scripts, and documentation are available at:

https://github.com/qxift/esthetics


References

  1. Huh, M., Cheung, B., Wang, T., & Isola, P. (2024). The Platonic Representation Hypothesis. arXiv. https://doi.org/10.48550/arXiv.2405.07987
  2. Zangwill, N. (2024). Aesthetic Judgment. In E. N. Zalta & U. Nodelman (Eds.), The Stanford Encyclopedia of Philosophy (Fall 2024 ed.). Metaphysics Research Lab, Stanford University. https://plato.stanford.edu/archives/fall2024/entries/aesthetic-judgment/
  3. Andow, J. (2022). Further exploration of anti-realist intuitions about aesthetic judgment. Philosophical Psychology, 35(5), 621–661. https://doi.org/10.1080/09515089.2021.2014440
  4. Benjamin, W. (1940). Theses on the Philosophy of History.

Credits

Course: Foundation of Digital Humanities (DH-405), EPFL

Professor: Frédéric Kaplan

Supervisors: Alexander Rusnak

Authors: Anastasia Meijer, Marguerite Novikov