Universality of Aesthetics (Cross-Cultural Dataset Focus): Difference between revisions
| Line 123: | Line 123: | ||
To quantify local coherence in representation space, we compute k-nearest-neighbor (kNN) purity separately for aesthetic and unaesthetic images. For each image, kNN purity measures the proportion of its k nearest neighbors that share the same label. | To quantify local coherence in representation space, we compute k-nearest-neighbor (kNN) purity separately for aesthetic and unaesthetic images. For each image, kNN purity measures the proportion of its k nearest neighbors that share the same label. | ||
Unless otherwise stated, kNN purity is computed with k = 10. | This metric captures how locally consistent a category is within the embedding space: higher kNN purity indicates that images from the same category tend to cluster together, while lower values indicate greater mixing or dispersion. | ||
Unless otherwise stated, kNN purity is computed with k = 10 using cosine similarity on normalized DINOv2 embeddings. | |||
[[File:Neighbor-based coherence in DINOv2 representation space.png|thumb|Line plot showing the proportion of same-class nearest neighbours (aesthetic vs unaesthetic) as a function of k, computed from cosine similarity in DINOv2 feature space.]] | [[File:Neighbor-based coherence in DINOv2 representation space.png|thumb|Line plot showing the proportion of same-class nearest neighbours (aesthetic vs unaesthetic) as a function of k, computed from cosine similarity in DINOv2 feature space.]] | ||
Revision as of 21:07, 17 December 2025
Introduction
This project investigates how aesthetic representations learned by modern vision and multimodal models are shaped by cultural imbalance in large-scale art datasets. While recent advances in computer vision and representation learning have enabled models to capture increasingly abstract visual concepts such as style, beauty, and artistic quality, the datasets used to train and evaluate these systems remain heavily skewed toward Western art traditions. As a result, what models learn as “aesthetic” may implicitly encode a narrow, culturally specific notion of beauty rather than a genuinely universal one.
Within the field of Digital Humanities, this raises important methodological and epistemological questions. Art historical datasets are not neutral collections of images, but the outcome of curatorial practices, institutional histories, and geopolitical power structures. When such datasets are repurposed for machine learning, their underlying biases risk being amplified rather than examined. This project positions itself at the intersection of digital art history and machine learning, using computational methods not only to analyze visual data, but also to reflect critically on the cultural assumptions embedded in it.
Concretely, the project focuses on large, widely used art and aesthetics datasets, including museum collections and web-curated corpora, and asks how aesthetic signals differ across cultural groupings. Instead of treating aesthetic prediction as a purely technical task, we frame it as a representational problem: what kinds of images cluster together in representation space, which cultural traditions are foregrounded, and which are marginalized or rendered anomalous.
Motivation
Aesthetic prediction and representation learning play a central role in contemporary AI systems, from content recommendation and image retrieval to creative generation. Models such as CLIP, DINO, and aesthetic scoring networks are increasingly used as generic feature extractors, often without close scrutiny of what their learned representations encode.[1] However, these models are trained on datasets that overwhelmingly privilege Western art histories, photographic conventions, and museum canons.
This project aims to test whether visual representations learned by self-supervised vision models encode a shared distinction between aesthetic and non-aesthetic images across cultures, or whether aesthetic structure diverges once cultural balance is enforced at the dataset level. Rather than assuming that aesthetic structure is universal, we explicitly test whether representations of “beautiful” and “non-beautiful” images align across cultures once dataset imbalance is controlled, in line with philosophical accounts that treat aesthetic judgment as historically and culturally situated.[2][3]
By constructing culturally balanced subsets of aesthetic and non-aesthetic images, this project analyzes how vision models represent aesthetic distinctions across different cultural contexts. This approach combines computational analysis of representation spaces with critical attention to how aesthetic categories are defined, labeled, and balanced during dataset construction. More broadly, it follows Walter Benjamin’s insight that cultural artifacts which survive and are canonized are shaped by historical processes of power and preservation, rather than constituting a neutral record of global artistic production.[4]
Research aims and expected results
The project is guided by the following research aims:
To construct a culturally diverse art dataset by aggregating and normalizing artworks from multiple sources.
To analyze how aesthetic representations cluster across cultural labels in learned embedding spaces.
To examine whether Western-centric aesthetic norms dominate similarity structure and scoring behavior.
To reflect on the implications of these findings for the use of aesthetic models in art-historical and creative contexts.
We expect to observe measurable differences in how artworks from different cultural traditions are embedded and scored by aesthetic models. In particular, we hypothesize that Western artworks will occupy denser and more internally coherent regions of representation space, while non-Western artworks may appear more dispersed or peripheral, reflecting both dataset imbalance and learned aesthetic conventions.
Original planning and milestones
The project was initially planned as an exploratory study combining dataset construction, representation analysis, and qualitative interpretation, with a particular focus on cultural balance and metadata normalization as central methodological concerns.
Planned milestones:
Week 1–2: Dataset collection and initial inspection using museum APIs and existing art datasets. Preliminary assessment of metadata quality, cultural labels, and overall dataset imbalance.
Week 3: Data cleaning, normalization, and alignment of cultural and national labels. Resolution of ambiguous or inconsistent metadata categories.
Week 4: Construction of culturally balanced subsets of aesthetic and non-aesthetic images, ensuring comparable representation across cultural groups while preserving visual diversity.
Week 5: Representation extraction using pretrained self-supervised vision models, followed by exploratory analysis and visualization of embedding spaces to compare aesthetic and non-aesthetic representations across cultures.
Week 6: Interpretation of results, writing, and documentation. Compilation of methodological choices, findings, and limitations into the final Wiki deliverable, along with documentation of code and data-processing steps.
During the course of the project, additional time was allocated to data cleaning and cultural label normalization, as this step proved more complex and consequential than initially anticipated. This adjustment reflected the central role of dataset construction in shaping representational outcomes and strengthened the methodological grounding of the project.
Data and assumptions
Data sources
The project draws on several publicly available datasets commonly used in computational art and visual culture analysis.
The WikiArt dataset was used as a primary source of digitized artworks. While WikiArt as a platform displays artist nationality and cultural information, commonly used WikiArt dataset releases do not include standardized national labels in a machine-readable form. To address this, artworks were scraped and re-labeled based on available artist metadata. As is typical for WikiArt-based studies, the dataset exhibits a strong emphasis on Western artists and art movements. (WikiArt)
The Metropolitan Museum of Art Open Access Dataset was used to supplement WikiArt with artworks from a broader range of cultural and historical contexts. The dataset provides high-quality images and structured metadata, including cultural attributions, enabling more explicit cross-cultural comparisons. (The Metropolitan Museum of Art Open Access Dataset)
Additional artworks and metadata were drawn from the British Museum’s open collections data, further expanding the representation of non-Western artistic traditions and historical periods. (British Museum Collection Online)
In addition to museum and art-historical datasets, a dataset of children’s drawings from diverse cultural contexts was included. This dataset provides non-institutional, non-canonical visual material and was used as a contrastive reference for images produced outside professional or museum-curated settings.
Due to GitHub file size constraints and practical storage limitations, large image files are not stored directly in the project repository. Instead, the combined CSV file includes metadata and direct URLs to the original images. Preprocessing scripts are provided to download the images from their original sources if full dataset reproduction is required.
Cultural labels and assumptions
A key challenge in this project concerns the definition and normalization of cultural labels. Museum metadata frequently mixes nationality, ethnicity, geographic region, and art-historical classification, resulting in ambiguous or inconsistent categories. Labels such as “Bengali,” “Bangladeshi,” or “Indian,” for example, may refer to overlapping but distinct cultural, linguistic, or political identities.
Across datasets, the level of label consistency varies substantially. WikiArt provides relatively clean and standardized artist nationality information, making it the most straightforward dataset to organize by cultural labels. In contrast, the Metropolitan Museum of Art dataset contains several thousand distinct cultural or national descriptors, many of which required consolidation and renaming in order to enable cross-dataset comparison. The British Museum dataset similarly required extensive renaming and manual alignment of labels, as well as individual downloading of associated image data.
Rather than attempting to impose a single authoritative cultural taxonomy, this project adopts a pragmatic normalization strategy. Cultural labels are consolidated to maximize internal consistency while preserving meaningful distinctions relevant to the analysis. All normalization decisions are explicitly documented, and ambiguous cases are handled conservatively.
Importantly, cultural labels are treated as analytical tools rather than fixed or essential categories. The project does not aim to represent cultures exhaustively or definitively, but to examine how existing datasets operationalize cultural difference and how vision models respond to these operationalizations.
Methods
Dataset construction pipeline
The dataset was constructed through an iterative process combining automated scripts with manual inspection and normalization. Artworks and metadata were collected from publicly available sources, including WikiArt, the Metropolitan Museum of Art, and the British Museum.
Entries with missing images, broken links, or insufficient cultural metadata were excluded. Cultural labels were then normalized to reduce extreme fragmentation across datasets, particularly for museum metadata that mixed nationality, geography, and stylistic descriptors. Normalization focused on enabling consistent cross-dataset comparison rather than producing a fully balanced corpus.
In addition to museum and art-historical datasets, a dataset of children’s drawings from diverse cultural contexts was incorporated as a source of non-canonical imagery. This dataset was used as a contrastive reference for the “unaesthetic” category, under the assumption that children’s drawings reflect different constraints on technical execution and stylistic convention than professional or museum-curated artworks. This distinction is used analytically rather than normatively and does not imply a value judgment about children’s artistic expression.
The resulting combined dataset remains culturally imbalanced, reflecting the biases of the source collections. For exploratory analysis in this mini-project, a small, approximately balanced sample was created by selecting a fixed number of images per nationality and aesthetic category. This sampling was used solely for visualization and representation analysis, not as a claim of full dataset balance.
All preprocessing and sampling steps are implemented in reproducible Python scripts, and the final datasets are stored as lightweight CSV files containing metadata and image URLs.
Representation extraction
To analyze how aesthetic distinctions are encoded in learned visual representations, this project extracts fixed-length image embeddings from pretrained vision models. The goal is not to optimize performance on an aesthetic classification task, but to examine the existing structure of representations learned from large-scale visual data.
The primary feature extractor used throughout the analysis is DINOv2 (ViT-B/14), a self-supervised vision transformer trained on large, diverse image corpora. DINOv2 embeddings serve as the main representational space for all similarity, neighborhood, and clustering analyses. As a point of comparison, OpenCLIP (ViT-B/32) is used to assess cross-model representational alignment, allowing tests on whether aesthetic structure is shared between a purely visual model and a multimodal image–text model.
All images are converted to RGB and resized to 224×224 pixels, followed by standard ImageNet normalization. For each image, the output of the final DINOv2 embedding layer is extracted and flattened into a single vector. No fine-tuning or task-specific adaptation is performed. This ensures that the analysis probes the representational geometry learned during pretraining, rather than structure induced by the current dataset.
To guarantee reproducibility and computational efficiency, all embeddings are cached to disk and reused across analysis steps. Cultural labels are inferred directly from filenames, following a consistent naming convention established during dataset construction (for example, albanian_3.jpg). All embeddings are L2-normalized prior to similarity computation, so that dot products correspond to cosine similarity.
In addition to the main pipeline, a secondary extraction script using a smaller DINOv2 variant (ViT-S/14) is employed for sanity checks and exploratory PCA visualizations. This auxiliary pipeline reproduces the main trends observed with the larger model and is used to validate that results are not artefacts of a single architecture.
Analysis methodology
The analysis focuses on the geometry of the DINOv2 representation space and combines local, global, and cross-model diagnostics. Rather than testing a single hypothesis, the methodology is exploratory and comparative, designed to reveal how aesthetic and unaesthetic images are organized relative to one another and across cultural groupings.
Similarity structure and representational alignment
For each model, a full cosine similarity matrix is computed across all images. These matrices encode how each image relates to every other image in representation space. To compare DINOv2 and OpenCLIP directly, a per-image representational alignment measure is used. For each image, the column of cosine similarities to all other images is treated as a similarity profile. Alignment is then computed as the Pearson correlation between the DINOv2 and CLIP similarity profiles for the same image.
This yields a per-image alignment score that reflects how similarly the two models “see” the rest of the dataset from that image’s perspective. Mean alignment values are computed separately for aesthetic and unaesthetic images.
Neighbor-based coherence (kNN purity)
To quantify local coherence in representation space, we compute k-nearest-neighbor (kNN) purity separately for aesthetic and unaesthetic images. For each image, kNN purity measures the proportion of its k nearest neighbors that share the same label.
This metric captures how locally consistent a category is within the embedding space: higher kNN purity indicates that images from the same category tend to cluster together, while lower values indicate greater mixing or dispersion.
Unless otherwise stated, kNN purity is computed with k = 10 using cosine similarity on normalized DINOv2 embeddings.
Culture-level neighbour-based coherence
To examine cultural structure, neighbour-based coherence is computed separately for each culture using DINOv2 feature space. For each cultural subset, the proportion of same-class nearest neighbours is calculated for aesthetic and unaesthetic images, providing a culture-specific measure of local representational coherence.
Differences between aesthetic and unaesthetic images within a culture are assessed by comparing their respective neighbour-based coherence values, rather than by computing distances between aggregated centroids.
Dimensionality reduction and visualization
Finally, dimensionality reduction techniques are used to visualize global structure. PCA is applied for linear inspection, while t-SNE is used to visualize non-linear clustering patterns. These plots are intended as qualitative aids rather than as evidence for strict separability.
Results
Across all analyses, consistent structural differences emerge between aesthetic and unaesthetic images in DINOv2 representation space.
Local coherence differs strongly between categories
Neighbor-based analysis shows that unaesthetic images exhibit substantially higher kNN purity than aesthetic images across all neighborhood sizes. Even for small k, unaesthetic images are much more likely to have neighbors of the same category. This pattern is stable as k increases (Figure 2).
This indicates that unaesthetic images form compact, locally coherent clusters, whereas aesthetic images are more dispersed and intermingled. Given that the unaesthetic subset is dominated by children’s drawings, this coherence likely reflects shared production constraints (line quality, color usage, compositional simplicity) rather than aesthetic judgment per se.
Cross-model alignment is higher for unaesthetic images
Representational alignment between OpenCLIP and DINOv2 is consistently higher for unaesthetic images than for aesthetic images (Figure 1). While both models trivially align with themselves (DINO→DINO baseline), CLIP’s similarity structure agrees more closely with DINO for unaesthetic images.
This suggests that different pretrained models converge more strongly on visually homogeneous material, whereas agreement decreases for the more diverse aesthetic category. Aesthetic images thus appear more model-dependent in their similarity relations.
Global structure shows partial separation, not binary clustering
Dimensionality reduction reveals partial but incomplete separation between categories (Figure 5). Unaesthetic images tend to occupy denser regions of representation space, while aesthetic images spread across a broader area. However, there is substantial overlap, and no clear decision boundary emerges.
This supports the interpretation that aesthetic distinction is not encoded as a simple binary split, but as a diffuse and heterogeneous region in representation space.
Cultural patterns are heterogeneous
Centroid similarity analyses show that cultural structure varies substantially across groups (Figures 3 and 4). Some cultures exhibit high internal similarity and clear separation between aesthetic and unaesthetic centroids, while others show much weaker differentiation. There is no single dominant cultural axis that explains the aesthetic distinction across all groups.
This variability suggests that cultural label interacts with aesthetic categorization in non-uniform ways, and that aesthetic structure learned by the model reflects both dataset composition and culturally specific visual regularities.
Quality assessment and limitations
Quality assessment
This project produces a clear empirical result: in the constructed dataset, unaesthetic images are more consistently grouped in representation space than aesthetic images. That is, images labeled as unaesthetic tend to be closer to one another, while images labeled as aesthetic are more dispersed.
Taken at face value, this suggests that, for this data and model, what counts as unaesthetic is more uniform than what counts as aesthetic. Rather than pointing toward a universal notion of beauty, the results indicate that aesthetic images span a wider and less coherent range of visual forms, whereas non-aesthetic images cluster more tightly.
This interpretation is deliberately limited to the specific datasets, labels, and representation model used in the analysis. The value of the project lies in making this asymmetry visible and in showing how it emerges from concrete data choices, rather than in advancing a general theory of aesthetics.
Limitations
Several important limitations must be acknowledged.
First, cultural labels are necessarily coarse proxies for complex and fluid identities. National or cultural categories collapse linguistic, historical, and regional variation, and different datasets operationalize these labels in inconsistent ways. While normalization improves internal consistency, it does not resolve the underlying conceptual ambiguity of cultural categorization.
In addition, cultural attribution in museum datasets is often based on provenance rather than authorship. Some objects are labeled according to the location where they were found, collected, or preserved, rather than the cultural identity of their maker. For example, artworks classified as belonging to one culture may have been produced by artists from another region, or discovered in a different historical or political context (e.g. Greek artifacts found in Roman territory). As a result, cultural labels should be interpreted as approximate indicators rather than as definitive statements of origin.
Second, dataset imbalance cannot be fully eliminated. Although this project explicitly aims to surface and partially correct cultural imbalance, many cultural traditions remain represented by only a small number of images. As a result, some observed patterns may be sensitive to sampling effects rather than reflecting stable representational structure.
Third, the definition of aesthetic and unaesthetic material introduces a significant limitation. The aesthetic subset consists largely of artworks that have survived processes of preservation, canonization, and institutional selection. These objects are heterogeneous in medium, style, and function, and their inclusion reflects historical survival rather than explicit aesthetic evaluation.
In contrast, the unaesthetic subset is currently composed primarily of children’s drawings, which introduces a strong formal and stylistic coherence unrelated to aesthetic judgment itself. This asymmetry may partially explain why unaesthetic images appear more locally coherent in representation space than aesthetic images. The observed separation may therefore reflect differences in medium, production constraints, or developmental stage rather than aesthetic value per se.
Importantly, children’s drawings are not inherently unaesthetic. They are treated as such in this project only as a pragmatic proxy for visual material that has not been curated or canonized according to dominant aesthetic norms. This assumption may introduce bias by exaggerating the apparent diversity of aesthetic images and the apparent uniformity of unaesthetic ones.
Supporting this concern, even when restricting analysis to cases where both aesthetic and unaesthetic samples consist of artworks (for example, within specific national subsets), aesthetic images still appear more dispersed than children’s drawings. While this suggests that the effect is not solely driven by medium, it does not rule out form-based or institutional factors as major contributors.
Finally, the analysis focuses on representational geometry rather than downstream task performance. While this allows insight into how models internally organize visual information, it does not directly assess how these representations affect classification, generation, or other applied tasks.
Taken together, these limitations are not treated as external noise but as central components of the research problem. They underscore the difficulty of operationalizing aesthetics computationally and motivate future work aimed at disentangling aesthetic judgment from medium, production context, cultural attribution, and historical survival.
Future improvements
This mini-project was conceived as an exploratory study rather than a definitive evaluation of cross-cultural aesthetic representation. Several extensions would substantially strengthen and expand the analysis.
A first priority would be to further expand and balance the underlying datasets. While the current work highlights structural cultural imbalance, many cultural groups remain represented by only a small number of images. Future work could focus on systematically increasing coverage for underrepresented regions and traditions, with the goal of constructing culturally balanced subsets at a much larger scale. This would allow more robust statistical comparisons and reduce sensitivity to sampling effects.
A second direction concerns model comparison. The present analysis focuses on a single self-supervised vision model (DINOv2). Extending the study to additional representation learning frameworks—such as VICReg, CLIP, or aesthetic-specific scoring models—would make it possible to test whether observed patterns are model-dependent or reflect more general properties of learned visual representations. Such comparisons would also allow the project to more directly engage with recent claims of representational convergence, including the Platonic Representation Hypothesis (Huh et al., 2024).
Another promising extension would be to examine intentionally non-canonical or anti-aesthetic visual material. Beyond children’s drawings, future work could incorporate artistic movements that deliberately challenge conventional notions of beauty (e.g. certain strands of modernism, conceptual art, or kitsch). While such categories are inherently subjective and difficult to operationalize, they may offer valuable contrastive cases for understanding how models encode deviations from dominant aesthetic norms. In addition, a large-scale collection of visually neutral, everyday imagery scraped from the internet could be introduced as a third reference category, positioned neither explicitly aesthetic nor unaesthetic. Including such neutral material would help disentangle whether observed structure arises from aesthetic judgment itself or from more general visual regularities.
Finally, this project could be extended into a fully multimodal setting. By incorporating text descriptions, curatorial metadata, or captions, future work could explore how aesthetic structure emerges jointly across image–text representations and whether aesthetic convergence persists across modalities. This direction would enable a deeper dialogue between computational analysis and philosophical debates about aesthetic universality and subjectivity.
Together, these extensions would transform the present exploratory study into a broader investigation of how aesthetic categories, cultural histories, and representational geometry interact in contemporary machine learning systems.
GitHub repository
The full codebase, preprocessing scripts, and documentation are available at:
https://github.com/qxift/esthetics
References
- ↑ Huh, M., Cheung, B., Wang, T., & Isola, P. (2024). The Platonic Representation Hypothesis. arXiv. https://doi.org/10.48550/arXiv.2405.07987
- ↑ Zangwill, N. (2024). Aesthetic Judgment. In E. N. Zalta & U. Nodelman (Eds.), The Stanford Encyclopedia of Philosophy (Fall 2024 ed.). Metaphysics Research Lab, Stanford University. https://plato.stanford.edu/archives/fall2024/entries/aesthetic-judgment/
- ↑ Andow, J. (2022). Further exploration of anti-realist intuitions about aesthetic judgment. Philosophical Psychology, 35(5), 621–661. https://doi.org/10.1080/09515089.2021.2014440
- ↑ Benjamin, W. (1940). Theses on the Philosophy of History.
Credits
Course: Foundation of Digital Humanities (DH-405), EPFL
Professor: Frédéric Kaplan
Supervisors: Alexander Rusnak
Authors: Anastasia Meijer, Marguerite Novikov