Universal Aesthetics (Multimodal Focus): Difference between revisions

From FDHwiki
Jump to navigation Jump to search
Line 14: Line 14:
*  It categorizes the poems into 135 types based on their form (haiku, sonnet, etc.), which could facilitate our further studies.
*  It categorizes the poems into 135 types based on their form (haiku, sonnet, etc.), which could facilitate our further studies.


However, this dataset still needs to be cleaned before usage. We identify two problems with the raw dataset. First, some poems contain copyright notices at the end, which introduce noise into subsequent processing. However, because the copyright information is clearly marked with a special mark ©️, it can be easily removed through rule-based filtering. Second, although most poems are in English, a small portion is not. Since the plain-text dataset contains exclusively English texts, we should also remove the non-English poems from this dataset.


<pre>
Afterward is an unknown term in future
Before that we face the present,
Coming at well future depends on present;
Dismissing hazardous future
Endeavor best early at present.
Copyright © Muzahidul Reza | 29 November,2017
</pre>
{| class="wikitable"
{| class="wikitable"
|-
|-

Revision as of 21:42, 27 November 2025

Introduction

Methods

Data

As for the convergence of language models, we need both plain texts and aesthetic texts. For simplicity, we reuse this text-image dataset, which is also used in Huh et al.'s paper, and then add another poem dataset.

Plain Text

Peoms

For poems, we use the Poems dataset from Kaggle. We find this dataset ideal for this project because of the following reasons:

  • As the plain-text dataset contains 1,024 entries, it provides enough poems to yield a substantial amount of data.
  • It categorizes the poems into 135 types based on their form (haiku, sonnet, etc.), which could facilitate our further studies.

However, this dataset still needs to be cleaned before usage. We identify two problems with the raw dataset. First, some poems contain copyright notices at the end, which introduce noise into subsequent processing. However, because the copyright information is clearly marked with a special mark ©️, it can be easily removed through rule-based filtering. Second, although most poems are in English, a small portion is not. Since the plain-text dataset contains exclusively English texts, we should also remove the non-English poems from this dataset.

Afterward is an unknown term in future
Before that we face the present,
Coming at well future depends on present;
Dismissing hazardous future
Endeavor best early at present.
Copyright © Muzahidul Reza | 29 November,2017
列1 列2
内容A1 内容A2
内容B1 内容B2

References