Chinese Cookbook: Difference between revisions
Line 138: | Line 138: | ||
=== Data Processing === | === Data Processing === | ||
==== Construct dataset ==== | ==== Construct dataset ==== | ||
After obtaining data from the website, it is necessary to clean and organize the data into the following structure: Food_Name, Effect, Ingredients, Steps. | |||
==== Categorize ==== | ==== Categorize ==== | ||
To conduct further data analysis and implement the search function for the website, the initial step involves categorizing our data. This categorization includes 15 kinds of recipe categories, 10 kinds of cooking methods, 13 kinds of ingredient categories, and 9 kinds of effects. The detail categories are as followed, | |||
- Recipe categories: | |||
- Cooking methods: | |||
- Ingredient categories: | |||
- Effect categories: | |||
==== Translation: Ancient Chinese to Modern Chinese ==== | ==== Translation: Ancient Chinese to Modern Chinese ==== |
Revision as of 22:41, 17 December 2023
Introduction
Motivation and description of the deliverables
Project Plan and Milestones
Weekly Project Plan
Date | Data Collection | Data Processing | Data Analysis | Web Construction |
---|---|---|---|---|
Week 3 | Search historical Chinese cookbooks and compare them | |||
Week 4 | Choose one historical Chinese cookbook
|
|||
Week 5 | Get data from the website | First clean and sort the data
|
||
Week 6 | Construct the dataset of ingredients
|
|||
Week 7 | Categorize data | |||
Week 8 | Analyse cooking method, effect, category, and ingredient frequency
|
Start web construction | ||
Week 9 | Analyse ingredient and ingredient category pairing
|
Continue web construction | ||
Week 10 | Analyse effect and ingredient pairing
|
Continue web construction
| ||
Week 11 | Modren Chinese to English translation | Continue web construction
| ||
Week 12 | Modren Chinese to English translation | Continue web construction
| ||
Week 13 | Finalize and improve the website | |||
Week 14 | Prepare the Wikipedia page & final presentation |
Milestone 1
- Prepare a project proposal and the goal and objective of the project
- Get Chinese cooking book data from the website
Milestone 2
- Clean the data and construct the datasets for the Chinese cooking book
- Translate from Ancient Chinese and Modern Chinese
- Categorize the data depending on ingredient, effect, category, and cooking method
Milestone 3
- Data Analysis
- Web construction and recipe filtering and recommendation system
- Prepare final presentation and Wikipedia page
Methods
Data Collection
"Yinshanzhengyao" was published in 1330 during the Yuan dynasty, and all existing editions are derived from the Ming dynasty edition of 1456. Despite the presence of a scanned version of the book on the internet, Optical Character Recognition (OCR) poses a challenge due to the ancient Chinese text and the inclusion of illustrations. Fortunately, the Chinese Text Project (中國哲學書電子書計劃) has undertaken the noble initiative of providing open access to ancient Chinese books for both Chinese and non-Chinese scholars, resulting in the creation of a comprehensive database. Currently, it encompasses over thirty thousand books, making it the largest among historical Chinese literature databases.
"Yinshanzhengyao" is among the books included in this extensive database. Leveraging the well-defined structure of the database, we scrape data from the website. Given the project's specific focus on the recipes within the book, our data extraction is limited to the recipe content, which includes the following chapter:
- Strange Delicacies of Combined Flavours 1, 2, 3
- Various Hot Beverages and Concentrates
- Foods that Cure Various Illnesses
In total, there are 210 recipes, each accompanied by information on its effects, ingredients with quantities, and step-by-step instructions. As a medical text, the "effect" refers to the benefits of the food and precautions to be taken, providing valuable insights into the medicinal properties of the recipes.
Data Processing
Construct dataset
After obtaining data from the website, it is necessary to clean and organize the data into the following structure: Food_Name, Effect, Ingredients, Steps.
Categorize
To conduct further data analysis and implement the search function for the website, the initial step involves categorizing our data. This categorization includes 15 kinds of recipe categories, 10 kinds of cooking methods, 13 kinds of ingredient categories, and 9 kinds of effects. The detail categories are as followed, - Recipe categories: - Cooking methods: - Ingredient categories: - Effect categories:
Translation: Ancient Chinese to Modern Chinese
The text is written in ancient Chinese, but contemporary communication predominantly employs modern Chinese. Consequently, for in-depth data analysis, it is essential to translate the recipes from ancient Chinese to modern Chinese. In our evaluation, we compared the proficiency of an ancient Chinese to modern Chinese translation model (Figure 2) against ChatGPT 3.5 (Figure 3). Our findings indicate that the translations generated by ChatGPT are more fluent and closely aligned with contemporary language usage. Based on this observation, we have chosen to adopt ChatGPT as our primary translation tool.
English Translation
Data Analysis
Website
Quality assessment??
Discussion and limitations
Links
- GitHub: https://github.com/changchuntzu0618/DH405-CookingBook