France: Exploring Historical Cookbooks: Difference between revisions

From FDHwiki
Jump to navigation Jump to search
 
Line 301: Line 301:
|-
|-
| 9)
| 9)
| Vin blanc
| Vin balanc
| 69
| 69
| [[File:Vin_balanc.jpeg|200px|center]]
| [[File:Vin_balanc.jpeg|200px|center]]

Latest revision as of 10:46, 22 December 2022

Introduction

Cuisine has an important place in the cultural heritage of France. In the 21st century, the great classics of French cuisine can be found in starred restaurants of most cities of France and even all around the world. But above all, French cuisine owes its current prestige to the different regional cuisines that were developed over several hundred years, taking advantage of the geographical and cultural specificities of each region.

This is at least the point of view of Mr. Curnonsky who travelled the regions of France throughout his life at the beginning of the 20th century in search of the traditional regional recipes that are the pillars of the French cuisine we know today. His book Recettes des Provinces de France written in 1962 [Figure 1] gathers many traditional recipes collected by himself all around France.

At a time when all knowledge is shared online on the web, it has become easy to obtain information on the history of French cuisine or even many contemporary recipes. However, a significant amount of knowledge and culinary practices are still stored in books that are much more difficult to access. This knowledge would benefit from being digitalized both to share it with the largest number of people, but also to take advantage of the latest computational techniques to perform more in-depth analyses.

This project is hence an exploration of a historical French cookbook. From the physical book to a clean structured dataset, our main focus is on the digitalization of a historical cookbook and its challenges. In addition to that, we use the collected knowledge to extract analyses to better understand the French cuisine of the previous century. We use the cookbook from Mr. Curnonsky mentioned before as an example to answer our research questions.

Research Questions

More specifically, we aim at answering the following research questions:

  • What are the steps and difficulties when digitalizing an old cookbook?
  • What knowledge can be extracted from a cookbook and what information can it provide about the culture and practices of the region at that time?
  • From Mr. Curnonsky's cookbook, what can we say about the regional cuisine of France in the early 20th century?

Project Plan and Milestones

Weekly progress of the project

Date Data Collection Data Processing Data Analysis
Week 4 Collect and compare multiple historical cookbooks
Week 5 Choose one historical French cookbook

& Scan the physical book

Week 6 Split the .png files by subregions

& Perform the OCR of each page

Week 7 Construct the dataset of recipes
Week 8
Week 9
Week 10
Week 11 Adapt the dataset to facilitate the data processing Create constants for Region2Subregions, Units

& Categories of ingredient

Week 12 Data processing of the ingredients Exploratory Data Analysis of the dataset
Week 13 Improve the dataset based on the EDA Improve the processing of the ingredients Overall analysis & Per region analysis
Week 14 Prepare the Wikipedia page & final presentation

Milestone 1

  • Prepare a project proposal & discuss the objectives of the project.
  • Obtain a digital version of the cookbook.

Milestone 2

  • Construct the dataset of recipes for the selected book.
  • Define the pipeline of processing and analysis to answer the chosen research questions.

Milestone 3

  • Process the list of ingredients for each recipe.
  • Improve the dataset based on an exploratory data analysis.
  • Analyse the dataset of recipes in general, per subregions and per regions.
  • Prepare a final presentation and a wikipedia page to display the results of the project.

Data

The book used for this project is Recette des Provinces de Frances was written by Mr. Curnonsky, a renowned gastronomic critic of the early 20th century, and published in 1962.

It is separated in 6 big regions and 30 subregions of France that can be found in [Table 1]. Each of the subregion contains: a description of the region, a set of recipes, information about some specific ingredients (e.g. local wine or cheese) and images . For the purpose of this project, the focus has been made only on the set of recipes.

[Figure 1] Recettes des Provinces de France by Curnonsky
[Table 1] Region and subregion of France
Region Subregion
Paris, Ile-de-France, Val de Loire Paris, Ile-de-France, Orléans, Touraine
Pays de l’Ouest Anjou, Bretagne, Poitou Vendée, Charentes
Sud-Ouest & Pyrénées Bordelais, Gascogne, Pays Basque, Roussillon, Périgord, Languedoc
Sud-Est & Méditérannée Provence, Nice, Corse, Dauphiné, Savoie, Lyon, Auvergne, Limousin
Bourgogne, Champagne, Bresse, Franche-Comté, Alsace, Lorraine Bourgogne, Champagne, Bresse, Franche-Comté, Alsace, Lorraine
Nord & Normandie Nord, Normandie

Methodology

Data digitalization

From a physical book to digital images

The first step of the digitalization process was to scan the book pages by pages using a professional book scanner. By changing some settings such as increasing the luminosity and the contrast, it allowed to have very good quality images with very little noise.

Using an OCR model to retrieve the textual content

After collecting all the pages containing recipes for each subregion in a .png format [Figure 2], we used Tesseract to perform the Optical Character Recognition (OCR). Such software allows to convert .png images into .txt files by retrieving all the text contained in the provided images. With the new advances in natural language processing and especially the emergence of transformers, the latest models are extremely good and precise to perform this kind of task, even if the text is in French. Hence, most of the textual data could be extracted with the OCR.

However, a human intervention still had to be carried out for two reasons. First, there is always some noise either in the physical page of origin, due to the scanning process or in the recognition process of the model. For example when digitalizing this book, many letter ‘c’ where recognized as ‘o’, maybe due to the font used in the book. In addition, due to the specific layout of the recipes, the output of the OCR model still needs some pre-processing. As it can be seen in [Figure 3], the list of ingredients of all the recipes in the page are first read and then the body is extracted.

Data pre-processing to structure the information


Data processing

In our project, we will extract and construct the following information from the recipes:

  • quantity: the amount of the ingredient
  • unit: the metric of the ingredient
  • ingredient: the entity appeared in the recipes
  • category: the major category the ingredient belongs to


Units

Type Unit
Spoons cuil. à café, cuil. café, cuil. à soupe, cuil. soupe, petite cuil., grande cuil., cuil.
Glasses petit verre, verre à liqueur, verres à liqueur, verres, verre, tasses, tasse
Bottles bout., bouteilles, bouteille
Containers grande boîte, boîtes, boîte, tubes, tube
Spices & Aromatic plants gousses, gousse, branches, branche, bâtons, bâton, pincée
Meat related membres, membre, tronçons, tronçon, tranches, tranche
Standard measures litres , litre , cl , dl , kg , g, l

Categories

We map different ingredients to several major categories

Category Ingredient
Viande (Meat) viande, oie, canard, oiseau, lard, bœuf, veau, poule, poulet, poularde, volaille, porc, caille, canard, caneton, mouton, cochon, coq, chevreuil, lièvre, levraut, lapin, faisan, gibier, jambon, chorizo, cervelas, agneau, escargot, grenouille
Poisson (Fish) poisson, brochet, carpe, morue, lamproie, lotte, maquereau, omble, rouget, sardine, thon, truite, anchois, anguille, merlan, sole, barbue, turbot, raie, perche, saumon, colin, goujon, loup, congre, rascasse, grondin, merlu, merluza, hareng, alose, brême
Fruit de mer (Sea food) crevette, langouste, moule, écrevisse, palourde, homard, chiperon, seiche, huître, coquille, poulpe
Alcool (Alcohol) alcool, bière, vin, cidre, fine, liqueur
Plante aromatique (Aromatic plant) bouquet garni, ail, anis, aromate, angélique, basilic, persil, sarriette, cerfeuil, ciboule, ciboulette, clou de girofle, clous de girofle, girofle, cive, câpre, estragon, feuille de vigne, fines herbes, laurier, menthe, pissenlit, romarin, thym
Epice (Spicy) cannelle, coriandre, curry, safran, poivre, sel, moutarde, muscade, paprika, piment, sauge, serpolet, épices
Produit laitier (Diary product) lait, crème, fromage, gruyère, parmesan
Légume (Vegetable) artichaut, asperge, aubergine, bette, betterave, cardon, chou, cornichon, courgette, cresson, céleri, fenouil, légume, navet, panais, poireau, pomme de terre, pommes de terre, potiron, rave, salade, tomate, échalote, épinard
Fruit (Fruit) abricot, banane, cerise, coing, fraise, framboise, groseille, raisin, olive, orange, pomme
Agrume (Citrus) citron, cédrat, fleur d'oranger, fleurs d'oranger
Céréale (Cereal) farine, pain, pâte, riz
Légumineuse (Legume) févette, haricot, pois
Fruit sec (Nut) amande, noix, noisette
Champignon (Mushroom) champignon, truffe, cèpe, girofle, morille, levure, oronge, duelle

Data analysis and visualization

Dataset Overview

We have a total of 352 different recipes from 30 regions, 6 subregions.

Top 10 most used ingredients.

Rank Ingredient Number of occurrences Picture
1) Beurre 180
Beurre.png
2) Sel 167
Sel.jpg
3) Poivre 146
Poivre.jpeg
4) Œufs 101
Œufs.jpeg
5) Oignons 95
Oignons.jpeg
6) Farine 89
Farine.png
7) Persil 82
Persil.png
8) Ail 76
Ail.jpeg
9) Vin balanc 69
Vin balanc.jpeg
10) Bouquet garni 46
Bouquet garni.png



Region Analysis

From this graph, we could see that "Plante aromatique" and "Epice" are frequently used by all the six major regions while "Fruit sec" is the least frequently used one.

Heatmap of categories by region.

Subregion Analysis

Looking into subregions, we could see that "Plante aromatique" and "Epice" are heavily used in Paris, Bretagne, and Bourgogne. Bourgogne also has more recipes for fish than in other regions. There are also more alcohol recipes in Paris and Bretagne than in other regions. Périgord has more recipes about meat and mushroom.

Heatmap of categories by subregion.

Co-occurrence Analysis

We could see that "Plante aromatique" and "Epice" appear together a lot, then they appear together with "Viande", "Légume", and "Céréale".

Matrix of co-occurences by subregion.
Map of France.

Discussion and limitations

Steps for Digitalizing an Old Cookbook

  1. Scanning: The first step is to scan the cookbook. This can be done with a standard scanner or a digitizing device. Make sure to scan the pages at a high resolution for clarity.
  2. Data Entry: Once the pages have been scanned, the next step is to enter the information into digital format. Data entry should include the recipe names, ingredients, regions, notes from the book, and any other relevant information.
  3. Metadata: Metadata, or “data about data”, should also be added to the project. This includes the book’s title, author, publisher, publication year, and any other associated content.
  4. Check for Consistency: Once the data entry is complete, the project should be checked to make sure that all of the entries are consistent. Pay attention to recipe names, spelling, and formatting.
  5. Compile and Store: Finally, all of the components of the data project should be compiled into a single file and properly stored.

Difficulties

  • Skewed images of the pages may be produced if the scanning device is not correctly calibrated.
  • The dataset curation process can be time-consuming and errors can occur if the text is not carefully proofread.
  • software may not be able to recognize the formatting of the recipes, meaning manual input of data may be necessary.

Knowledge Extracted from a Cookbook

A cookbook can provide a wealth of knowledge about the culture and practices of a specific region. The ingredients, flavors, and cooking techniques used in a cookbook can provide insight into the culinary heritage of the area. Additionally, notes from the author or other contributors may provide information about regional customs and traditions.

From Mr. Curnonsky’s Cookbook

Mr. Curnonsky’s cookbook provides insight into the regional cuisine of France in the early 20th century. The book contains recipes for classic French dishes such as soufflés, omelets, and tarts. Various sauces and pastries also feature prominently in the book. The emphasis on traditional techniques and ingredients provides a window into the culinary culture of France during this period.

Limitations

Like many other research, this project has its limitations. For example, in the data analysis part, it was a roughly count of categories and we did not take quantity into account.

Future work

  • Build a search engine that would display the recipes and add filters to search them by name, region or ingredients
  • User-friendly interface to visualize the results of the analysis
  • Comparison with other cookbooks from different periods or different countries

Links

Github repository

Scanned book

OCR result