Jerusalem: locating the colonies and neighborhoods: Difference between revisions

From FDHwiki
Jump to navigation Jump to search
Line 48: Line 48:
To deal with the overlap within two Wikipedia sources, we first identify neighborhoods with identical names and merge them. We also determine whether the two neighborhoods with different names are actually the same community by determining whether they have links that can redirect to each other. In this process, we retain information about the source of the data. The pre-processed names of neighborhoods are used for matching with neighborhoods from the book.
To deal with the overlap within two Wikipedia sources, we first identify neighborhoods with identical names and merge them. We also determine whether the two neighborhoods with different names are actually the same community by determining whether they have links that can redirect to each other. In this process, we retain information about the source of the data. The pre-processed names of neighborhoods are used for matching with neighborhoods from the book.


=== Matching data from the book ===
=== Matching data from the book and Wiki ===


== Database establishment ==
== Database establishment ==

Revision as of 20:24, 21 December 2022

Introduction

The goal of this project is to study the construction of neighborhoods in Jerusalem over time. We collect information about Jerusalem neighborhoods from four different sources, including the book Jerusalem and its Environs, the Wikipedia category Neighbourhoods of Jerusalem, the Wikidata list of places in Jerusalem and Wikidata entity Neighborhood of Jerusalem. These sources provide us with information about Jerusalem neighborhoods with different focuses. We merge this content through matching methods and present it on a carefully organized and visualized web page. With various features our webpage provides, users can get a clear picture of the Jerusalem community in our map interface. At the same time, the matching approach we use can be easily applied to other cities with multiple sources of information, with the potential for reuse in the future.

Motivation

Jerusalem

The study of the geography and chronology of neighborhoods in Jerusalem can provide valuable insights into the city's past and present. The location of a neighborhood can often reflect the social, economic, and political forces that shaped it, as well as the cultural traditions and values of its residents.

Examining the founding year of a neighborhood can also provide insight into the city's history and development. Visualizing the location and founded year of neighborhoods in Jerusalem can be a powerful tool for understanding the city's past and present. By mapping and analyzing these data, it is possible to gain a deeper understanding of the cultural, social, and economic dynamics of different neighborhoods and the forces that have shaped them.

A city with such a rich and varied history as Jerusalem has many different accounts of it. These accounts from various sources are an important basis when studying it. How to integrate the information from these sources is also one of the focuses of our research.

Existing interactive maps about Jerusalem neighborhoods usually do not include information about the neighborhoods that once existed, nor do they contain information about when the neighborhoods were built. Therefore, our work is of great importance in the study of the history of the Jerusalem community.

Deliverables

  • OCR results of Development of Jerusalem neighborhoods information from Jerusalem and its Environs.
  • Crawler results from Wikipedia category Neighbourhoods of Jerusalem, Wikidata list of places in Jerusalem and Wikidata entity Neighborhood of Jerusalem.
  • Integrated database with multiple information sources after perfect matching and fuzzy matching.
  • An interactive and user-friendly webpage showing the changes in neighborhoods of Jerusalem with time, which contains:
    • A timeline feature that illustrates the evolution of the construction of neighborhoods in Jerusalem over time.
    • A search function that enables users to search for neighborhoods by name.
    • A dedicated sub-page that contains relevant information for each neighborhood.
    • TODO

Methodology

Data collection

OCR method for paper book

Jerusalem and its Environs

Jerusalem and its Environs is a book written by Ruth Kark in 2001 that provides detailed information about the development of Jerusalem neighborhoods throughout different time periods. We utilized OCR technology to scan relevant information from the book, and also conducted manual proofreading to ensure the accuracy of the data due to the presence of punctuations and annotations in some community names. Data from the book includes information about the name, year of foundation, number of inhabitants, and initiating entity. Remarks are also included in some cases.

This source gives us the year of foundation for Jerusalem neighborhoods, a crucial aspect of our study. However, not every neighborhood has a precise year of construction. For neighborhoods with foundation years that are intervals and for those with ambiguous construction years (e.g., 1900s or end of Mandate), we have chosen the first year of the period for further analysis.

The following is an example of the comparison of raw data, OCR result, and manually checked result:

TODO

Crawler method for Wikipedia and Wikidata information

We employ API of Wikidata, requests package and BeautifulSoup parser package to implement a crawler to retrieve data from the internet (primarily Wikipedia and Wikidata). For data on Wikipedia, we mainly focus on the coordinates of the neighborhood. For data on Wikidata, we put particular attention to neighborhoods with 'inception' attribute, which serves as another source for the founding year of the neighborhoods.

It should be noted that there is a significant amount of overlap in the internet information, requiring further data cleaning and matching. We store this data in different dataframes for further processing.

The following is an example of data respectively from the Wikipedia category Neighbourhoods of Jerusalem, the Wikipedia list Places of Jerusalem - Neighborhoods and Wikidata entity neighborhood of Jerusalem.

TODO

Data matching

Data matching workflow

Preprocessing data from Wikipedia sources

To deal with the overlap within two Wikipedia sources, we first identify neighborhoods with identical names and merge them. We also determine whether the two neighborhoods with different names are actually the same community by determining whether they have links that can redirect to each other. In this process, we retain information about the source of the data. The pre-processed names of neighborhoods are used for matching with neighborhoods from the book.

Matching data from the book and Wiki

Database establishment

In this project, we employ MySQL as the database management system. The database is established by installing MySQL on the system and creating a new database. Subsequently, the structure of the database is defined through the creation of tables and the specification of the columns within those tables. The preprocessed data is imported into the tables using the LOAD DATA INFILE statement. Finally, the database is accessed from Python code via the MySQL connector and mysql.connector.connect() function. Throughout the course of the project, SQL queries and the MySQL connector are utilized to retrieve and manipulate data from the database as required.

Our database is currently not accessible online as we do not have a server to host it. This means that the database can only be accessed and modified on the local system where it is installed.

Webpage development

Search function

Timeline feature

Result assessment

Limitations and Further Work

Limitations

  • Our database only contains the founding year of each neighborhood, and does not include the termination year. As a result, all the neighborhoods in our database are displayed on the map, leading to a cluttered appearance. If we had access to the termination year of each neighborhood, we could simplify and improve the clarity of our timeline map.
  • Many neighborhoods from the book are not listed on Wikipedia. This may be due to the disappearance of some neighborhoods or the lack of notability for a Wikipedia page. As a result, we are unable to obtain the coordinates of these neighborhoods and include them on the maps. If we had access to additional historical documents, we could better handle our data from the book.

Further Work

Obtaining more resources

  • We could potentially acquire more resources from both the internet and books to expand the scope of our source material. For example, the primary source for the founding years is our reference book, with a small amount of additional data from Wikipedia and Wikidata. However the data on Wikipedia and Wikidata is very scattered and it is difficult to get all the detailed results from one crawler. Therefore our data is not complete. We have included a link to Wikipedia in the detailed page, and in the future we can try to get more information from Wikidata to make the dataset more complete.

More percise matching method

  • Currently, we are making matches based on the names of the neighborhoods and manually reviewing the top substitutions according to their scores. In the future, we plan to use additional information, such as redirect links and coordinates, to identify more possible matches.

Project Plan and Milestones

Date Task Completion
By Week 3
  • Brainstorm project ideas.
  • Prepare slides for initial project idea presentation.
By Week 4
  • Organize the Jerusalem neighborhood information from the book into csv files by OCR.
  • Conduct manual review and adjust formats.
By Week 5
  • Get neighborhood information on Wikipedia through crawlers, including names, links, and coordinates.
By Week 6
  • Merge the data get from wikipedia different website.
  • Extract the same neighborhoods in both book and wikipedia.
By Week 7
  • Start working on webpages.
  • Decide to use GitHub pages and bootstrap as our output methods and learn the basic concepts.
By Week 8
  • Use fuzzy matching method to link information from the book and Wikipedia.
  • Work on webpage: use leaflet to present maps.
By Week 9
  • Transfer data into usable format for HTML.
  • Combine the front-end webpage and back-end data.
  • Create our first demo webpage.
By Week 10
  • Fill in information on wiki.
  • Get prepared for the midterm presentation.
By Week 11
  • Get neighborhood information on Wikidata through crawlers, including names, links, and establish time.
  • Merge the information from Wikidata to existing data.
  • Find out a way to deal with duplicated data and extract results from fuzzy matching.
By Week 12
  • Find neighborhoods with area shapes and find out a way to visualize.
  • Create searching function on the website.
  • Adding information and adjust the website.
By Week 13
  • Create another page to list the information on the website.
  • Create Github Pages for our webpage.
By Week 14
  • Complete the wiki on motivation, methods, results...
  • Refine the visualization of our webpage
By Week 15
  • Final presentation

Github Repository

https://github.com/WayerLiu/fdh_jerusalem.github.io