City of Water and Ink: Decoding Venice through Multi-Modal Semantic Search

From FDHwiki
Jump to navigation Jump to search

Introduction

The history of Venice has been recorded through two fundamentally different media:

  • Water (form and vision): The physical layout of the city, its canal network, and building textures, as depicted in historical maps.
  • Ink (records and text): Contracts, cadastral registers, and socio-economic documents preserved in archival collections.

For a long time, these two sources have remained disconnected. This project establishes a Multimodal Parallel Retrieval Architecture, integrating 17th–18th century panoramic views of Venice with cadastral archives into a unified high-dimensional vector space. By combining a Perception Encoder with MiniLM, we enable deep retrieval across visual (texture features), textual (cross-lingual semantics), and spatial (unified coordinate system) dimensions.

The project delivers more than a search tool—it provides a sustainable Digital Humanities infrastructure that opens new computational approaches for decoding pre-industrial urban forms.

Motivation

Our motivation is driven by the need to achieve breakthrough solutions to the following three core challenges.

Bridging Semantic Gap

Traditional Geographic Information Systems face inherent limitations when dealing with historical maps as unstructured raster images. Content-based search is difficult, and conventional vectorization methods are highly inefficient. Our approach leverages computer vision techniques (PE) to transform visual textures and spatial forms in maps into vector representations, enabling a semantic shift from raw “pixels” to interpretable visual meanings. Without manual vectorization, the system can effectively “see” and index architectural features, marking a paradigm shift from metadata-driven search to visual content–based retrieval.

Breaking Data Silos

Historical urban research is inherently multimodal: maps (“water”) provide spatial form, while texts (“ink”) convey social and functional information. A central obstacle in traditional research is data fragmentation, which forces scholars to perform labor-intensive manual cross-referencing between maps and cadastral records. Our solution adopts a **dual-tower architecture** and establishes logical links between map and text data within a unified geographic space. This allows researchers to retrieve urban parcel simultaneously through visual characteristics (“what it looks like”) and textual descriptions (“what it is”), enabling truly multimodal and multi-temporal historical inquiry and forming a spatio-temporally connected digital model of Venice.

Immersive Visualization

Historical data is often abstract and difficult to interpret intuitively. Conventional tools typically present search results as lists, lacking spatial context. We address this by developing an immersive map-based visualization platform that transforms retrieval results into intuitive spatial insights. The platform supports precise geospatial localization on historical maps and further represents abstract semantic similarity through 3D dynamic heatmaps, converting invisible similarity scores into visible patterns of urban density. This visualization approach enables simultaneous observation of micro-level evidence and macro-level trends, providing researchers with a powerful interactive analytical environment.

Project Plan

Week Visual Pipeline Text Pipeline System Development
(Frontend + Backend + Database)
QA & Documentation
Week 3 (Pending Start) (Pending Start) (Pending Start) Project Scoping
- Preliminary research on data, tools, and models.
Week 4 Phase 0: Feasibility Analysis
- Analyze TIFF resolution and visual details of the 1704/1675 maps.
- Validate the feasibility of the “semantic unit” slicing strategy.
- Investigate GDAL support for historical map georeferencing.
Phase 0: Data Structure Analysis
- In-depth analysis of fields in the 1740/1808 archival records.
- Define cleaning rules and field-mapping logic (e.g., mapping the Rent field to a weight).
(Pending Start) Define the project MVP scope.
Week 5 Step 1: Map Patching
- Implement sliding-window patching.
- Generate raw image patches.
Step 1: Data Cleaning
- Write cleaning scripts to process CSV text data
Environment Setup (Dev Start)
- Initialize Git repository structure.
- Initialize Next.js and FastAPI scaffolding.
Week 6 Step 2: Visual Embedding
- Run batch inference using the PE-Core-B16 model.
- Generate 1024-dimensional feature vectors for all map patches.
- Upsert vectors into the Qdrant Map collection.
Step 2: Linearization
- Develop a semantic template engine.
- Convert structured CSV rows into natural-language sentences.
Backend Infrastructure
- Initialize FastAPI backend services.
- Design Qdrant collection schemas (Map and Doc collections).
- Integrate the Perception Encoder model loader.
Week 7 Step 3: Georeferencing
- Implement TPS interpolation to address non-linear perspective distortion in bird’s-eye maps.
- Develop a GDAL middleware for coordinate mapping.
Step 3: Text Embedding
- Run batch inference using all-MiniLM-L6-v2.
- Generate 384-dimensional semantic vectors for archival records.
Core Dev: I2I / T2I / I2T
- Develop Image-to-Image (I2I) search.
- Develop Text-to-Image (T2I) search.
- Develop Image-to-Text (I2T) search.
- Manual testing of I2I, T2I, and I2T search.
- Testing coordinate transformation accuracy.
Week 8 Step 4: Vector Upsert
- Insert semantic vectors into the Qdrant Doc collection.
Core Dev: T2T
- Develop Text-to-Text (T2T) retrieval logic.
- Integrate the MiniLM retrieval pipeline.
- Milestone: full multimodal retrieval capability ready.
- Manual testing of T2T search.
Week 9–10 Frontend Integration
- Build a hybrid map rendering engine.
- Connect frontend search UI with backend FastAPI endpoints.
- Performance optimization: implement binary streaming.
- Heatmap testing; API latency benchmarking.
Week 11 Algorithm Tuning
- Implement Z-score normalization.
- Debug score fusion logic to resolve distribution mismatch between MiniLM (high-score range) and PE (low-score range).
- Fine-tune absolute and relative thresholds.
System Testing
- Integration testing: verify the full pipeline from query input to heatmap rendering.
- Stress testing: simulate 50+ concurrent requests to evaluate system stability.
Week 12 Final Polish
- UI/UX refinement.
Wiki & Delivery
- Write Wiki documentation.
- Record feature demonstration videos.
- Final code cleanup and commenting.

Deliverables

This project delivers not only a website but also a reusable digital humanities research infrastructure, including pipelines and data assets.

Data Pipeline

The pipeline(https://github.com/wuu03/Urban-Semantic-Search) is a fully automated, modular ETL workflow that processes raw data (TIFF/CSV) into vector indices, covering slicing, feature extraction, georefrencing, and database ingestion. Its open design allows researchers to seamlessly integrate new maps or archival records into the system without modifying the core code.

Search Platform

A user-facing, browser-based visualization platform built with Next.js (frontend, https://github.com/SheEagle/Urban-Semantic-Search-Frontend) and FastAPI (backend, https://github.com/SheEagle/Urban-Semantic-Search-Backend). It supports 4 hybrid search modes—text-to-image, image-to-text, image-to-image, text-to-text search and provides 3D dynamic heatmaps with historical map overlays, enabling near real-time cross-modal exploration.

Semantic Dataset

A fully processed collection of digital assets, cleaned, registered, and vectorized. It contains thousands of historical map patches and text entries represented as embedding vectors. By precomputing the most time-consuming feature engineering steps, the dataset allows researchers to immediately perform clustering analyses, study urban morphology evolution, or train downstream models.

Methods

Data Source

Historical Maps

To construct a high-precision semantic index of Venice, we selected two historical maps with exceptional cartographic value as the visual foundation of our dataset. A key strategy in this selection was the deliberate choice of bird's-eye views rather than ichnographic plans. In computer vision tasks, buildings on planar maps appear merely as similar rectangular shapes, making them difficult to distinguish. In contrast, bird's-eye views preserve the vertical (Z-axis) features of structures—such as domes, towers, and arches—providing rich visual cues that are crucial for training semantic models to recognize urban forms. Here is an example that compares these 2 types of maps.

1. Pieter Mortier's Venice (1704)

The first base map comes from Pieter Mortier's 1704 publication of a panoramic view of Venice. This work is not simply a reprint but a reinterpretation of Vincenzo Coronelli’s 1693 original, reflecting the high cartographic standards of the early eighteenth century. It offers a detailed representation of the city’s overall structure, with clear building outlines and rich urban textures.


2. Venetia (1675)

The second map is the 1675 panoramic Venetia, held by the Bibliothèque nationale de France (BNF). This long-format map ($1025 \times 415$ mm) combines elements of both plan and landscape drawings. Compared with the 1704 map, it provides an earlier temporal snapshot, and its large physical size allows for very high-resolution digitization.

By digitizing, georeferencing, and slicing these two maps, we constructed a historical visual dataset that spans different time periods and cartographic styles. This diversity strengthens the robustness of our algorithms across varied historical contexts.

Historical Archives

To complement the semantic gaps in the visual data, we integrated two sets of heterogeneous cadastral records: the 1740 Catastici (dot-based rental registers from the late Venetian Republic) and the 1808 Sommarioni (standardized geometric cadastres from the Napoleonic period). These textual sources provide socio-economic attributes that cannot be captured visually, such as ownership, rent, and building function.

Notably, these archival records are not raw scans but preprocessed CSV data with basic geocoding. However, they remain at the "symbolic" level—the system knows that a coordinate corresponds to "Lauro" but cannot interpret its socio-economic meaning or link it to visual map features. Our subsequent work focuses not on basic digitization, but on using embedding techniques to transform these records into deep semantic vectors, enabling cross-modal retrieval.

Cross-modal Search Strategy

Based on the data source, we designed a multi-model synergy retrieval architecture. Rather than relying on a single model for all tasks, the system dynamically switches between underlying vector spaces according to the differences between visual and textual semantics.

Cross-modal Retrieval System Architecture

The Perception Encoder Cluster

For tasks involving visual features, we consistently employ the Perception Encoder (PE-Core-B16-224) along with its associated text encoder. While traditional CLIP excels at global image classification, it often lacks fine-grained localization capabilities when handling densely detailed information. In contrast, the PE model is optimized for dense prediction tasks, allowing it to capture subtle geometric structures and local textures within an image. This characteristic makes it particularly robust when processing historical maps that contain noise, such as yellowed paper or ink stains.

  • Text-to-Image (T2I): When a user inputs "Shipyard," the system generates a vector using the PE Text Encoder and retrieves visual features from the map database. This enables zero-shot recognition, allowing complex urban forms to be located without manual annotation.
  • Image-to-Image (I2I): When a user uploads a modern image of Piazza San Marco, the system extracts its PE visual vector and searches for similar regions across the full map.
  • Image-to-Text (I2T): This corresponds to the image-to-text retrieval mode. The system uses the PE visual vector of the uploaded image to query the document database’s visual-aligned vectors, identifying archival records that describe similar visual features (e.g., "palaces with courtyards").

The Linguistic Cluster

In processing textual archives, we did not rely solely on the text encoder of the Perception Encoder. Instead, we introduced MiniLM (all-MiniLM-L6-v2) as the semantic core.

The primary reason for using MiniLM instead of PE for pure text tasks lies in the difference between their semantic spaces. The text encoder of PE is trained to align with visual features. It excels at understanding concrete, visualizable descriptions (e.g., "red brick wall", "domed church"), but its performance drops significantly when faced with the many abstract concepts and socio-economic terms found in historical archives (e.g., "church jurisdiction", "hereditary rent", "monastery assets"). These concepts do not have direct visual counterparts on maps, making it difficult for PE to generate accurate vector representations.

By contrast, MiniLM is a sentence embedding model pretrained on massive textual corpora. It is highly effective at capturing deep linguistic logic and cross-lingual conceptual alignment (semantic equivalence). For example, it can recognize the semantic relationship between "Church assets" (English query) and "Monastero" (Italian archival record) rather than relying on simple keyword matching.

Therefore, we employ miniLM in Text-to-Text Search.

  • Text-to-Text (T2T): When a user queries "Church assets," the system uses the MiniLM model to retrieve historical records in archaic Italian. MiniLM can understand the semantic equivalence between "Church" (English) and "Ecclesiastici" (Italian), enabling it to handle abstract social concepts.

Data Processing Pipelines

To support the four retrieval modes described above, we established three processing pipelines that transform the raw data into semantic vectors of specific dimensions.

Visual Pipeline

This pipeline is designed to extract individual "semantic unit" from each historical map and embed them as semantic vectors, which directly support T2I and I2I search.

Step 1: Map Patching

We implemented a sliding-window strategy based on semantic assumptions, centered on defining “semantic units” within the urban space. We assume that a 224×224 pixel tile represents an independent semantic unit. At the scale of the 1704 map, this roughly corresponds physically to a Venetian insula (city block), a monastery complex, or a section of a canal.

Considering that urban buildings are distributed continuously, we sought to address the issue of traditional grid-based slicing potentially cutting through entire structures. Instead of using non-overlapping tiles, we adopted a stride of 112 pixels, resulting in a 50% overlap between adjacent tiles. This redundant topology ensures that objects located at the edges of a slice—such as a church dome split between two tiles—appear fully and centered in one of the adjacent semantic units. This approach maximizes the robustness and completeness of feature extraction.


Step 2: Feature Extraction/Embedding

Text Pipeline

This pipeline is designed to convert discrete text entries into semantic vectors. To simultaneously support T2T and I2T search, we implement a unique dual-encoding strategy.

Step 1: Semantic Linearization

Preprocessed historical archival records are often stored as discrete key–value pairs. An example is shown below.

    

And such context-free data can appear as high-frequency noise to a semantic model. To address this, we introduce a technique called semantic linearization, which uses a template engine to transform these sparse fields into grammatically structured natural-language sentences. An example is shown below.

 This is a residential property located in the district of San Marco, owned by the noble family Lauro… 

This provides miniLM with the contextual information it needs to interpret the data effectively.

Step 2: Dual Embedding

To accommodate different retrieval tasks, we generate two vectors for the same text entry:

  • Visual-Aligned Vector (1024-dim): Generated using the PE Text Encoder, this vector captures visually descriptive information in the text and is specifically used to support I2T and T2I tasks.
  • Semantic Vector (384-dim): Generated using MiniLM, this vector captures linguistic and abstract socio-economic concepts, and is specifically used to support T2T tasks.

Georeferencing Pipeline

This module is responsible for addressing spatiotemporal misalignment by anchoring historical pixels to the modern coordinate system.

Step 1: Non-linear Rectification

Due to our deliberate choice of 3D bird's-eye views, the historical maps exhibit severe non-uniform perspective distortions. Traditional affine transformations cannot resolve this "pseudo-3D to 2D" mapping. Therefore, we first performed manual georeferencing in QGIS. Multiple rigid landmarks, such as church spires and canal junctions, were manually selected as ground control points (GCPs), and the Thin Plate Spline (TPS) algorithm was applied. TPS simulates the physical deformation of a “rubber sheet” and, by minimizing bending energy, applies local non-uniform stretching and compression, thereby establishing a reference transformation model from the 1704 perspective view to the modern OpenStreetMap layer.

Step 2: Coordinate Mapping

To apply the results of manual georeferencing to tens of thousands of map tiles, we developed a GDAL-based coordinate mapping middleware. This middleware loads the control point file exported from QGIS and reproduces the exact TPS transformation matrix. For each tile, it computes the modern geographic coordinates (lat,lon) corresponding to its center pixel (u,v). This approach ensures that the backend algorithmic outputs remain pixel-level consistent with our manual calibration, effectively eliminating edge drift.


Step 3: Hybrid Coordinate Architecture

Database Design

Overall Design

Database Dual Collection Logical Schema and Entity Relationship Diagram.

We selected Qdrant as the core vector database. Qdrant is a high-performance, open-source vector retrieval engine developed in Rust, and our choice was based on three main reasons. First, its excellent performance and stability support millisecond-level, high-concurrency searches. Second, it natively supports geo-spatial filtering, allowing us to constrain searches within a geographic range, which is crucial for map-based applications. Third, its flexible payload indexing mechanism perfectly accommodates our need to handle temporal, spatial, and attribute data simultaneously.

We follow the principle of “separating storage and computation while maintaining logical associations.”

  • Physical Separation: Due to the large difference in vector structures between map tiles (1024-dim) and documents (384-dim + 1024-dim), we store them in two separate collections: the Map Collection and the Document Collection.
  • Logical Association: All data entries are required to include year (temporal) and geo_location (spatial) as payloads. This enables the system to bridge the two collections at the logical level, supporting spatiotemporally informed joint queries.


Data Schema

Map Collection Schema: Stores a single visual vector per entry.

At the metadata (payload) level, we store the following key information to support frontend interactions:

  • year: Stored as an integer index, used for the frontend timeline component, allowing users to quickly filter data by specific years.
  • source_image: Indicates the data source, ensuring that each retrieval result can be traced back to the original map file.
  • geo_location: Stored as a Geo Point, serving as a logical anchor connecting visual and textual data.
  • pixel_coords: Stored only in the map collection, allowing the frontend to highlight specific regions on the original scanned maps when displaying results.

Document Collection Schema: Stores dual vectors forms the foundation for implementing a flexible retrieval strategy.

Web Development

The web interface is designed as a map-centric, immersive digital humanities platform. It serves not only as a presentation layer for backend algorithms but also as an interactive bridge connecting historians with complex spatiotemporal data. The core development goal is to balance high-throughput multimodal retrieval with low-latency geovisualization, ensuring a smooth user experience when exploring historical map tiles or cadastral records directly in the browser.

Requirements Analysis & Function Definition

In the early development stage, we conducted a detailed needs analysis for both digital humanities researchers and the general public:

  • Cross-modal Retrieval: Users need to move beyond traditional text-to-text search, enabling text-to-visual queries (e.g., searching “shipyard” to locate docks on the map) and image-to-text retrieval for historical archives.
  • Spatio-temporal Visualization: Search results should not appear as plain lists; they must be visualized on the map through geographic markers (pins) and density patterns (heatmaps).
  • Dynamic Filtering: Users should be able to filter data in real time using both the temporal range (1700–1850) and the current map viewport.
  • Micro-level Traceability: For any search result, users must be able to access the original high-resolution map tile or cadastral record, enabling a smooth transition from macro-level exploration to micro-level evidence.

Architecture Design

The system adopts a decoupled frontend–backend architecture, communicating through RESTful APIs. On the frontend, we build with the Next.js (React) framework and employ a dual-layer map rendering strategy: Leaflet is used at the base layer to display static historical raster tiles and preserve the texture of the original maps, while Deck.gl (WebGL) is layered on top to provide GPU-accelerated rendering for large-scale vector overlays.

On the backend, the services are implemented with FastAPI. The codebase follows a strict separation of routing, business logic, and data access layers, with core retrieval logic—such as model inference and score fusion—encapsulated within the service layer. This design ensures maintainability and future scalability.

Overalll System Architecture, detailing the offline ETL pipelines and the online interactive platform powered by Next.js and FastAPI.

Page Design

The interface design closely follows Ben Shneiderman’s classic HCI principle—“overview first, zoom and filter, then details on demand.” To maximize visual immersion, the main page adopts a HUD (Heads-Up Display) layout in which the map canvas always occupies 100% of the screen, while all controls float above it. The unified search bar at the top serves as the primary entry point for interaction and automatically detects whether the user provides text input or an uploaded image. A timeline slider along with a mode selector at the bottom allows users to select a specific year range and select different map layers (eg. specific historical maps or 3D heatmap).

To optimize information flow, we employ a dual-sidebar layout. The left sidebar presents retrieved Top-K results as lightweight cards for quick scanning. When the user clicks an item or a map pin, a detailed panel slides in from the right, showing the full cadastral information and high-resolution map tile. This bifurcated layout effectively separates the cognitive modes of “browsing” and “reading,” ensuring that even during complex retrieval tasks, users remain grounded in the map-centric context.

Dual-Sidebar HUD Layout, illustrating the map-centric context.

Text Search Implementation

For natural-language queries (e.g., “Shipyard” or “Church assets”), the system employs a parallel retrieval architecture that simultaneously handles Text-to-Text (T2T) and Text-to-Image (T2I) tasks.

Parallel Text Search and Z-Score Normalization Pipeline

Backend

When the server receives a query, it launches two inference processes in parallel:

  • MiniLM encoding: Produces a 384-dimensional semantic vector for searching the Document Collection.
  • PE Text Encoder: Produces a 1024-dimensional visual-aligned vector for searching the Map Collection.

Because the cosine-similarity distributions of MiniLM and PE differ substantially, their scores cannot be merged directly. The system computes the mean (μ) and standard deviation (σ) of each candidate set and converts the raw scores into standardized values:

<math>z = \frac{x - \mu}{\sigma}</math>

This places heterogeneous scores into a shared statistical space.

The system applies two filtering criteria:

  • Absolute threshold (e.g., <math>Z > 0</math>): removes items below average relevance.
  • Relative threshold (e.g., <math>\text{Score} > 0.8 \times \text{Top1}</math>): retains only high-quality top results.

Frontend interaction

After normalization, results are combined in the left sidebar, while map markers differentiate the two retrieval modes:

T2T results (documents): displayed as markers with a document icon; clicking opens a detailed cadastral record in the right panel.

T2I results (map tiles): displayed as markers with pins. When clicked, the interface uses WebGL to load the corresponding historical overlay, aligning the 1704 or 1675 map tile with the modern basemap to verify the correspondence between textual descriptions and visual features.

Image Search Implementation

For visual queries (user-uploaded images), the system performs both Image-to-Image (I2I) and Image-to-Text (I2T) retrieval.

Backend

The backend uses the Perception Encoder (PE) to extract a 1024-dimensional visual feature vector. This vector is used for:

  • searching texture-similar map tiles in the map collection (I2I)
  • matching archival documents whose pe_vector fields encode comparable visual patterns (I2T)

To fuse heterogeneous similarity scores, the system applies:

  • Z-Score normalization to align score distributions from maps and documents

Frontend Interaction

3D Heatmap Implementation

To visualize the city-wide density of specific semantic concepts (e.g., “fortifications” or “monasteries”), we developed a high-performance 3D heatmap feature. Calculating global similarities for over 50,000 map tiles generates massive data, and transmitting it in standard JSON would exceed 5 MB, causing significant loading delays. To address this, the backend implements binary serialization using Python’s module, compressing each (latitude, longitude, score) tuple into a compact 32-bit floating-point stream, reducing the payload by over 90% to roughly 300 KB. On the frontend, a instantly parses the binary stream and feeds it directly into Deck.gl’s . With GPU acceleration, similarity scores are mapped to hexagon heights and local density to color intensity, enabling the “Show Heatmap” feature to respond in milliseconds and deliver a smooth, interactive data exploration experience.structFloat32ArrayHexagonLayer

Evaluation and Final Results

Limitations

Patch Granularity and Redundancy

This project uses a fixed patch size as the basic “semantic unit” for feature extraction. This design introduces several limitations. Large-scale features that span multiple patches may not be observed in their entirety, potentially leading to a loss of semantic context. At the same time, the relatively small patch size combined with a 50% overlap can result in multiple neighboring patches being returned for the same query, introducing redundancy. While such redundancy can improve recall, it may reduce user efficiency when browsing results in the left-side panel.

Lack of a Unified Cross-Modal Evaluation Metric

At present, neither the broader research community nor this project has access to a unified and objective metric for evaluating the quality of multimodal retrieval results. Although Z-score normalization is used to fuse heterogeneous scores from MiniLM (textual semantics) and PE (visual semantics), this approach merely places the scores within the same statistical space and does not guarantee that their semantic importance is truly equivalent. For queries such as “shipyard,” assessing the relevance of a returned map patch (T2I) versus an archival record (T2T) is inherently subjective and cannot be easily quantified using traditional metrics such as precision or recall.

Georeferencing Error

Despite the use of the advanced Thin Plate Spline (TPS) algorithm, georeferencing errors remain unavoidable. TPS relies on manually selected ground control points (GCPs) to perform non-linear warping. Because the 1704 map contains original cartographic inaccuracies and non-standard perspectives, TPS can only approximate the transformation rather than eliminate all errors. At a local scale, particularly in areas far from manually defined anchor points—small spatial drifts persist between historical pixels and modern geographic coordinates. These deviations may lead to inaccuracies at a micro level, such as when locating individual buildings or small bridges.

Future Work

VLM Integration

The next phase of this project will focus on integrating advanced vision–language models (VLMs), such as fastVLM or LLaVA, to upgrade the platform’s core analytical capabilities from simple vector similarity search to full-fledged visual question answering (Visual Question Answering, VQA). The primary objective of VLM integration is to transform the user interaction paradigm. Instead of being limited to searching for specific objects (e.g., querying “bridges”), users will be able to pose complex, comparative, and analytical questions directly about map content. For example, users may ask: “How has the density of bridges in this area changed compared to the 1740 map?” or “Describe all structures facing the Grand Canal that are labeled as ‘church assets’.”

To enable VQA, we plan to adopt a dense captioning strategy. A VLM will be used to generate detailed, high-quality textual descriptions for each 224 × 224 map patch. These descriptions will then be stored in an additional text index, serving as a complementary layer to the existing visual embedding index. When a VQA query is issued, the system can perform parallel retrieval over both the PE-based visual vector index and the dense-caption text index, allowing it to combine visual features with generated textual context to answer complex queries.

Integration of Modern POIs and Satellite Image

Future work will involve integrating modern points of interest (POI) data and contemporary satellite imagery to enhance temporal depth and support systematic comparisons between historical and modern urban conditions.

The integration of modern POI data enables direct and automated diachronic analysis of historical urban functions. For example:

  • Functional change tracking: the system can trace whether a “monastery” recorded in 1740 has evolved into a “university library” in the present day, allowing analysis of urban functional transformation across centuries.
  • Validation and correction: modern datasets can be used to verify the accuracy of historical records and to identify biases or errors in model-based recognition.

Through this integration, the platform can evolve into a powerful tool for diachronic urban analysis.

Github Repo

Data pipeline: https://github.com/wuu03/Urban-Semantic-Search

Frontend: https://github.com/SheEagle/Urban-Semantic-Search-Frontend

Backend: https://github.com/SheEagle/Urban-Semantic-Search-Backend

Credits

Course: Foundation of Digital Humanities (DH-405), EPFL
Professor: Frédéric Kaplan
Supervisor: Alexander Rusnak
Authors: Xiru Wang, Jingru Wang