Introduction

This project aims to address the long-standing challenges of “data silos” and “semantic gaps” in historical urban studies.

The history of Venice has been recorded through two fundamentally different media:

Water (form and vision): The physical layout of the city, its canal network, and building textures, as depicted in historical maps.

Ink (records and text): Contracts, cadastral registers, and socio-economic documents preserved in archival collections.

For a long time, these two sources have remained disconnected. This project establishes a Multimodal Parallel Retrieval Architecture, integrating 17th–18th century panoramic views of Venice with cadastral archives into a unified high-dimensional vector space. By combining a Perception Encoder with MiniLM, we enable deep retrieval across visual (texture features), textual (cross-lingual semantics), and spatial (unified coordinate system) dimensions.

The project delivers more than a search tool—it provides a sustainable Digital Humanities infrastructure that opens new computational approaches for decoding pre-industrial urban forms.

Motivation

Project Plan and Milestones

Deliverables

This project delivers not only a website but also a reusable digital humanities research infrastructure, including pipelines and data assets.

Data Pipeline

The pipeline is a fully automated, modular ETL workflow that processes raw scans (TIFF/CSV) into vector indices, covering slicing, feature extraction, spatial registration, and database ingestion. Its open design allows researchers to seamlessly integrate new maps or archival records into the system without modifying the core code.

Search Platform

A user-facing, browser-based visualization platform built with Next.js (frontend) and FastAPI (backend). It supports 4 hybrid search modes—text-to-image, image-to-text, image-to-image, text-to-text search and provides 3D dynamic heatmaps with historical map overlays, enabling near real-time cross-modal exploration.

Semantic Dataset

A fully processed collection of digital assets, cleaned, registered, and vectorized. It contains thousands of historical map patches and text entries represented as embedding vectors. By precomputing the most time-consuming feature engineering steps, the dataset allows researchers to immediately perform clustering analyses, study urban morphology evolution, or train downstream models.

Methods

Data Source

Historical Maps

To construct a high-precision semantic index of Venice, we selected two historical maps with exceptional cartographic value as the visual foundation of our dataset. A key strategy in this selection was the deliberate choice of bird's-eye views rather than ichnographic plans. In computer vision tasks, buildings on planar maps appear merely as similar rectangular shapes, making them difficult to distinguish. In contrast, bird's-eye views preserve the vertical (Z-axis) features of structures—such as domes, towers, and arches—providing rich visual cues that are crucial for training semantic models to recognize urban forms.

1. Pieter Mortier's Venice (1704)

The first base map comes from Pieter Mortier's 1704 publication of a panoramic view of Venice. This work is not simply a reprint but a reinterpretation of Vincenzo Coronelli’s 1693 original, reflecting the high cartographic standards of the early eighteenth century. It offers a detailed representation of the city’s overall structure, with clear building outlines and rich urban textures. These features make it an excellent visual reference for training semantic models to recognize the spatial patterns of early modern cities.

2. Venetia (1675)

The second map is the 1675 panoramic Venetia, held by the Bibliothèque nationale de France (BNF). This long-format map ($1025 \times 415$ mm) combines elements of both plan and landscape drawings. Compared with the 1704 map, it provides an earlier temporal snapshot, and its large physical size allows for very high-resolution digitization. This enables our sliding-window pipeline to extract extremely fine-grained features—such as bridge structures and defensive works—thus enriching the dataset with micro-scale details.

By digitizing, georeferencing, and slicing these two maps, we constructed a historical visual dataset that spans different time periods and cartographic styles. This diversity strengthens the robustness of our algorithms across varied historical contexts.

Historical Archives

To complement the semantic gaps in the visual data, we integrated two sets of heterogeneous cadastral records: the 1740 Catastici (dot-based rental registers from the late Venetian Republic) and the 1808 Sommarioni (standardized geometric cadastres from the Napoleonic period). These textual sources provide socio-economic attributes that cannot be captured visually, such as ownership, rent, and building function.

Notably, these archival records are not raw scans but preprocessed CSV data with basic geocoding. However, they remain at the "symbolic" level—the system knows that a coordinate corresponds to "Lauro" but cannot interpret its socio-economic meaning or link it to visual map features. Our subsequent work focuses not on basic digitization, but on using embedding techniques to transform these records into deep semantic vectors, enabling cross-modal retrieval.

Cross-modal Search Strategy

Based on the data source, we designed a multi-model synergy retrieval architecture. Rather than relying on a single model for all tasks, the system dynamically switches between underlying vector spaces according to the differences between visual and textual semantics.

The Perception Encoder Cluster

For tasks involving visual features, we consistently employ the Perception Encoder (PE) along with its associated text encoder. The PE model is optimized for dense prediction tasks, is sensitive to geometric structures, and shares a latent space with textual representations.

Text-to-Image (T2I): When a user inputs "Shipyard," the system generates a vector using the PE Text Encoder and retrieves visual features from the map database. This enables zero-shot recognition, allowing complex urban forms to be located without manual annotation.

Image-to-Image (I2I): When a user uploads a modern image of Piazza San Marco, the system extracts its PE visual vector and searches for similar regions across the full map.

Image-to-Text (I2T): This corresponds to the image-to-text retrieval mode. The system uses the PE visual vector of the uploaded image to query the document database’s visual-aligned vectors, identifying archival records that describe similar visual features (e.g., "palaces with courtyards").

The Linguistic Cluster

For pure textual semantic understanding, the PE model has limitations, as it cannot fully capture abstract concepts. Therefore, we employ miniLM.

Text-to-Text (T2T): When a user queries "Church assets," the system uses the MiniLM model to retrieve historical records in archaic Italian. MiniLM can understand the semantic equivalence between "Church" (English) and "Ecclesiastici" (Italian), enabling it to handle abstract social concepts.

Data Processing Pipelines

To support the four retrieval modes described above, we established three processing pipelines that transform the raw data into semantic vectors of specific dimensions.

Visual Pipeline

This pipeline is designed to extract individual "semantic unit" from each historical map and embed them as semantic vectors, which directly support T2I and I2I search.

Step 1: Map Patching

Historical maps contain a very high density of information, and feeding the entire sheet directly into a model would cause fine-grained urban textures to be overwhelmed by large-scale visual noise. To address this, we define a “semantic unit” at the physical scale of the city. The map is divided into small tiles of 224 × 224 pixels, a size that roughly corresponds to a typical Venetian block, a monastery complex, or a short canal segment.

To avoid losing important features at the boundaries, we use a 112-pixel stride, resulting in a 50% overlap between tiles. This redundant topology ensures that any geographical feature smaller than the window size—such as a church dome that happens to fall on a slicing boundary—will appear centered in a neighboring tile. This design preserves the completeness and robustness of feature extraction across the entire map.

Step 2: Feature Extraction/Embedding

Text Pipeline

This pipeline is designed to convert discrete text entries into semantic vectors. To simultaneously support T2T and I2T search, we implement a unique dual-encoding strategy.

Step 1: Semantic Linearization

Preprocessed historical archival records are often stored as discrete key–value pairs (e.g., {Type: C, Rent: 10}), and such context-free data can appear as high-frequency noise to a semantic model. To address this, we introduce a technique called semantic linearization, which uses a template engine to transform these sparse fields into grammatically structured natural-language sentences. An example is shown below.

 This is a residential property located in the district of San Marco, owned by the noble family Lauro…

This provides miniLM with the contextual information it needs to interpret the data effectively.

Step 2: Dual Embedding

To accommodate different retrieval tasks, we generate two vectors for the same text entry:

Visual-Aligned Vector (1024-dim): Generated using the PE Text Encoder, this vector captures visually descriptive information in the text and is specifically used to support I2T and T2I tasks.

Semantic Vector (384-dim): Generated using MiniLM, this vector captures linguistic and abstract socio-economic concepts, and is specifically used to support T2T tasks.

Georeferencing Pipeline

This module is responsible for addressing spatiotemporal misalignment by anchoring historical pixels to the modern coordinate system.

Step 1: Non-linear Rectification

Due to our deliberate choice of 3D bird's-eye views, the historical maps exhibit severe non-uniform perspective distortions. Traditional affine transformations cannot resolve this "pseudo-3D to 2D" mapping. Therefore, we first performed manual georeferencing in QGIS. Multiple rigid landmarks, such as church spires and canal junctions, were manually selected as ground control points (GCPs), and the Thin Plate Spline (TPS) algorithm was applied. TPS simulates the physical deformation of a “rubber sheet” and, by minimizing bending energy, applies local non-uniform stretching and compression, thereby establishing a reference transformation model from the 1704 perspective view to the modern OpenStreetMap layer.

Step 2: Coordinate Mapping

To apply the results of manual georeferencing to tens of thousands of map tiles, we developed a GDAL-based coordinate mapping middleware. This middleware loads the control point file exported from QGIS and reproduces the exact TPS transformation matrix. For each tile, it computes the modern geographic coordinates (lat,lon) corresponding to its center pixel (u,v). This approach ensures that the backend algorithmic outputs remain pixel-level consistent with our manual calibration, effectively eliminating edge drift.

Step 3: Hybrid Coordinate Architecture

Database Design

Overall Design

We selected Qdrant as the core vector database. Qdrant is a high-performance, open-source vector retrieval engine developed in Rust, and our choice was based on three main reasons. First, its excellent performance and stability support millisecond-level, high-concurrency searches. Second, it natively supports geo-spatial filtering, allowing us to constrain searches within a geographic range, which is crucial for map-based applications. Third, its flexible payload indexing mechanism perfectly accommodates our need to handle temporal, spatial, and attribute data simultaneously.

We follow the principle of “separating storage and computation while maintaining logical associations.”

Physical Separation: Due to the large difference in vector structures between map tiles (1024-dim) and documents (384-dim + 1024-dim), we store them in two separate collections: the Map Collection and the Document Collection.

Logical Association: All data entries are required to include year (temporal) and geo_location (spatial) as payloads. This enables the system to bridge the two collections at the logical level, supporting spatiotemporally informed joint queries.

Data Schema

Map Collection Schema: Stores a single visual vector per entry.

At the metadata (payload) level, we store the following key information to support frontend interactions:

year: Stored as an integer index, used for the frontend timeline component, allowing users to quickly filter data by specific years.

source_image: Indicates the data source, ensuring that each retrieval result can be traced back to the original map file.

geo_location: Stored as a Geo Point, serving as a logical anchor connecting visual and textual data.

pixel_coords: Stored only in the map collection, allowing the frontend to highlight specific regions on the original scanned maps when displaying results.

Document Collection Schema: Stores dual vectors forms the foundation for implementing a flexible retrieval strategy.

Web Development

The web interface is designed as a map-centric, immersive digital humanities platform. It serves not only as a presentation layer for backend algorithms but also as an interactive bridge connecting historians with complex spatiotemporal data. The core development goal is to balance high-throughput multimodal retrieval with low-latency geovisualization, ensuring a smooth user experience when exploring historical map tiles or cadastral records directly in the browser.

Requirements Analysis & Function Definition

In the early development stage, we conducted a detailed needs analysis for both digital humanities researchers and the general public:

Cross-modal Retrieval: Users need to move beyond traditional text-to-text search, enabling text-to-visual queries (e.g., searching “shipyard” to locate docks on the map) and image-to-text retrieval for historical archives.

Spatio-temporal Visualization: Search results should not appear as plain lists; they must be visualized on the map through geographic markers (pins) and density patterns (heatmaps).

Dynamic Filtering: Users should be able to filter data in real time using both the temporal range (1700–1850) and the current map viewport.

Micro-level Traceability: For any search result, users must be able to access the original high-resolution map tile or cadastral record, enabling a smooth transition from macro-level exploration to micro-level evidence.

Architecture Design

The system adopts a decoupled frontend–backend architecture, communicating through RESTful APIs. On the frontend, we build with the Next.js (React) framework and employ a dual-layer map rendering strategy: Leaflet is used at the base layer to display static historical raster tiles and preserve the texture of the original maps, while Deck.gl (WebGL) is layered on top to provide GPU-accelerated rendering for large-scale vector overlays.

On the backend, the services are implemented with FastAPI. The codebase follows a strict separation of routing, business logic, and data access layers, with core retrieval logic—such as model inference and score fusion—encapsulated within the service layer. This design ensures maintainability and future scalability.

Page Design

The interface design closely follows Ben Shneiderman’s classic HCI principle—“overview first, zoom and filter, then details on demand.” To maximize visual immersion, the main page adopts a HUD (Heads-Up Display) layout in which the map canvas always occupies 100% of the screen, while all controls float above it. The unified search bar at the top serves as the primary entry point for interaction and automatically detects whether the user provides text input or an uploaded image. A spatiotemporal slider at the bottom allows users to select a specific year range (e.g., 1740–1808), triggering real-time data filtering.

To optimize information flow, we employ a dual-sidebar layout. The left sidebar presents retrieved Top-K results as lightweight cards for quick scanning. When the user clicks an item or a map pin, a detailed panel slides in from the right, showing the full cadastral information and high-resolution map tile. This bifurcated layout effectively separates the cognitive modes of “browsing” and “reading,” ensuring that even during complex retrieval tasks, users remain grounded in the map-centric context.

Text Search Implementation

For natural-language queries (e.g., “Shipyard” or “Church assets”), the system employs a parallel retrieval architecture that simultaneously handles Text-to-Text (T2T) and Text-to-Image (T2I) tasks.

Backend

When the server receives a query, it launches two inference processes in parallel:

MiniLM encoding: Produces a 384-dimensional semantic vector for searching the Document Collection.

PE Text Encoder: Produces a 1024-dimensional visual-aligned vector for searching the Map Collection.

Because the cosine-similarity distributions of MiniLM and PE differ substantially, their scores cannot be merged directly. The system computes the mean (μ) and standard deviation (σ) of each candidate set and converts the raw scores into standardized values:

<math>z = \frac{x - \mu}{\sigma}</math>

This places heterogeneous scores into a shared statistical space.

The system applies two filtering criteria:

Absolute threshold (e.g., <math>Z > 0</math>): removes items below average relevance.

Relative threshold (e.g., <math>\text{Score} > 0.8 \times \text{Top1}</math>): retains only high-quality top results.

Frontend interaction

After normalization, results are combined in the left sidebar, while map markers differentiate the two retrieval modes:

T2T results (documents): displayed as markers with a document icon; clicking opens a detailed cadastral record in the right panel.

T2I results (map tiles): displayed as markers with pins. When clicked, the interface uses WebGL to load the corresponding historical overlay, aligning the 1704 or 1675 map tile with the modern basemap to verify the correspondence between textual descriptions and visual features.

Image Search Implementation

For visual queries (user-uploaded images), the system performs both Image-to-Image (I2I) and Image-to-Text (I2T) retrieval.

Backend

The backend uses the Perception Encoder (PE) to extract a 1024-dimensional visual feature vector. This vector is used for:

searching texture-similar map tiles in the map collection (I2I)
matching archival documents whose pe_vector fields encode comparable visual patterns (I2T)

To fuse heterogeneous similarity scores, the system applies:

Z-Score normalization to align score distributions from maps and documents

Frontend Interaction

After the image is uploaded:

the thumbnail appears in the query panel
the left sidebar lists a merged set of I2I and I2T results
selecting an item opens the detail panel on the right
the corresponding location is highlighted on the main map canvas to maintain spatial continuity

3D Heatmap Implementation

To visualize the city-wide density of specific semantic concepts (e.g., “fortifications” or “monasteries”), we developed a high-performance 3D heatmap feature. Calculating global similarities for over 50,000 map tiles generates massive data, and transmitting it in standard JSON would exceed 5 MB, causing significant loading delays. To address this, the backend implements binary serialization using Python’s module, compressing each (latitude, longitude, score) tuple into a compact 32-bit floating-point stream, reducing the payload by over 90% to roughly 300 KB. On the frontend, a instantly parses the binary stream and feeds it directly into Deck.gl’s . With GPU acceleration, similarity scores are mapped to hexagon heights and local density to color intensity, enabling the “Show Heatmap” feature to respond in milliseconds and deliver a smooth, interactive data exploration experience.structFloat32ArrayHexagonLayer

Evaluation and Final Results

Limitations

Future Work

Github Repo

Data pipeline: https://github.com/wuu03/Urban-Semantic-Search

Frontend: https://github.com/SheEagle/Urban-Semantic-Search-Frontend

Backend: https://github.com/SheEagle/Urban-Semantic-Search-Backend

Urban Semantic Search

Contents

Introduction

Motivation

Project Plan and Milestones

Deliverables

Data Pipeline

Search Platform

Semantic Dataset

Methods

Data Source

Historical Maps

Historical Archives

Cross-modal Search Strategy

The Perception Encoder Cluster

The Linguistic Cluster

Data Processing Pipelines

Visual Pipeline

Text Pipeline

Georeferencing Pipeline

Database Design

Overall Design

Data Schema

Web Development

Requirements Analysis & Function Definition

Architecture Design

Page Design

Text Search Implementation

Backend

Frontend interaction

Image Search Implementation

Backend

Frontend Interaction

3D Heatmap Implementation

Evaluation and Final Results

Limitations

Future Work

Github Repo

Navigation menu