Introduction

The history of Venice has been recorded through two fundamentally different media:

🌊 Water (form and vision)

   The physical layout of the city, its canal network, and building textures, as depicted in historical maps.

🖋️ Ink (records and text)

   Contracts, cadastral registers, and socio-economic documents preserved in archival collections.

For a long time, these two sources have remained disconnected. This project establishes a Multimodal Parallel Retrieval Architecture, integrating 17th–18th century panoramic views of Venice with cadastral archives into a unified high-dimensional vector space. By combining a Perception Encoder with MiniLM, we enable deep retrieval across visual (texture features), textual (cross-lingual semantics), and spatial (unified coordinate system) dimensions.

The project delivers more than a search tool—it provides a sustainable Digital Humanities infrastructure that opens new computational approaches for decoding pre-industrial urban forms.

Motivation

Our motivation is driven by the need to achieve breakthrough solutions to the following three core challenges.

Bridging Semantic Gap

Traditional Geographic Information Systems face inherent limitations when dealing with historical maps as unstructured raster images. Content-based search is difficult, and conventional vectorization methods are highly inefficient. Our approach leverages computer vision techniques (PE) to transform visual textures and spatial forms in maps into vector representations, enabling a semantic shift from raw “pixels” to interpretable visual meanings. Without manual vectorization, the system can effectively “see” and index architectural features, marking a paradigm shift from metadata-driven search to visual content–based retrieval.

Breaking Data Silos

Historical urban research is inherently multimodal: maps ("water") provide spatial form, while texts ("ink") convey social and functional information. A central obstacle in traditional research is data fragmentation, which forces scholars to perform labor-intensive manual cross-referencing between maps and cadastral records. Our solution establishes logical links between map and text data within a unified geographic space. This allows researchers to retrieve urban parcel simultaneously through visual characteristics ("what it looks like") and textual descriptions ("what it is"), enabling truly multimodal and multi-temporal historical inquiry and forming a spatio-temporally connected digital model of Venice.

Immersive Visualization

Historical data is often abstract and difficult to interpret intuitively. Conventional tools typically present search results as lists, lacking spatial context. We address this by developing an immersive map-based visualization platform that transforms retrieval results into intuitive spatial insights. The platform supports precise geospatial localization on historical maps and further represents abstract semantic similarity through 3D dynamic heatmaps, converting invisible similarity scores into visible patterns of urban density. This visualization approach enables simultaneous observation of micro-level evidence and macro-level trends, providing researchers with a powerful interactive analytical environment.

Project Plan and Milestones

The project execution adhered to a 10-week development lifecycle, designed to synchronize the processing of heterogeneous historical data with the iterative construction of the software infrastructure. To maximize efficiency, the workflow was orchestrated across three concurrent streams:

● Visual Pipeline: Segmentation, feature extraction, and georeferencing of historical maps.

● Text Pipeline: Cleaning, linearization, and semantic embedding of structured archival records.

● System Development: Frontend + Backend + Database.

Week	Visual Pipeline	Text Pipeline	System Development	QA & Documentation	Milestone
Week 4				Project Scoping - Preliminary research on data, tools, and models.
Week 5	Phase 0: Feasibility Analysis - Validate semantic unit strategy & PE encoding	Phase 0: Data Structure Analysis - Analysis of 1740/1808 records. - Define cleaning/mapping logic.		Define project MVP scope.	M1: Technical Feasibility Confirmed
Week 6	Step 1: Map Patching - Implement sliding-window patching. - Generate raw image patches.		Environment Setup - Initialize Git repository. - Initialize Next.js and FastAPI.		M2: Infrastructure & Data Slicing Ready
Week 7	— BREAK —
Week 8	Step 2: Visual Embedding - Run PE-Core-B16 batch inference. - Generate 1024-d vectors. - Upsert to Qdrant.	Step 1: Linearization - Develop semantic template engine. - Convert TSV/JSON to sentences.	Backend Infrastructure - Design Qdrant schemas. - Integrate PE model loader.		M3: Visual Vector Space Established
Week 9-10	Step 3: Georeferencing - Implement TPS interpolation. - Develop GDAL middleware.	Step 2: Text Embedding - Run all-MiniLM-L6-v2 inference. - Generate 384-d vectors.	Core Dev: Multi-modal Search - Develop I2I search. - Develop T2I/I2T search.	- Manual testing of search modes. - Testing coordinate accuracy.
Week 11		Step 3: Vector Upsert - Insert semantic vectors into Qdrant Doc collection.	Core Dev: T2T - Develop Text-to-Text retrieval. - Integrate MiniLM pipeline.	- Manual testing of T2T search.	M4: Multi-modal Search Capability & Backend Feature Completion
Week 12			Frontend Integration - Build hybrid map rendering engine. - Connect UI with FastAPI. - Implement binary streaming.	- Heatmap testing. - API latency benchmarking.	M5: Full-Stack Integration
Week 13			Algorithm Tuning - Implement Z-score normalization. - Debug score fusion logic. - Fine-tune search thresholds.	System Testing - Full pipeline integration testing.	M6: System Optimization
Week 14			Final Polish - UI/UX refinement.	Wiki & Delivery - Write Wiki documentation. - Record demonstration videos. - Final code cleanup.	M7: Final Delivery

Deliverables

This project delivers not only a website but also a reusable digital humanities research infrastructure, including pipelines and data assets.

Data Pipeline

Function:
The data pipeline implements a fully automated and modular ETL workflow that transforms raw historical maps and archival texts into semantic vectors. It automatically slices high-resolution historical maps and uses the Perception Encoder to generate image embeddings, while also aligning historical pixel coordinates with modern geographic coordinate systems. The pipeline also processes textual data, using MiniLM and the Perception Encoder to generate vector representations for archival records.

Presentation:
The pipeline is delivered as a Github repository(https://github.com/wuu03/Urban-Semantic-Search), including modular scripts, configuration files, and example notebooks.

Significance:
By automating feature extraction and georeferencing, the pipeline reduces the computational effort required to process historical data. Its modular design allows the workflow to be reused across different cities and datasets.

Search Platform

Function:
The search platform provides multimodal semantic search capabilities, supporting text-to-image, image-to-text, image-to-image, and text-to-text queries. It also allows users to overlay historical maps and explore search results through 3D heatmap visualizations.

Presentation:
The platform is implemented as a responsive web application, with a Next.js frontend(https://github.com/SheEagle/Urban-Semantic-Search-Frontend) and a FastAPI backend( https://github.com/SheEagle/Urban-Semantic-Search-Backend).

Significance:
The platform turns abstract vector search results into spatially interpretable visualizations, lowering technical barriers for users. It enables humanities researchers to explore data and test ideas without requiring a background in machine learning.

Semantic Dataset

Function:
The semantic dataset contains the final outputs of the data pipeline, including historical map patches and standardized archival texts. Each record includes precomputed high-dimensional vector embeddings and associated geographic coordinates. The dataset is directly exported from the Qdrant vector database and is consistent with the data used by the search platform.

Presentation:
The dataset is distributed as snapshot.

Significance:
By providing pre-vectorized data, the dataset avoids repeated feature extraction and heavy computational costs. This allows researchers to directly perform clustering, spatial analysis, or comparative studies using the data.

Methods

Data Source

Historical Maps

To construct a high-precision semantic index of Venice, we selected two historical maps with exceptional cartographic value as the visual foundation of our dataset. A key strategy in this selection was the deliberate choice of bird's-eye views rather than ichnographic plans. In computer vision tasks, buildings on planar maps appear merely as similar rectangular shapes, making them difficult to distinguish. In contrast, bird's-eye views preserve the vertical (Z-axis) features of structures—such as domes, towers, and arches—providing rich visual cues that are crucial for training semantic models to recognize urban forms. Here is an example that compares these 2 types of maps.

a. Traditional 2D Planar Map (Buildings appear as simple outlines with sparse features.)
b. Detail from the 1704 Bird's-eye View (Rich Z-axis features like domes and shadows are preserved.)

1. Pieter Mortier's Venice (1704)

The first base map comes from Pieter Mortier's 1704 publication of a panoramic view of Venice. This work is not simply a reprint but a reinterpretation of Vincenzo Coronelli’s 1693 original, reflecting the high cartographic standards of the early eighteenth century. It offers a detailed representation of the city’s overall structure, with clear building outlines and rich urban textures.

2. Venetia (1675)

The second map is the 1675 panoramic Venetia, held by the Bibliothèque nationale de France (BNF). This long-format map (1025 × 415 mm) is executed in Cavalier projection (Plan cavalier), offering a pseudo-3D perspective that highlights architectural details. Compared with the 1704 map, it provides an earlier temporal snapshot, and its large physical size allows for very high-resolution digitization.

By digitizing, georeferencing, and slicing these two maps, we constructed a historical visual dataset that spans different time periods and cartographic styles. This diversity strengthens the robustness of our algorithms across varied historical contexts.

Historical Archives

To complement the semantic gaps in the visual data, we integrated two sets of heterogeneous cadastral records: the 1740 Catastici (dot-based rental registers from the late Venetian Republic) and the 1808 Sommarioni (standardized geometric cadastres from the Napoleonic period). These textual sources provide socio-economic attributes that cannot be captured visually, such as ownership, rent, and building function.

Notably, these archival records are not raw scans but preprocessed TSV/JSON data with basic geocoding. However, they remain at the "symbolic" level - the system knows that a record corresponds to "Lauro" but cannot interpret its socio-economic meaning or link it to visual map features. Our subsequent work focuses not on basic digitization, but on using embedding techniques to transform these records into deep semantic vectors, enabling cross-modal retrieval.

Cross-modal Search Strategy

Based on the data source, we designed a multi-model synergy retrieval architecture. Rather than relying on a single model for all tasks, the system dynamically switches between underlying vector spaces according to the differences between visual and textual semantics.

Cross-modal Retrieval System Architecture

The Perception Encoder Cluster

For tasks involving visual features, we consistently employ the Perception Encoder (PE-Core-B16-224) along with its associated text encoder. While traditional CLIP excels at global image classification, it often lacks fine-grained localization capabilities when handling densely detailed information. In contrast, the PE model is optimized for dense prediction tasks, allowing it to capture subtle geometric structures and local textures within an image. This characteristic makes it particularly robust when processing historical maps that contain noise, such as yellowed paper or ink stains.

Text-to-Image (T2I): When a user inputs "Shipyard," the system generates a vector using the PE Text Encoder and retrieves visual features from the map database. This enables zero-shot recognition, allowing complex urban forms to be located without manual annotation.

Image-to-Image (I2I): When a user uploads a modern image of Piazza San Marco, the system extracts its PE visual vector and searches for similar regions across the full map.

Image-to-Text (I2T): This corresponds to the image-to-text retrieval mode. The system uses the PE visual vector of the uploaded image to query the document database's visual-aligned vectors, identifying archival records that describe similar visual features.

The Linguistic Cluster

In processing textual archives, we did not rely solely on the text encoder of the Perception Encoder. Instead, we introduced MiniLM (all-MiniLM-L6-v2) as the semantic core.

The primary reason for using MiniLM instead of PE for pure text tasks lies in the difference between their semantic spaces. The text encoder of PE is trained to align with visual features. It excels at understanding concrete, visualizable descriptions (e.g., "red brick wall", "domed church"), but its performance drops significantly when faced with the many abstract concepts and socio-economic terms found in historical archives (e.g., "church jurisdiction", "hereditary rent", "monastery assets"). These concepts do not have direct visual counterparts on maps, making it difficult for PE to generate accurate vector representations.

By contrast, MiniLM is a sentence embedding model pretrained on massive textual corpora. It is highly effective at capturing deep linguistic logic and cross-lingual conceptual alignment (semantic equivalence). For example, it can recognize the semantic relationship between "Church assets" (English query) and "Monastero" (Italian archival record) rather than relying on simple keyword matching.

Therefore, we employ miniLM in Text-to-Text Search.

Text-to-Text (T2T): When a user queries "Church assets," the system uses the MiniLM model to retrieve historical records in archaic Italian. MiniLM can understand the semantic equivalence between "Church" (English) and "Ecclesiastici" (Italian), enabling it to handle abstract social concepts.

Data Processing Pipelines

To support the four retrieval modes described above, we established three processing pipelines that transform the raw data into semantic vectors of specific dimensions.

Visual Pipeline

This pipeline is designed to extract individual "semantic unit" from each historical map and embed them as semantic vectors, which directly support T2I and I2I search.

Step 1: Map Patching

We implemented a sliding-window strategy based on semantic assumptions, centered on defining “semantic units” within the urban space. We assume that a 224×224 pixel tile represents an independent semantic unit. At the scale of the 1704 map, this roughly corresponds physically to a Venetian insula (city block), a monastery complex, or a section of a canal.

Considering that urban buildings are distributed continuously, we sought to address the issue of traditional grid-based slicing potentially cutting through entire structures. Instead of using non-overlapping tiles, we adopted a stride of 112 pixels, resulting in a 50% overlap between adjacent tiles. This redundant topology ensures that objects located at the edges of a slice—such as a church dome split between two tiles—appear fully and centered in one of the adjacent semantic units. This approach maximizes the robustness and completeness of feature extraction.

Step 2: Embedding

We utilizes Perception Encoder (PE-Core-B16-224) to project map patches into a high-dimensional vector space. After the image patches undergo preprocessing and normalization to match the model's distribution, they are processed through the visual encoder using parallel batch encoding to optimize computational efficiency. This process transforms visual features into standardized vectors within a shared semantic space, providing direct support for similarity comparison and cross-modal alignment during retrieval tasks.

Text Pipeline

This pipeline is designed to convert discrete heterogeneous textual entries into unified semantic vectors. To simultaneously support T2T and I2T search, we implement a unique dual-encoding strategy.

Step 1: Spatial Normalization & Aggregation

CRS Transformation - For 1740 Dataset: In the 1740 Catastici dataset, the original data employs EPSG:32633 (UTM Zone 33N) as the projected coordinate system. To ensure compatibility with frontend web mapping tools such as Leaflet and OpenStreetMap, we utilized GeoPandas during the preprocessing stage to reproject all geometric data into the EPSG:4326 (WGS 84) standard.

Geometry Aggregation - For 1808 Dataset: In the 1808 Sommarioni dataset, we addressed the issue where a single property entry might correspond to multiple disjoint spatial components (e.g., a "main building" plus a "courtyard"), where storing them separately would lead to semantic fragmentation. To resolve this, we adopted a MultiPolygon strategy, using shapely.ops.unary_union function to merge all Polygons belonging to the same entry into a single geometric object. Additionally, we implemented semantic enhancement by automatically generating structural descriptions within the text, such as "The property comprises 2 parts: 1 building and 1 courtyard", to enable the model to comprehend the property's complete physical morphology.

Step 2: Text Serialization & Semantic Chunking

Historical archival records are often stored as discrete key-value pairs. An example is shown below.

{ "type": "Feature", "properties": { "uid": "AGN-0001", "author": "Davide Drago", ... "function": "casa in soler rovinosa", ... "place": "Calle de Franchi", ... } }

Such context-free metadata appears as high-frequency noise to semantic models and cannot be directly understood. To address this, we introduce a technique called semantic serialization, which uses a template engine to transform these discrete fields into grammatically structured natural-language sentences:

Field Cleaning: First, we handle NaN values, empty lists/arrays, and redundant whitespace to ensure the purity of the input data.
Text Segmentation: Then, we construct a cohesive narrative from structured metadata. Given that BERT-like models (such as MiniLM) typically impose an input length limit of 512 tokens, and information exceeding this threshold is subject to direct truncation, leading to the loss of critical historical details (such as later repair records or transfer information), we must take into account the length of natural-language text representations. Furthermore, excessively long texts can also lead to semantic dilution. By slicing long records into semantically relatively independent segments, retrieval accuracy for specific details should be improved. Therefore, we segment generated long texts based on a predefined maximum character limit.
Semantic Anchor / Header: When chunking, intermediate fragments (e.g., "repair roof") may lose subject and location context, resulting in semantic incompleteness. Therefore, for each record, we extract core identity information (function, owner, location) to generate a mandatory summary, which is forcibly prepended to every text chunk resulting from the segmentation, ensuring that even fragments describing minor details retain a basic context, capturing key information.
Full Metadata Injection: To maximize searchability, we transform almost all non-empty fields possessing semantic features from the original dataset (including tenants, professions, property types, etc.) into natural language descriptions. We utilize robust concatenation logic, safeguarded by the field cleaning function, to handle non-standardized data. This approach preserves the precision of structured data while endowing it with natural language context.

An example of text representations generated from the template and processed by the chunking strategy is shown below.

This is a CASA owned by Paolina FRANCO located in Dorsoduro (Sant'Agnese). Details: District: Dorsoduro. Parish: Sant'Agnese. Place: Calle de Franchi. Function: casa in soler rovinosa; Class: CASA > CASA; Features: SOLER. Owner: Original Name: Paolina Franco | Standardised Name: Paolina FRANCO. Owner Details: Type: Private, Notes: PERSON.

This provides miniLM with the contextual information it needs to interpret the data effectively. Additionally, to facilitate frontend display, we have also retained the metadata used to construct these text representations within the data structure.

Step 3: Dual Embedding

To accommodate different retrieval tasks, we generate two vectors for the same text entry:

Visual-Aligned Vector (1024-dim): Generated using the PE Text Encoder, this vector captures visually descriptive information in the text and is specifically used to support I2T and T2I tasks.

Semantic Vector (384-dim): Generated using paraphrase-multilingual-MiniLM-L12-v2, which excels in multilingual contexts and supports multilingual search, delivering an exceptionally high cost-performance ratio in semantic similarity tasks. This vector captures complete linguistic and abstract socio-economic concepts, and is specifically used to support T2T tasks.

Georeferencing Pipeline

This module is responsible for addressing spatiotemporal misalignment by anchoring historical pixels to the modern coordinate system.

Step 1: Non-linear Rectification

To ensure high precision in registration, we first performed manual georeferencing of the hand-drawn map in QGIS by directly utilizing OpenStreetMap (OSM) as the ground truth and establishing EPSG:3857 (Pseudo-Mercator) as the reference coordinate system. Due to our deliberate choice of 3D bird's-eye views for more concrete visual features, our historical maps commonly suffer from severe non-linear perspective distortions and inherent hand-drawn inaccuracies. Traditional linear transformations (such as Affine transformation) cannot resolve this "pseudo-3D to 2D" mapping. Therefore, we adopted the Thin Plate Spline (TPS) algorithm, using it as a "rubber sheeting" mechanism to correct local perspective compression through non-uniform stretching while employing Nearest Neighbour resampling to preserve the sharpness of original details. Our Ground Control Point (GCP) strategy focused on rigid landmarks such as bridges and canal bifurcations, utilizing a "density gradient" approach that significantly increased point density in the highly compressed background areas to counteract severe geometric distortions. Finally, we performed clipping on the georeferenced raster image by generating an Alpha channel and utilizing the GeoJSON contour of the modern main island of Venice as a Mask to eliminate invalid "black hole" artifacts caused by topological folding at the map edges.

a. Original 1704 Bird's-eye Map
b. TPS-Transformed Map
c. Modern venice map

Step 2: Coordinate Mapping

During the semantic search phase, the model outputs pixel coordinates for image patches derived from the original, untransformed map to ensure that visual recognition is not compromised by distortion. To precisely locate these search results on the web map, we developed a GDAL-based coordinate mapping middleware. This middleware directly invokes GDAL's TPS transformation engine and loads the .point file (GCP data) exported from QGIS to reproduce the exact TPS transformation matrix. For each tile, the system applies the TPS mapping to calculate the geographic coordinates for its center point and the four corners, providing both EPSG:3857 and EPSG:4326 (WGS 84) coordinates for frontend visualization. This approach ensures that the backend algorithmic outputs remain pixel-level consistent with our manual calibration, effectively eliminating edge drift.

Step 3: Hybrid Coordinate Architecture

To balance computational efficiency with frontend compatibility, our system adopts a hybrid dual-coordinate system architecture:

Raster Layer: Both georeferencing and historical map tile storage are based on EPSG:3857, the native projection for web mapping (e.g., OSM, Google Maps). This ensures seamless alignment between the georeferenced TIF map and the base map grid, maximizes tiling efficiency, and prevents image quality degradation caused by secondary reprojection.

Vector Layer: The geospatial metadata returned by semantic search is additionally provided in EPSG:4326 (WGS 84). This adheres to the GeoJSON standard and aligns with the interface specifications of frontend libraries such as Leaflet and Mapbox, facilitating direct display in web applications.

Database Design

Overall Design

Database Dual Collection Logical Schema and Entity Relationship Diagram.

We chose Qdrant as the main vector database. Qdrant is a high-performance, open-source vector search engine written in Rust. We chose it for three main reasons. First, it is fast and stable, which allows millisecond-level searches even under high concurrency. Second, it supports geo-spatial filtering, so we can limit searches to a specific geographic area, which is very useful for map-based applications. Third, its flexible payload indexing lets us store and query temporal, spatial, and attribute data at the same time.

We follow the principle of "separating storage and computation while keeping logical connections."

Physical Separation

Due to the large difference in vector structures between map tiles (1024-dim) and documents (384-dim + 1024-dim), we store them in two separate collections: the Map Collection and the Document Collection.

Logical Association

All data entries are required to include year (temporal) and geo_location (spatial) as payloads. This enables the system to bridge the two collections at the logical level, supporting spatiotemporally informed joint queries.

Data Schema

Map Collection Schema

Stores a single visual vector per entry (pe_vector).

At the metadata (payload) level, we store the following key information to support frontend interactions:

year: Stored as an integer index, used for the frontend timeline component, allowing users to quickly filter data by specific years.

source_image: Indicates the data source, ensuring that each retrieval result can be traced back to the original map file.

geo_location: Stored as a Geo Point, serving as a logical anchor connecting visual and textual data.

pixel_coords: Stored only in the map collection, allowing the frontend to highlight specific regions on the original scanned maps when displaying results.

Document Collection Schema

Stores dual vectors (pe_vector, text_vector) for implementing a flexible retrieval strategy. And the metadata is similar to map collection.

Web Development

The web interface is built as a map-focused, immersive platform for digital humanities. It works not just as a way to show results from the backend but also as an interactive tool that connects historians with complex spatiotemporal data. The main goal is to balance fast multimodal searches with smooth map visualizations, so users can explore historical map tiles or cadastral records directly in the browser without delays.

Requirements Analysis & Function Definition

In the early development stage, we conducted a detailed needs analysis:

Cross-modal Retrieval Users need to move beyond traditional text-to-text search, enabling text-to-visual queries (e.g., searching “shipyard” to locate docks on the map) and image-to-text retrieval for historical archives.	Spatio-temporal Visualization Search results should not appear as plain lists; they must be visualized on the map through geographic markers (pins) and density patterns (heatmaps).
Dynamic Filtering Users should be able to filter data in real time using both the temporal range (1700–1850) and the current map viewport.	Micro-level Traceability For any search result, users must be able to access the original high-resolution map tile or cadastral record, enabling a smooth transition from macro-level exploration to micro-level evidence.

Architecture Design

The system uses a decoupled client–server architecture connected through RESTful APIs. This design helps keep the system modular and makes it easier to handle high-dimensional multimodal data efficiently.

Client Side
Built with the Next.js framework. A two-layer setup is used for map rendering: Leaflet displays static historical map tiles for fine details, while Deck.gl serves as a WebGL-based overlay to render large vector datasets and 3D heatmaps. This setup allows GPU acceleration and keeps the interface responsive.

Backend Side
Implemented with FastAPI, following a layered architecture (Controllers, Services, Data Access). The controller layer handles validation using Pydantic. The service layer orchestrates parallel tasks like model inference. The data access layer uses the repository pattern to interact with the Qdrant vector database.

Overall System Architecture, detailing the offline ETL pipelines and the online interactive platform powered by Next.js and FastAPI.

Page Design

The interface design follows Ben Shneiderman’s HCI principle of “overview first, zoom and filter, then details on demand.” To make the experience more immersive, the main page uses a HUD (Heads-Up Display) layout where the map canvas always takes up the full screen and all controls float on top. The search bar at the top is the main way to interact with the system and can automatically detect whether the user enters text or uploads an image. A timeline slider and a mode selector at the bottom let users pick a year range and switch between different map layers, such as specific historical maps or a 3D heatmap.

We also use a dual-sidebar layout to organize information. The left sidebar shows the Top-K search results as small cards, making it easy to scan quickly. When a user clicks on a card or a map pin, a detailed panel slides in from the right, showing full cadastral information and high-resolution map tiles. This layout separates “browsing” from “reading,” so users can stay focused on the map even when exploring lots of search results.

Dual-Sidebar HUD Layout, illustrating the map-centric context.

Text Search Implementation

For natural-language queries (e.g., “Shipyard” or “Church assets”), the system employs a parallel dual-stream retrieval architecture. This logic is encapsulated in the SearchService class, which orchestrates the simultaneous retrieval of semantic text matches and visual-aligned map features using distinct embedding models.

Parallel Search Pipeline: Encoding, Retrieval, and Z-Score Normalization

Backend

When the server receives a query, it executes a four-stage pipeline to reconcile the heterogeneous vector spaces:

1. Dual-Model Encoding

The system generates two distinct query vectors: a semantic vector via the Text Model (e.g., MiniLM) for documents, and a visual-aligned vector via the Perception Encoder for map tiles.

2. Parallel Retrieval

ThreadPoolExecutor is used to query the Qdrant database in parallel threads. To reduce noise before normalization, distinct raw similarity thresholds are applied:

Documents: Cosine Similarity > 0.50
Map: Cosine Similarity > 0.21

3. Statistical Normalization (Z-Score)

Since the raw score distributions of the two models differ substantially, they cannot be merged directly. The system calculates the Mean (μ) and Standard Deviation (σ) for each result set independently. Raw scores are standardized using the formula:

z = (x - μ) / σ

This step transforms absolute similarity into "statistical significance," placing unrelated modalities onto a shared standard normal distribution.

4. Merging & Filtering

The normalized results are combined into a single ranked list. A final quality filter is applied (e.g., Z > 0.75), retaining only items that are significantly more relevant than the average retrieval set.

Frontend

The unified result objects drive the UI behavior based on their type:

T2T Results (Documents): Displayed with a document icon. Clicking opens the full archival record.
T2I Results (Map Tiles): Displayed with a pin icon. When clicked, the interface triggers the WebGL engine to load the specific historical map fragment, overlaying it on the modern basemap to verify the visual correspondence.

Image Search Implementation

For visual queries (user-uploaded images), the system supports both Image-to-Image (I2I) and Image-to-Text (I2T) retrieval. Compared to text search, the overall logic remains the same, but the pipeline is simplified by relying on a single visual embedding.

Backend

When an image is uploaded, the backend encodes it once using the Perception Encoder (PE), generating a single 1024-dimensional pe_vector. This vector is reused to query both the Map Collection (I2I) and the Document Collection (I2T), where documents store a precomputed `pe_vector` for visual alignment. As in text search, retrieval is executed in parallel, and Z-Score normalization is applied to align similarity scores across collections before merging the results.

Frontend

The frontend renders results using the same interaction patterns as text-based search.

3D Heatmap Implementation

To visualize the city-wide density of specific semantic concepts (e.g., “fortifications” or “monasteries”), we developed a high-performance 3D heatmap feature.

Calculating global similarities for over 50,000 map tiles generates massive data, and transmitting it in standard JSON would exceed 5 MB, causing significant loading delays. To address this, the backend implements binary serialization using Python’s struct module, compressing each (latitude, longitude, score) tuple into a compact 32-bit floating-point stream, reducing the payload by over 90% to roughly 300 KB.

On the frontend, a custom loader instantly parses the binary stream and feeds it directly into Deck.gl’s HexagonLayer. With GPU acceleration, similarity scores are mapped to hexagon heights and local density to color intensity, enabling the “Show Heatmap” feature to respond in milliseconds and deliver a smooth, interactive data exploration experience.

3D heatmap for query "church"

Final Results & Evaluation

This chapter presents a comprehensive evaluation of the developed digital humanities infrastructure. Given the complex nature of historical cross-modal retrieval and the system's Adaptive Thresholding mechanism (where the result list length N varies dynamically from 0 to 20), we employed a task-specific evaluation strategy. This framework combines Quantitative Metrics adjusted for variable result sizes, Qualitative Case Studies with limitation analysis, and rigorous System Performance benchmarking.

Retrieval Accuracy: Quantitative Analysis

To rigorously assess the system's viability, we constructed a Benchmark Query List consisting of 15 representative queries (see Appendix). We applied distinct metrics tailored to the nature of each query group.

Methodology

For Unique Landmarks (Group A):
We calculated the Hit Rate (HR). Since the target is a specific unique entity, the query is considered successful ("Hit") if the target appears anywhere in the returned list.

For Typologies & Functions (Group B & C):
We calculated Mean Precision (MP). Since these queries target generic categories (e.g., "Bridges" or "Shops"), we measure the proportion of relevant items within the dynamic result set provided to the user.

Quantitative Results

The evaluation reveals how the system leverages different modalities to achieve high accuracy:

Query Category	Primary Metric	Score	Interpretation of System Behavior
Group A: Iconic Landmarks (e.g., "Ponte di Rialto", "Piazza San Marco")	Hit Rate	100%	Perfect Retrieval. The specific target was successfully retrieved for every landmark query. While the list occasionally included visually similar "distractors" (e.g., other squares), the correct location was always present, ensuring user success.
Group B: Visual Typologies (e.g., "Formal Garden", "Bridge")	Mean Precision	0.84	Visual-Dominant Accuracy. The system excels at identifying morphological features. The Adaptive Threshold correctly expanded the result list to provide diverse examples of these common urban structures.
Group C: Hidden Functions (e.g., "San Cancian", "San Polo")	Mean Precision	0.70	Text-Dominant Accuracy. For objects that are visually indistinguishable from the generic urban fabric, the system achieved respectable precision by pivoting to the Textual Stream, relying on archival metadata rather than ambiguous visual shapes.

Appendix: Query Test Set

The test set is rigorously divided into three categories to demonstrate how the system handles different granularities of historical information.

Query Test Set (Click to Expand)
Category	Query (English Input)	Target Visual Signature	Primary Retrieval Output
Group A: Iconic Landmarks (Specific Named Entities)	"Palazzo Ducale"	Courtyard Structure: Large rectangular complex with a central courtyard.	Dual-Modal (Balanced) Returns both specific Map Tiles (location) and Archival Texts (name). (The system identifies both the "Where" and the "What".)
	"Piazza San Marco"	Geometry: Distinct "L-shaped" open space with pavement patterns.
	"Ponte di Rialto"	Structure: Massive single-arch structure spanning the Grand Canal.
	"Santa Maria della Salute"	Shape: Massive octagonal footprint with a central dome.
	"San Giorgio Maggiore"	Topology: Isolated island complex with church and tower.
Group B: Visual Typologies (Morphological Features)	"Garden"	Texture: Geometric plots with green vegetation patterns.	Visual-Dominant (Map Focused) Returns primarily Map Tiles. The query describes a shape or texture. The system retrieves the visual form directly via the Perception Encoder.
	"Arched Bridge Over Water"	Shape: Linear white structure connecting two land masses.
	"Bell Tower"	Shadow: Small square footprint casting a long shadow.
	"Round Dome"	Geometry: Prominent circular outline atop a building.
	"Colonnade"	Pattern: Repetitive series of dots/lines along a building edge.
Group C: Hidden Functions (Indistinguishable from Above)	"San Cancian"	Generic Urban Fabric (Indistinguishable)	Text-Dominant (Doc Focused) Since these objects lack unique visual signatures, the Map Stream returns generic results. The Text Stream is essential to identify the location by name.
	"Magazzino"	Generic Urban Fabric (Indistinguishable)
	"Santa Croce"	Generic Urban Fabric (Indistinguishable)
	"San Polo"	Generic Urban Fabric (Indistinguishable)
	"San Salvador"	Generic Urban Fabric (Indistinguishable)

Qualitative Evaluation: Case Studies & Limitations

We also analyzed one representative case from each query category, explicitly noting both the retrieval success and the technical limitations observed.

Case 1: The "Visual Homonym" Challenge (Group A)

This case demonstrates the challenge of retrieving unique landmarks that share fundamental morphological features with generic urban elements.

Query: "Ponte di Rialto" (Rialto Bridge)
Result: The system successfully retrieved the correct target with high confidence. However, the subsequent results included smaller, generic stone bridges elsewhere in the city.

Visual Retrieval Results showing Scale Ambiguity
Correct Target: The Rialto Bridge. The system correctly identifies the massive single-arch structure spanning the Grand Canal and places it at the top of the list.
Distractor: A generic stone bridge over a minor canal. It shares the basic "arch-over-water" morphology but lacks the scale.
Distractor: Another minor bridge. The system recognizes the object type ("bridge") but fails to distinguish specific architectural grandeur.

Analysis:
The Perception Encoder correctly identified the core visual signature: a "linear arched structure connecting landmasses over water." It successfully found the target based on this fundamental morphology, ensuring a 100% Hit Rate.

Limitation Observed (Scale & Detail Ambiguity):
The Precision was lowered by these "Visual Homonyms." The current visual embedding model captures the general shape effectively but struggles to differentiate the specific scale and architectural uniqueness of the Rialto from hundreds of other smaller bridges. This confirms that while the AI is excellent for morphological recall, a "Human-in-the-Loop" is necessary for final verification of specific identity.

Case 2: Morphological Recognition (Group B)

This case tests the system's ability to identify repeating architectural patterns without textual labels.

Query: "Colonnade" (Loggia / Portico)
Result: The system successfully retrieved major architectural colonnades with high confidence. However, it also included visually similar structures made of different materials.

Representative Results showing Material Ambiguity
Correct Target: The Procuratie Nuove in San Marco. A clear architectural stone colonnade with repetitive arches.
Distractor (Texture Confusion): A garden pergola or trellis. The system conflates the repetitive wooden grid pattern with the visual rhythm of a stone colonnade.

Analysis:
The Perception Encoder excels at detecting high-frequency geometric patterns (the "rhythm" of columns). It correctly clustered visually similar linear structures, effectively solving the "Image-to-Image" search task for unlabelled architecture.

Limitation Observed (Material & Texture Ambiguity):
While the *geometric* pattern was a match (repetitive vertical/grid lines), the model struggled with *material* distinction. It conflated the structural rhythm of a stone loggia with the visual rhythm of a garden trellis (pergola). This indicates that the visual embedding prioritizes high-level geometric frequency over fine-grained texture (stone vs. vegetation) when analyzing low-resolution historical map tiles.

Case 3: Text-in-Image Interference (Group C)

This case highlights a specific limitation where the visual model conflates "semantic text" with "visual texture."

Query: "San Polo" (District Name)
Result: The Text Stream successfully retrieved the correct district map via metadata. However, the Visual Stream returned several unrelated map tiles containing dense handwriting.

Visual Stream Noise: The 'Pseudo-OCR' Effect
Distractor (False OCR): An incorrect result retrieved purely because of the dense handwritten labels. The model acts as a crude OCR, matching the "visual presence of text" rather than the specific content "San Polo".
Distractor (False OCR): Another unrelated fragment containing ink script. The Perception Encoder failed to distinguish the specific letter shapes, confusing the general texture of handwriting with the requested district name.

Analysis:
The system correctly pivoted to metadata for the primary result. However, the visual model introduced noise.

Limitation Observed (OCR/Texture Confusion):
The Perception Encoder identified the "visual texture of handwriting" (ink labels) on the map tiles. Because the query was a specific textual name ("San Polo"), the visual model attempted to match the shape of the letters, acting as a crude and inaccurate OCR. It retrieved tiles simply because they "contained text," not because they contained the *correct* text. This highlights a need for better separation between "Visual Texture" (shapes) and "Semantic Labels" (text) in future iterations.

System Performance Evaluation

To rigorously evaluate the system's responsiveness, we utilized the query "arched bridge over water" as a representative test case. This specific query triggers the complete text-to-dual-modal pipeline, requiring the system to generate embeddings for the input text and simultaneously retrieve matching results from both the Map Index and the Textual Archive Index. We utilized Postman's automated testing suite to measure both the End-to-End Latency (user perception) and the Database Retrieval Speed (backend efficiency).

Test Environment

The benchmarking was conducted on a local environment.

Hardware: 4060.
Backend: Qdrant Vector Database with HNSW (Hierarchical Navigable Small World) Indexing.
Task Scope: The test measures the full round-trip time including Query Embedding, Parallel Search in Map & Archive Indexes, and Result Aggregation.

End-to-End Latency

We simulated the typical workflow of a researcher. Using Postman's Functional Runner, we executed a sequence of 100 consecutive queries for "arched bridge over water" against the search pipeline.

Functional Benchmark Results. The system achieved a remarkable average response time of 288 ms over 100 requests.

Metric: Average Response Time.
Result: 288 ms.
Conclusion: The system exceeds the Service Level Objective (SLO) of 500ms by a significant margin. The low latency confirms that the backend efficiently orchestrates the cross-modal retrieval, making the search experience feel instantaneous.

Vector Database Efficiency

To understand the scalability of the data layer, we also analyzed the retrieval logs of the Qdrant engine.

Benchmark logs demonstrate the raw retrieval speed:

Metric	Measurement	Performance Note
Regular Search Time	7ms - 15ms	Extremely fast approximate nearest neighbor (ANN) retrieval using HNSW.
Exact Search Time	8ms - 15ms	Brute-force verification remains highly performant due to optimized vector quantization.
Precision@10	1.0 ± 0	The system maintained perfect retrieval stability for this morphological query.

Performance Analysis & Scalability

The combination of end-to-end and database-level benchmarks confirms the high efficiency and scalability of the architecture. Qdrant engine retrieves vectors in merely 7ms to 15ms (as shown in the logs), the database layer contributes a small part to the total latency. This demonstrates strong database scalability; the HNSW index effectively handles the high-dimensional vector space, ensuring that even as the historical dataset expands significantly in the future, the retrieval time will remain negligible, preserving the real-time "speed of thought" user experience.

Limitations

Aparts from the limitations we discussed in the previous chapter (Evaluation), there are some other types of limitations about our project in a larger scale.

Patch Granularity and Redundancy

This project uses a fixed patch size as the basic “semantic unit” for feature extraction. This design introduces several limitations. Large-scale features that span multiple patches may not be observed in their entirety, potentially leading to a loss of semantic context. At the same time, the relatively small patch size combined with a 50% overlap can result in multiple neighboring patches being returned for the same query, introducing redundancy. While such redundancy can improve recall, it may reduce user efficiency when browsing results in the left-side panel.

Lack of a Unified Cross-Modal Evaluation Metric

At present, this project doesn't have a unified and objective metric for evaluating the quality of multimodal retrieval results. Although Z-score normalization is used to fuse heterogeneous scores from MiniLM (textual semantics) and PE (visual semantics), this approach merely places the scores within the same statistical space and does not guarantee that their semantic importance is truly equivalent. For queries such as “shipyard,” assessing the relevance of a returned map patch (T2I) versus an archival record (T2T) is inherently subjective.

Georeferencing Error

Despite the use of the advanced Thin Plate Spline (TPS) algorithm, georeferencing errors remain unavoidable. TPS relies on manually selected ground control points (GCPs) to perform non-linear warping. Since our historical maps contain inherent cartographic inaccuracies and non-standard perspectives, TPS can only approximate the transformation rather than eliminate all errors. At a local scale, particularly in areas far from manually defined anchor points—small spatial drifts persist between historical pixels and modern geographic coordinates. These deviations may lead to inaccuracies at a micro level, such as when locating individual buildings or small bridges.

Future Work

Integration of VLM

The next phase of this project will focus on integrating advanced vision–language models (VLMs), such as fastVLM or LLaVA, to upgrade the platform’s core analytical capabilities from simple vector similarity search to full-fledged visual question answering (Visual Question Answering, VQA). The primary objective of VLM integration is to transform the user interaction paradigm. Instead of being limited to searching for specific objects (e.g., querying “bridges”), users will be able to pose complex, comparative, and analytical questions directly about map content. For example, users may ask: “How has the density of bridges in this area changed compared to the 1740 map?” or “Describe all structures facing the Grand Canal that are labeled as ‘church assets’.”

To enable VQA, we plan to adopt a dense captioning strategy. A VLM will be used to generate detailed, high-quality textual descriptions for each 224 × 224 map patch. These descriptions will then be stored in an additional text index, serving as a complementary layer to the existing visual embedding index. When a VQA query is issued, the system can perform parallel retrieval over both the PE-based visual vector index and the dense-caption text index, allowing it to combine visual features with generated textual context to answer complex queries.

Integration of Modern POIs and Satellite Image

Future work will involve integrating modern points of interest (POI) data and contemporary satellite imagery to enhance temporal depth and support systematic comparisons between historical and modern urban conditions.

The integration of modern POI data enables direct and automated diachronic analysis of historical urban functions. For example:

Functional change tracking: the system can trace whether a “monastery” recorded in 1740 has evolved into a “university library” in the present day, allowing analysis of urban functional transformation across centuries.
Validation and correction: modern datasets can be used to verify the accuracy of historical records and to identify biases or errors in model-based recognition.

Through this integration, the platform can evolve into a powerful tool for diachronic urban analysis.

Github Repo

Data pipeline: https://github.com/wuu03/Urban-Semantic-Search

Frontend: https://github.com/SheEagle/Urban-Semantic-Search-Frontend

Backend: https://github.com/SheEagle/Urban-Semantic-Search-Backend

Credits

Course: Foundation of Digital Humanities (DH-405), EPFL
Professor: Frédéric Kaplan
Supervisor: Alexander Rusnak
Authors: Xiru Wang, Jingru Wang

References

VENETIA KB (1704) - Pieter Mortier's Venice
Venetia BNF (1675)
1740 Catastici
1808 Sommarioni

City of Water and Ink: Decoding Venice through Multi-Modal Semantic Search

Introduction

Motivation

Project Plan and Milestones

Deliverables

Data Pipeline

Search Platform

Semantic Dataset

Methods

Data Source

Historical Maps

Historical Archives

Cross-modal Search Strategy

The Perception Encoder Cluster

The Linguistic Cluster

Data Processing Pipelines

Visual Pipeline

Text Pipeline

Georeferencing Pipeline

Database Design

Overall Design

Data Schema

Web Development

Requirements Analysis & Function Definition

Architecture Design

Page Design

Text Search Implementation

Backend

Frontend

Image Search Implementation

Backend

Frontend

3D Heatmap Implementation

Final Results & Evaluation

Retrieval Accuracy: Quantitative Analysis

Methodology

Quantitative Results

Appendix: Query Test Set

Qualitative Evaluation: Case Studies & Limitations

Case 1: The "Visual Homonym" Challenge (Group A)

Case 2: Morphological Recognition (Group B)

Case 3: Text-in-Image Interference (Group C)

System Performance Evaluation

Test Environment

End-to-End Latency

Vector Database Efficiency

Performance Analysis & Scalability

Limitations

Patch Granularity and Redundancy

Lack of a Unified Cross-Modal Evaluation Metric

Georeferencing Error

Future Work

Integration of VLM

Integration of Modern POIs and Satellite Image

Github Repo

Credits

References

Navigation menu

Search