Paintings / Photos geolocalisation

Introduction

The goal of this project is to locate a given painting or photo of Venice on the map. We use two different methods to achieve this goal, one is to use SIFT to find matched key points of the images, the other is to use deep learning model. In the final website we implement, user can upload an image, and the predicted location of the image will be shown on the map.

Motivation

Travelling is nowadays a universal hobby. There are many platforms like Instagram and Flickr for people to post their travel photos and share with strangers. With geo information of the posts provided by bloggers, other users can search pictures in a specific location and decide if they want to go travel there. But how about seeing an amazing picture without location indicated? We would like to address this problem in our project. We hope to come up with a solution that can locate an image on map so that if someone find a gorgeous picture without geo information, he or she can use our method to find the location of the place and plan a trip there.

Also, our method should work on realistic paintings, as the features in those painting should be similar as in photo. Therefore, art lovers can use our method to locate a painting and be in the painting themselves.

The scale is restricted to Venice in our project.

Project Plan and Milestones

Date	Task	Completion
By Week 3	Brainstorm project ideas, come up with at least one feasible innovative idea. Prepare slides for initial project idea presentation.	✓
By Week 8	Study related works about geolocalisation. Determine the methods to be used. Obtain geo-tagged images from Flickr as training dataset. Prepare slides for midterm presentation.	✓
By Week 9	Implement the SIFT method, try to locate an image based on its feature points. Improve the SIFT method by using multi processing. Get result using SIFT method.	✓
By Week 10	Evaluate the result from SIFT. Construct our first regression model based on ResNet101 to obtain preliminary results.	✓
By Week 11	Try to fine tune the first model. Try other possible deep learning models. Finalize result of deep learning.	✓
By Week 12	Evaluate the results from deep learning. Implement web using Streamlit python package, deep learning method will be used on the web.	✓
By Week 13	Sort out the codes and push them to GitHub repository. Write project report. Prepare slides for final presentation.	--
By Week 14	Finish presentation slides and report writing. Presentation rehearsal and final presentation.

Methodology

Data collection

We use the python package flickrapi to crawl photos with geo-coordinates inside Venice from Flickr. In order to exclude the photos of events and human portrait that are taken in Venice, we set the key words to be "Venice, building". Since it is possible that the keyword "Venice" appear in photos taken in other place, we also set up a latitude and longitude region of Venice, returned photos with geo-coordinates outside this range will not be considered. After this step, we generate a text file containing the geo-coordinates of photos and URLs to the photos.

We repeat the first step for several times and realize that for each time, the number of returned images is different. We therefore processed our text files by deleting the duplicated images and merging them. Then, we use requests package to download collected images using the URLs we get from previous step, at the same time, we generate a label text file with geo-coordinates and corresponding image file names.

Finally, we get 2387 images of Venice buildings with geo-coordinates.

Figure 1: Images Distribution

SIFT

SIFT, scale-invariant feature transform, is a feature detection algorithm to detect and describe local features in images. We try to use this method to detect and describe key points in the image to be geolocalised and images with geo-coordinates. With these key points, we can find the most similar image and then finish the geolocalisation.

Dataset spliting

To check the feasibility of our method, we try to use images with geo-coordinates to test. Therefore, we should split our dataset into testing dataset(to find geo-coordinates) and matching dataset(with geo-coordinates). In our experiment, because we do not have a dataset large enough and the matching without parallel is time-consuming, we randomly choose 2% of the dataset to be test dataset.

Scale-invariant feature detection and description

We should firstly project the image into a collection of vector features. The keypoints defined thoes who has local maxima and minima of the result of difference of Gaussians function in the vector feature space, and each keypoint will have a descriptor, including location, scale and direction. This process can be simply completed with the python lib CV2.

Keypoints matching

To find the most similar image with geo-coordinates, we do keypoints match for each test image, finding the keypoint pairs with all images in the matching dataset. Then, calculating the sum distance of top 50 matched pairs' distance. We choose the image with the smallest sum distance as the most similar image and give its geo-coordinates to the test image.

Error analysis

For each match-pair, we calulate the MSE(mean square error) of lantitude and longitude. In order to assess our result, We try to visualise the distribution of MSE and give a 95% CI of median value of MAE by bootstrapping.

Deep Learning

The idea of using deep learning model to find the geo-coordinates of an image is inspired by Wolfram. However, instead of using a classification method, we use regression model to predict the latitude and longitude of an image directly. We implement the model using TensorFlow Keras module provided in python. As in this module, structure of different kinds of CNN models are provided and pre-trained. This is essential since we are not sure if the data we collected is enough to train a model start from nothing. We utilize the model pre-trained on ImageNet, freeze the weights of those main layers and modify the input layer and output layer in order to make the neural network suitable for our purpose. In the training process, only the weights of the layers we modified will be updated.

Model Selection

In the Wolfram project, ResNet101 was trained on YFCC100m geo-tagged data, and is shown to have pretty satisfied predicted result. And in another project done by Cambridge University, a modified GoogLeNet is used to predict the camera position of a given image. Therefore, when selecting model to be used in our project, we mainly try Inception model and ResNet model.

Model	Learning Curve	Predict Results
ResNet50
ResNet50V2
InceptionResNetV2
InceptionV3
ResNet101V2
ResNet101

Fine-Tune Model

Web Implementation

Assessment

Sift Results

Sample of matching results

Distribution of MSE

We try to visualize the kernel density estimation and histogram of the MSE to see how it distributes.

Figure 2: MSE kdeplot

Figure 3: MSE distplot

Bootstrapping

Because we do not have a large enough dataset, we try to use a bootstrapping method to detect the distribution of median value of MSE. We set a 10000-loop bootstrapping and find the 95% CI of the median value is [1.198612000000505e-05, 9.385347600000349e-05] and 50% CI is [3.817254499999426e-05, 6.0381709000008324e-05]. Combined with our visualization results, we estimate that the mean MSE of Sift geolocalisation is of about e-05 order of magnitude, which means the mean distance error is of about km order of magnitude.

Paintings / Photos geolocalisation

Contents

Introduction

Motivation

Project Plan and Milestones