Paintings / Photos geolocalisation
Introduction
The goal of this project is to locate a given painting or photo of Venice on the map. We use two different methods to achieve this goal, one is to use SIFT to find matched key points of the images, the other is to use deep learning model. In the final website we implement, user can upload an image, and the predicted location of the image will be shown on the map.
Motivation
Travelling is nowadays a universal hobby. There are many platforms like Instagram and Flickr for people to post their travel photos and share with strangers. With geo information of the posts provided by bloggers, other users can search pictures in a specific location and decide if they want to go travel there. But how about seeing an amazing picture without location indicated? We would like to address this problem in our project. We hope to come up with a solution that can locate an image on map so that if someone find a gorgeous picture without geo information, he or she can use our method to find the location of the place and plan a trip there.
Also, our method should work on realistic paintings, as the features in those painting should be similar as in photo. Therefore, art lovers can use our method to locate a painting and be in the painting themselves.
The scale is restricted to Venice in our project.
Project Plan and Milestones
Date | Task | Completion |
---|---|---|
By Week 3 |
|
✓ |
By Week 8 |
|
✓ |
By Week 9 |
|
✓ |
By Week 10 |
|
✓ |
By Week 11 |
|
✓ |
By Week 12 |
|
✓ |
By Week 13 |
|
-- |
By Week 14 |
|
Methodology
Data collection
We use the python package flickrapi to crawl photos with geo-coordinates inside Venice from Flickr. In order to exclude the photos of events and human portrait that are taken in Venice, we set the key words to be "Venice, building". Since it is possible that the keyword "Venice" appear in photos taken in other place, we also set up a latitude and longitude region of Venice, returned photos with geo-coordinates outside this range will not be considered. After this step, we generate a text file containing the geo-coordinates of photos and URLs to the photos.
We repeat the first step for several times and realize that for each time, the number of returned images is different. We therefore processed our text files by deleting the duplicated images and merging them. Then, we use requests package to download collected images using the URLs we get from previous step, at the same time, we generate a label text file with geo-coordinates and corresponding image file names.
Finally, we get 2387 images of Venice buildings with geo-coordinates.
SIFT
SIFT, scale-invariant feature transform, is a feature detection algorithm to detect and describe local features in images. We try to use this method to detect and describe key points in the image to be geolocalised and images with geo-coordinates. With these key points, we can find the most similar image and then finish the geolocalisation.
- Dataset spliting
To check the feasibility of our method, we try to use images with geo-coordinates to test. Therefore, we should split our dataset into testing dataset(to find geo-coordinates) and matching dataset(with geo-coordinates). In our experiment, because we do not have a dataset large enough and the matching without parallel is time-consuming, we randomly choose 2% of the dataset to be test dataset.
- Scale-invariant feature detection and description
We should firstly project the image into a collection of vector features. The keypoints defined thoes who has local maxima and minima of the result of difference of Gaussians function in the vector feature space, and each keypoint will have a descriptor, including location, scale and direction. This process can be simply completed with the python lib CV2.
- Keypoints matching
To find the most similar image with geo-coordinates, we do keypoints match for each test image, finding the keypoint pairs with all images in the matching dataset. Then, calculating the sum distance of top 50 matched pairs' distance. We choose the image with the smallest sum distance as the most similar image and give its geo-coordinates to the test image.
- Error analysis
For each match-pair, we calulate the MSE(mean square error) of lantitude and longitude. In order to assess our result, We try to visualise the distribution of MSE and give a 95% CI of median value of MAE by bootstrapping.
Deep Learning
The idea of using deep learning model to find the geo-coordinates of an image is inspired by Wolfram. However, instead of using a classification method, we use regression model to predict the latitude and longitude of an image directly. We implement the model using TensorFlow Keras module provided in python. As in this module, structure of different kinds of CNN models are provided and pre-trained. This is essential since we are not sure if the data we collected is enough to train a model start from nothing. We utilize the model pre-trained on ImageNet, freeze the weights of those main layers and modify the input layer and output layer in order to make the neural network suitable for our purpose. In the training process, only the weights of the layers we modified will be updated.
- Model Selection
In the Wolfram project, ResNet101 was trained on YFCC100m geo-tagged data, and is shown to have pretty satisfied predicted result. And in another project done by Cambridge University, a modified GoogLeNet is used to predict the camera position of a given image. Therefore, when selecting model to be used in our project, we mainly try Inception model and ResNet model.
Model | Learning Curve | Predict Results | Model | Learning Curve | Predict Results |
---|---|---|---|---|---|
ResNet50 |
Average distance: 727m Min distance: 37m Max distance: 2869m |
ResNet50V2 |
Average distance: 1777m Min distance: 134m Max distance: 11804m | ||
ResNet101 |
Average distance: 695m Min distance: 33m Max distance: 3490m |
ResNet101V2 |
Average distance: 1333m Min distance: 49m Max distance: 8218m | ||
InceptionResNetV2 |
Average distance: 1633m Min distance: 126m Max distance: 7461m |
InceptionV3 |
Average distance: 1799m Min distance: 28m Max distance: 5492m |
- Fine-Tuning
Web Implementation
Assessment
Sift Results
- Sample of matching results
- Distribution of MSE
We try to visualize the kernel density estimation and histogram of the MSE to see how it distributes.
- Bootstrapping
Because we do not have a large enough dataset, we try to use a bootstrapping method to detect the distribution of median value of MSE. We set a 10000-loop bootstrapping and find the 95% CI of the median value is [1.198612000000505e-05, 9.385347600000349e-05] and 50% CI is [3.817254499999426e-05, 6.0381709000008324e-05]. Combined with our visualization results, we estimate that the mean MSE of Sift geolocalisation is of about e-05 order of magnitude, which means the mean distance error is of about km order of magnitude.
Deep Learning Results
Future Work
Multiprocessing Sift
Our sift matching method is time-consuming. We tested 47 images and it cost about 3 minute to find the most similar image for each image. If we have larger matching dataset, this method may not perform well because of the long processing time. We have tried simple CPU multiprocessing, but the result did not improve dramatically. So we may consider other multiprocessing method and some special data structure to store the sift descriptor, which may contribute to a shorter matching time.
Resnet Classification
For our deep learning method, we try to give the geo-coordinates by regression. Inspired by a geolocation model classifies the location offered by Wolfram, we are considering if a classification model can have a better results. We can simply use our own grid to assigned each image to a certain cluster. Furthermore, Google offers the S2 Geometry Library, which represents all data on a three-dimensional sphere (similar to a globe). This makes it possible to build a worldwide geographic database with no seams or singularities, using a single coordinate system, and with low distortion everywhere compared to the true shape of the Earth. It offers different level of cells whose average area ranges from 85011012.19 km2 to 0.74 cm2. With S2 library, each image can be assigned to a cell. Then we can train our geolocation classfication model. However, how to choose a proper level of cell to get the best results need more experiments.
Multi-source Data
Currently, wo do not have a large dataset of images with geo-coordinates. We may try to get more geo-tagged images from other social media like facebook, instagram or something else. We can also give an access to users to upload images with geo-coordinates. However, there is still a problem to be solved that for this type of multisource dataset, how we can check the accuracy of their geo-coordinates? And this effect our results dramatically.
Links
Paintings/Photos geolocalisation GitHub