Paintings / Photos geolocalisation: Difference between revisions

From FDHwiki
Jump to navigation Jump to search
Line 115: Line 115:
| [[File:ResNet50.JPG|250px|center]]
| [[File:ResNet50.JPG|250px|center]]
|  
|  
  *Average distance: 727m
Average distance: 727m
  *Min distance: 37m
Min distance: 37m
  *Max distance: 2869m
Max distance: 2869m
| ResNet50V2
| ResNet50V2
| [[File:Resnet50V2.JPG|250px|center]]
| [[File:Resnet50V2.JPG|250px|center]]
|  
|  
  *Average distance: 1777m
Average distance: 1777m
  *Min distance: 134m
Min distance: 134m
  *Max distance: 11804m
Max distance: 11804m


|-
|-
Line 129: Line 129:
| [[File:Resnet101.JPG|250px|center]]
| [[File:Resnet101.JPG|250px|center]]
|  
|  
  *Average distance: 695m
Average distance: 695m
  *Min distance: 33m
Min distance: 33m
  *Max distance: 3490m
Max distance: 3490m
| ResNet101V2
| ResNet101V2
| [[File:Resnet101V2.JPG|250px|center]]
| [[File:Resnet101V2.JPG|250px|center]]
|  
|  
  *Average distance: 1333m
Average distance: 1333m
  *Min distance: 49m
Min distance: 49m
  *Max distance: 8218m
Max distance: 8218m
|-
|-
| InceptionResNetV2
| InceptionResNetV2
| [[File:InceptionResnetV2.JPG|250px|center]]
| [[File:InceptionResnetV2.JPG|250px|center]]
|  
|  
  *Average distance: 1633m
Average distance: 1633m
  *Min distance: 126m
Min distance: 126m
  *Max distance: 7461m
Max distance: 7461m
| InceptionV3
| InceptionV3
| [[File:InceptionV3.JPG|250px|center]]
| [[File:InceptionV3.JPG|250px|center]]
|  
|  
  *Average distance: 1799m
Average distance: 1799m
  *Min distance: 28m
Min distance: 28m
  *Max distance: 5492m
Max distance: 5492m
|-
|-
|}
|}

Revision as of 09:47, 13 December 2020

Introduction

The goal of this project is to locate a given painting or photo of Venice on the map. We use two different methods to achieve this goal, one is to use SIFT to find matched key points of the images, the other is to use deep learning model. In the final website we implement, user can upload an image, and the predicted location of the image will be shown on the map.

Motivation

Travelling is nowadays a universal hobby. There are many platforms like Instagram and Flickr for people to post their travel photos and share with strangers. With geo information of the posts provided by bloggers, other users can search pictures in a specific location and decide if they want to go travel there. But how about seeing an amazing picture without location indicated? We would like to address this problem in our project. We hope to come up with a solution that can locate an image on map so that if someone find a gorgeous picture without geo information, he or she can use our method to find the location of the place and plan a trip there.

Also, our method should work on realistic paintings, as the features in those painting should be similar as in photo. Therefore, art lovers can use our method to locate a painting and be in the painting themselves.

The scale is restricted to Venice in our project.

Project Plan and Milestones

Date Task Completion
By Week 3
  • Brainstorm project ideas, come up with at least one feasible innovative idea.
  • Prepare slides for initial project idea presentation.
By Week 8
  • Study related works about geolocalisation.
  • Determine the methods to be used.
  • Obtain geo-tagged images from Flickr as training dataset.
  • Prepare slides for midterm presentation.
By Week 9
  • Implement the SIFT method, try to locate an image based on its feature points.
  • Improve the SIFT method by using multi processing.
  • Get result using SIFT method.
By Week 10
  • Evaluate the result from SIFT.
  • Construct our first regression model based on ResNet101 to obtain preliminary results.
By Week 11
  • Try to fine tune the first model.
  • Try other possible deep learning models.
  • Finalize result of deep learning.
By Week 12
  • Evaluate the results from deep learning.
  • Implement web using Streamlit python package, deep learning method will be used on the web.
By Week 13
  • Sort out the codes and push them to GitHub repository.
  • Write project report.
  • Prepare slides for final presentation.
--
By Week 14
  • Finish presentation slides and report writing.
  • Presentation rehearsal and final presentation.

Methodology

Data collection

We use the python package flickrapi to crawl photos with geo-coordinates inside Venice from Flickr. In order to exclude the photos of events and human portrait that are taken in Venice, we set the key words to be "Venice, building". Since it is possible that the keyword "Venice" appear in photos taken in other place, we also set up a latitude and longitude region of Venice, returned photos with geo-coordinates outside this range will not be considered. After this step, we generate a text file containing the geo-coordinates of photos and URLs to the photos.

We repeat the first step for several times and realize that for each time, the number of returned images is different. We therefore processed our text files by deleting the duplicated images and merging them. Then, we use requests package to download collected images using the URLs we get from previous step, at the same time, we generate a label text file with geo-coordinates and corresponding image file names.

Finally, we get 2387 images of Venice buildings with geo-coordinates.

Figure 1: Images Distribution

SIFT

SIFT, scale-invariant feature transform, is a feature detection algorithm to detect and describe local features in images. We try to use this method to detect and describe key points in the image to be geolocalised and images with geo-coordinates. With these key points, we can find the most similar image and then finish the geolocalisation.

  • Dataset spliting

To check the feasibility of our method, we try to use images with geo-coordinates to test. Therefore, we should split our dataset into testing dataset(to find geo-coordinates) and matching dataset(with geo-coordinates). In our experiment, because we do not have a dataset large enough and the matching without parallel is time-consuming, we randomly choose 2% of the dataset to be test dataset.

  • Scale-invariant feature detection and description

We should firstly project the image into a collection of vector features. The keypoints defined thoes who has local maxima and minima of the result of difference of Gaussians function in the vector feature space, and each keypoint will have a descriptor, including location, scale and direction. This process can be simply completed with the python lib CV2.

  • Keypoints matching

To find the most similar image with geo-coordinates, we do keypoints match for each test image, finding the keypoint pairs with all images in the matching dataset. Then, calculating the sum distance of top 50 matched pairs' distance. We choose the image with the smallest sum distance as the most similar image and give its geo-coordinates to the test image.

  • Error analysis

For each match-pair, we calulate the MSE(mean square error) of lantitude and longitude. In order to assess our result, We try to visualise the distribution of MSE and give a 95% CI of median value of MAE by bootstrapping.

Deep Learning

The idea of using deep learning model to find the geo-coordinates of an image is inspired by Wolfram. However, instead of using a classification method, we use regression model to predict the latitude and longitude of an image directly. We implement the model using TensorFlow Keras module provided in python. As in this module, structure of different kinds of CNN models are provided and pre-trained. This is essential since we are not sure if the data we collected is enough to train a model start from nothing. We utilize the model pre-trained on ImageNet, freeze the weights of those main layers and modify the input layer and output layer in order to make the neural network suitable for our purpose. In the training process, only the weights of the layers we modified will be updated.

  • Model Selection

In the Wolfram project, ResNet101 was trained on YFCC100m geo-tagged data, and is shown to have pretty satisfied predicted result. And in another project done by Cambridge University, a modified GoogLeNet is used to predict the camera position of a given image. Therefore, when selecting model to be used in our project, we mainly try Inception model and ResNet model.

Model Learning Curve Predict Results Model Learning Curve Predict Results
ResNet50
ResNet50.JPG
Average distance: 727m
Min distance: 37m
Max distance: 2869m
ResNet50V2
Resnet50V2.JPG
Average distance: 1777m
Min distance: 134m
Max distance: 11804m
ResNet101
Resnet101.JPG
Average distance: 695m
Min distance: 33m
Max distance: 3490m
ResNet101V2
Resnet101V2.JPG
Average distance: 1333m
Min distance: 49m
Max distance: 8218m
InceptionResNetV2
InceptionResnetV2.JPG
Average distance: 1633m
Min distance: 126m
Max distance: 7461m
InceptionV3
InceptionV3.JPG
Average distance: 1799m
Min distance: 28m
Max distance: 5492m

We compare the six different models. We observe that ResNet50 and ResNet101 converge faster than other models, and the predict results are better. In our project, we finally choose the ResNet101 model, as it converges even faster than ResNet50 and given the experience from Wolfram, we think it is suitable for or purpose.

  • Fine-Tuning

Web Implementation

Assessment

Sift Results

  • Sample of matching results
1625.jpg
1587.jpg
1613.jpg
1545.jpg
583.jpg
573.jpg
1031.jpg
1030.jpg


  • Distribution of MSE

We try to visualize the kernel density estimation and histogram of the MSE to see how it distributes.

Figure 2: MSE kdeplot
Figure 3: MSE distplot
  • Bootstrapping

Because we do not have a large enough dataset, we try to use a bootstrapping method to detect the distribution of median value of MSE. We set a 10000-loop bootstrapping and find the 95% CI of the median value is [1.198612000000505e-05, 9.385347600000349e-05] and 50% CI is [3.817254499999426e-05, 6.0381709000008324e-05]. Combined with our visualization results, we estimate that the mean MSE of Sift geolocalisation is of about e-05 order of magnitude, which means the mean distance error is of about km order of magnitude.

Deep Learning Results

Figure 4: Distance Distribution

Future Work

Multiprocessing Sift

Our sift matching method is time-consuming. We tested 47 images and it cost about 3 minute to find the most similar image for each image. If we have larger matching dataset, this method may not perform well because of the long processing time. We have tried simple CPU multiprocessing, but the result did not improve dramatically. So we may consider other multiprocessing method and some special data structure to store the sift descriptor, which may contribute to a shorter matching time.

Resnet Classification

For our deep learning method, we try to give the geo-coordinates by regression. Inspired by a geolocation model classifies the location offered by Wolfram, we are considering if a classification model can have a better results. We can simply use our own grid to assigned each image to a certain cluster. Furthermore, Google offers the S2 Geometry Library, which represents all data on a three-dimensional sphere (similar to a globe). This makes it possible to build a worldwide geographic database with no seams or singularities, using a single coordinate system, and with low distortion everywhere compared to the true shape of the Earth. It offers different level of cells whose average area ranges from 85011012.19 km2 to 0.74 cm2. With S2 library, each image can be assigned to a cell. Then we can train our geolocation classfication model. However, how to choose a proper level of cell to get the best results need more experiments.

Multi-source Data

Currently, wo do not have a large dataset of images with geo-coordinates. We may try to get more geo-tagged images from other social media like facebook, instagram or something else. We can also give an access to users to upload images with geo-coordinates. However, there is still a problem to be solved that for this type of multisource dataset, how we can check the accuracy of their geo-coordinates? And this effect our results dramatically.

Links

Paintings/Photos geolocalisation GitHub

References

  1. Estimating the Location of Images Using Apache MXNet and Multimedia Commons Dataset on AWS EC2
  2. Wolframe Neural Net Repository
  3. Visual Localisation project from University of Cambridge