Unsupervised classification of Earth Observation data

The project aims to give students an opportunity to experiment with the unsupervised classification of Earth observation data. In this case, the most suitable approach (out of the unsupervised learning approaches) seems to be clustering. This project assignment does not put restrictions on any particular mission. The goal is the experimentation, not necessarily some real-world applicable result.

The desired result of the project is to group similar observations together without the usage of any labels originally assigned to the product data (images). Another possible output of the project is an embedding that would cluster together images capturing similar objects.

The input data should consist of a sufficient number of products, which would allow getting some conclusive results of this experiment. The exact number of images/products is not specified here. The students should demonstrate an attempt to create a large-enough toy dataset. If the source product files capture a large area, the images could be divided into smaller tiles, which could be viewed as different products for the purpose of this project/exercise.

It is up to students to choose the classification approach – in a sense, that it is a pixel-based classification or object-based classification. The complication with this assignment is that the unsupervised classification/clustering should be performed across multiple product images.

However, direct clustering of images represented by pixels in all bands might not be useful due to the task’s large computational demands, the difficulty for the clustering method to capture differences between the entries, etc. Before the actual clustering, some dimensional reduction of the data representing an entry might be required – feature selection or feature extraction. In the case of an approach with the pixel-based classification, the feature extraction might be a method such as principal component analysis. If students decide to choose an object-based-classification-inspired approach, approaches such as polygonization might be important (count of polygons, sizes of polygons, etc., could be considered as the extracted features).

Some of the problems caused by the dataset’s size could be mitigated using “out-of-core” algorithms. Another possibility to deal with the large dataset might be the subsampling of the initial dataset, including the algorithms provided inside Scikit-learn. For instance, Mini-Batch K-Means or Incremental PCA. These models can be recognized by having the “partial_fit” method.

Some steps of the work

Review of the state of the art – at least about three pages (preferably more) reviewing how such task is done in the world. The pages should have a “word document”-like number of letters considering the default font style and size.
Construct a toy dataset – a dataset that contains a variety of observations: land, sea, urban areas, agricultural land, forests, …
At least 20 different product files, if the product images are split into tiles. This task should be supported by a program that would download the required data and transform them into the desired form.
Choose how are the clustered data going to be represented, choose dimensionality reduction methods – raw pixel values, principal components of pixel values, object-based classification with the extraction of polygons, line fitting, etc. Experiment with different methods on images from your toy dataset.
Implement your solution – The aim is to utilize third-party libraries and algorithms mainly. This project’s activity should be writing the code that would “chain” together transformation of data, feature extraction, and clustering.
Prepare code for visualization of the results – Think about how to best show the results of your experiment. Textual-form, graphical-form, maybe both.
Debug your code on small parts of the toy dataset
Run the processing on the whole toy dataset
Prepare a presentation of your results

An initial inspiration could be provided by the article by Syam Kakarla Unsupervised Learning in Satellite Imagery Using Python. However, here only a single image is being classified. One would need to observe the memory requirements of the task, and deal with the possible issues. Maybe the usage of out-of-core methods might be necessary, maybe subsampling is required. Another necessity is to preserve the association of the entries (pixel values) with the original product identifiers, as the pixel values are serialized and passed to the clustering algorithm.

Links/Resources

Computational resources

A remote computer for long-running code can be provided to students after discussion.

Semester 2020/2021

This project is assigned to Michal Bardzák and Peter Kincel.