Contributing¶

The TorchGeo project welcomes contributions and suggestions! If you think you’ve found a bug or would like to suggest a new feature, you can open an issue on GitHub. TorchGeo is an open-source community-supported project, so we try to address issues in order of severity or impact. If you feel confident, the fastest way to make changes to TorchGeo is to submit a pull request. This guide explains everything you need to know about contributing to TorchGeo.

Note

TorchGeo is a library for geospatial datasets, transforms, and models. If you would like to add a new transform or model that doesn’t involve geospatial data or isn’t specific to the remote sensing domain, you’re better off adding it to a general purpose computer vision library like torchvision or Kornia.

Git¶

All development is done on GitHub. If you would like to submit a pull request, you’ll first want to fork https://github.com/microsoft/torchgeo. Then, clone the repository using:

$ git clone https://github.com/<your-username>/torchgeo.git

From there, you can make any changes you want. Once you are satisfied with your changes, you can commit them and push them back to your fork. If you want to make multiple changes, it’s best to create separate branches and pull requests for each change:

$ git checkout main
$ git branch <descriptive-branch-name>
$ git checkout <descriptive-branch-name>
$ git add <files-you-changed...>
$ git commit -m "descriptive commit message"
$ git push

For changes to Python code, you’ll need to ensure that your code is well-tested and all linters pass. When you’re ready, you can open a pull request on GitHub. All pull requests should be made against the main branch. If it’s a bug fix, we will backport it to a release branch for you.

Licensing¶

TorchGeo is licensed under the MIT License. Most contributions require you to agree to a Contributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us the rights to use your contribution. For details, visit https://opensource.microsoft.com/cla/.

When you submit a pull request, a CLA bot will automatically determine whether you need to provide a CLA and decorate the PR appropriately (e.g., status check, comment). Simply follow the instructions provided by the bot. You will only need to do this once across all repos using our CLA.

If your pull request adds any new files containing code, including *.py and *.ipynb files, you’ll need to add the following comment to the top of the file:

# Copyright (c) Microsoft Corporation. All rights reserved.
# Licensed under the MIT License.

Tests¶

TorchGeo uses GitHub Actions for Continuous Integration. We run a suite of unit tests on every commit to ensure that pull requests don’t break anything. If you submit a pull request that adds or modifies any Python code, we require unit tests for that code before the pull request can be merged.

For example, if you add a new dataset in torchgeo/datasets/foo.py, you’ll need to create corresponding unit tests in tests/datasets/test_foo.py. The easiest way to do this is to find unit tests for similar datasets and modify them for your dataset. These tests can then be run with pytest:

$ pytest --cov=torchgeo/datasets --cov-report=term-missing tests/datasets/test_foo.py
========================= test session starts =========================
platform darwin -- Python 3.10.11, pytest-6.2.4, py-1.9.0, pluggy-0.13.0
rootdir: ~/torchgeo, configfile: pyproject.toml
plugins: mock-1.11.1, anyio-3.2.1, cov-2.8.1, nbmake-0.5
collected 7 items

tests/datasets/test_foo.py .......                              [100%]

---------- coverage: platform darwin, python 3.10.11-final-0 -----------
Name                                      Stmts   Miss  Cover   Missing
-----------------------------------------------------------------------
torchgeo/datasets/__init__.py                26      0   100%
torchgeo/datasets/foo.py                    177     62    65%   376-403, 429-496, 504-509
...
-----------------------------------------------------------------------
TOTAL                                      1709    920    46%

========================== 7 passed in 6.20s ==========================

From this output, you can see that all tests pass, but many lines of code in torchgeo/datasets/foo.py are not being tested, including 376–403, 429–496, etc. In order for this pull request to be merged, additional tests will need to be added until there is 100% test coverage.

These tests require pytest and pytest-cov to be installed.

Note

If you add a new dataset, the tests will require some form of data to run. This data should be stored in tests/data/<dataset>. Please don’t include real data, as this may violate the license the data is distributed under, and can involve very large file sizes. Instead, create fake data examples using the instructions found here.

Linters¶

In order to remain PEP-8 compliant and maintain a high-quality codebase, we use a couple of linting tools:

ruff for code formatting
mypy for static type analysis

These tools should be used from the root of the project to ensure that our configuration files are found. Ruff is relatively easy to use, and will automatically fix most issues it encounters:

$ ruff check
$ ruff format

Mypy won’t fix your code for you, but will warn you about potential issues with your code:

$ mypy .

If you’ve never used mypy before or aren’t familiar with Python type hints, this check can be particularly daunting. Don’t hesitate to ask for help with resolving any of these warnings on your pull request.

You can also use git pre-commit hooks to automatically run these checks before each commit. pre-commit is a tool that automatically runs linters locally, so that you don’t have to remember to run them manually and then have your code flagged by CI. You can set up pre-commit with:

$ pip install pre-commit
$ pre-commit install
$ pre-commit run --all-files

Now, every time you run git commit, pre-commit will run and let you know if any of the files that you changed fail the linters. If pre-commit passes then your code should be ready (style-wise) for a pull request. Note that you will need to run pre-commit run --all-files if any of the hooks in .pre-commit-config.yaml change, see here.

Documentation¶

All of our documentation is hosted on Read the Docs. If you make non-trivial changes to the documentation, it helps to build the documentation yourself locally. To do this, make sure the dependencies are installed:

$ pip install .[docs]
$ cd docs
$ pip install -r requirements.txt

Then run the following commands:

$ make clean
$ make html

The resulting HTML files can be found in _build/html. Open index.html in your browser to navigate the project documentation. If you fix something, make sure to run make clean before running make html or Sphinx won’t rebuild all of the documentation.

Tutorials¶

TorchGeo has a number of tutorials included in the documentation that can be run in Google Colab. These Jupyter notebooks are tested before each release to make sure that they still run properly. To test these locally, install pytest and nbmake and run:

$ pytest --nbmake docs/tutorials

Datasets¶

A major component of TorchGeo is the large collection of torchgeo.datasets that have been implemented. Adding new datasets to this list is a great way to contribute to the library. A brief checklist to follow when implementing a new dataset:

Implement the dataset extending either GeoDataset or NonGeoDataset
Add the dataset definition to torchgeo/datasets/__init__.py
Add a data.py script to tests/data/<new dataset>/ that generates test data with the same directory structure/file naming conventions as the new dataset
Add appropriate tests with 100% test coverage to tests/datasets/
Add the dataset to docs/api/datasets.rst
Add the dataset metadata to either docs/api/geo_datasets.csv or docs/api/non_geo_datasets.csv

A good way to get started is by looking at some of the existing implementations that are most closely related to the dataset that you are implementing (e.g. if you are implementing a semantic segmentation dataset, looking at the LandCover.ai dataset implementation would be a good starting point).

I/O Benchmarking¶

For PRs that may affect GeoDataset sampling speed, you can test the performance impact as follows. On the main branch (before) and on your PR branch (after), run the following commands:

$ python -m torchgeo fit --config tests/conf/io_raw.yaml
$ python -m torchgeo fit --config tests/conf/io_preprocessed.yaml

This code will download a small (1 GB) dataset consisting of a single Landsat 9 scene and CDL file. It will then profile the speed at which various samplers work for both raw data (original downloaded files) and preprocessed data (same CRS, res, TAP, COG). The important output to look out for is the total time taken by train_dataloader_next (RandomGeoSampler) and val_next (GridGeoSampler). With this, you can create a table on your PR like:

state	raw (random)	raw (grid)	preprocessed (random)	preprocessed (grid)
before	17.223	10.974	15.685	4.6075
after	17.360	11.032	9.613	4.6673

In this example, we see a 60% speed-up for RandomGeoSampler on preprocessed data. All other numbers are more or less the same across multiple runs.