In cytometry, visualizing all the markers (i.e features, which can be interpreted as dimensions) can be challenging as cells can express multiple markers at once. Dimensionality reduction is the process of taking high dimensional data and projecting this in low dimensional space while retaining as much information as possible. Dimensionality reduction allows for the visualization of cells that have similar marker expression, normally in a 2D space, by placing closely related cells through marker expression close to each other. This article shows how to set-up a UMAP (Uniform Manilfold Approximation and Projection) in OMIQ.
UMAP uses Riemannian geometry and algebraic topology to model the data into a manifold with a fuzzy topological structure. UMAP then searchers for a low dimensional projection of the data that has the closest possible equivalent fuzzy topological structure to embed the data.
McInnes, L., Healy, J., and Melville, J. UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction. arXiv 1802.30426 (2018). https://doi.org/10.48550/arXiv.1802.03426
1. Add a UMAP Task
Click Add new child task and select UMAP from the task selector. In this example, we have subsampled to live cells for our UMAP task.
Your exact workflow branch may look different than the example above. The important thing is that your workflow follows a logical ordering of tasks.
2. Setup the UMAP Task
2.1 Select Files and Features
Select the Files you want to include for your UMAP.
Include all the files that you would want to directly compare in the same UMAP run as each run will create a unique visualization and result.
Select the Features you want to use for the dimensionality reduction.
Each feature you select will affect how the algorithm computes the result. You do not necessarily have to include all features. Often, it will make sense to exclude certain markers if they will not help inform your results (input heterogeneity will equal output heterogeneity).
2.2 Enter UMAP Settings
Feel free to change the default settings for your analysis goal. New to dimensionality reduction? Try out the default settings first and see how changing the hyperparameters below affect your result.
Neighbors: Sets the nearest neighbors that UMAP uses to approximate the manifold. Low values instruct the UMAP algorithm to focus more on local structure (e.g. CCR7 levels between CD4+ T cells) at the expense of the bigger picture while high values instruct the UMAP algorithm to focus more on the global structure (e.g. B cells compared to monocytes) at the expense of the fine details of the data.
Minimum Distance: Controls how the points are displayed in the UMAP (i.e. how tightly UMAP packs the points together).
Low values allow for the points to be closer to each other (packs the points together within the islands of the embedding) while larger values will prevent UMAP from packing the points together, focusing on preservation of the broad topological structures.
Components: Determines the number of parameters the UMAP result will generate (umap_1, umap_2, umap_3, etc). 2 UMAP parameters would be considered the most traditional display.
Unlike other dimensionality reduction algorithms, such as t-SNE, UMAP scales well in the embedding dimension, so you can use it for more than just visualization in 2- or 3- dimensions.
Metric: Controls how the distance is computed in the ambient space of the input data. Choose between Euclidean, Cosine, and Correlation.
Distance Metrics:
- Euclidean: Measures the straight-line distance between two points in space.
- Cosine: Considers the angle between vectors.
- Correlation: Looks at the how the features are correlated and groups similar objects with high feature correlation, even if they are far apart in Euclidean distance.
Learning Rate: The speed at which UMAP optimizes embedding.
Machine learning algorithms will learn from the data at a specific speed represented by the learning rate. The learning rate determines how the algorithm adjusts its own parameters at each step of the optimization phase. Although we recommend leaving the default automatic learning rate, you can set it manually just by typing the desired number. If the learning rate is set too low or too high, the specific territories for the different cell types won’t be properly separated. A higher learning rate means the algorithm takes bigger steps in each stage of learning but may overshoot the optimal solution. A lower learning rate means that the algorithm takes smaller steps but may result in the process getting stuck and not reaching the optimal solution.
Epochs: The number of training cycles UMAP uses to optimize the embedding.
Random Seed: A number that is used to initialize the UMAP operation. This is optional to change. The UMAP algorithm is stochastic. To make it reproducible, a fixed Seed may be set. If the same dataset and settings are used, by retaining the same Random Seed value, the same result will be achieved.
Embedding Initialization: This relates to how the low dimension embedding is initialized. You can choose between Spectral and Random.
- Spectral: Uses a spectral embedding argument of the fuzzy 1-skeleton
- Random: assigns the initial embedding positions at random.
2.3 Run
Click Run UMAP. This will take you to the Status tab and you can watch the progress. However, you are free to go back to your workflow or do whatever you please while this runs in the cloud. The status can also be seen in the Workflow itself, or you can have an email sent to you when it is completed.
3. Review the Results
You can view the results of the UMAP by going to the Results tab and see if the results are as expected.
The single plot above is only for a quick sanity check of results. You shouldn't use it for anything more than that.
For plotting and data visualization, go back out to the workflow and add a Figure as a child of this analysis. To learn more about this, please see our resource Dimension Reduction Visualization.