In cytometry, visualizing all the markers (i.e features, which can be interpreted as dimensions) can be challenging as cells can express multiple markers at once. Dimensionality reduction is the process of taking high dimensional data and projecting this in low dimensional space while retaining as much information as possible. Dimensionality reduction allows for the visualization of cells that have similar marker expression, normally in a 2D space, by placing closely related cells through marker expression close to each other. This article shows how to set-up a PaCMAP (Pairwise Controlled Manilfold Approximation Projection) in OMIQ.
PaCMAP helps preserve both the local and global structure when performing dimensionality reduction. PaCMAP uses three feature pairs: near pairs, mid-near pairs, and further pairs to help achieve this.
Wang, Y., Huang, H., Rudin, C., and Shaposhnik, Y. Understanding How Dimension Reduction Tools Work: An Empirical Approach to Deciphering t-SNE, UMAP, TriMap, and PacMAP for Data Visualization. Journal of Machine Learning Research 22(201):1-73 (2021). https://jmlr.org/papers/v22/20-1061.html
1. Add a PaCMAP Task
Click Add new child task and select PaCMAP from the task selector. In this example, we have subsampled to live cells for our PaCMAP task.
Your exact workflow branch may look different than the example above. The important thing is that your workflow follows a logical ordering of tasks.
2. Setup the PaCMAP Task
2.1 Select Files and Features
Select the Files you want to include for your PaCMAP.
Include all the files that you would want to directly compare in the same PacMAP run as each run will create a unique visualization and result.
Select the Features you want to use for the dimensionality reduction.
Each feature you select will affect how the algorithm computes the result. You do not necessarily have to include all features. Often, it will make sense to exclude certain markers if they will not help inform your results (input heterogeneity will equal output heterogeneity).
2.2 Enter PaCMAP Settings
Feel free to change the default settings for your analysis goal. New to dimensionality reduction? Try out the default settings first and see how changing the hyperparameters below affect your plot.
Nearest Neighbors: Sets the number of nearest neighbors for the k-nearest neighbors graph.
Num Output Dimensions: Determines the number of parameters the PaCMAP result will generate (pacmap_1, pacmap_2, pacmap_3, etc). 2 PaCMAP parameters would be considered the most traditional display.
Distance Metric: Controls how the distance is computed in the ambient space of the input data. You can choose between Euclidean, Manhattan, Angular, and Hamming.
Distance Metrics:
- Euclidean: Measures the straight-line distance between two points in space.
- Manhattan: Computes the sum of absolute differences along each dimension, often reflecting grid-like movement.
- Angular: Considers the angle between vectors.
- Hamming: This is the number of points that are different when two strings of equal length are compared.
Mid-Near Pairs to Nearest Neighbors Ratio: This specifies the number of points that will be considered as medium distance from each other. This can help space out the clusters/groups of the resulting dimensionality reduction plot.
Increasing the mid-near pairs to nearest neighbors ratio will focus on the overall global structure. Decreasing the mid-near pairs to nearest neighbors ratio will focus on more local structure.
Further Pairs to Nearest Neighbors Ratio: This specifies the number of points that will be considered as far from each other. This helps preserve the overall global structure of the resulting dimensionality reduction plot.
Increasing the further pairs to nearest neighbors ratio will focus on the global structure and may blur local structure. Decreasing the further pairs to nearest neighbors ratio will focus on more local structure.
Random Seed: A number that is used to initialize the PaCMAP operation. This is optional to change. The PaCMAP algorithm is stochastic. To make it reproducible, a fixed Seed may be set. If the same dataset and settings are used, by retaining the same Random Seed value, the same result will be achieved.
Learning Rate: The speed at which PaCMAP optimizes embedding.
Machine learning algorithms will learn from the data at a specific speed represented by the learning rate. The learning rate determines how the algorithm adjusts its own parameters at each step of the optimization phase. Although we recommend leaving the default automatic learning rate, you can set it manually just by typing the desired number. If the learning rate is set too low or too high, the specific territories for the different cell types won’t be properly separated. A higher learning rate means the algorithm takes bigger steps in each stage of learning but may overshoot the optimal solution. A lower learning rate means that the algorithm takes smaller steps but may result in the process getting stuck and not reaching the optimal solution.
Num Iterations: This is the number of iterations that PaCMAP will do to optimize the embedding of the data.
KNN Method: This sets what k-nearest neighbors method that PaCMAP will use. You can choose between HNSW or ANNOY.
- HNSW (Hierarchical Navigable Small World) approximates the nearest neighbors and builds a multi-layer graph structure. Choose this for high dimensional, large datasets.
- ANNOY (Approximate Nearest Neighbor Oh Yeah) is an algorithm that approximates nearest neighbors through through a tree-based index that allows for quick searchers in high dimensional space.
Embedding Initialization Method: This relates to how the low dimension embedding is initialized. You can choose between Random and Custom.
- Random: Assigns the initial embedding positions at random.
- Custom: Allows you to choose features as a Pre-init Embedding X and Pre-init Embedding Y that will be used as the initial embedding positions.
2.3 Run
Click Run PaCMAP. This will take you to the Status tab and you can watch the progress. However, you are free to go back to your workflow or do whatever you please while this runs in the cloud. The status can also be seen in the Workflow itself, or you can have an email sent to you when it is completed.
3. Review the Results
You can view the results of the PaCMAP by going to the Results tab and see if the results are as expected.
The single plot above is only for a quick sanity check of results. You shouldn't use it for anything more than that.
For plotting and data visualization, go back out to the workflow and add a Figure as a child of this analysis. To learn more about this, please see our resource Dimension Reduction Visualization.