In cytometry, visualizing all the markers (i.e features, which can be interpreted as dimensions) can be challenging as cells can express multiple markers at once. Dimensionality reduction is the process of taking high dimensional data and projecting this in low dimensional space while retaining as much information as possible. Dimensionality reduction allows for the visualization of cells that have similar marker expression, normally in a 2D space, by placing closely related cells through marker expression close to each other. This article shows how to set-up an EmbedSOM in OMIQ.
EmbedSOM uses unsupervised manifold learning by self-organizing map (SOM) to project high dimensional data to a low dimensional space. This involves identifying landmark points in the SOM and embedding the high dimensional data points to these landmarks.
Kratochvíl, M., Koladiya, A., and Vondrášek, J. Generalized EmbedSOM on Quadtree-Structured Self-Organizing Maps. F1000Research 8 (2019). https://doi.org/10.12688/f1000research.21642.2
1. Add an EmbedSOM Task
Click Add new child task and select EmbedSOM from the task selector. In this example, we have subsampled to live cells for our EmbedSOM task.
Your exact workflow branch may look different than the example above. The important thing is that your workflow follows a logical ordering of tasks.
2. Setup the EmbedSOM Task
2.1 Select Files and Features
Select the Files you want to include for your EmbedSOM.
Include all the files that you would want to directly compare in the same EmbedSOM run as each run will create a unique visualization and result.
Select the Features you want to use for the dimensionality reduction.
Each feature you select will affect how the algorithm computes the result. You do not necessarily have to include all features. Often, it will make sense to exclude certain markers if they will not help inform your results (input heterogeneity will equal output heterogeneity).
2.2 Enter EmbedSOM Settings
Feel free to change the default settings for your analysis goal. New to dimensionality reduction? Try out the default settings first and see how changing the hyperparameters below affect your result.
Smooth: This determines how the embedding takes into consideration close and far nodes in creating neighborhood approximations.
The default value of 0 is good in most cases. Increasing this (positive values) will result in "smoother" embedding of the data points that can focus on more global structure at the expense of local structure. Decreasing this (negative values) will result in "sharper" embedding of the data points that can focus on local structure at the expense of global structure.
k: This determines how many SOM vertices (landmarks) to use in embedding.
The default is 0, which sets the algorithm to automatic mode. A value between 10-50 is recommended if not using the automatic mode. A higher value means more landmarks are used (more precise) and focuses on global structure (e.g. B cells compared to monocytes). A lower value means less landmarks are used and focuses on local structure (e.g. CCR7 levels between CD4+ T cells). Please note that as you increase the k value, the computational load increases as well.
Adjust: This determines how non-local events (distant/far events) influence the embedding. This factor reduces the effect of non-local effects on the embedding.
Increasing the values lessens the effect of non-local events, giving more focus on local structure.
xdim: This determines the number of SOM nodes on the x-axis (width).
ydim: This determines the number of SOM nodes on the y-axis (height).
The xdim and ydim components determines the size of the SOM (xdim x ydim). In the example image above, xdim is 10 and ydim is 10 meaning that the SOM is 100.
rlen (# of training iterations): This determines the number of iterations to train the SOM.
Distance Metric: This considers the distance of the data points to the nodes in the SOM. You can choose from Euclidean, Manhattan, Chebyshev, and Cosine.
Distance Metrics:
- Euclidean: Measures the straight-line distance between two points in space.
- Manhattan: Computes the sum of absolute differences along each dimension, often reflecting grid-like movement.
- Chebyshev: Measures the farthest distance between two points along any dimension.
- Cosine: Considers the angle between vectors.
Random Seed: A number that is used to initialize the EmbedSOM operation. This is optional to change. The EmbedSOM algorithm is stochastic. To make it reproducible, a fixed Seed may be set. If the same dataset and settings are used, by retaining the same Random Seed value, the same result will be achieved.
2.3 Run
Click Run EmbedSOM. This will take you to the Status tab and you can watch the progress. However, you are free to go back to your workflow or do whatever you please while this runs in the cloud. The status can also be seen in the Workflow itself, or you can have an email sent to you when it is completed.
3. Review the Results
You can view the results of the EmbedSOM by going to the Results tab and see if the results are as expected.
The single plot above is only for a quick sanity check of results. You shouldn't use it for anything more than that.
For plotting and data visualization, go back out to the workflow and add a Figure as a child of this analysis. To learn more about this, please see our resource Dimension Reduction Visualization.