Perform Dimensionality Reduction Using t-SNE-CUDA – OMIQ

In cytometry, visualizing all the markers (i.e features, which can be interpreted as dimensions) can be challenging as cells can express multiple markers at once. Dimensionality reduction is the process of taking high dimensional data and projecting this in low dimensional space while retaining as much information as possible. Dimensionality reduction allows for the visualization of cells that have similar marker expression, normally in a 2D space, by placing closely related cells through marker expression close to each other. This article shows how to set-up an t-SNE-CUDA in OMIQ.

t-SNE-CUDA utilizes CUDA, a parallel computing platform and application programming interface that uses graphics processing unit (GPU) for accelerated t-SNE implementation. Further, the t-SNE used by t-SNE-CUDA is based on the Barnes-Hut implementation of t-SNE and approximate nearest neighbors allowing for faster embedding.

Chan, D.M., Rao, R., Huang, F., & Canny, J. t-SNE-CUDA: GPU-Accelerated t-SNE and its Applications to Modern Data. 30th International Symposium on Computer Architecture and High Performance Computing: pp. 330-338 (2018). https://doi.org/10.1109/CAHPC.2018.8645912

Chan, D.M., Rao, R., Huang, F., & Canny, J. GPU Accelerated t-Distributed Stochastic Neighbor Embedding. Journal of Parallel and Distributed Computing 131:1-13 (2019). https://doi.org/10.1016/j.jpdc.2019.04.008

1. Add a t-SNE-CUDA Task

Click Add new child task and select t-SNE-CUDA from the task selector. In this example, we have subsampled to live cells for our t-SNE-CUDA task.

Your exact workflow branch may look different than the example above. The important thing is that your workflow follows a logical ordering of tasks.

2. Setup the t-SNE-CUDA Task

2.1 Select Files and Features

Select the Files you want to include for your t-SNE-CUDA.

Include all the files that you would want to directly compare in the same t-SNE-CUDA run as each run will create a unique visualization and result.

Select the Features you want to use for the dimensionality reduction.

Each feature you select will affect how the algorithm computes the result. You do not necessarily have to include all features. Often, it will make sense to exclude certain markers if they will not help inform your results (input heterogeneity will equal output heterogeneity).

2.2 Enter t-SNE-CUDA Settings

Feel free to change the default settings for your analysis goal. New to dimensionality reduction? Try out the default settings first and see how changing the hyperparameters below affect your result.

Iterations: This is the number of iterations that t-SNE-CUDA will do to optimize the embedding of the data.

Early Exaggeration Iters: This is the number of iterations used for early exaggeration.

Early Exaggeration Factor: The factor used in the early exaggeration phase. Early exaggeration increases the attractive forces between similar data points. This, in turn, improves the convergence of the data and helps create the separation between clusters.

Perplexity: Though distinct, this will have a similar impact as editing the nearest neighbors for each data point. The algorithm uses the perplexity to calculate how similar data points are in the high dimensional space before projecting it to a low dimensional space. Larger datasets may need to have a higher perplexity. Low perplexity focuses on local structure (e.g. CCR7 levels between CD4+ T cells) while high perplexity focuses on global structure (e.g. B cells compared to monocytes).

Theta: Theta controls how similar the Barnes-Hut implementation of t-SNE is to the original t-SNE algorithm (a lower value means it is more similar).

The Barnes-Hut implementation of t-SNE was created to allow the algorithm to be used on larger datasets (with more than a few thousand events total) with faster run times, so decreasing it is generally never recommended. Please note that changing the value of theta may result in groups of events or observations being separated on the map that don’t have meaningful differences in marker expression.

Learning Rate: The speed at which t-SNE-CUDA optimizes embedding.

Machine learning algorithms will learn from the data at a specific speed represented by the learning rate. The learning rate determines how the algorithm adjusts its own parameters at each step of the optimization phase. Although we recommend leaving the default automatic learning rate, you can set it manually just by typing the desired number. If the learning rate is set too low or too high, the specific territories for the different cell types won’t be properly separated. A higher learning rate means the algorithm takes bigger steps in each stage of learning but may overshoot the optimal solution. A lower learning rate means that the algorithm takes smaller steps but may result in the process getting stuck and not reaching the optimal solution.

Num Nearest Neighbors: Sets the number of nearest neighbors that t-SNE-CUDA will use for embedding. Low values focus on local structure (e.g. CCR7 levels between CD4+ T cells) while high values focus on global structure (e.g. B cells compared to monocytes).

Print Progress Iters: The number of iterations before the algorithm prints out the progress.

2.3 Run

Click Run t-SNE-CUDA. This will take you to the Status tab and you can watch the progress. However, you are free to go back to your workflow or do whatever you please while this runs in the cloud. The status can also be seen in the Workflow itself, or you can have an email sent to you when it is completed.

3. Review the Results

You can view the results of t-SNE-CUDA by going to the Results tab and see if the results are as expected.

The single plot above is only for a quick sanity check of results. You shouldn't use it for anything more than that.

For plotting and data visualization, go back out to the workflow and add a Figure as a child of this analysis. To learn more about this, please see our resource Dimension Reduction Visualization.