AWS SageMaker Algorithms

Overview of AWS SageMaker Built-in Algorithms

Amazon SageMaker provides a suite of built-in algorithms to help data scientists and machine learning practitioners get started on training and deploying machine learning models quickly. 

For someone that is new to SageMaker, choosing the right algorithm for your particular use case can be a challenging task.

The following table provides a quick cheat sheet that shows how you can start with an example problem or use case and find an appropriate built-in algorithm offered by SageMaker that is valid for that problem type. 

SageMaker AlgorithmDescriptionDomainProblem TypesUse CasesKey Sagemaker Related Details
BlazingTextThe Amazon SageMaker BlazingText algorithm provides highly optimized implementations of the Word2vec and text classification algorithms. Textual AnalysisText classification
Natural language processing (NLP) tasks, such as sentiment analysis, named entity recognition, machine translation, etc.
Text classification is an important task for applications that perform web searches, information retrieval, ranking, and document classification.
The BlazingText algorithm is not parallelizable.
The BlazingText algorithm expects a single preprocessed text file with space-separated tokens. 
DeepAR ForecastingThe Amazon SageMaker DeepAR forecasting algorithm is a supervised learning algorithm for forecasting scalar (one-dimensional) time series using recurrent neural networks (RNN).qSupervised LearningTime-series forecasting
Based on historical data for a behavior, predict future behavior: predict sales on a new product based on previous sales data.
DeepAR supports two data channels. The required train channel describes the training dataset. The optional test channel describes a dataset that the algorithm uses to evaluate model accuracy after training. 
Factorization MachinesThe Factorization Machines algorithm is a general-purpose supervised learning algorithm that you can use for both classification and regression tasks. It is an extension of a linear model that is designed to capture interactions between features within high dimensional sparse datasets economically.Supervised LearningRegression
Binary/Multi Class Classification
Factorization machines are a good choice for tasks dealing with high dimensional sparse datasets, such as click prediction and item recommendation.CSV Format is not recommended due to high sparse data. JSON format is recommended.
Image ClassificationThe Amazon SageMaker image classification algorithm is a supervised learning algorithm that supports multi-label classificationImage ProcessingImage and multi-label classificationLabel/tag an image based on the content of the image: alerts about adult content in an imageThe recommended input format is Apache MXNet RecordIO. However, you can also use raw images in .jpg or .png format.
It uses a convolutional neural network (ResNet).
IP InsightsAmazon SageMaker IP Insights is an unsupervised learning algorithm that learns the usage patterns for IPv4 addresses.Unsupervised LearningIP anomaly detection
Protect your application from suspicious users: detect if an IP address accessing a service might be from a bad actorThe SageMaker IP Insights algorithm can run on both GPU and CPU instances.
The SageMaker IP Insights algorithm can also learn vector representations of IP addresses, known as embeddings.
K-MeansK-means is an unsupervised learning algorithm. It attempts to find discrete groupings within data, where members of a group are as similar as possible to one another and as different as possible from members of other groups. Unsupervised LearningClustering or groupingGroup similar objects/data together: find high-, medium-, and low-spending customers from their transaction historiesThe k-means algorithm expects tabular data, where rows represent the observations that you want to cluster, and the columns represent attributes of the observations.
K-Nearest Neighbors (k-NN)Amazon SageMaker k-nearest neighbors (k-NN) algorithm is an index-based algorithm.Supervised LearningRegression
Binary/Multi Class Classification
Predict if an item belongs to a category: an email spam filter
Predict a numeric/continuous value: estimate the value of a house
 k-NN supports text/csv and application/x-recordio-protobuf data formats.
Latent Dirichlet Allocation (LDA) The Amazon SageMaker Latent Dirichlet Allocation (LDA) algorithm is an unsupervised learning algorithm that attempts to describe a set of observations as a mixture of distinct categories.Unsupervised LearningTopic modeling
Organize a set of documents into topics (not known in advance): tag a document as belonging to a medical category based on the terms used in the document.LDA supports both recordIO-wrapped-protobuf (dense and sparse) and CSV file formats. 
Linear Learner Linear models are supervised learning algorithms used for solving either classification or regression problems. Supervised LearningRegression
Binary/Multi Class Classification
Predict if an item belongs to a category: an email spam filter
Predict a numeric/continuous value: estimate the value of a house
For training, supports both recordIO-wrapped protobuf and CSV formats.
For inference, supports the application/json, application/x-recordio-protobuf, and text/csv formats.
Neural Topic Model (NTM)Amazon SageMaker NTM is an unsupervised learning algorithm that is used to organize a corpus of documents into topics that contain word groupings based on their statistical distribution. Unsupervised LearningTopic modeling
Organize a set of documents into topics (not known in advance): tag a document as belonging to a medical category based on the terms used in the document.For training, supports both recordIO-wrapped protobuf and CSV formats.
For inference, supports the application/json, application/jsonlines,application/x-recordio-protobuf, and text/csv formats.
Object2VecThe Amazon SageMaker Object2Vec algorithm is a general-purpose neural embedding algorithm that is highly customizable. It can learn low-dimensional dense embeddings of high-dimensional objects. Supervised LearningEmbeddings: convert high-dimensional objects into low-dimensional space.Improve the data embeddings of the high-dimensional objects: identify duplicate support tickets or find the correct routing based on similarity of text in the tickets 
Object Detection The Amazon SageMaker Object Detection algorithm detects and classifies objects in images using a single deep neural network. It is a supervised learning algorithm that takes images as input and identifies all instances of objects within the image scene. Image ProcessingObject detection and classificationDetect people and objects in an image: police review a large photo gallery for a missing personAmazon SageMaker Object Detection uses the Single Shot multibox Detector (SSD) algorithm
Principal Component Analysis (PCA) PCA is an unsupervised machine learning algorithm that attempts to reduce the dimensionality (number of features) within a dataset while still retaining as much information as possible. This is done by finding a new set of features called components, which are composites of the original features that are uncorrelated with one another. Unsupervised LearningFeature engineering: dimensionality reductionDrop those columns from a dataset that have a weak relation with the label/target variable: the color of a car when predicting its mileage.For inference, PCA supports text/csv, application/json, and application/x-recordio-protobuf. Results are returned in either application/json or application/x-recordio-protobuf format with a vector of “projections.”
Random Cut Forest (RCF) Amazon SageMaker Random Cut Forest (RCF) is an unsupervised algorithm for detecting anomalous data points within a data set.Unsupervised LearningAnomaly detection
Detect abnormal behavior in application: spot when an IoT sensor is sending abnormal readings 
Semantic SegmentationThe SageMaker semantic segmentation algorithm provides a fine-grained, pixel-level approach to developing computer vision applications. It tags every pixel in an image with a class label from a predefined set of classes. Image ProcessingComputer visionTag every pixel of an image individually with a category: self-driving cars prepare to identify objects in their wayThe SageMaker semantic segmentation algorithm only supports GPU instances for training.
Sequence-to-SequenceAmazon SageMaker Sequence to Sequence is a supervised learning algorithm where the input is a sequence of tokens (for example, text, audio) and the output generated is another sequence of tokens.Textual AnalysisMachine translation algorithm
Text summarization
Speech-to-text
Convert text from one language to other: Spanish to English
Summarize a long text corpus: an abstract for a research paper
Convert audio files to text: transcribe call center conversations for further analysis
Amazon SageMaker seq2seq is only supported on GPU instance types
XGBoost The XGBoost (eXtreme Gradient Boosting) is a popular and efficient open-source implementation of the gradient boosted trees algorithm. Gradient boosting is a supervised learning algorithm that attempts to accurately predict a target variable by combining an ensemble of estimates from a set of simpler and weaker models. Supervised LearningRegression
Binary/Multi Class Classification
Predict if an item belongs to a category: an email spam filter
Predict a numeric/continuous value: estimate the value of a house
 
AWS SageMaker Algorithms