Overview of AWS SageMaker Built-in Algorithms

Amazon SageMaker provides a suite of built-in algorithms to help data scientists and machine learning practitioners get started on training and deploying machine learning models quickly.

For someone that is new to SageMaker, choosing the right algorithm for your particular use case can be a challenging task.

The following table provides a quick cheat sheet that shows how you can start with an example problem or use case and find an appropriate built-in algorithm offered by SageMaker that is valid for that problem type.

SageMaker Algorithm	Description	Domain	Problem Types	Use Cases	Key Sagemaker Related Details
BlazingText	The Amazon SageMaker BlazingText algorithm provides highly optimized implementations of the Word2vec and text classification algorithms.	Textual Analysis	Text classification	Natural language processing (NLP) tasks, such as sentiment analysis, named entity recognition, machine translation, etc. Text classification is an important task for applications that perform web searches, information retrieval, ranking, and document classification.	The BlazingText algorithm is not parallelizable. The BlazingText algorithm expects a single preprocessed text file with space-separated tokens.
DeepAR Forecasting	The Amazon SageMaker DeepAR forecasting algorithm is a supervised learning algorithm for forecasting scalar (one-dimensional) time series using recurrent neural networks (RNN).q	Supervised Learning	Time-series forecasting	Based on historical data for a behavior, predict future behavior: predict sales on a new product based on previous sales data.	DeepAR supports two data channels. The required train channel describes the training dataset. The optional test channel describes a dataset that the algorithm uses to evaluate model accuracy after training.
Factorization Machines	The Factorization Machines algorithm is a general-purpose supervised learning algorithm that you can use for both classification and regression tasks. It is an extension of a linear model that is designed to capture interactions between features within high dimensional sparse datasets economically.	Supervised Learning	Regression Binary/Multi Class Classification	Factorization machines are a good choice for tasks dealing with high dimensional sparse datasets, such as click prediction and item recommendation.	CSV Format is not recommended due to high sparse data. JSON format is recommended.
Image Classification	The Amazon SageMaker image classification algorithm is a supervised learning algorithm that supports multi-label classification	Image Processing	Image and multi-label classification	Label/tag an image based on the content of the image: alerts about adult content in an image	The recommended input format is Apache MXNet RecordIO. However, you can also use raw images in .jpg or .png format. It uses a convolutional neural network (ResNet).
IP Insights	Amazon SageMaker IP Insights is an unsupervised learning algorithm that learns the usage patterns for IPv4 addresses.	Unsupervised Learning	IP anomaly detection	Protect your application from suspicious users: detect if an IP address accessing a service might be from a bad actor	The SageMaker IP Insights algorithm can run on both GPU and CPU instances. The SageMaker IP Insights algorithm can also learn vector representations of IP addresses, known as embeddings.
K-Means	K-means is an unsupervised learning algorithm. It attempts to find discrete groupings within data, where members of a group are as similar as possible to one another and as different as possible from members of other groups.	Unsupervised Learning	Clustering or grouping	Group similar objects/data together: find high-, medium-, and low-spending customers from their transaction histories	The k-means algorithm expects tabular data, where rows represent the observations that you want to cluster, and the columns represent attributes of the observations.
K-Nearest Neighbors (k-NN)	Amazon SageMaker k-nearest neighbors (k-NN) algorithm is an index-based algorithm.	Supervised Learning	Regression Binary/Multi Class Classification	Predict if an item belongs to a category: an email spam filter Predict a numeric/continuous value: estimate the value of a house	k-NN supports text/csv and application/x-recordio-protobuf data formats.
Latent Dirichlet Allocation (LDA)	The Amazon SageMaker Latent Dirichlet Allocation (LDA) algorithm is an unsupervised learning algorithm that attempts to describe a set of observations as a mixture of distinct categories.	Unsupervised Learning	Topic modeling	Organize a set of documents into topics (not known in advance): tag a document as belonging to a medical category based on the terms used in the document.	LDA supports both recordIO-wrapped-protobuf (dense and sparse) and CSV file formats.
Linear Learner	Linear models are supervised learning algorithms used for solving either classification or regression problems.	Supervised Learning	Regression Binary/Multi Class Classification	Predict if an item belongs to a category: an email spam filter Predict a numeric/continuous value: estimate the value of a house	For training, supports both recordIO-wrapped protobuf and CSV formats. For inference, supports the application/json, application/x-recordio-protobuf, and text/csv formats.
Neural Topic Model (NTM)	Amazon SageMaker NTM is an unsupervised learning algorithm that is used to organize a corpus of documents into topics that contain word groupings based on their statistical distribution.	Unsupervised Learning	Topic modeling	Organize a set of documents into topics (not known in advance): tag a document as belonging to a medical category based on the terms used in the document.	For training, supports both recordIO-wrapped protobuf and CSV formats. For inference, supports the application/json, application/jsonlines,application/x-recordio-protobuf, and text/csv formats.
Object2Vec	The Amazon SageMaker Object2Vec algorithm is a general-purpose neural embedding algorithm that is highly customizable. It can learn low-dimensional dense embeddings of high-dimensional objects.	Supervised Learning	Embeddings: convert high-dimensional objects into low-dimensional space.	Improve the data embeddings of the high-dimensional objects: identify duplicate support tickets or find the correct routing based on similarity of text in the tickets
Object Detection	The Amazon SageMaker Object Detection algorithm detects and classifies objects in images using a single deep neural network. It is a supervised learning algorithm that takes images as input and identifies all instances of objects within the image scene.	Image Processing	Object detection and classification	Detect people and objects in an image: police review a large photo gallery for a missing person	Amazon SageMaker Object Detection uses the Single Shot multibox Detector (SSD) algorithm
Principal Component Analysis (PCA)	PCA is an unsupervised machine learning algorithm that attempts to reduce the dimensionality (number of features) within a dataset while still retaining as much information as possible. This is done by finding a new set of features called components, which are composites of the original features that are uncorrelated with one another.	Unsupervised Learning	Feature engineering: dimensionality reduction	Drop those columns from a dataset that have a weak relation with the label/target variable: the color of a car when predicting its mileage.	For inference, PCA supports text/csv, application/json, and application/x-recordio-protobuf. Results are returned in either application/json or application/x-recordio-protobuf format with a vector of “projections.”
Random Cut Forest (RCF)	Amazon SageMaker Random Cut Forest (RCF) is an unsupervised algorithm for detecting anomalous data points within a data set.	Unsupervised Learning	Anomaly detection	Detect abnormal behavior in application: spot when an IoT sensor is sending abnormal readings
Semantic Segmentation	The SageMaker semantic segmentation algorithm provides a fine-grained, pixel-level approach to developing computer vision applications. It tags every pixel in an image with a class label from a predefined set of classes.	Image Processing	Computer vision	Tag every pixel of an image individually with a category: self-driving cars prepare to identify objects in their way	The SageMaker semantic segmentation algorithm only supports GPU instances for training.
Sequence-to-Sequence	Amazon SageMaker Sequence to Sequence is a supervised learning algorithm where the input is a sequence of tokens (for example, text, audio) and the output generated is another sequence of tokens.	Textual Analysis	Machine translation algorithm Text summarization Speech-to-text	Convert text from one language to other: Spanish to English Summarize a long text corpus: an abstract for a research paper Convert audio files to text: transcribe call center conversations for further analysis	Amazon SageMaker seq2seq is only supported on GPU instance types
XGBoost	The XGBoost (eXtreme Gradient Boosting) is a popular and efficient open-source implementation of the gradient boosted trees algorithm. Gradient boosting is a supervised learning algorithm that attempts to accurately predict a target variable by combining an ensemble of estimates from a set of simpler and weaker models.	Supervised Learning	Regression Binary/Multi Class Classification	Predict if an item belongs to a category: an email spam filter Predict a numeric/continuous value: estimate the value of a house

AWS SageMaker Algorithms

AWS SageMaker Algorithms

Overview of AWS SageMaker Built-in Algorithms

Unlocking the Power of Data: A..

Databricks Certification: How to Get Certified..

Databricks Unity Catalog: A Comprehensive Guide..

AWS DevOps Interview Questions

Databricks vs Redshift

Overview of AWS SageMaker Built-in Algorithms

You Might Also Like