Overview of AWS SageMaker Built-in Algorithms
Amazon SageMaker provides a suite of built-in algorithms to help data scientists and machine learning practitioners get started on training and deploying machine learning models quickly.
For someone that is new to SageMaker, choosing the right algorithm for your particular use case can be a challenging task.
The following table provides a quick cheat sheet that shows how you can start with an example problem or use case and find an appropriate built-in algorithm offered by SageMaker that is valid for that problem type.
SageMaker Algorithm | Description | Domain | Problem Types | Use Cases | Key Sagemaker Related Details |
BlazingText | The Amazon SageMaker BlazingText algorithm provides highly optimized implementations of the Word2vec and text classification algorithms. | Textual Analysis | Text classification | Natural language processing (NLP) tasks, such as sentiment analysis, named entity recognition, machine translation, etc. Text classification is an important task for applications that perform web searches, information retrieval, ranking, and document classification. | The BlazingText algorithm is not parallelizable. The BlazingText algorithm expects a single preprocessed text file with space-separated tokens. |
DeepAR Forecasting | The Amazon SageMaker DeepAR forecasting algorithm is a supervised learning algorithm for forecasting scalar (one-dimensional) time series using recurrent neural networks (RNN).q | Supervised Learning | Time-series forecasting | Based on historical data for a behavior, predict future behavior: predict sales on a new product based on previous sales data. | DeepAR supports two data channels. The required train channel describes the training dataset. The optional test channel describes a dataset that the algorithm uses to evaluate model accuracy after training. |
Factorization Machines | The Factorization Machines algorithm is a general-purpose supervised learning algorithm that you can use for both classification and regression tasks. It is an extension of a linear model that is designed to capture interactions between features within high dimensional sparse datasets economically. | Supervised Learning | Regression Binary/Multi Class Classification | Factorization machines are a good choice for tasks dealing with high dimensional sparse datasets, such as click prediction and item recommendation. | CSV Format is not recommended due to high sparse data. JSON format is recommended. |
Image Classification | The Amazon SageMaker image classification algorithm is a supervised learning algorithm that supports multi-label classification | Image Processing | Image and multi-label classification | Label/tag an image based on the content of the image: alerts about adult content in an image | The recommended input format is Apache MXNet RecordIO. However, you can also use raw images in .jpg or .png format. It uses a convolutional neural network (ResNet). |
IP Insights | Amazon SageMaker IP Insights is an unsupervised learning algorithm that learns the usage patterns for IPv4 addresses. | Unsupervised Learning | IP anomaly detection | Protect your application from suspicious users: detect if an IP address accessing a service might be from a bad actor | The SageMaker IP Insights algorithm can run on both GPU and CPU instances. The SageMaker IP Insights algorithm can also learn vector representations of IP addresses, known as embeddings. |
K-Means | K-means is an unsupervised learning algorithm. It attempts to find discrete groupings within data, where members of a group are as similar as possible to one another and as different as possible from members of other groups. | Unsupervised Learning | Clustering or grouping | Group similar objects/data together: find high-, medium-, and low-spending customers from their transaction histories | The k-means algorithm expects tabular data, where rows represent the observations that you want to cluster, and the columns represent attributes of the observations. |
K-Nearest Neighbors (k-NN) | Amazon SageMaker k-nearest neighbors (k-NN) algorithm is an index-based algorithm. | Supervised Learning | Regression Binary/Multi Class Classification | Predict if an item belongs to a category: an email spam filter Predict a numeric/continuous value: estimate the value of a house | k-NN supports text/csv and application/x-recordio-protobuf data formats. |
Latent Dirichlet Allocation (LDA) | The Amazon SageMaker Latent Dirichlet Allocation (LDA) algorithm is an unsupervised learning algorithm that attempts to describe a set of observations as a mixture of distinct categories. | Unsupervised Learning | Topic modeling | Organize a set of documents into topics (not known in advance): tag a document as belonging to a medical category based on the terms used in the document. | LDA supports both recordIO-wrapped-protobuf (dense and sparse) and CSV file formats. |
Linear Learner | Linear models are supervised learning algorithms used for solving either classification or regression problems. | Supervised Learning | Regression Binary/Multi Class Classification | Predict if an item belongs to a category: an email spam filter Predict a numeric/continuous value: estimate the value of a house | For training, supports both recordIO-wrapped protobuf and CSV formats. For inference, supports the application/json, application/x-recordio-protobuf, and text/csv formats. |
Neural Topic Model (NTM) | Amazon SageMaker NTM is an unsupervised learning algorithm that is used to organize a corpus of documents into topics that contain word groupings based on their statistical distribution. | Unsupervised Learning | Topic modeling | Organize a set of documents into topics (not known in advance): tag a document as belonging to a medical category based on the terms used in the document. | For training, supports both recordIO-wrapped protobuf and CSV formats. For inference, supports the application/json, application/jsonlines,application/x-recordio-protobuf, and text/csv formats. |
Object2Vec | The Amazon SageMaker Object2Vec algorithm is a general-purpose neural embedding algorithm that is highly customizable. It can learn low-dimensional dense embeddings of high-dimensional objects. | Supervised Learning | Embeddings: convert high-dimensional objects into low-dimensional space. | Improve the data embeddings of the high-dimensional objects: identify duplicate support tickets or find the correct routing based on similarity of text in the tickets | |
Object Detection | The Amazon SageMaker Object Detection algorithm detects and classifies objects in images using a single deep neural network. It is a supervised learning algorithm that takes images as input and identifies all instances of objects within the image scene. | Image Processing | Object detection and classification | Detect people and objects in an image: police review a large photo gallery for a missing person | Amazon SageMaker Object Detection uses the Single Shot multibox Detector (SSD) algorithm |
Principal Component Analysis (PCA) | PCA is an unsupervised machine learning algorithm that attempts to reduce the dimensionality (number of features) within a dataset while still retaining as much information as possible. This is done by finding a new set of features called components, which are composites of the original features that are uncorrelated with one another. | Unsupervised Learning | Feature engineering: dimensionality reduction | Drop those columns from a dataset that have a weak relation with the label/target variable: the color of a car when predicting its mileage. | For inference, PCA supports text/csv, application/json, and application/x-recordio-protobuf. Results are returned in either application/json or application/x-recordio-protobuf format with a vector of “projections.” |
Random Cut Forest (RCF) | Amazon SageMaker Random Cut Forest (RCF) is an unsupervised algorithm for detecting anomalous data points within a data set. | Unsupervised Learning | Anomaly detection | Detect abnormal behavior in application: spot when an IoT sensor is sending abnormal readings | |
Semantic Segmentation | The SageMaker semantic segmentation algorithm provides a fine-grained, pixel-level approach to developing computer vision applications. It tags every pixel in an image with a class label from a predefined set of classes. | Image Processing | Computer vision | Tag every pixel of an image individually with a category: self-driving cars prepare to identify objects in their way | The SageMaker semantic segmentation algorithm only supports GPU instances for training. |
Sequence-to-Sequence | Amazon SageMaker Sequence to Sequence is a supervised learning algorithm where the input is a sequence of tokens (for example, text, audio) and the output generated is another sequence of tokens. | Textual Analysis | Machine translation algorithm Text summarization Speech-to-text | Convert text from one language to other: Spanish to English Summarize a long text corpus: an abstract for a research paper Convert audio files to text: transcribe call center conversations for further analysis | Amazon SageMaker seq2seq is only supported on GPU instance types |
XGBoost | The XGBoost (eXtreme Gradient Boosting) is a popular and efficient open-source implementation of the gradient boosted trees algorithm. Gradient boosting is a supervised learning algorithm that attempts to accurately predict a target variable by combining an ensemble of estimates from a set of simpler and weaker models. | Supervised Learning | Regression Binary/Multi Class Classification | Predict if an item belongs to a category: an email spam filter Predict a numeric/continuous value: estimate the value of a house |