Saira Kennedy

Types of Machine Learning – Part #2 in the Intro to AI/ML Series

December 9, 2020

Editor’s Note: MapR products and solutions sold prior to the acquisition of such assets by Hewlett Packard Enterprise Company in 2019, may have older product names and model numbers that differ from current solutions. For information about current offerings, which are now part of HPE Ezmeral Data Fabric, please visit https://www.hpe.com/us/en/software/data-fabric.html

Original Post Information:

"authorDisplayName": "Saira Kennedy",
"publish": "2018-09-28T07:00:00.000Z",
"tags": "machine-learning"

Based on MapR Academy course, Introduction to Artificial Intelligence and Machine Learning

In this post – second in the Intro to AI/ML Series – we discuss the different methods of machine learning and some of the most common algorithms available for your projects. Read the first blog in this series here.

Types of Machine Learning

There are a few different types of machine learning, but they generally fall into these main groups. Supervised and unsupervised learning are the primary learning types, along with semi-supervised learning.

Other methods sit in the middle or outskirts of these methodologies, such as reinforcement learning, which describes a machine that creates a cyclical learning cycle as it continuously trains itself from its own results. These results are then fed back into itself as input data. We won't be going into much detail on these other techniques, but further information can be found online.

First, let's understand what differentiates each of these learning types from each other and how they work.

Supervised Learning: Defined

Supervised learning uses labeled data to train machines to learn the relationships between given inputs and outputs.

A label is a known description given to objects in the data, which trains the machine on what to look for. Labels also provide the structure of the algorithm output, as any result must be one of these labels. Therefore, you can think of labels as a schema, defining the possible output that we want the machine to look for.

Think of supervised learning as the algorithm to use when data scientists have labeled input data and when the type of behavior to predict is known. We want the machine to learn the patterns used to classify this data and apply those patterns to classify new data.

How Supervised Learning Works

First, labeled or classified data is loaded into the system. The preparation of labeled data makes this step the most time-consuming, as it is often done by a human trainer.
The model is trained and connections to inputs and outputs are made.
As new data is introduced, the algorithm is applied.
Output is categorized data.

In the previous post, where we provided the cat example on labeled data, the trained labels include ears, nose, tail, paws, and cat, which the algorithm then applies to presented data, in this case an image of a cat, and returns the results of known output as "cat," yes.

Pros and Cons of Supervised Learning

Supervised learning always has a clear objective and can be easily measured for accuracy. The training of the machine is also tightly controlled, which leads to very specific behavioral outcomes.

On the downside, it is often very labor-intensive, as all data needs to be labeled before the model is trained, which can take hundreds of hours of specialized human effort. The costs can become astronomical. This creates an overall slower training process and may also limit the data that it can work with.

Pros	Cons
Very clear objective	Often labor-intensive
Easy to measure accuracy	Limited data to work with
Controlled training	Limited insights

Finally, insights may be more limited, as the predicted behavior is described in advance. There is no freedom for the machine to explore other possibilities, as we will see with unsupervised learning.

In supervised learning, there are primarily two categories of algorithms: classification and regression.

A classification algorithm organizes input data as belonging to one of several predefined classes. This algorithm is the most useful for providing categorical results that fit within the predefined labels. It is very effective with well-calculated if-then rules and distinguishes one class of objects from another.

Type	Algorithm or Task
Supervised	Classification (used to predict a categorical result)
Supervised	Regression (used to predict the output value given the input value)

Supervised Algorithms

Classification: Used to predict a categorical result

Some common use cases for classification algorithms include credit card fraud detection and email spam detection, both of which are binary classification problems, meaning there are only two possible output values. Data is labeled, for example, as fraud/non-fraud or spam/non-spam.

Generally, if the question we are asking of a model is open-ended or if the potential answers are not categorical, then we aren't dealing with a classification problem, but more likely a regression one.

A regression algorithm attempts to predict the output value given the input value. Regression problems are predictive of a continuous numerical, as opposed to categorical, result.

Think of this continuous value as a range or average, something that is estimating the relationship between variables.

Regression: Used to predict the output value given the input value

For example, this type of algorithm can be used to determine how profitable a credit card model is. It is also used to predict customer or employee churn models.

Regression algorithms determine the strength of correlation between two attributes, allowing you to find a predictive range of likelihood.

Supervised Algorithms Table

This table depicts some of the most common algorithms used with supervised learning types. It is important to understand that many machine learning data models will use more than one, and sometimes many, different algorithms for a project.

Type	Algorithm or Task
Classification	Naïve Bayes
Classification	Logistic Regression
Classification	Support Vector Machines (SVMs)
Regression	Linear Regression
Both	Decision Trees/Random Forest
Both	k-Nearest Neighbors (k-NN)
Both	Gradient Boosting Algorithms

Unsupervised Learning: Defined

While supervised learning involves having labeled data to find input-output relationships during the training phase, unsupervised learning has no knowledge of the output label. In this type of ML, the machine finds groups and patterns in the data on its own, and there is no specific outcome or target to predict.

Think of unsupervised learning as the algorithm to use when we don't know how to classify the data and we want the machine to classify or group it for us.

How Unsupervised Learning Works

Here are the steps of how the unsupervised learning algorithm works:

First, unlabeled raw data is loaded into the system.
Next, the algorithm analyzes the data.
It looks for patterns on its own.
Then, it identifies and groups patterns of behavior and provides output results.

Pros and Cons of Unsupervised Learning

Compared to supervised learning, unsupervised learning projects are much faster to implement, as no data labeling is required. In this regard, it uses fewer human resources. It also interprets data on its own and has the potential to provide unique, disruptive insights for a business to consider.

However, unsupervised learning can be difficult to measure for accuracy because there is no expected result to compare it to. It can require more experimentation and tuning to get meaningful results.

Lastly, unsupervised learning does not natively handle high-dimensional data, or massively large datasets with considerable variance, well. This is known as the curse of dimensionality. In some cases, the dimensions, or number of variables, may need to be reduced for it to work effectively. This requires human-intensive data cleansing.

Pros	Cons
Very fast to start	Difficult to measure accuracy
Disruptive insights	Requires more experience
	Curse of dimensionality

Common Use Cases for Unsupervised Algorithms

Let's take a look at a common use case example using cluster analysis. Cluster analysis has the goal of organizing raw data into related groups and is often used for anomaly detection.

A security company uses it to identify unusual patterns in network traffic, indicating potential signs of a security breach or intrusion.

Unsupervised Learning Use Case – Anomaly Detection

Recall the steps of how this type of algorithm works.

First, the security company streams in raw network traffic data.
Next, the algorithm analyzes the data on its own and looks for unusual patterns.
Then, it identifies patterns of behavior as either normal or suspect.
When suspect behavior is identified, the output is provided and the company is notified.

With this example using anomaly detection, a scatter plot may return results looking something like the image below. The green dots indicate behavior that is grouped together as normal, and the red dots show the potential outliers that are sent back as suspect.

This table depicts some unsupervised learning algorithms. The most common algorithm here is k-means, for cluster analysis, which is what we've just focused on with our security use case example on anomaly detection.

Type	Algorithm or Task
Unsupervised	_k_-Means: Cluster Analysis
Unsupervised	Association Rule Learning
Unsupervised	Dimensionality Reduction Techniques (PCA, SVD)

Semi-Supervised Learning: Defined

Semi-supervised learning includes a combination of supervised and unsupervised learning types together. Usually, this means that only a part of the provided input data is labeled, which the machine is trained on. It then learns to create additional labels and classifiers for raw data, on its own, which in turn gets added back to the original training data set.

How Semi-Unsupervised Learning Works

As a combination of the previous two learning types, let's look at how a common self-training algorithm, from the semi-supervised learning method, works:

First, an initial set of labeled input training data is loaded into the system.
The model is trained on the data. Then, a new data set of unlabeled data is presented.
The algorithm infers new labels and classifiers to apply to the new data. High-confidence data, or data that scores well based on the algorithm, is added back to the original labeled data set. From here, the machine progressively adapts and learns in an iterative process.
In some cases, when the labels and the rule-based engine conflicts, a human is needed for verification.

Semi-Supervised Algorithms Table

This table depicts some semi-supervised learning algorithms.

Type	Algorithm or Task
Semi-Supervised	Self-Training Algorithms
Semi-Supervised	Generative Model – Gaussian Mixture Model
Semi-Supervised	Graph-Based Algorithms – Label Propagation

Check your Knowledge

Classify the items listed below as a supervised, semi-supervised, or unsupervised learning method:

Answer Key:

Works on both labeled data and raw data: Semi-supervised

Easiest data preparation method: Unsupervised

Only uses labeled input data: Supervised

Infers patterns on its own: Unsupervised

Output is predefined: Supervised

Can be used to automate data labeling: Semi-supervised

More Resources: Machine Learning Libraries

Where do we go from here

Keep your eyes out for the next post in this series, discussing real world use cases for AI and ML.

Types of Machine Learning – Part #2 in the Intro to AI/ML Series

Original Post Information:

Types of Machine Learning

Supervised Learning: Defined

How Supervised Learning Works

Pros and Cons of Supervised Learning

Supervised Algorithms

Classification: Used to predict a categorical result

Regression: Used to predict the output value given the input value

Supervised Algorithms Table

Unsupervised Learning: Defined

How Unsupervised Learning Works

Pros and Cons of Unsupervised Learning

Common Use Cases for Unsupervised Algorithms

Unsupervised Learning Use Case – Anomaly Detection

Semi-Supervised Learning: Defined

How Semi-Unsupervised Learning Works

Semi-Supervised Algorithms Table

Check your Knowledge

More Resources: Machine Learning Libraries

Where do we go from here

Tags

Related

3 ways a data fabric enables a data-first approach

A Functional Approach to Logging in Apache Spark

Getting Started with DataTaps in Kubernetes Pods

Accessing HPE Ezmeral Data Fabric Object Storage from Spring Boot S3 Micro Service deployed in K3s cluster

An Inside Look at the Components of a Recommendation Engine

Analyzing Flight Delays with Apache Spark GraphFrames and MapR Database

Apache Spark as a Distributed SQL Engine

Apache Spark Machine Learning Tutorial

HPE Developer Newsletter

HPE Developer

About HPE