Search
Saira Kennedy

Types of Machine Learning – Part #2 in the Intro to AI/ML Series

December 9, 2020

Editor’s Note: MapR products and solutions sold prior to the acquisition of such assets by Hewlett Packard Enterprise Company in 2019, may have older product names and model numbers that differ from current solutions. For information about current offerings, which are now part of HPE Ezmeral Data Fabric, please visit https://www.hpe.com/us/en/software/data-fabric.html

Original Post Information:

"authorDisplayName": "Saira Kennedy",
"publish": "2018-09-28T07:00:00.000Z",
"tags": "machine-learning"

Based on MapR Academy course, Introduction to Artificial Intelligence and Machine Learning

In this post – second in the Intro to AI/ML Series – we discuss the different methods of machine learning and some of the most common algorithms available for your projects. Read the first blog in this series here.

Types of Machine Learning

There are a few different types of machine learning, but they generally fall into these main groups. Supervised and unsupervised learning are the primary learning types, along with semi-supervised learning.

Other methods sit in the middle or outskirts of these methodologies, such as reinforcement learning, which describes a machine that creates a cyclical learning cycle as it continuously trains itself from its own results. These results are then fed back into itself as input data. We won't be going into much detail on these other techniques, but further information can be found online.

First, let's understand what differentiates each of these learning types from each other and how they work.

Supervised Learning: Defined

Supervised learning uses labeled data to train machines to learn the relationships between given inputs and outputs.

A label is a known description given to objects in the data, which trains the machine on what to look for. Labels also provide the structure of the algorithm output, as any result must be one of these labels. Therefore, you can think of labels as a schema, defining the possible output that we want the machine to look for.

Think of supervised learning as the algorithm to use when data scientists have labeled input data and when the type of behavior to predict is known. We want the machine to learn the patterns used to classify this data and apply those patterns to classify new data.

How Supervised Learning Works

  • First, labeled or classified data is loaded into the system. The preparation of labeled data makes this step the most time-consuming, as it is often done by a human trainer.
  • The model is trained and connections to inputs and outputs are made.
  • As new data is introduced, the algorithm is applied.
  • Output is categorized data.

In the previous post, where we provided the cat example on labeled data, the trained labels include ears, nose, tail, paws, and cat, which the algorithm then applies to presented data, in this case an image of a cat, and returns the results of known output as "cat," yes.

Pros and Cons of Supervised Learning

Supervised learning always has a clear objective and can be easily measured for accuracy. The training of the machine is also tightly controlled, which leads to very specific behavioral outcomes.

On the downside, it is often very labor-intensive, as all data needs to be labeled before the model is trained, which can take hundreds of hours of specialized human effort. The costs can become astronomical. This creates an overall slower training process and may also limit the data that it can work with.

Pros
Cons
Very clear objectiveOften labor-intensive
Easy to measure accuracyLimited data to work with
Controlled trainingLimited insights

Finally, insights may be more limited, as the predicted behavior is described in advance. There is no freedom for the machine to explore other possibilities, as we will see with unsupervised learning.

In supervised learning, there are primarily two categories of algorithms: classification and regression.

A classification algorithm organizes input data as belonging to one of several predefined classes. This algorithm is the most useful for providing categorical results that fit within the predefined labels. It is very effective with well-calculated if-then rules and distinguishes one class of objects from another.

Type
Algorithm or Task
SupervisedClassification (used to predict a categorical result)
SupervisedRegression (used to predict the output value given the input value)

Supervised Algorithms

Classification: Used to predict a categorical result

Some common use cases for classification algorithms include credit card fraud detection and email spam detection, both of which are binary classification problems, meaning there are only two possible output values. Data is labeled, for example, as fraud/non-fraud or spam/non-spam.

Generally, if the question we are asking of a model is open-ended or if the potential answers are not categorical, then we aren't dealing with a classification problem, but more likely a regression one.

A regression algorithm attempts to predict the output value given the input value. Regression problems are predictive of a continuous numerical, as opposed to categorical, result.

Think of this continuous value as a range or average, something that is estimating the relationship between variables.

Regression: Used to predict the output value given the input value

For example, this type of algorithm can be used to determine how profitable a credit card model is. It is also used to predict customer or employee churn models.

Regression algorithms determine the strength of correlation between two attributes, allowing you to find a predictive range of likelihood.

Supervised Algorithms Table

This table depicts some of the most common algorithms used with supervised learning types. It is important to understand that many machine learning data models will use more than one, and sometimes many, different algorithms for a project.

Type
Algorithm or Task
ClassificationNaïve Bayes
ClassificationLogistic Regression
ClassificationSupport Vector Machines (SVMs)
RegressionLinear Regression
BothDecision Trees/Random Forest
Bothk-Nearest Neighbors (k-NN)
BothGradient Boosting Algorithms

Unsupervised Learning: Defined

While supervised learning involves having labeled data to find input-output relationships during the training phase, unsupervised learning has no knowledge of the output label. In this type of ML, the machine finds groups and patterns in the data on its own, and there is no specific outcome or target to predict.

Think of unsupervised learning as the algorithm to use when we don't know how to classify the data and we want the machine to classify or group it for us.

How Unsupervised Learning Works

Here are the steps of how the unsupervised learning algorithm works:

  • First, unlabeled raw data is loaded into the system.
  • Next, the algorithm analyzes the data.
  • It looks for patterns on its own.
  • Then, it identifies and groups patterns of behavior and provides output results.

Pros and Cons of Unsupervised Learning

Compared to supervised learning, unsupervised learning projects are much faster to implement, as no data labeling is required. In this regard, it uses fewer human resources. It also interprets data on its own and has the potential to provide unique, disruptive insights for a business to consider.

However, unsupervised learning can be difficult to measure for accuracy because there is no expected result to compare it to. It can require more experimentation and tuning to get meaningful results.

Lastly, unsupervised learning does not natively handle high-dimensional data, or massively large datasets with considerable variance, well. This is known as the curse of dimensionality. In some cases, the dimensions, or number of variables, may need to be reduced for it to work effectively. This requires human-intensive data cleansing.

Pros
Cons
Very fast to startDifficult to measure accuracy
Disruptive insightsRequires more experience
Curse of dimensionality

Common Use Cases for Unsupervised Algorithms

Let's take a look at a common use case example using cluster analysis. Cluster analysis has the goal of organizing raw data into related groups and is often used for anomaly detection.

A security company uses it to identify unusual patterns in network traffic, indicating potential signs of a security breach or intrusion.

Unsupervised Learning Use Case – Anomaly Detection

Recall the steps of how this type of algorithm works.

  • First, the security company streams in raw network traffic data.
  • Next, the algorithm analyzes the data on its own and looks for unusual patterns.
  • Then, it identifies patterns of behavior as either normal or suspect.
  • When suspect behavior is identified, the output is provided and the company is notified.

With this example using anomaly detection, a scatter plot may return results looking something like the image below. The green dots indicate behavior that is grouped together as normal, and the red dots show the potential outliers that are sent back as suspect.

This table depicts some unsupervised learning algorithms. The most common algorithm here is k-means, for cluster analysis, which is what we've just focused on with our security use case example on anomaly detection.

Type
Algorithm or Task
Unsupervised_k_-Means: Cluster Analysis
UnsupervisedAssociation Rule Learning
UnsupervisedDimensionality Reduction Techniques (PCA, SVD)

Semi-Supervised Learning: Defined

Semi-supervised learning includes a combination of supervised and unsupervised learning types together. Usually, this means that only a part of the provided input data is labeled, which the machine is trained on. It then learns to create additional labels and classifiers for raw data, on its own, which in turn gets added back to the original training data set.

How Semi-Unsupervised Learning Works

As a combination of the previous two learning types, let's look at how a common self-training algorithm, from the semi-supervised learning method, works:

  • First, an initial set of labeled input training data is loaded into the system.
  • The model is trained on the data. Then, a new data set of unlabeled data is presented.
  • The algorithm infers new labels and classifiers to apply to the new data. High-confidence data, or data that scores well based on the algorithm, is added back to the original labeled data set. From here, the machine progressively adapts and learns in an iterative process.
  • In some cases, when the labels and the rule-based engine conflicts, a human is needed for verification.

Semi-Supervised Algorithms Table

This table depicts some semi-supervised learning algorithms.

Type
Algorithm or Task
Semi-SupervisedSelf-Training Algorithms
Semi-SupervisedGenerative Model – Gaussian Mixture Model
Semi-SupervisedGraph-Based Algorithms – Label Propagation

Check your Knowledge

Classify the items listed below as a supervised, semi-supervised, or unsupervised learning method:

Answer Key:

Works on both labeled data and raw data: Semi-supervised

Easiest data preparation method: Unsupervised

Only uses labeled input data: Supervised

Infers patterns on its own: Unsupervised

Output is predefined: Supervised

Can be used to automate data labeling: Semi-supervised

More Resources:  Machine Learning Libraries

Where do we go from here

Keep your eyes out for the next post in this series, discussing real world use cases for AI and ML.

Related

Ted Dunning & Ellen Friedman

3 ways a data fabric enables a data-first approach

Mar 15, 2022
Nicolas Perez

A Functional Approach to Logging in Apache Spark

Feb 5, 2021
Cenz Wong

Getting Started with DataTaps in Kubernetes Pods

Jul 6, 2021
Kiran Kumar Mavatoor

Accessing HPE Ezmeral Data Fabric Object Storage from Spring Boot S3 Micro Service deployed in K3s cluster

Sep 13, 2021
Carol McDonald

An Inside Look at the Components of a Recommendation Engine

Jan 22, 2021
Carol McDonald

Analyzing Flight Delays with Apache Spark GraphFrames and MapR Database

Dec 16, 2020
Nicolas Perez

Apache Spark as a Distributed SQL Engine

Jan 7, 2021
Carol McDonald

Apache Spark Machine Learning Tutorial

Nov 25, 2020

HPE Developer Newsletter

Stay in the loop.

Sign up for the HPE Developer Newsletter or visit the Newsletter Archive to see past content.

By clicking on “Subscribe Now”, I agree to HPE sending me personalized email communication about HPE and select HPE-Partner products, services, offers and events. I understand that my email address will be used in accordance with HPE Privacy Statement. You may unsubscribe from receiving HPE and HPE-Partner news and offers at any time by clicking on the Unsubscribe button at the bottom of the newsletter.

For more information on how HPE manages, uses, and protects your personal data please refer to HPE Privacy Statement.