MACHINE LEARNING - GLOSSARY - COMPUTER SCIENCE CAFÉ

LEARNING IS A JOURNEY
MACHINE LEARNING GLOSSARY

GLOSSARY OF TERMS FOR MACHINE LEARNING AND RECOMMENDATION SYSTEMS

Behavioural data refers to data that is collected from observing and recording the actions, decisions, and interactions of individuals or groups. In the context of IB Computer Science, behavioral data can be used for a variety of purposes such as user modeling, personalization, and recommendation systems. Behavioral data can be collected through various means such as website logs, mobile apps, or social media platforms. It can be used to understand and predict user behavior, improve user experience and create targeted advertisements.

Cloud delivery models refer to the ways in which cloud computing services are provided to customers. There are three main cloud delivery models: Infrastructure as a Service (IaaS), Platform as a Service (PaaS), and Software as a Service (SaaS). IaaS provides virtualized computing resources, PaaS provides a platform for developing, running and managing applications, and SaaS provides access to software applications that are hosted and managed by the provider. Understanding these models can help in the decision-making process of choosing the most suitable service to meet the requirements of a particular use case.

Cloud deployment models refer to the way the cloud infrastructure is configured and accessed. There are three main cloud deployment models: public cloud, private cloud, and hybrid cloud. Public cloud is provided by a third-party provider and is accessible to the public, private cloud is operated solely for an organization and hybrid cloud is a combination of both. Understanding the different deployment models can help in the decision-making process of choosing the most suitable service to meet the requirements of a particular use case.

Collaborative filtering is a method used in recommendation systems to make predictions about a user's preferences based on the preferences of similar users. It can be used to make personalized recommendations to users. Different techniques can be u sed to implement collaborative filtering, such as k-nearest neighbors or matrix factorization.

Content-based filtering is a method used in recommendation systems to make predictions about a user's preferences based on the characteristics of the items they have interacted with. It can be used to make recommendations to users based on the characteristics of the items they have interacted with. Different techniques can be used to implement content-based filtering, such as natural language processing or image recognition.

Cost function is a mathematical expression used to determine the performance of a machine learning model. It helps in optimizing the model by minimizing the error between the predicted and actual values. Different types of cost functions can be used such as mean squared error or cross-entropy.

F-measure is a statistical measure used to evaluate the performance of a machine learning model in terms of precision and recall. It is used to balance the trade-off between precision and recall, and can be used to evaluate the performance of different models. In the context of IB Computer Science, students should know how to calculate F-measure and interpret the results, and how it can be used to compare the performance of different models.

Hyperparameter is a value that is set before the training of a machine learning model and controls the behavior of the model. Examples of hyperparameters include the learning rate in a neural network, the number of trees in a random forest, or the number of clusters in a k-means algorithm. In the context of IB Computer Science, students should know how to choose appropriate values for different types of models, and how to use techniques such as cross-validation to tune the hyperparameters for optimal performance.

K-nearest neighbour (k-NN) algorithm is a supervised machine learning algorithm used for classification and regression. It is based on the idea of finding the k-nearest data points to a given point and making a prediction based on the majority class or mean value of those points. In the context of IB Computer Science, students should know how the k-NN algorithm works, its advantages and disadvantages, and how it can be used for different types of data and problems.

Matrix factorization is a technique used in recommendation systems to factorize a user-item matrix into two lower-dimensional matrices, one representing users and the other representing items. The goal is to find the latent representation of users and items that can explain the observed ratings in the user-item matrix. In the context of IB Computer Science, students should know how matrix factorization can be used to handle the sparsity of the user-item matrix, provide low-dimensional representation of users and items, and can be regularized to prevent overfitting.

Mean absolute error (MAE) is a measure of the average magnitude of the errors in a set of predictions, without considering their direction. In the context of IB Computer Science, students should know how MAE can be used to evaluate the performance of a model, and how it compares to other error measures such as Root-Mean-Square Error (RMSE).

Overfitting is a phenomenon in machine learning where a model is trained too well on the training data and performs poorly on new, unseen data. In the context of IB Computer Science, students should know how to identify and prevent overfitting, and how to use techniques such as regularization to control overfitting.

Popularity bias is a common problem in recommendation systems where popular items are recommended more often than less popular items. This can occur when the algorithm used in the recommendation system is based on the popularity of items rather than the individual preferences of users. In the context of IB Computer Science, students should know how popularity bias can negatively affect the performance of a recommendation system and how to recognize and address it by using techniques such as diversity and novelty.

Precision is a measure of the accuracy of a classifier in terms of the proportion of true positive predictions out of all positive predictions. In the context of IB Computer Science, students should know how precision is used to evaluate the performance of a classifier, and how it can be used to compare the performance of different models.

Recall is a measure of the completeness of a classifier in terms of the proportion of true positive predictions out of all actual positive instances. In the context of IB Computer Science, students should know how recall is used to evaluate the performance of a classifier, and how it can be used to compare the performance of different models. Together precision and recall can be used to evaluate the trade-off of a classifier in terms of accuracy and completeness.

Reinforcement learning is a type of machine learning that focuses on training agents to make decisions in an environment by rewarding or punishing them based on their actions. In the context of IB Computer Science, students should know the fundamental concepts of reinforcement learning, such as the Markov Decision Process, Q-learning, and policy gradient methods, and how they can be used to train agents to play games, control robots, or optimize decision-making systems.

The right to anonymity refers to the ability for individuals to protect their identity and remain anonymous in their interactions and transactions. In the context of IB Computer Science, students should know about the importance of anonymity in online interactions, particularly in terms of protecting personal information and sensitive data, and the various technologies and techniques that can be used to achieve anonymity, such as VPNs, Tor, and blockchain.

The right to privacy refers to the ability for individuals to control the collection, use, and dissemination of their personal information. In the context of IB Computer Science, students should know about the legal and ethical considerations surrounding privacy, including data protection laws and regulations, and the technologies and techniques that can be used to protect privacy, such as encryption, access control, and data minimization.

Root-mean-square error (RMSE) is a measure of the difference between predicted values and actual values in a dataset. It is commonly used to evaluate the performance of a machine learning model. RMSE is calculated as the square root of the mean of the squared differences between predicted and actual values. In the context of IB Computer Science, students should know how to calculate RMSE, interpret the results and compare it with other error measures such as Mean Absolute Error (MAE)

Stochastic gradient descent (SGD) is an optimization algorithm used to minimize a cost function in machine learning by iteratively updating the model parameters in the direction of the negative gradient. It is called "stochastic" because it updates the parameters based on a small, randomly selected subset of the training data, rather than the entire dataset. In the context of IB Computer Science, students should know about the advantages and disadvantages of SGD compared to other optimization algorithms such as batch gradient descent and how to choose the appropriate learning rate and batch size for optimal performance.

Training data refers to the set of data used to train a machine learning model. This data is used to optimize the model's parameters so that it can make accurate predictions on new, unseen data. In the context of IB Computer Science, students should know about the importance of having a representative and diverse set of training data, and how to prepare and preprocess the data to ensure it is suitable for training. Additionally, they will learn about techniques such as cross-validation, to ensure the model generalizes well to new data and prevent overfitting.