IB COMPUTER SCIENCE OPTION A DATABASES

DATABASES | DATABASE MODELS AND ANALYSIS

ALSO IN THIS TOPIC
BASIC CONCEPTS
RELATIONAL DATABASES 2.1 - 2.8
RELATIONAL DATABASES 2.9 - 2.20
DATABASE MANAGEMENT
YOU ARE HERE | DATABASE MODELS AND ANALYSIS
DATABASE KEY TERMINOLOGY
DATABASE ANSWERS

Topics from the International Baccalaureate (IB) 2014 Computer Science Guide.

SECTION 1 | CHARACTERISTICS OF DIFFERENT DATABASE MODELS

In this section we look at different database models and the characteristics of each including:

Relation Database Models
Object Oriented DataBase Model
Network Database Model
Spatial Database Model
Multi-Dimensional Database Model

SECTION 2 | RELATIONAL DATABASE MODEL

A relational database model is a type of database model that organises data into one or more tables, where each table represents a collection of related data. Each table is made up of rows and columns, where each row represents a unique record and each column represents a specific attribute or field of the record.

Think of a relational database model like a spreadsheet, where each sheet represents a table, and each row represents a record and each column represents a specific attribute of that record. For example, a table for a customer database might have columns for the customer's name, address, phone number, and email address, with each row representing a unique customer record.

Relational databases are widely used in a variety of applications, such as inventory management, accounting, and e-commerce. They are flexible, scalable, and efficient, and allow for easy data retrieval and manipulation through the use of SQL (Structured Query Language) queries.

Characteristics: Some key characteristics of a relational database model include the use of related tables, Primary keys, normalisation and SQL.

Tables

Tables: A relational database model stores data in one or more tables, where each table represents a collection of related data. Each table consists of rows and columns, where each row represents a unique record and each column represents a specific attribute or field of the record.

Primary Keys

Primary Keys: Each table in a relational database model has a primary key, which is a unique identifier for each record in the table. This allows for easy retrieval and manipulation of data within the table.

Relationships

Relationships: In a relational database model, relationships between tables are established through foreign keys, which create a one-to-many relationship between tables. This allows data to be efficiently organised and related across multiple tables.

Normalisation

Normalisation: Relational databases use normalisation techniques to reduce data redundancy and improve data consistency. Normalisation involves breaking up tables into smaller, more specialised tables, which reduces the likelihood of data duplication and inconsistency.

Structured Query Language (SQL)

Structured Query Language (SQL): Relational databases use SQL as a standardised language for querying and manipulating data. SQL allows for easy retrieval, filtering, sorting, and aggregation of data within tables and across tables.

A relational database model organises data into tables with primary keys and establishes relationships between tables through foreign keys. Normalisation techniques are used to improve data consistency and reduce redundancy, and SQL is used as a standardised language for querying and manipulating data.

American Airlines is an example of a company that uses a relational database model:

American Airlines: American Airlines uses a relational database model to store data about its flight schedules, routes, and bookings. Each of these entities is represented as a separate table, with columns representing the various attributes of the entity. This allows American Airlines to efficiently manage and retrieve data about its flights, and provide real-time information to passengers about flight status and delays.

Frequently used by companies that deal with large amounts of structured data, such as inventory management, accounting, e-commerce, or airline reservations, may find a relational database model to be a useful tool for organising and analysing their data. By using a relational database model, organisations can efficiently store and manage data, and make better-informed decisions based on the relationships between different types of data.

SECTION 3 | OBJECT ORIENTED DATABASE MODEL

Object-oriented databases (OODBs) are databases that store data in the form of objects, which can contain both data and behaviour. OODBs support Encapsulation, Inheritance and Polymorphism. Software that supports OODBs includes ObjectStore, GemStone/S, and Versant.

Example datasets that could be well-suited for an OODB include those that have complex relationships between data, such as those found in social networks, medical records, or financial transactions.

Some companies that might use OODBs include those in the financial industry, where complex financial transactions require a flexible and efficient database solution. For example, Goldman Sachs has used GemStone/S to manage their risk management and trading systems, and JPMorgan Chase has used ObjectStore to manage their credit risk and pricing models. Other companies that may use OODBs include those in the healthcare industry, where complex patient data requires a flexible and secure database solution.

In summary, encapsulation is about hiding internal state, inheritance is about sharing common functionality, and polymorphism is about being adaptable to different situations. These are all important concepts in object-oriented databases that help to make them powerful and flexible tools for managing complex data.

Encapsulation

Encapsulation in object-oriented databases means that objects can hide their internal state (i.e., data) and only expose certain methods for interacting with that state. This helps to ensure that the data remains valid and consistent. Think of encapsulation like a vending machine – you don't need to know how the machine works on the inside, you just need to know which buttons to press to get what you want.

Inheritance

Inheritance in object-oriented databases allows objects to inherit behaviour and properties from other objects, resulting in a more efficient and flexible codebase. This means that objects can share common functionality, which reduces the amount of code that needs to be written and makes it easier to maintain and update the code. Think of inheritance like a family tree – children inherit certain characteristics from their parents, such as eye colour or height.

Polymorphism

Polymorphism in object-oriented databases means that objects can take on multiple forms depending on the context in which they are used. This allows objects to be more flexible and adaptable, as they can be used in a variety of different ways. Think of polymorphism like a Swiss Army Knife – it has multiple tools that can be used in different situations, such as a knife, a can opener, and a screwdriver.

SECTION 4 | NETWORK DATABASE MODEL

A networked database model is a type of database model that organises data into a graph-like structure, where data is represented as nodes (also called records) and relationships between them. In this model, nodes can have multiple relationships to other nodes, and relationships can have attributes that describe the nature of the connection.

Think of a networked database model like a social network, where each person is represented as a node, and the connections between them (such as friends or family members) are represented as relationships. Each person (node) can have multiple relationships to other people (nodes), and each relationship can have attributes that describe the nature of the connection (such as how long they have been friends, or what type of family relationship they have).

Networked database models are well-suited for complex data structures that have many-to-many relationships between entities, such as social networks, genealogy, or supply chain management. By representing data in a graph-like structure, networked databases can efficiently store and retrieve complex relationships between data, and allow for more flexible and powerful analysis of the data.

Characteristics: A Network Database Model includes Nodes and Relationships, Multiple Parents, Hierarchical Organisation, Graph Theory and Flexibility.

Nodes and Relationships

Nodes and Relationships: A networked database model represents data as a graph-like structure consisting of nodes (records) and relationships between them. Each node can have multiple relationships to other nodes, and relationships can have attributes that describe the nature of the connection.

Multiple Parents

Multiple Parents: Unlike in a relational database model, nodes in a networked database model can have multiple parents, which means that they can belong to multiple groups or categories. This makes it easier to represent complex relationships between data.

Hierarchical Organisation

Hierarchical Organisation: In a networked database model, nodes can be organised hierarchically, with some nodes being higher up in the hierarchy than others. This allows for efficient navigation and retrieval of data within the network.

Graph Theory

Graph Theory: Networked databases are based on graph theory, which is a mathematical framework for analysing networks of nodes and relationships. This means that networked databases can leverage algorithms and techniques from graph theory to efficiently traverse the data and extract insights.

Flexibility

Flexibility: The networked database model is more flexible than the relational database model, allowing for more complex and dynamic relationships between data. This makes it well-suited for complex data structures that have many-to-many relationships between entities.

Example use: A couple of examples of companies that use a network database model and the type of data they store:

Amazon: Amazon uses a network database model to store data about its product catalog and customer recommendations. Each product is represented as a node, and relationships between products are based on similarities or recommendations. Customers are also represented as nodes, and relationships between customers are based on their purchasing behaviour.
LinkedIn: LinkedIn uses a network database model to store data about its members and their connections. Each member is represented as a node, and connections between members (such as friends, colleagues, or alumni) are represented as relationships. This allows LinkedIn to efficiently store and retrieve complex relationship data, and provide valuable insights and recommendations to its users.

In general, companies that deal with complex data structures that have many-to-many relationships between entities, such as social networks, supply chain management, or genealogy, may find a network database model to be a useful tool for organising and analysing their data.

SECTION 5 | SPATIAL DATABASE MODEL

A spatial database model is a type of database model that is used to store and manage spatial data, which is data that has a location or geographic component. In other words, a spatial database model is designed to handle data that is tied to a specific location on the earth's surface.

Think of a spatial database model like a digital map that stores information about the location of various features, such as cities, roads, rivers, and landmarks. Each feature is represented as a point, line, or polygon, and can have attributes that describe various characteristics of the feature, such as its name, population, or elevation.

Spatial database models are used in a variety of applications, such as geographic information systems (GIS), environmental management, transportation planning, and urban planning. By using a spatial database model, organisations can store and analyse spatial data more efficiently and accurately, and make better-informed decisions based on the geographic relationships between different features.

In summary, a spatial database model is a specialised type of database model that is designed to store and manage spatial data. It includes special data types, indexing techniques, and operations that are optimised for working with geographic data. By using a spatial database model, organisations can efficiently store and analyse spatial data, and make better-informed decisions based on the geographic relationships between different features.

Characteristics: Specialises in storing and querying data about geographic locations, has specific functions to handle spatial data such as distance, area, and intersection. A spatial database model often encompasses key features such as; Geographical location, Spatial Data Types, Spatial Indexing, Geographical Operations and Integration with other data.

Geographic Location

Geographic Location: A spatial database model stores and manages data that is tied to a specific geographic location on the earth's surface. This allows for efficient analysis and retrieval of data based on its location.

Spatial Data Types

Spatial Data Types: A spatial database model includes special data types that are designed to represent spatial data, such as points, lines, and polygons. These data types can be used to store and manipulate geographic features such as cities, roads, rivers, and land parcels.

Spatial Indexing

Spatial Indexing: A spatial database model uses special indexing techniques to efficiently store and retrieve spatial data. These indexing techniques are designed to take advantage of the unique characteristics of spatial data, such as proximity and topology.

Geographic Operations

Geographic Operations: A spatial database model includes specialized operations that can be used to analyze and manipulate spatial data. These operations can include calculations of distances and areas, buffering, overlay analysis, and network analysis.

Integration with Other Data

Integration with Other Data: A spatial database model can integrate spatial data with other types of data, such as demographic data, environmental data, or business data. This allows organisations to make more informed decisions based on the geographic relationships between different types of data.

A couple of examples of companies that use a spatial database model and the type of data they store:

Google: Google uses a spatial database model to power its popular Google Maps application. The spatial database stores data about the geographic locations of roads, buildings, parks, and other features, as well as information about traffic, transit schedules, and other real-time data. This allows Google Maps to provide accurate and up-to-date information about the location of various features and to calculate efficient routes for navigation.
National Weather Service: The National Weather Service uses a spatial database model to store and analyse weather data, such as temperature, precipitation, wind speed, and pressure. The spatial database stores this data based on the location and time of the measurement, allowing meteorologists to analyse and predict weather patterns across different regions and time periods.

In general, companies that deal with spatial data, such as maps, satellite imagery, environmental data, or urban planning, may find a spatial database model to be a useful tool for organising and analysing their data. By using a spatial database model, organisations can efficiently store and analyse spatial data, and make better-informed decisions based on the geographic relationships between different features.

SECTION 6 | MULTI DIMENSIONAL DATABASE MODEL

A multi-dimensional model is a type of database model that organises data into a multidimensional cube, where each dimension represents a different aspect of the data. In other words, a multi-dimensional model is designed to handle data that has multiple dimensions, such as time, geography, and product type.

Think of a multi-dimensional model like a Rubik's Cube, where each side represents a different dimension of the data. For example, a multi-dimensional model for a sales database might have dimensions for time (such as year, quarter, and month), product (such as category and SKU), and geography (such as region and store location). Each cell in the cube represents a unique combination of these dimensions, and can contain data such as sales revenue, profit margin, or inventory levels.

Multi-dimensional models are used in a variety of applications, such as business intelligence, financial analysis, and forecasting. By organising data into a multi-dimensional cube, organisations can quickly and easily analyse the data from different perspectives, and identify trends and insights that might not be apparent from a traditional table-based view.

Characteristics: Some key characteristics of a multi-dimensional database model using Dimensions, Measures, Aggregation, Slice and Dice and OLAP

Dimensions: A multi-dimensional database model organises data into multiple dimensions, which represent different aspects of the data. For example, a sales database might have dimensions for time, product, and geography.

Measures: Within each cell of the multi-dimensional cube, a multi-dimensional database model stores one or more measures, which represent the data that is being analysed. For example, a sales database might have measures for revenue, profit margin, or inventory levels.

Aggregation: A multi-dimensional database model supports aggregation of data across multiple dimensions, allowing users to analyse data from different perspectives and levels of detail. For example, users can aggregate sales data by year, by region, or by product category.

Slice and Dice: A multi-dimensional database model supports "slice and dice" operations, which allow users to selectively filter and view data based on specific dimensions or subsets of dimensions. This allows users to focus on specific aspects of the data that are relevant to their analysis.

OLAP: Multi-dimensional databases use Online Analytical Processing (OLAP) technology to efficiently analyse and retrieve data. OLAP allows for fast retrieval and manipulation of data, and allows users to interactively explore data from multiple dimensions.

A couple of examples of companies that use a multi-dimensional database model and the type of data they store:

Walmart: Walmart uses a multi-dimensional database model to store and analyse sales data for its retail stores. The multi-dimensional database allows Walmart to quickly analyze sales data across different dimensions, such as time, geography, and product category. This enables Walmart to make data-driven decisions about inventory management, pricing, and marketing.
Netflix: Netflix uses a multi-dimensional database model to store and analyse data about its streaming video service. The multi-dimensional database allows Netflix to analyse data across dimensions such as viewer demographics, viewing habits, and content preferences. This enables Netflix to make data-driven decisions about which content to produce and recommend to its users.

Companies that deal with large amounts of data across multiple dimensions, such as sales, marketing, or customer behaviour, may find a multi-dimensional database model to be a useful tool for organising and analysing their data. By using a multi-dimensional database model, organisations can efficiently store and manage data, and make better-informed decisions based on the relationships between different dimensions of the data.

Dimensions

Dimensions: A multi-dimensional database model organises data into multiple dimensions, which represent different aspects of the data. For example, a sales database might have dimensions for time, product, and geography.

Measures

Measures: Within each cell of the multi-dimensional cube, a multi-dimensional database model stores one or more measures, which represent the data that is being analysed. For example, a sales database might have measures for revenue, profit margin, or inventory levels.

Aggregation

Aggregation: A multi-dimensional database model supports aggregation of data across multiple dimensions, allowing users to analyse data from different perspectives and levels of detail. For example, users can aggregate sales data by year, by region, or by product category.

Slice and Dice

Slice and Dice: A multi-dimensional database model supports "slice and dice" operations, which allow users to selectively filter and view data based on specific dimensions or subsets of dimensions. This allows users to focus on specific aspects of the data that are relevant to their analysis.

OLAP

OLAP: Multi-dimensional databases use Online Analytical Processing (OLAP) technology to efficiently analyse and retrieve data. OLAP allows for fast retrieval and manipulation of data, and allows users to interactively explore data from multiple dimensions.

SECTION 7 | OBJECT-ORIENTED DATABASE VS RELATIONAL DATABASE

The use of object-oriented databases (OODBs) versus relational databases (RDBs) is a topic of ongoing debate in the database community. Here's an evaluation of the pros and cons of OODBs and RDBs:

ADVANTAGES OF AN OBJECT-ORIENTED DATABASE

Flexibility: OODBs are more flexible than RDBs because they allow for complex and hierarchical data structures. This is because OODBs allow for the storage of complex objects that can have methods and properties.
Performance: OODBs can have better performance than RDBs when dealing with complex queries and large volumes of data. This is because OODBs can use more advanced indexing and caching techniques than RDBs.
Reduced Mapping Overhead: OODBs eliminate the need for mapping objects to relational structures, which can simplify development and reduce the amount of code needed to handle data.

DISADVANTAGES OF AN OBJECT-ORIENTED DATABASE

Complexity: OODBs are more complex than RDBs because they require knowledge of both object-oriented programming and database design.
Limited Tool Support: There are fewer tools and frameworks available for working with OODBs than RDBs, which can make development and maintenance more difficult.
Scalability: OODBs may have scalability issues because they are not as widely used as RDBs and may not be able to handle as many simultaneous users.

ADVANTAGES OF A RELATIONAL DATABASE

Familiarity: RDBs are widely used and well-understood, making them easier to use and maintain for many developers.
Tool Support: There are a wide range of tools and frameworks available for working with RDBs, which can make development and maintenance easier.
Scalability: RDBs can be highly scalable because they are widely used and have been optimised for performance and scalability.

DISADVANTAGES OF A RELATIONAL DATABASE

Limited Flexibility: RDBs are limited in their ability to handle complex and hierarchical data structures.
Performance: RDBs can have performance issues when dealing with complex queries and large volumes of data.
Mapping Overhead: RDBs require mapping objects to relational structures, which can increase development time and introduce complexity.

Both OODBs and RDBs have their pros and cons. OODBs offer flexibility and performance benefits, but are more complex and have limited tool support. RDBs are widely used and familiar, but have limited flexibility and can have performance issues. Ultimately, the choice between OODBs and RDBs depends on the specific requirements of the application and the preferences of the development team.

SECTION 8 | DATA WAREHOUSE

A data warehouse is a host of data from various sources, the data inflow must abide be the data warehouse rules and cannot be changed once stored in the warehouse. Whilst multiple backups may be made the data warehouse should become a single point of access to clean/quality data. A data warehouse would generally follow these 5 rules.

1: Subject oriented - Broken down in to subject specific areas.
2: Integrated - Follows common rules for data conventions
3: Time-Variant - Contains historical data and a means to query data by date.
4: Non-Volatile - Cannot be changed once stored in the data warehouse.
5: Summarised - The data should be summarised ready for easy analysis.

Data warehousing is the process of collecting, storing, and managing large amounts of data from multiple sources for the purpose of creating a centralised repository for decision-making and business intelligence. It involves creating a data warehouse that integrates data from different sources, such as transactional databases, to create a single, unified view of the data. The data is then organised and optimised for fast querying and analysis, allowing organisations to make informed decisions based on data-driven insights.

Data warehousing is suitable for a range of situations where large amounts of data need to be analysed for business intelligence and decision-making purposes. Some examples include:

Business reporting: Data warehouses are often used to support business reporting by providing a centralised repository for data from multiple sources.
Data integration: Data warehouses are used to integrate data from disparate sources into a single, unified view of the data.
Data mining: Data warehouses enable organisations to perform data mining, or the process of uncovering hidden patterns and relationships in large amounts of data.
Customer behaviour analysis: Retailers, for example, may use a data warehouse to analyse customer purchasing patterns, preferences and behaviour.
Financial analysis: Financial institutions may use a data warehouse to analyse large amounts of financial data to identify trends and make informed decisions.
Supply chain management: Supply chain managers may use a data warehouse to analyse supply chain data to identify areas for improvement and make decisions that increase efficiency and reduce costs.
Marketing analysis: Marketing departments may use a data warehouse to analyse customer data to understand market trends and target customers more effectively.

SECTION 9 | TIME DEPENDENT

Data warehousing is time dependent because it stores historical data as well as current data. In a data warehouse, data is organised into time periods, such as days, weeks, or months, and each period contains data from that time frame.

The time dependency of a data warehouse is important because it allows for trend analysis and historical reporting. By storing historical data, the data warehouse can provide insight into how data has changed over time, and how trends have evolved. This can help organisations make informed decisions based on past performance and predict future trends.

For example, a retail company may use a data warehouse to track sales by store and product over time. By analysing historical sales data, the company can identify trends in sales, such as which products are selling well and which stores are performing best. This information can be used to make informed decisions about inventory management, marketing, and store operations.

In addition, data warehouses often contain data from multiple sources, such as transactional databases and external data sources. By organising this data by time periods, the data warehouse can provide a unified and consistent view of the data over time, even as the underlying data sources change.

Data warehousing is time dependent because it stores historical data as well as current data, allowing for trend analysis and historical reporting. By organising data into time periods, data warehouses can provide a unified and consistent view of data over time, which can help organisations make informed decisions and predict future trends.

SECTION 10 | REAL TIME UPDATES

Updating data in a data warehouse in real-time involves continuously incorporating new data into the warehouse as it is generated. This requires a fast, efficient, and reliable method for capturing, transforming, and loading the data into the warehouse.

There are several ways to achieve real-time updates in a data warehouse, including:

Stream processing: This involves capturing data in real-time as it is generated, transforming it on the fly, and loading it directly into the warehouse. This method is suitable for high-velocity data streams, such as log data, social media feeds, and IoT sensor data.
Change data capture (CDC): This involves monitoring source systems for changes and capturing only the changes as they occur, rather than reading the entire source data set. The changes are then transformed and loaded into the warehouse.
Batch processing with near real-time updates: This involves processing data in batch mode and updating the warehouse periodically, but with a minimal delay. This method is suitable for large data sets that can be processed in batch mode, such as sales data or financial data.

The choice of method for real-time updates in a data warehouse will depend on the volume, velocity, and complexity of the data, as well as the requirements of the business. In some cases, a combination of methods may be used to achieve the desired level of real-time updates.

SECTION 11 | DATA WAREHOUSING ADVANTAGES AND DISADVANTAGES

Data warehousing provides several advantages, including:

Improved Decision Making: Data warehousing enables organisations to store and analyse vast amounts of historical data, providing insights that can be used to make informed business decisions.
Centralized Data: By centralizing data from multiple sources, data warehousing makes it easier for organizations to access and analyze their data. This reduces the time and effort required to consolidate data from disparate sources, improving the accuracy and reliability of the data.
Data Consistency: Data warehousing enforces data standards, ensuring that data is consistent across the organization. This eliminates the need for manual data reconciliation, reducing the risk of errors and improving the quality of the data.
Data Integration: Data warehousing integrates data from different sources, enabling organizations to analyze data that would otherwise be difficult to access. This enables organizations to gain a more comprehensive view of their operations, providing valuable insights into business performance.
Scalability: Data warehousing can handle large volumes of data, making it possible for organizations to store and analyze data as their business grows. This allows organizations to keep pace with their changing data requirements, ensuring that their data management systems remain effective.
Improved Performance: Data warehousing utilizes specialized technologies, such as data marts and OLAP cubes, to improve the performance of data analysis. This enables organizations to quickly and efficiently analyse their data, improving the speed and accuracy of decision making.

Data warehousing also has several disadvantages, including:

High Cost: Data warehousing can be a costly undertaking, requiring significant investments in hardware, software, and personnel. This can be a challenge for smaller organizations or organizations with limited budgets.
Complexity: Data warehousing can be complex, requiring specialized skills and knowledge to implement and maintain. This can make it challenging for organizations to manage their data warehousing systems, leading to errors and reduced data quality.
Data Latency: Data warehousing relies on batch processing to update the warehouse, which can result in data latency. This means that the data in the warehouse may not reflect the most current state of the business, reducing its usefulness for real-time decision making.
Data Integration Challenges: Integrating data from multiple sources can be challenging, particularly if the data has different structures or formats. This can result in errors, data quality issues, and a need for extensive data cleaning and transformation.
Maintenance Requirements: Data warehousing requires ongoing maintenance, including data backup and recovery, performance tuning, and hardware upgrades. This can be time-consuming and costly, and can take resources away from other critical business tasks.
Data Privacy and Security: Data warehousing can raise privacy and security concerns, as it requires storing and analyzing large amounts of sensitive data. This can result in the risk of data breaches and the unauthorized access of confidential information.

SECTION 12 | EXTRACT, TRANSFORM, LOAD

ETL (Extract, Transform, Load) processes are an essential part of data warehousing. The ETL process is used to extract data from various sources, transform it into a format that can be used by the data warehouse, and load it into the data warehouse.

The need for ETL processes in data warehousing arises from the fact that data is often stored in multiple locations and in different formats, making it difficult to integrate and analyse. ETL processes provide a way to extract data from these disparate sources and transform it into a consistent format that can be used by the data warehouse. This process ensures that the data is accurate, complete, and consistent, which is essential for making informed decisions based on the data.

One important aspect of ETL processes is data cleaning, which involves identifying and correcting errors, inconsistencies, and discrepancies in the data. Here are some ways that ETL processes can be used to clean up data for a data warehouse:

Removing Duplicates: ETL processes can identify and remove duplicate records from the data, ensuring that each record is unique and accurate.
Standardising Formats: ETL processes can standardise the format of data, such as dates or addresses, to ensure consistency across the data.
Data Validation: ETL processes can validate data to ensure that it meets certain criteria, such as data type or range, to ensure accuracy and completeness.
Correcting Errors: ETL processes can correct errors in the data, such as misspelled names or incorrect values, to ensure accuracy.
Handling Missing Data: ETL processes can handle missing data by either filling in missing values with a default value or dropping records with missing data, depending on the requirements of the data warehouse.

By cleaning up the data before loading it into the data warehouse, ETL processes ensure that the data is accurate, consistent, and complete, which is essential for making informed decisions based on the data.

SECTION 13 | DATA MINING

Data mining is the process of discovering patterns and insights from large datasets. There are several techniques used in data mining to discover different types of patterns. Here are five commonly used techniques in data mining:

Cluster Analysis: Cluster analysis is a technique used to group similar items or records together based on their characteristics. It is used to identify natural groupings in the data and can be used for market segmentation or customer profiling. For example, a retailer may use cluster analysis to group customers based on their purchasing behaviour.
Association: Association is a technique used to discover relationships between variables or items in the data. It is used to find co-occurrences of items or events and can be used for market basket analysis or recommendation engines. For example, a retailer may use association analysis to identify which products are commonly purchased together.
Classification: Classification is a technique used to predict the value of a categorical variable based on other variables in the data. It is used for making predictions and can be used for credit risk analysis or fraud detection. For example, a bank may use classification to predict whether a loan applicant is likely to default on a loan.
Sequential Patterns: Sequential patterns are patterns that occur over time, such as a sequence of events or a pattern of behaviour. It is used to find patterns in temporal data and can be used for predicting future events. For example, a healthcare provider may use sequential pattern analysis to predict the likelihood of a patient developing a certain condition based on their medical history.
Forecasting: Forecasting is a technique used to predict future values of a variable based on its past values. It is used for making predictions and can be used for sales forecasting or demand forecasting. For example, a retailer may use forecasting to predict sales for a particular product based on past sales data.

Different techniques are used in data mining to discover patterns and insights from large datasets, including cluster analysis, association, classification, sequential patterns, and forecasting. Each technique is used for a different purpose and can provide valuable insights into the data.

SECTION 14 | DATA MINING USES

Data mining can be useful in a wide range of industries and applications. Examples of data mining you might expect to see at IB level are:

FRAUD DETECTION IN BANKING
Data mining can be used to detect fraudulent credit card use in banking. By analysing patterns in credit card transactions, data mining algorithms can identify unusual spending patterns that may indicate fraud. For example, if a credit card is suddenly used to make multiple large purchases in different locations, this could be a sign of fraudulent activity. Data mining can also be used to identify groups of customers with a high risk of fraud, such as those with a history of chargebacks or those who have recently opened a new account.

TARGETED MARKETING IN RETAIL
Data mining can be used to identify subsets of the population likely to respond to a particular promotion in retail. By analysing customer data, such as purchase history and demographics, data mining algorithms can identify groups of customers who are likely to be interested in a particular product or promotion. This can help retailers target their marketing efforts more effectively and increase the likelihood of a sale. For example, a retailer may use data mining to identify customers who have recently purchased a certain type of product and send them a targeted promotion for a related product.

In both of these examples, data mining is used to analyse large datasets to identify patterns and insights that can help organisations make informed decisions. By using data mining techniques, organisations can identify risks, opportunities, and trends that may not be immediately apparent from the data.

SECTION 15 | PREDICTIVE MODELING

Predictive modeling is a technique used to analyse data and make predictions about future events or trends. It involves using statistical algorithms and machine learning techniques to build models that can predict the outcome of future events based on historical data.

One of the key techniques used in predictive modeling is decision tree induction. Decision tree induction involves constructing a tree-like model of decisions and their possible consequences. Each decision point in the tree is based on a particular attribute of the data, and the branches represent the possible values of that attribute. The leaves of the tree represent the predicted outcome or classification. Decision tree induction can be used for classification, regression, or survival analysis, depending on the type of data being analysed.

Another technique used in predictive modeling is back-propagation in neural networks. Back-propagation is a supervised learning technique used in artificial neural networks to train the network to make predictions. It involves adjusting the weights of the network based on the error between the predicted output and the actual output. Back-propagation is used to optimise the performance of neural networks for tasks such as image recognition, natural language processing, and predictive analytics.

Predictive modeling can be used in a wide range of applications, including customer segmentation, fraud detection, and risk analysis. For example, a bank may use predictive modeling to predict the likelihood of a loan default based on a customer's credit history, income, and other factors. A retailer may use predictive modeling to identify which customers are most likely to respond to a particular promotion based on their purchase history and demographics.

In summary, predictive modeling is a technique used to analyze data and make predictions about future events or trends. It involves using statistical algorithms and machine learning techniques to build models that can predict the outcome of future events based on historical data. Decision tree induction and backpropagation in neural networks are two examples of techniques used in predictive modeling. These techniques can be applied in a wide range of applications to improve decision-making and inform business strategies.

SECTION 16 | DATABASE SEGMENTATION

Database segmentation is the process of dividing a database into smaller, more manageable segments or subsets. There are several reasons why database segmentation may be needed:

Performance: Large databases can be slow to query and update, especially if they contain millions or billions of records. By segmenting the database into smaller subsets, queries and updates can be performed more quickly, improving performance.
Security: Segmentation can be used to limit access to sensitive data. For example, an organisation may segment their database so that only authorised personnel have access to certain records or fields.
Maintenance: Smaller databases are easier to maintain and backup than larger databases. By segmenting the database into smaller subsets, organisations can simplify maintenance tasks, such as backup and recovery.
Scalability: As a database grows in size, it may become necessary to segment it into smaller subsets to accommodate growth. This can also help to distribute the load across multiple servers or clusters, improving scalability.
Compliance: Some industries or regulations require data to be segmented or partitioned. For example, the Payment Card Industry Data Security Standard (PCI DSS) requires that credit card data be stored in a separate, secure database segment.

By segmenting a database, organisations can improve performance, simplify maintenance tasks, and ensure compliance with industry regulations.

SECTION 17 | LINK ANALYSIS

The purpose of link analysis in the context of associations between individual records in a data set is to identify and analyse relationships and patterns between individual records. Link analysis can be used to identify connections between records, and to understand the nature and significance of those connections.

Link analysis can be particularly useful in analysing large and complex data sets, such as those found in social media or customer data. By analysing the relationships between individual records, link analysis can provide insights into customer behaviour, social networks, and other patterns that may be relevant to a particular analysis or investigation.

Some examples of link analysis in the context of associations between individual records in a data set include:

Identifying Key Relationships: Link analysis can be used to identify key relationships between records, such as frequent co-occurrences or shared attributes. This information can be used to understand the nature of the relationships between records and to identify key patterns or trends.
Cluster Analysis: Link analysis can be used to group records into clusters or subgroups based on their relationships. This can help to identify patterns or trends within the data set and to understand the structure of the data.
Fraud Detection: Link analysis can be used to detect patterns of fraud or other suspicious activity within a data set. By analysing the relationships between individual records, link analysis can identify anomalies or unusual patterns that may be indicative of fraudulent behavior.
Customer Analysis: Link analysis can be used to analyse the behaviour of customers, such as their purchasing patterns or interactions with other customers. By analysing the relationships between individual records, link analysis can provide insights into customer behaviour and preferences.

By analysing these relationships, link analysis can provide valuable insights into the nature of the data and the patterns and trends within it.

SECTION 18 | DEVIATION DETECTION

Deviation detection in databases refers to the process of identifying when the values in a database deviate from expected or normal values. Deviation detection is used to identify unusual or unexpected data values that may indicate errors, outliers, or other anomalies in the data. This is important in ensuring the accuracy and integrity of data and in detecting issues that may impact the quality of decisions made based on the data. Deviation detection can be performed through statistical methods, such as mean and standard deviation calculations, or through machine learning algorithms that can identify patterns and anomalies in the data. Deviation detection can be performed in real-time, as data is entered into the database, or as part of a regular data quality assessment process. The goal is to identify deviations early, so that they can be corrected or investigated before they impact decision-making processes or cause more serious issues.

ALSO IN THIS TOPIC

BASIC CONCEPTS
RELATIONAL DATABASES 2.1 - 2.8
RELATIONAL DATABASES 2.9 - 2.20
DATABASE MANAGEMENT
DATABASE MODELS AND ANALYSIS
DATABASE KEY TERMINOLOGY
DATABASE ANSWERS