7.7 DATABASE WAREHOUSING

DATABASES
7.6 DATABASE WAREHOUSING

This section looks at some of the more advanced terminology and theory used in IB and A-Level Computer Science.

DATA WAREHOUSING

A data warehouse is a host of data from various sources, the data inflow must abide be the data warehouse rules and cannot be changed once stored in the warehouse. Whilst multiple backups may be made the data warehouse should become a single point of access to clean/quality data. A data warehouse would generally follow these 5 rules.

1: Subject oriented - Broken down in to subject specific areas.
2: Integrated - Follows common rules for data conventions
3: Time-Variant - Contains historical data and a means to query data by date.
4: Non-Volatile - Cannot be changed once stored in the data warehouse.
5: Summarised - The data should be summarised ready for easy analysis.

Data warehousing is the process of collecting, storing, and managing large amounts of data from multiple sources for the purpose of creating a centralized repository for decision-making and business intelligence. It involves creating a data warehouse that integrates data from different sources, such as transactional databases, to create a single, unified view of the data. The data is then organized and optimized for fast querying and analysis, allowing organizations to make informed decisions based on data-driven insights.

Data warehousing is suitable for a range of situations where large amounts of data need to be analyzed for business intelligence and decision-making purposes. Some examples include:

Business reporting: Data warehouses are often used to support business reporting by providing a centralized repository for data from multiple sources.
Data integration: Data warehouses are used to integrate data from disparate sources into a single, unified view of the data.
Data mining: Data warehouses enable organizations to perform data mining, or the process of uncovering hidden patterns and relationships in large amounts of data.
Customer behavior analysis: Retailers, for example, may use a data warehouse to analyze customer purchasing patterns, preferences and behavior.
Financial analysis: Financial institutions may use a data warehouse to analyze large amounts of financial data to identify trends and make informed decisions.
Supply chain management: Supply chain managers may use a data warehouse to analyze supply chain data to identify areas for improvement and make decisions that increase efficiency and reduce costs.
Marketing analysis: Marketing departments may use a data warehouse to analyze customer data to understand market trends and target customers more effectively.

CHARACTERISTICS OF DIFFERENT DATABASE MODELS

RELATIONAL DATABASE MODEL
Characteristics: Uses tables to represent data and relationships between data, enforces rules of normalization, and supports SQL as the query language.
Example: MySQL, Oracle, Microsoft SQL Server

OBJECT ORIENTED DATABASE MODEL
Characteristics: Uses objects and classes to model data, supports inheritance and encapsulation, and has methods to interact with data.
Example: MongoDB, CouchDB, Oracle NoSQL

NETWORK DATABASE MODEL
Characteristics: Uses a graph structure to represent data and relationships, allows many-to-many relationships, and uses CODASYL database language.
Example: Integrated Data Store (IDS)

SPATIAL DATABASE MODEL
Characteristics: Specializes in storing and querying data about geographic locations, has specific functions to handle spatial data such as distance, area, and intersection.
Example: PostGIS, SQL Server Spatial, Oracle Spatial

MULTI DIMENSIONAL DATABASE MODEL
Characteristics: Uses a multi-dimensional data structure, often in the form of a cube, to model and query data, supports OLAP (Online Analytical Processing) operations.
Example: Oracle Essbase, Microsoft Analysis Services, SAP HANA.

MULTI-DIMENSIONAL DATA WEAREHOUSING

Multi-dimensional data warehousing refers to the design and implementation of data warehouses that are optimized to store and analyze multi-dimensional data. This type of data typically includes measures and dimensions, such as time, geography, products, and customers, which can be used to explore patterns, trends, and relationships within the data. Multi-dimensional data warehousing allows organizations to analyze complex data in an efficient and flexible way, making it easier to gain insights and make informed decisions. The architecture typically includes a relational database, an OLAP (Online Analytical Processing) cube, and a front-end tool, such as a reporting or business intelligence tool, that can be used to interact with the data and create meaningful reports and visualizations.

DATA SHARING

Data sharing in the context of data warehousing refers to the practice of making data available to multiple users, systems, or organizations. The goal of data sharing is to increase data accessibility and improve decision-making by making data more readily available to those who need it. Data sharing can be achieved through various means, including data warehousing, data lakes, or cloud-based data storage solutions. Data sharing can also be facilitated through data exchange and integration, such as the use of common data models and data exchange protocols, or the use of data integration tools. Data sharing can help organizations to increase efficiency, reduce data duplication, and improve collaboration, but it also raises security and privacy concerns, and it requires careful management and governance to ensure that data is used correctly and in compliance with relevant laws and regulations.

DEVIATION DETECTION

Deviation detection in databases refers to the process of identifying when the values in a database deviate from expected or normal values. Deviation detection is used to identify unusual or unexpected data values that may indicate errors, outliers, or other anomalies in the data. This is important in ensuring the accuracy and integrity of data and in detecting issues that may impact the quality of decisions made based on the data. Deviation detection can be performed through statistical methods, such as mean and standard deviation calculations, or through machine learning algorithms that can identify patterns and anomalies in the data. Deviation detection can be performed in real-time, as data is entered into the database, or as part of a regular data quality assessment process. The goal is to identify deviations early, so that they can be corrected or investigated before they impact decision-making processes or cause more serious issues.

DDL AND DML

DDL and DML are two types of SQL (Structured Query Language) statements used in databases.

DDL (Data Definition Language) statements are used to define the database schema (the structure), create, alter or drop database objects such as tables, views, indexes, and procedures. Examples of DDL statements include CREATE TABLE, ALTER TABLE, DROP VIEW, and CREATE INDEX. DDL statements are executed to modify the structure of a database and do not modify or manipulate the data stored in the database.

DML (Data Manipulation Language) statements are used to modify or manipulate the data stored in a database(the content). Examples of DML statements include SELECT, INSERT, UPDATE, and DELETE. DML statements are executed to insert, update, delete or retrieve data from the database and do not modify the database schema.

In summary, DDL statements are used to manage the structure of a database, while DML statements are used to manage the data stored in a database.

TIME DEPENDENCY OF DATA WAREHOUSING

Data warehousing is time dependent because it involves the storage and analysis of historical data, which changes over time. The data in a data warehouse is usually updated at regular intervals, such as daily, weekly or monthly, and the analysis of this data is usually done over a specific period, such as a month, quarter or year. The time aspect is critical in data warehousing as it enables organizations to track changes and trends in their data over time, and make informed decisions based on this historical analysis. Additionally, data warehousing often involves the use of time-based dimensions, such as date, time and fiscal periods, to enable multi-dimensional analysis and to facilitate the aggregation and comparison of data over time.

REAL TIME UPDATES

Updating data in a data warehouse in real-time involves continuously incorporating new data into the warehouse as it is generated. This requires a fast, efficient, and reliable method for capturing, transforming, and loading the data into the warehouse.

There are several ways to achieve real-time updates in a data warehouse, including:

Stream processing: This involves capturing data in real-time as it is generated, transforming it on the fly, and loading it directly into the warehouse. This method is suitable for high-velocity data streams, such as log data, social media feeds, and IoT sensor data.
Change data capture (CDC): This involves monitoring source systems for changes and capturing only the changes as they occur, rather than reading the entire source data set. The changes are then transformed and loaded into the warehouse.
Batch processing with near real-time updates: This involves processing data in batch mode and updating the warehouse periodically, but with a minimal delay. This method is suitable for large data sets that can be processed in batch mode, such as sales data or financial data.

The choice of method for real-time updates in a data warehouse will depend on the volume, velocity, and complexity of the data, as well as the requirements of the business. In some cases, a combination of methods may be used to achieve the desired level of real-time updates.

DATA WAREHOUSING ADVANTAGES AND DISADVANTAGES

Data warehousing provides several advantages, including:

Improved Decision Making: Data warehousing enables organizations to store and analyse vast amounts of historical data, providing insights that can be used to make informed business decisions.
Centralized Data: By centralizing data from multiple sources, data warehousing makes it easier for organizations to access and analyze their data. This reduces the time and effort required to consolidate data from disparate sources, improving the accuracy and reliability of the data.
Data Consistency: Data warehousing enforces data standards, ensuring that data is consistent across the organization. This eliminates the need for manual data reconciliation, reducing the risk of errors and improving the quality of the data.
Data Integration: Data warehousing integrates data from different sources, enabling organizations to analyze data that would otherwise be difficult to access. This enables organizations to gain a more comprehensive view of their operations, providing valuable insights into business performance.
Scalability: Data warehousing can handle large volumes of data, making it possible for organizations to store and analyze data as their business grows. This allows organizations to keep pace with their changing data requirements, ensuring that their data management systems remain effective.
Improved Performance: Data warehousing utilizes specialized technologies, such as data marts and OLAP cubes, to improve the performance of data analysis. This enables organizations to quickly and efficiently analyse their data, improving the speed and accuracy of decision making.

Data warehousing also has several disadvantages, including:

High Cost: Data warehousing can be a costly undertaking, requiring significant investments in hardware, software, and personnel. This can be a challenge for smaller organizations or organizations with limited budgets.
Complexity: Data warehousing can be complex, requiring specialized skills and knowledge to implement and maintain. This can make it challenging for organizations to manage their data warehousing systems, leading to errors and reduced data quality.
Data Latency: Data warehousing relies on batch processing to update the warehouse, which can result in data latency. This means that the data in the warehouse may not reflect the most current state of the business, reducing its usefulness for real-time decision making.
Data Integration Challenges: Integrating data from multiple sources can be challenging, particularly if the data has different structures or formats. This can result in errors, data quality issues, and a need for extensive data cleaning and transformation.
Maintenance Requirements: Data warehousing requires ongoing maintenance, including data backup and recovery, performance tuning, and hardware upgrades. This can be time-consuming and costly, and can take resources away from other critical business tasks.
Data Privacy and Security: Data warehousing can raise privacy and security concerns, as it requires storing and analyzing large amounts of sensitive data. This can result in the risk of data breaches and the unauthorized access of confidential information.

What is data warehousing and how does it differ from a traditional database?
How is data collected and integrated in a data warehouse?
What are the different types of data warehousing architectures and their advantages/disadvantages?
What are the benefits of using a data warehouse in an organization?
How is data stored and organized in a data warehouse and why is it important?
What is dimensional modeling and how is it used in data warehousing?
What is the role of ETL (extract, transform, load) in data warehousing?
How is data security maintained in a data warehouse environment?
What are some common data warehousing tools and technologies used for data analysis?
How is data warehousing used for business intelligence and decision-making?

KEY TERMS

DATA CONSISTENCY: Ensuing data within a category is of the same format.
DATA INTEGRITY: Ensuring the data is correct, relevant, accurate and up to date.
ISOLATION: Isolates the changes made by a transaction from the rest of the database until the transaction is committed.
TRANSACTION: A modification made to the data within a database, often referred to as a 'unit of work performed'.
DATA DICTIONARY: A document containing details of the database, such as access rights, field types and parameters, table constraints.
ATOMIC DATA: Data is broken down into it smallest parts. For example fullName should be - firstName, middleName, lastName.
ACID: An acronym standing for Atomicity, Consistency, Isolation and Durability.
ATOMICITY: A rule that dictates that either the entire transaction is made or none at all, in other words, no partial transaction should be made, if one part fails it all fails.
DEVIATION DETECTION: The process of identifying when the values in a database deviate from expected or normal values.
DATA SHARING: The practice of making data available to multiple users, systems, or organizations.
DDL: Data Definition Language statements are used to manage the structure of a database
DML: Data Manipulation Language statements are used to manage the data stored in a database.

USEFUL LINKS

Java T Point - A great site that gives in-depth explanations of many database principles (IB and A-Level)
Free MySQL Hosting - Host your DB online