The Queen's University of Belfast
Parallel Computer Centre
[Next] [Previous] [Top]
- The Relational Model
- revolutionised transaction processing systems
- DBMS gave access to the data stored
- OLTPs are good at putting data into databases
- The data explosion
- Increase in use of electronic data gathering devices e.g. point-of-sale, remote sensing devices etc.
- Data storage became easier and cheaper with increasing computing power
- DBMS gave access to the data stored but no analysis of data
- Analysis required to unearth the hidden relationships within the data i.e. for decision support
- Size of databases has increased e.g. VLDBs, need automated techniques for analysis as they have grown beyond manual extraction
- typical scientific user knew nothing of commercial business applications
- the business database programmers, knew nothing of massively parallel principles
- solution was for database software producers to create easy-to-use tools and form strategic relationships with hardware manufacturers
What is data mining?
the non trivial extraction of implicit, previously unknown, and potentially useful information from data
William J Frawley, Gregory Piatetsky-Shapiro and Christopher J Matheus
variety of techniques to identify nuggets of information or decision-making knowledge in bodies of data, and extracting these in such a way that they can be put to use in the areas such as decision support, prediction, forecasting and estimation. The data is often voluminous, but as it stands of low value as no direct use can be made of it; it is the hidden information in the data that is useful
Clementine User Guide
- Data mining encompasses a number of different technical approaches, such as:
- data summarization,
- learning classification rules,
- finding dependency net works,
- analysing changes, and
- detecting anomalies.
- Data mining is the analysis of data and the use of software techniques for finding patterns and regularities in sets of data.
- The computer is responsible for finding the patterns by identifying the underlying rules and features in the data.
- It is possible to `strike gold' in unexpected places as the data mining software extracts patterns not previously discernable or so obvious that no-one has noticed them before.
- Mining analogy:
- large volumes of data are sifted in an attempt to find something worthwhile
- in a mining operation large amounts of low grade materials are sifted through in order to find something of value.
Comparison Data Mining and DBMS
- DBMS - queries based on the data held e.g.
- last months sales for each product
- sales grouped by customer age etc.
- list of customers who lapsed their policy
- Data Mining - infer knowledge from the data held to answer queries e.g.
- what characteristics do customers share who lapsed their policies and how do they differ from those who renewed their policies?
- why is the Cleveland division so profitable?
of a data mining system
- Large quantities of data
- volume of data so great it has to be analysed by automated techniques e.g. POS, satellite information, credit card transactions etc.
- Noisy, incomplete data
- imprecise data is characteristic of all data collection
- databases - usually contaminated by errors, cannot assume that the data they contain is entirely correct e.g. some attributes rely on subjective or measurement judgements
- Complex data structure - conventional statistical analysis not possible
- Heterogeneous data stored in legacy systems
Who needs data mining?
Who(ever) has information fastest and uses it wins
former president of Coke Cola
- Businesses are looking for new ways to let end users find the data they need to:
- make decisions
- serve customers and
- gain the competitive edge
Philadelphia Police & Fire Credit Union
- Used data mining to maximise their membership base i.e.
- looked at the multiple relationships with members such as consumer loans, annuities, credit cards etc.
- Information Harvesters software was used to identify most and least profitable members to the organization, most attractive loan candidates etc.
- Major discovery which was counter-intuitive
- members who had filed for bankruptcy are more inclined to clear debts with the Credit Union than outside lenders
- Medicine - drug side effects, hospital cost analysis, genetic sequence analysis, prediction etc.
- Finance - stock market prediction, credit assessment, fraud detection etc.
- Marketing/sales - product analysis, buying patterns, sales prediction, target mailing, identifying `unusual behaviour' etc.
- Knowledge Acquisition
- Scientific discovery - superconductivity research, etc.
- Engineering - automotive diagnostic expert systems, fault detection etc.
Data Mining Goals
- DM system learns from examples or the data how to partition or classify the data i.e. it formulates classification rules
- Example - customer database in a bank
- Question - Is a new customer applying for a loan a good investment or not?
- Typical rule formulated -
if STATUS = married and INCOME > 10000
and HOUSE_OWNER = yes
then INVESTMENT_TYPE = good
- Rules that associate one attribute of a relation to another
- Set oriented approaches are the most efficient means of discovering such rules
- Example - supermarket database
- 72% of all the records that contain items A and B also contain item C
- the specific percentage of occurrences, 72 is the confidence factor of the rule
- Sequential pattern functions analyse collections of related records and detect frequently occurring patterns over a period of time
- Difference between sequence rules and other rules is the temporal factor
- Example - retailers database
- Can be used to discover the set of purchases that frequently precedes the purchase of a microwave oven
- Example - natural disasters database
- Discovery could be that when there is an earthquake in Los Angeles the next day Mount Kilimanjaro erupts
Data Mining and Machine Learning
- Data Mining (DM) or Knowledge Discovery in Databases (KDD) is about finding understandable knowledge
- Machine Learning (ML) is concerned with improving performance of an agent
- training a neural network to balance a pole is part of ML, but not of KDD
- Efficiency of the algorithm and scalability is more important in DM or KDD
- DM is concerned with very large, real-world databases
- ML typically looks at smaller data sets
- ML has laboratory type examples for the training set
- DM deals with `real world' data
- Real world data tend to have problems such as:
- missing values
- dynamic data
- pre-existing data
Data Mining Process
Data Mining Process
- Data pre-processing
- heterogeneity resolution
- data cleansing
- data warehousing
- Data Mining Tools applied
- extraction of patterns from the pre-processed data
- Interpretation and evaluation
- user bias i.e. can direct DM tools to areas of interest
- attributes of interest in databases
- goal of discovery
- domain knowledge
- prior knowledge or belief about the domain
Issues in Data Mining
- Noisy data
- Missing values
- Static data
- Sparse data
- Dynamic data
- Algorithm efficiency
- Size and complexity of data
- Set oriented database methods
- Neural networks
- Rule Induction
- Set oriented approaches/Databases
- make use of DBMSs to discover knowledge, SQL is limiting
- can be used in several data mining stages
- data cleansing i.e. the removal of erroneous or irrelevant data known as outliers
- EDA, exploratory data analysis e.g. frequency counts, histograms etc.
- data selection - sampling facilities and so reduce the scale of computation
- attribute re-definition e.g. Body Mass Index, BMI, which is Weight/Height2
- data analysis - measures of association and relationships between attributes, interestingness of rules, classification etc.
- enhances EDA, makes patterns more visible e.g. NETMAP a commercial data mining tool uses this technique
- Clustering i.e. Cluster Analysis
- Clustering and segmentation is basically partitioning the database so that each partition or group is similar according to some criteria or metric
- Clustering according to similarity is a concept which appears in many disciplines e.g. in chemistry the clustering of molecules
- Data mining applications make use of clustering according to similarity e.g. to segment a client/customer base
- It provides sub-groups of a population for further analysis or action - very important when dealing with very large databases
- Can be used for profile generation for target marketing i.e. where previous response to mailing campaigns can be used to generate a profile of people who responded and this can be used to predict response and filter mailing lists to achieve the best response
Knowledge Representation Methods
- Neural Networks
- a trained neural network can be thought of as an "expert" in the category of information it has been given to analyse
- provides projections given new situations of interest and answers "what if" questions
- problems include:
- the resulting network is viewed as a black box
- no explanation of the results is given i.e. difficult for the user to interpret the results
- difficult to incorporate user intervention
- slow to train due to their iterative nature
A neural net can be trained to identify the risk of cancer from a number of factors
- Decision trees
- used to represent knowledge
- built using a training set of data and can then be used to classify new objects
- problems are:
- opaque structure - difficult to understand
- missing data can cause performance problems
- they become cumbersome for large data sets
- probably the most common form of representation
- tend to be simple and intuitive
- unstructured and less rigid
- problems are:
- difficult to maintain
- inadequate to represent many types of knowledge
- Example format
- templates for holding clusters of related knowledge about a very particular subject
- a natural way to represent knowledge
- has a taxonomy approach
- problem is
- more complex than rule representation
- Data Warehousing
- On-line Analytical Processing, OLAP
- A data warehouse can be defined as any centralised data repository which can be queried for business benefit
- warehousing makes it possible to
- extract archived operational data
- overcome inconsistencies between different legacy data formats
- integrate data throughout an enterprise, regardless of location, format, or communication requirements
- incorporate additional or expert information
Characteristics of a data warehouse
defined by Bill Inmon (IS guru)
- subject-oriented - data organized by subject instead of application e.g.
- an insurance company would organize their data by customer, premium, and claim, instead of by different products (auto, life, etc.)
- contains only the information necessary for decision support processing
- integrated - encoding of data is often inconsistent e.g.
- gender might be coded as "m" and "f" or 0 and 1 but when data are moved from the operational environment into the data warehouse they assume a consistent coding convention
- time-variant - the data warehouse contains a place for storing data that are five to 10 years old, or older e.g.
- this data is used for comparisons, trends, and forecasting
- these data are not updated
- non-volatile - data are not updated or changed in any way once they enter the data warehouse
- data are only loaded and accessed
- insulate data - i.e. the current operational information
- preserves the security and integrity of mission-critical OLTP applications
- gives access to the broadest possible base of data
- retrieve data - from a variety of heterogeneous operational databases
- data is transformed and delivered to the data warehouse/store based on a selected model (or mapping definition)
- metadata - information describing the model and definition of the source data elements
- data cleansing - removal of certain aspects of operational data, such as low-level transaction information, which slow down the query times.
- transfer - processed data transferred to the data warehouse, a large database on a high performance box
of a data warehouse
- a central store against which the queries are run
- uses very simple data structures with very little assumptions about the relationships between data
- a data mart is a small warehouse which provides subsets of the main store, summarised information
- depending on the requirements of a specific group/department
- marts often use multidimensional databases which can speed up query processing as they can have data structures which are reflect the most likely questions
Data Warehouse model
Structure of data inside the data warehouse
An example of levels of summarization of data
for a data warehouse
- Load Performance
- require incremental loading of new data on a periodic basis
- must not artificially constrain the volume of data
- Load Processing
- data conversions, filtering, reformatting, integrity checks, physical storage, indexing, and metadata update
- Data Quality Management
- ensure local consistency, global consistency, and referential integrity despite "dirty" sources and massive database size
- Query Performance
- must not be slowed or inhibited by the performance of the data warehouse RDBMS
- Terabyte Scalability
- Data warehouse sizes are growing at astonishing rates so RDBMS must not have any architectural limitations. It must support modular and parallel management.
- Mass User Scalability
- Access to warehouse data must not be limited to the elite few has to support hundreds, even thousands, of concurrent users while maintaining acceptable query performance.
- Networked Data Warehouse
- Data warehouses rarely exist in isolation, users must be able to look at and work with multiple warehouses from a single client workstation
- Warehouse Administration
- large scale and time-cyclic nature of the data warehouse demands administrative ease and flexibility
- The RDBMS must Integrate Dimensional Analysis
- dimensional support must be inherent in the warehouse RDBMS to provide the highest performance for relational OLAP tools
- Advanced Query Functionality
- End users require advanced analytic calculations, sequential and comparative analysis, and consistent access to detailed and summarized data
Problems with data warehousing
- the rush of companies to jump on the band wagon as
these companies have slapped `data warehouse' labels on traditional transaction-processing products and co- opted the lexicon of the industry in order to be considered players in this fast-growing category
Chris Erickson, Red Brick
Data warehousing & OLTP
Similarities and Differences
- OLTP systems designed to maximise transaction capacity but they:
- cannot be repositories of facts and historical data for business analysis
- cannot quickly answer ad hoc queries
- rapid retrieval is almost impossible
- data is inconsistent and changing, duplicate entries exist, entries can be missing
- OLTP offers large amounts of raw data which is not easily understood
- Typical OLTP query is a simple aggregation e.g.
- what is the current account balance for this customer?
Data warehouse systems
- Data warehouses are interested in query processing as opposed to transaction processing
- Typical business analysis query e.g.
- which product line sells best in middle-America and how does this correlate to demographic data?
On-line Analytical processing
- Problem is how to process larger and larger databases
- OLAP involves many data items (many thousands or even millions) which are involved in complex relationships
- Fast response is crucial in OLAP
- Difference between OLAP and OLTP
- OLTP servers handle mission-critical production data accessed through simple queries
- OLAP servers handle management-critical data accessed through an iterative analytical investigation
common analytical operations
- Consolidation - involves the aggregation of data i.e. simple roll-ups or complex expressions involving inter-related data
- e.g. sales offices can be rolled-up to districts and districts rolled-up to regions
- Drill-Down - can go in the reverse direction i.e. automatically display detail data which comprises consolidated data
- "Slicing and Dicing" - ability to look at the data base from different viewpoints e.g.
- one slice of the sales database might show all sales of product type within regions;
- another slice might show all sales by sales channel within each product type
- often performed along a time axis in order to analyse trends and find patterns
using data mining
- Expert systems are models of real world processes
- Much of the information is available straight from the process e.g.
- in production systems, data is collected for monitoring the system
- knowledge can be extracted using data mining tools
- experts can verify the knowledge
- TIGON project - detection and diagnosis of an industrial gas turbine engine
- Most significant development was retooling database software to maspar environments
- Parallel processors can easily assign small, independent transactions to different processors.
- More processors, more transactions can be executed without reducing throughput
- same concept applies to executing multiple independent SQL statements i.e a set of SQL statements can be broken up and allocated to different processors to increase speed
- Multiple data streams allow several operations to proceed simultaneously e.g.
- a customer table, can be spread across multiple disks, and independent threads can search each subset of the customer data
- data is partitioned into multiple subsets and performance is increased, the I/O subsystems feed data from the disks to the appropriate threads or streams
- An essential part of designing a database for parallel processing is the partitioning scheme
- large databases are indexed - independent indexes must also be partitioned to maximize performance
- Commercial Developments
- Oracle was first to market a parallel database ORACLE7 RDBMS
- Red Brick has a strong showing - VPT is a DBMS tuned for data warehouse applications
- IBM is still the world's largest producer of database management software, 80% of the FORTUNE 500, including the top 100 companies, rely on DB2 database
- INFORMIX is on-line with 8.0
- Sybase and System 10
- Information Harvester - software on the Convex Exemplar, market researchers in retail, insurance, financial and telecommunications firms will be able to analyse large data sets in a short time.
Information Harvester Inc.
- Founded 1994, based in Cambridge, Mass
- Makes use of conventional statistical analysis techniques by building upon a proprietary tree-based learning algorithm
- generates expert-system-like rules from datasets, initially presented in forms such as numbers, dates, categories, codes, etc.
- Examples of use:
- Healthcare - Michael Reese Medical Associates (MRMA) employed data mining software from Information Harvesting and Vantage Point as a tool for gaining advantage in contract negotiations
- Finance - The Philadelphia Police and Fire Federal Credit Union used data mining to maximize their membership base
Red Brick Company
- California based, specializes in products used for fast, accurate business decisions on large client/server databases
- VPT - Very large data warehouse support, Parallel query processing, Time based data management
- database server - SQL with decision support extensions
- TMU (table management utility) - transforms data to a warehouse schema
- gateway technology supporting client/server access to the warehouse
- Examples of use:
- H.E.B.- Category management in retailing
- Hewlett-Packard: "Discovering" Data To Manage Worldwide Support
- Reference - http://www.redbrick.com
- IBM provides a number of decision support tools giving a powerful but easy-to- use interface to the data warehouse
- IBM Information Warehouse Solutions - a choice of decision support tools that best meet the needs of the end users
- Customer Partnership Program e.g.
- Visa and IBM announced an agreement in May 1995 signalling their intention to work together
- changes the way in which Visa and its member banks exchange information worldwide i.e the proposed structure will facilitate the timely delivery of information and critical decision support tools directly to member financial institutions' desktops worldwide
- Reference - http://www.ibm.com
Data mining projects
UU - Jordanstown
- Data mining in the N.Ireland Housing Executive
- Knee disorders classification
- Fault diagnosis in a telecommunication network
- A self-learning urology patient audit system
- Policy lapse/renewal prediction
- House price prediction
Policy lapse/renewal prediction
- Problem - predicting whether a motor insurance policy will be lapsed or renewed
- 34 attributes stored for each policy
- 14 attributes were deemed relevant
- 2 attributes were derived from underlying attributes
- Predictive accuracy
- In a period of 3 weeks achieved the same accuracy as statistical models developed by the insurance company which had taken much longer to develop
The Mining Kernel System
- Based on the interdisciplinary approach of data mining
- Data pre-processing functionality i.e.
- statistical operations for removing outliers
- data dimensionality reduction
- dealing with missing data
- Algorithms provided for
- Facility to provide what is interesting to the user and presenting only interesting rules
- Facility to incorporate domain knowledge into the knowledge discovery process
- Data mining has a lot of potential
- Diversity in the field of application
- Estimated market for data mining is $500 million
[Next] [Previous] [Top]
All documents are the responsibility of, and copyright, © their authors and do not represent the views of The Parallel Computer Centre, nor of The Queen's University of Belfast.
Maintained by Alan Rea, email A.Rea@qub.ac.uk
Generated with CERN WebMaker