Data mining is the process of extracting patterns from data The term data refers to groups of information that represent the qualitative or quantitative attributes of a variable or set of variables. Data are typically the results of measurements and can be the basis of graphs, images, or observations of a set of variables. Data are often viewed as the lowest level of abstraction from which information and. Data mining is seen as an increasingly important tool by modern business to transform data into an informational advantage. It is currently used in a wide range of profiling practices One of the most challenging problems of the information society is dealing with the increasing data overload. with the digitizing of all sorts of content as well as the improvement and drop in cost of recording technologies, the amount of available information has become enormous and is increasing exponentially. It has thus become important for, such as marketing Marketing is the process by which companies create customer interest in products or services. It generates the strategy that underlies sales techniques, business communication, and business development. It is an integrated process through which companies build strong customer relationships and create value for their customers and for themselves, surveillance Surveillance is the monitoring of the behavior, activities, or other changing information, usually of people and often in a surreptitious manner. It most usually refers to observation of individuals or groups by government organizations, but disease surveillance, for example, is monitoring the progress of a disease in a community, fraud The specific legal definition varies by legal jurisdiction. Fraud is a crime, and also a civil law violation. Defrauding people or entities of money or valuables is a common purpose of fraud, but there have also been fraudulent "discoveries", e.g. in science, to gain prestige rather than immediate monetary gain detection, and scientific discovery.
The related terms data dredging Data dredging is the inappropriate (sometimes deliberately so) use of data mining to uncover misleading relationships in data. These relationships may be valid within the test set but have no statistical significance in the wider population, data fishing and data snooping refer to the use of data mining techniques to sample portions of the larger population data set that are (or may be) too small for reliable statistical inferences to be made about the validity of any patterns discovered (see also data-snooping bias). These techniques can however, be used in the creation of new hypothesises to test against the larger data populations.
Contents |
Background
The manual extraction of patterns from data The term data refers to groups of information that represent the qualitative or quantitative attributes of a variable or set of variables. Data are typically the results of measurements and can be the basis of graphs, images, or observations of a set of variables. Data are often viewed as the lowest level of abstraction from which information and has occurred for centuries. Early methods of identifying patterns in data include Bayes' theorem In probability theory, Bayes' theorem, often called Bayes' law or Bayes' rule, and named after Rev. Thomas Bayes , shows how one conditional probability (such as the probability of a hypothesis given observed evidence) depends on its inverse (in this case, the probability of that evidence given the hypothesis) (1700s) and regression analysis In statistics, regression analysis includes any techniques for modeling and analyzing several variables, when the focus is on the relationship between a dependent variable and one or more independent variables. More specifically, regression analysis helps us understand how the typical value of the dependent variable changes when any one of the (1800s). The proliferation, ubiquity and increasing power of computer technology has increased data collection and storage. As data sets A data set is a collection of data, usually presented in tabular form. Each column represents a particular variable. Each row corresponds to a given member of the data set in question. Its values for each of the variables, such as height and weight of an object or values of random numbers. Each value is known as a datum. The data set may comprise have grown in size and complexity, direct hands-on data analysis has increasingly been augmented with indirect, automatic data processing. This has been aided by other discoveries in computer science, such as neural networks Traditionally, the term neural network had been used to refer to a network or circuit of biological neurons; the modern usage of the term often refers to artificial neural networks, which are composed of artificial neurons or nodes. Thus the term has two distinct usages:, clustering Cluster analysis or clustering is the assignment of a set of observations into subsets so that observations in the same cluster are similar in some sense. Clustering is a method of unsupervised learning, and a common technique for statistical data analysis used in many fields, including machine learning, data mining, pattern recognition, image, genetic algorithms A genetic algorithm is a search technique used in computing to find exact or approximate solutions to optimization and search problems. Genetic algorithms are categorized as global search heuristics. Genetic algorithms are a particular class of evolutionary algorithms (EA) that use techniques inspired by evolutionary biology such as inheritance, (1950s), decision trees Decision tree learning, used in data mining and machine learning, uses a decision tree as a predictive model which maps observations about an item to conclusions about the item's target value. More descriptive names for such tree models are classification trees or regression trees. In these tree structures, leaves represent classifications and (1960s) and support vector machines Support vector machines are a set of related supervised learning methods used for classification and regression. In simple words, given a set of training examples, each marked as belonging to one of two categories, an SVM training algorithm builds a model that predicts whether a new example falls into one category or the other. Intuitively, an SVM (1980s). Data mining is the process of applying these methods to data with the intention of uncovering hidden patterns.[1] It has been used for many years by businesses, scientists and governments to sift through volumes of data such as airline passenger trip records, census data and supermarket scanner data to produce market research reports. (Note, however, that reporting is not always considered to be data mining.)
A primary reason for using data mining is to assist in the analysis of collections of observations of behaviour. Such data are vulnerable to collinearity because of unknown interrelations. An unavoidable fact of data mining is that the (sub-)set(s) of data being analysed may not be representative of the whole domain, and therefore may not contain examples of certain critical relationships and behaviours that exist across other parts of the domain. To address this sort of issue, the analysis may be augmented using experiment-based and other approaches, such as Choice Modelling Choice modelling attempts to model the decision process of an individual or segment in a particular context. Choice modelling may also be used to estimate non-market environmental benefits and costs for human-generated data. In these situations, inherent correlations can be either controlled for, or removed altogether, during the construction of the experimental design In general usage, design of experiments, or experimental design, is the design of any information-gathering exercises where variation is present, whether under the full control of the experimenter or not. However, in statistics, these terms are usually used for controlled experiments. Other types of study, and their design, are discussed in the.
There have been some efforts to define standards for data mining, for example the 1999 European Cross Industry Standard Process for Data Mining (CRISP-DM 1.0) and the 2004 Java Data Mining standard (JDM 1.0). These are evolving standards; later versions of these standards are under development. Independent of these standardization efforts, freely available open-source software systems like the R Project, Weka, KNIME The Development of KNIME was started January 2004 by a team of software engineers at Konstanz University. The original team came from a pharmaceutical and a silicon valley software company. In contrast to many other open source software projects, KNIME has been developed from day one using rigorous professional software engineering processes since, RapidMiner and others have become an informal standard for defining data-mining processes. Notably, all these systems are able to import and export models in PMML (Predictive Model Markup Language) which provides a standard way to represent data mining models so that these can be shared between different statistical applications[2]. PMML is an XML Extensible Markup Language is a set of rules for encoding documents in machine-readable form. It is defined in the XML 1.0 Specification produced by the W3C, and several other related specifications, all gratis open standards-based language developed by the Data Mining Group (DMG)[3], an independent group composed of many data mining companies. PMML version 4.0 was released in June 2009.[3][4][5]
Research and evolution
In addition to industry driven demand for standards and interoperability, professional and academic activity have also made considerable contributions to the evolution and rigour of the methods and models; an article published in a 2008 issue of the International Journal of Information Technology and Decision Making summarises the results of a literature survey which traces and analyzes this evolution.[6]
The premier professional body in the field is the Association for Computing Machinery The Association for Computing Machinery, or ACM, is a learned society for computing. It was founded in 1947 as the world's first scientific and educational computing society. Its membership is more than 92,000 as of 2009. Its headquarters are in New York City's Special Interest Group on Knowledge discovery Knowledge discovery is a concept of the field of computer science that describes the process of automatically searching large volumes of data for patterns that can be considered knowledge about the data. It is often described as deriving knowledge from the input data. This complex topic can be categorized according to 1) what kind of data is and Data Mining (SIGKDD SIGKDD is the Association for Computing Machinery's Special Interest Group on Knowledge Discovery and Data Mining).[citation needed] Since 1989 they have hosted an annual international conference and published its proceedings,[7] and since 1999 have published a biannual academic journal An academic journal is a peer-reviewed periodical in which scholarship relating to a particular academic discipline is published. Academic journals serve as forums for the introduction and presentation for scrutiny of new research, and the critique of existing research. Content typically takes the form of articles presenting original research, titled "SIGKDD Explorations".[8] Other Computer Science conferences on data mining include:
- DMIN - International Conference on Data Mining;[9]
- DMKD - Research Issues on Data Mining and Knowledge Discovery;
- ECML-PKDD - European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases European Conference on Machine Learning is one of the leading academic conferences on Machine Learning, held in Europe every year. The conference started in 1986 with the European Working Session on Learning (EWSL), which became the European Conference on Machine Learning as of 1993. Another conference on Knowledge Discovery, PKDD (Principles and;
- ICDM - IEEE International Conference on Data Mining;[10]
- MLDM - Machine Learning and Data Mining in Pattern Recognition;
- SDM - SIAM International Conference on Data Mining
- EDM - International Conference on Educational Data Mining
- ECDM - European Conference on Data Mining
- PAKDD - The annual Pacific-Asia Conference on Knowledge Discovery and Data Mining
Process
Pre-processing
Before data mining algorithms can be used, a target data set must be assembled. As data mining can only uncover patterns already present in the data, the target dataset must be large enough to contain these patterns while remaining concise enough to be mined in an acceptable timeframe. A common source for data is a datamart or data warehouse A data warehouse is a repository of an organization's electronically stored data, designed to facilitate reporting and analysis. Pre-process is essential to analyse the multivariate datasets before clustering or data mining.
The target set is then cleaned. Cleaning removes the observations with noise and missing data.
The clean data are reduced into feature vectors, one vector per observation. A feature vector is a summarised version of the raw data observation. For example, a black and white image of a face which is 100px by 100px would contain 10,000 bits of raw data. This might be turned into a feature vector by locating the eyes and mouth in the image. Doing so would reduce the data for each vector from 10,000 bits to three codes for the locations, dramatically reducing the size of the dataset to be mined, and hence reducing the processing effort. The feature(s) selected will depend on what the objective(s) is/are; obviously, selecting the "right" feature(s) is fundamental to successful data mining.
The feature vectors are divided into two sets, the "training set" and the "test set". The training set is used to "train" the data mining algorithm(s), while the test set is used to verify the accuracy of any patterns found.
Data mining
Data mining commonly involves four classes of tasks:[11]
- Clustering Cluster analysis or clustering is the assignment of a set of observations into subsets so that observations in the same cluster are similar in some sense. Clustering is a method of unsupervised learning, and a common technique for statistical data analysis used in many fields, including machine learning, data mining, pattern recognition, image - is the task of discovering groups and structures in the data that are in some way or another "similar", without using known structures in the data.
- Classification Statistical classification is a supervised machine learning procedure in which individual items are placed into groups based on quantitative information on one or more characteristics inherent in the items and based on a training set of previously labeled items - is the task of generalizing known structure to apply to new data. For example, an email program might attempt to classify an email as legitimate or spam. Common algorithms include decision tree learning Decision tree learning, used in data mining and machine learning, uses a decision tree as a predictive model which maps observations about an item to conclusions about the item's target value. More descriptive names for such tree models are classification trees or regression trees. In these tree structures, leaves represent classifications and, nearest neighbor In pattern recognition, the k-nearest neighbors algorithm is a method for classifying objects based on closest training examples in the feature space. k-NN is a type of instance-based learning, or lazy learning where the function is only approximated locally and all computation is deferred until classification. It can also be used for regression, naive Bayesian classification A naive Bayes classifier is a simple probabilistic classifier based on applying Bayes' theorem with strong (naive) independence assumptions. A more descriptive term for the underlying probability model would be "independent feature model", neural networks An artificial neural network , usually called "neural network" (NN), is a mathematical model or computational model that is inspired by the structure and/or functional aspects of biological neural networks. It consists of an interconnected group of artificial neurons and processes information using a connectionist approach to computation and support vector machines Support vector machines are a set of related supervised learning methods used for classification and regression. In simple words, given a set of training examples, each marked as belonging to one of two categories, an SVM training algorithm builds a model that predicts whether a new example falls into one category or the other. Intuitively, an SVM.
- Regression In statistics, regression analysis includes any techniques for modeling and analyzing several variables, when the focus is on the relationship between a dependent variable and one or more independent variables. More specifically, regression analysis helps us understand how the typical value of the dependent variable changes when any one of the - Attempts to find a function which models the data with the least error.
- Association rule learning In data mining, association rule learning is a popular and well researched method for discovering interesting relations between variables in large databases. Piatetsky-Shapiro describes analyzing and presenting strong rules discovered in databases using different measures of interestingness. Based on the concept of strong rules, Agrawal et al - Searches for relationships between variables. For example a supermarket might gather data on customer purchasing habits. Using association rule learning, the supermarket can determine which products are frequently bought together and use this information for marketing purposes. This is sometimes referred to as market basket analysis.
Results validation
The final step of knowledge discovery from data is to verify the patterns produced by the data mining algorithms occur in the wider data set. Not all patterns found by the data mining algorithms are necessarily valid. It is common for the data mining algorithms to find patterns in the training set which are not present in the general data set, this is called overfitting. To overcome this, the evaluation uses a test set of data which the data mining algorithm was not trained on. The learnt patterns are applied to this test set and the resulting output is compared to the desired output. For example, a data mining algorithm trying to distinguish spam from legitimate emails would be trained on a training set of sample emails. Once trained, the learnt patterns would be applied to the test set of emails which it had not been trained on, the accuracy of these patterns can then be measured from how many emails they correctly classify. A number of statistical methods may be used to evaluate the algorithm such as ROC curves In signal detection theory, a receiver operating characteristic , or simply ROC curve, is a graphical plot of the sensitivity, or true positives, vs. (1 − specificity), or false positives, for a binary classifier system as its discrimination threshold is varied. The ROC can also be represented equivalently by plotting the fraction of true.
If the learnt patterns do not meet the desired standards, then it is necessary to reevaluate and change the preprocessing and data mining. If the learnt patterns do meet the desired standards then the final step is to interpret the learnt patterns and turn them into knowledge.
Notable uses
Games
Since the early 1960s, with the availability of oracles In complexity theory and computability theory, an oracle machine is an abstract machine used to study decision problems. It can be visualized as a Turing machine with a black box, called an oracle, which is able to decide certain decision problems in a single operation. The problem can be of any complexity class. Even undecidable problems, like for certain combinatorial games Combinatorial game theory is a mathematical theory that studies two-player games which have a position in which the players take turns changing in defined ways or moves to achieve a defined winning condition. CGT does not study games of chance (like poker). It restricts itself to games whose position is public to both players, and in which the set, also called tablebases (e.g. for 3x3-chess) with any beginning configuration, small-board dots-and-boxes Dots and Boxes is a pencil and paper game for two players (or sometimes, more than two) first published in 1889 by Édouard Lucas, small-board-hex, and certain endgames in chess, dots-and-boxes, and hex; a new area for data mining has been opened up. This is the extraction of human-usable strategies from these oracles. Current pattern recognition approaches do not seem to fully have the required high level of abstraction in order to be applied successfully. Instead, extensive experimentation with the tablebases, combined with an intensive study of tablebase-answers to well designed problems and with knowledge of prior art, i.e. pre-tablebase knowledge, is used to yield insightful patterns. Berlekamp in dots-and-boxes etc. and John Nunn in chess Chess is a board game involving two players. It is played on a chessboard, a square-checkered board with 64 squares arranged in an eight-by-eight grid. At the beginning of the game each player controls sixteen pieces: one king, one queen, two rooks, two knights, two bishops, and eight pawns. The object of the game is to checkmate the opponent's endgames are notable examples of researchers doing this work, though they were not and are not involved in tablebase generation.
Fri, 25 Jun 2010 14:05:47 GMT+00:00
iSmashPhone (blog) ... with informing us about what kind use our information is going to we should be making a stand against this kind of data mining in the financial sphere. ...
Sun, 18 Nov 2007 07:48:54 PST
This is a short presentation of Data Mining technology and how it affects businesses.. video.google.com.


