Data mining is the process of extracting patterns from data The term data refers to groups of information that represent the qualitative or quantitative attributes of a variable or set of variables. Data are typically the results of measurements and can be the basis of graphs, images, or observations of a set of variables. Data are often viewed as the lowest level of abstraction from which information and. Data mining is seen as an increasingly important tool by modern business to transform data into an informational advantage. It is currently used in a wide range of profiling practices One of the most challenging problems of the information society is dealing with the increasing data overload. with the digitizing of all sorts of content as well as the improvement and drop in cost of recording technologies, the amount of available information has become enormous and is increasing exponentially. It has thus become important for, such as marketing Marketing is the process by which companies create customer interest in products or services. It generates the strategy that underlies sales techniques, business communication, and business development. It is an integrated process through which companies build strong customer relationships and create value for their customers and for themselves, surveillance Surveillance is the monitoring of the behavior, activities, or other changing information, usually of people and often in a surreptitious manner. It most usually refers to observation of individuals or groups by government organizations, but disease surveillance, for example, is monitoring the progress of a disease in a community, fraud The specific legal definition varies by legal jurisdiction. Fraud is a crime, and also a civil law violation. Defrauding people or entities of money or valuables is a common purpose of fraud, but there have also been fraudulent "discoveries", e.g. in science, to gain prestige rather than immediate monetary gain detection, and scientific discovery.

The related terms data dredging Data dredging is the inappropriate (sometimes deliberately so) use of data mining to uncover misleading relationships in data. These relationships may be valid within the test set but have no statistical significance in the wider population, data fishing and data snooping refer to the use of data mining techniques to sample portions of the larger population data set that are (or may be) too small for reliable statistical inferences to be made about the validity of any patterns discovered (see also data-snooping bias). These techniques can however, be used in the creation of new hypothesises to test against the larger data populations.

Contents

Background

The manual extraction of patterns from data The term data refers to groups of information that represent the qualitative or quantitative attributes of a variable or set of variables. Data are typically the results of measurements and can be the basis of graphs, images, or observations of a set of variables. Data are often viewed as the lowest level of abstraction from which information and has occurred for centuries. Early methods of identifying patterns in data include Bayes' theorem In probability theory, Bayes' theorem, often called Bayes' law or Bayes' rule, and named after Rev. Thomas Bayes , shows how one conditional probability (such as the probability of a hypothesis given observed evidence) depends on its inverse (in this case, the probability of that evidence given the hypothesis) (1700s) and regression analysis In statistics, regression analysis includes any techniques for modeling and analyzing several variables, when the focus is on the relationship between a dependent variable and one or more independent variables. More specifically, regression analysis helps us understand how the typical value of the dependent variable changes when any one of the (1800s). The proliferation, ubiquity and increasing power of computer technology has increased data collection and storage. As data sets A data set is a collection of data, usually presented in tabular form. Each column represents a particular variable. Each row corresponds to a given member of the data set in question. Its values for each of the variables, such as height and weight of an object or values of random numbers. Each value is known as a datum. The data set may comprise have grown in size and complexity, direct hands-on data analysis has increasingly been augmented with indirect, automatic data processing. This has been aided by other discoveries in computer science, such as neural networks Traditionally, the term neural network had been used to refer to a network or circuit of biological neurons; the modern usage of the term often refers to artificial neural networks, which are composed of artificial neurons or nodes. Thus the term has two distinct usages:, clustering Cluster analysis or clustering is the assignment of a set of observations into subsets so that observations in the same cluster are similar in some sense. Clustering is a method of unsupervised learning, and a common technique for statistical data analysis used in many fields, including machine learning, data mining, pattern recognition, image, genetic algorithms A genetic algorithm is a search technique used in computing to find exact or approximate solutions to optimization and search problems. Genetic algorithms are categorized as global search heuristics. Genetic algorithms are a particular class of evolutionary algorithms (EA) that use techniques inspired by evolutionary biology such as inheritance, (1950s), decision trees Decision tree learning, used in data mining and machine learning, uses a decision tree as a predictive model which maps observations about an item to conclusions about the item's target value. More descriptive names for such tree models are classification trees or regression trees. In these tree structures, leaves represent classifications and (1960s) and support vector machines Support vector machines are a set of related supervised learning methods used for classification and regression. In simple words, given a set of training examples, each marked as belonging to one of two categories, an SVM training algorithm builds a model that predicts whether a new example falls into one category or the other. Intuitively, an SVM (1980s). Data mining is the process of applying these methods to data with the intention of uncovering hidden patterns.[1] It has been used for many years by businesses, scientists and governments to sift through volumes of data such as airline passenger trip records, census data and supermarket scanner data to produce market research reports. (Note, however, that reporting is not always considered to be data mining.)

A primary reason for using data mining is to assist in the analysis of collections of observations of behaviour. Such data are vulnerable to collinearity because of unknown interrelations. An unavoidable fact of data mining is that the (sub-)set(s) of data being analysed may not be representative of the whole domain, and therefore may not contain examples of certain critical relationships and behaviours that exist across other parts of the domain. To address this sort of issue, the analysis may be augmented using experiment-based and other approaches, such as Choice Modelling Choice modelling attempts to model the decision process of an individual or segment in a particular context. Choice modelling may also be used to estimate non-market environmental benefits and costs for human-generated data. In these situations, inherent correlations can be either controlled for, or removed altogether, during the construction of the experimental design In general usage, design of experiments, or experimental design, is the design of any information-gathering exercises where variation is present, whether under the full control of the experimenter or not. However, in statistics, these terms are usually used for controlled experiments. Other types of study, and their design, are discussed in the.

There have been some efforts to define standards for data mining, for example the 1999 European Cross Industry Standard Process for Data Mining (CRISP-DM 1.0) and the 2004 Java Data Mining standard (JDM 1.0). These are evolving standards; later versions of these standards are under development. Independent of these standardization efforts, freely available open-source software systems like the R Project, Weka, KNIME The Development of KNIME was started January 2004 by a team of software engineers at Konstanz University. The original team came from a pharmaceutical and a silicon valley software company. In contrast to many other open source software projects, KNIME has been developed from day one using rigorous professional software engineering processes since, RapidMiner and others have become an informal standard for defining data-mining processes. Notably, all these systems are able to import and export models in PMML (Predictive Model Markup Language) which provides a standard way to represent data mining models so that these can be shared between different statistical applications[2]. PMML is an XML Extensible Markup Language is a set of rules for encoding documents in machine-readable form. It is defined in the XML 1.0 Specification produced by the W3C, and several other related specifications, all gratis open standards-based language developed by the Data Mining Group (DMG)[3], an independent group composed of many data mining companies. PMML version 4.0 was released in June 2009.[3][4][5]

Research and evolution

In addition to industry driven demand for standards and interoperability, professional and academic activity have also made considerable contributions to the evolution and rigour of the methods and models; an article published in a 2008 issue of the International Journal of Information Technology and Decision Making summarises the results of a literature survey which traces and analyzes this evolution.[6]

The premier professional body in the field is the Association for Computing Machinery The Association for Computing Machinery, or ACM, is a learned society for computing. It was founded in 1947 as the world's first scientific and educational computing society. Its membership is more than 92,000 as of 2009. Its headquarters are in New York City's Special Interest Group on Knowledge discovery Knowledge discovery is a concept of the field of computer science that describes the process of automatically searching large volumes of data for patterns that can be considered knowledge about the data. It is often described as deriving knowledge from the input data. This complex topic can be categorized according to 1) what kind of data is and Data Mining (SIGKDD SIGKDD is the Association for Computing Machinery's Special Interest Group on Knowledge Discovery and Data Mining).[citation needed] Since 1989 they have hosted an annual international conference and published its proceedings,[7] and since 1999 have published a biannual academic journal An academic journal is a peer-reviewed periodical in which scholarship relating to a particular academic discipline is published. Academic journals serve as forums for the introduction and presentation for scrutiny of new research, and the critique of existing research. Content typically takes the form of articles presenting original research, titled "SIGKDD Explorations".[8] Other Computer Science conferences on data mining include:

Process

Pre-processing

Before data mining algorithms can be used, a target data set must be assembled. As data mining can only uncover patterns already present in the data, the target dataset must be large enough to contain these patterns while remaining concise enough to be mined in an acceptable timeframe. A common source for data is a datamart or data warehouse A data warehouse is a repository of an organization's electronically stored data, designed to facilitate reporting and analysis. Pre-process is essential to analyse the multivariate datasets before clustering or data mining.

The target set is then cleaned. Cleaning removes the observations with noise and missing data.

The clean data are reduced into feature vectors, one vector per observation. A feature vector is a summarised version of the raw data observation. For example, a black and white image of a face which is 100px by 100px would contain 10,000 bits of raw data. This might be turned into a feature vector by locating the eyes and mouth in the image. Doing so would reduce the data for each vector from 10,000 bits to three codes for the locations, dramatically reducing the size of the dataset to be mined, and hence reducing the processing effort. The feature(s) selected will depend on what the objective(s) is/are; obviously, selecting the "right" feature(s) is fundamental to successful data mining.

The feature vectors are divided into two sets, the "training set" and the "test set". The training set is used to "train" the data mining algorithm(s), while the test set is used to verify the accuracy of any patterns found.

Data mining

Data mining commonly involves four classes of tasks:[11]

See also structured data analysis Structured data analysis is the statistical data analysis of structured data. This can arise either in the form of an a priori structure such as multiple-choice questionnaires or in situations with the need to search for structure that fits the given data, either exactly or approximately. This structure can then be used for making comparisons,.

Results validation

The final step of knowledge discovery from data is to verify the patterns produced by the data mining algorithms occur in the wider data set. Not all patterns found by the data mining algorithms are necessarily valid. It is common for the data mining algorithms to find patterns in the training set which are not present in the general data set, this is called overfitting. To overcome this, the evaluation uses a test set of data which the data mining algorithm was not trained on. The learnt patterns are applied to this test set and the resulting output is compared to the desired output. For example, a data mining algorithm trying to distinguish spam from legitimate emails would be trained on a training set of sample emails. Once trained, the learnt patterns would be applied to the test set of emails which it had not been trained on, the accuracy of these patterns can then be measured from how many emails they correctly classify. A number of statistical methods may be used to evaluate the algorithm such as ROC curves In signal detection theory, a receiver operating characteristic , or simply ROC curve, is a graphical plot of the sensitivity, or true positives, vs. (1 − specificity), or false positives, for a binary classifier system as its discrimination threshold is varied. The ROC can also be represented equivalently by plotting the fraction of true.

If the learnt patterns do not meet the desired standards, then it is necessary to reevaluate and change the preprocessing and data mining. If the learnt patterns do meet the desired standards then the final step is to interpret the learnt patterns and turn them into knowledge.

Notable uses

Games

Since the early 1960s, with the availability of oracles In complexity theory and computability theory, an oracle machine is an abstract machine used to study decision problems. It can be visualized as a Turing machine with a black box, called an oracle, which is able to decide certain decision problems in a single operation. The problem can be of any complexity class. Even undecidable problems, like for certain combinatorial games Combinatorial game theory is a mathematical theory that studies two-player games which have a position in which the players take turns changing in defined ways or moves to achieve a defined winning condition. CGT does not study games of chance (like poker). It restricts itself to games whose position is public to both players, and in which the set, also called tablebases (e.g. for 3x3-chess) with any beginning configuration, small-board dots-and-boxes Dots and Boxes is a pencil and paper game for two players (or sometimes, more than two) first published in 1889 by Édouard Lucas, small-board-hex, and certain endgames in chess, dots-and-boxes, and hex; a new area for data mining has been opened up. This is the extraction of human-usable strategies from these oracles. Current pattern recognition approaches do not seem to fully have the required high level of abstraction in order to be applied successfully. Instead, extensive experimentation with the tablebases, combined with an intensive study of tablebase-answers to well designed problems and with knowledge of prior art, i.e. pre-tablebase knowledge, is used to yield insightful patterns. Berlekamp in dots-and-boxes etc. and John Nunn in chess Chess is a board game involving two players. It is played on a chessboard, a square-checkered board with 64 squares arranged in an eight-by-eight grid. At the beginning of the game each player controls sixteen pieces: one king, one queen, two rooks, two knights, two bishops, and eight pawns. The object of the game is to checkmate the opponent's endgames are notable examples of researchers doing this work, though they were not and are not involved in tablebase generation.

Show All>>

 

The above information uses material from Wikipedia and is licensed under the GNU Free Documentation License The purpose of this License is to make a manual, textbook, or other functional and useful document "free" in the sense of freedom: to assure everyone the effective freedom to copy and redistribute it, with or without modifying it, either commercially or noncommercially. Secondarily, this License preserves for the author and publisher a.
Some facts may not have been fully verified for accuracy. [Disclaimers Wikipedia is an online open-content collaborative encyclopedia, that is, a voluntary association of individuals and groups working to develop a common resource of human knowledge. The structure of the project allows anyone with an Internet connection to alter its content. Please be advised that nothing found here has necessarily been reviewed by]
This page was last archived by our server on Fri Sep 3 04:14:35 2010. [ refresh local cache ]
Displaying this page or its contents does not use any Wikimedia Foundation's resources.
The owners of this site proudly support the Wikimedia Foundation.


A Brave New World: Apple and the Corporate Big Brother - iSmashPhone (blog)
news.google.com
A Brave New World: Apple and the Corporate Big Brother - iSmashPhone (blog)
Fri, 25 Jun 2010 14:05:47 GMT+00:00
iSmashPhone (blog) ... with informing us about what kind use our information is going to we should be making a stand against this kind of data mining in the financial sphere. ...
Google News Search: Data mining,
Fri Sep 3 04:14:40 2010
ggobi dataview data png
datamining.togaware.com
ggobi dataview data png
429px x 765px | 50.00kB

[source page]



Yahoo Images Search: Data mining,
Fri Sep 3 04:14:40 2010
Information Systems -
video.​google.​com
Information Systems -

Sun, 18 Nov 2007 07:48:54 PST

This is a short presentation of Data Mining technology and how it affects businesses.. video.google.co​m.

Google Videos Search: Data mining,
Fri Sep 3 04:14:40 2010