DATA MINING AND STATISTICAL METHODS USED FOR SCANNING CATEGORICAL DATA

It has been shown that data mining uncovers patterns in data using predictive techniques. These patterns play a critical role in decision making because they reveal areas for process improvement. Statistical techniques such as Chi-square test for association are widely used in the medical field. Yet, the interpretation of some of the results approached by the use of this statistical techniques is seems to be a very difficult task. The type of association is often non-linear and hence will mask the important part of the use of this technique. In this research work a new approach is adopted by scanning the raw data for any possible association (linear or non-linear). More data mining methods and statistical inference were the base tools of this research work .


INTRODUCTION
In the cross-classification of categorical data, researchers are always interested in the sense of searching the cross-classification table for any potential relation ship between groups of the cross classified variables.Such a relationship is statistically denoted as an association.
Data mining often concerns with the meaning and quality of the information embedded in any given set of data.Ideas from information measurements and statistical analysis will be merged in order to establish a linkage between these two tools [1].
Such a linkage will allow the users to handle a straightforward interpretation as to unmask the type and degree of association, and hence will enhance the meaning of the results obtained.
Most analysts separate data mining software into two groups: data mining tools and data mining applications.Data mining tools provide a number of techniques that can be applied to any business problem.Regardless of whether we are aware of them, our daily lives are influenced by data mining applications.For example, almost every financial transaction is processed by a data mining application to detect fraud.Both data mining tools and data mining applications are valuables, however.an instantiation of any sub set of variables in X referred to as an event pattern.In this research, light will be shed only on discrete, finite, multi-valued random variables.Using multi-valued discrete variables to represent a physical phenomenon, a concept, or an object, is common in a variety of fields such as medicine, business and economics.For example, in a medical diagnosis problem, gender and condition may be two variables of interest.A particular patient always has one and only one gender, meanwhile could be a located to none, one or more disease(s) [2].

Increasingly organizations as data mining tools and
The concept of data mining passed upon data patterns is to identify events patterns that are either statistically significant or not.One approach towards identifying statistical significant information is passed on event association [3].
Significant association may be determined by statistical hypothesis test passed on mutual information measure or residual analysis.as reported elsewhere, mutual information with regard to information theory is asymptotically distributed as chisquare distribution.this result has been extended elsewhere [4] to model residual analysis as a normally distributed random variable.In doing so, statistical hypothesis test passed on residual analysis may be used as a conceptual tool to discover data patterns with significant event associations.Another results discovered recently [5] is an algebraic linkage between information measure and statistical analysis that suggests yet another approach for detecting events association passed on symbol probability ratio.

2.Data Handling and Algorithm
as to clarify the algorithm with real set of data that were part of the data set collected by holmquest et al [6] in an investigation into observer reliability in the histological classification of carcinoma in situ and related lesions of the uterine cervix [7], will be used (table 1).

The algorithm
Before they are plunged into the process of calculating the chi-square statistics and its relevant kappa [8] it is useful to state the gradual steps of the

b. Calculation kappa :
According to the data of table 1, the following components are going to be calculated:

3.Data model
In this research, a data table has been done which contains the various number of the different diagnosis cases and the number of the cases dealt with (118) as the shown in table (1).As figure 1 show: 1. Some statistical tools were used as helping devices, such as chi-square statistics and kappa metric in table (1) for obtaining table (2).Then, the 2. standard deviation was calculated.
3. A database under the name (pathologists.mdb) was built and designed, which contains (118) records corresponding with the cases.The data base is expandable to include further number of records.Filtering of the data base records has been carried out in order to deal with the cases of positive mass under study.
The results of 1 and 2, and the use of the table below [8]: Were conducive to a report containing the essential information which has a role in making the decision that leads to diagnosis of the infection level of the studied cases.

4.Discussion Results
In this paper, 118 cases of factual data have been dealt with, as shown in table (1) and data base (pathologists.mdb).
After examining the statistical concepts, it has been noticed that using the weighted Kappa Metric is important in satisfactorily classifying and partitioning The 95% confidence intervals indicated that the value of Kappa will never be out of range (0.33, 0.61).

Conclusion
The study has reached the following conclusions: 1-Using the statistical concepts and tools as supporting tools in dealing with data mining has a role in uncovering data which are not easily revealed by normal methods.
2-Several interesting results are found in this research.
First, the concept of data patterns allows us to visualize any early step of data mining has being a process of finding significant event associations.This process lends itself to a set of data patterns for discovering an inference model; where such a model encapsulates significant behavior of the data as measured by statistical analysis as well as information measure.
3-Data mining uncovers patterns in data using predictive techniques.These patterns play a critical role in decision making because they reveal areas for processes improvement.Using data mining, organizations can increase the profitability of their interactions with customers, detect fraud, and improve risk management.The patterns uncovered using data mining help organizations make better and timelier decisions.

P
-ISSN 1991-8941 E-ISSN 2706-6703 Journal of University of Anbar for Pure Science (JUAPS) Open Access 2007,(1), (2 ) :126-134 127 data mining applications together in a integrated environment for predictive analytic.Assume a concept of event patterns as an embodiment of information.Consider a set of mutually exclusive random variables {Xi : i = 1 …k}.
information about the real patients as posed by the example mentioned in everitt B. S.[6].3. Cases with negative mass findings has been ignored in the study.Only those cases with positive mass findings were involved.This procedure has been done by the use of filter statement available within the Microsoft database program.4.A different procedure has been implemented to handle the cross-classification table of beliefs from pathologist 1 and 2 (Table1).5.A crystal report containing the cross classified table and the results of both chi-square statistics and kappa has been designed.The report also contained a decision statement the magnitude of kappa based on the comparison of the calculated value of kappa with its theoretical range of values.The detailed procedure for calculating chi-square test and kappa is given according to the following: a.The value of Chi-sqaure statistics calculation: the number of rows and k is the number of columns for the cross classified table: O is the observed frequency and E is the expected frequency calculated by multiplying the corresponding row and column totals and divide the result by the grand total.