Tests of Significance: Uses and Limitations Abstract Statistical tools are undoubtedly important in decision making. The use of these tools in everyday problems has led to a number of discoveries, conclusions and enhancement of knowledge. This ranges from direct calculations using general statistical formulas to formulas integrated in Statistical software to fasten the process of decision making. Statistical tools for testing hypothesis, significance tests are strong but only if used correctly and in good understanding of their concepts and limitations. Some researchers have indulged into wrong usage of this tests leading to wrong conclusions. This paper looks at the different significance tests (both parametric and non-parametric tests) their uses, when to be used and their limitations. It also evaluates the use of Statistical Significance tests in Information Retrieval and then proceeds to check the different significant tests used by researchers in the papers submitted to Special Interest Group on Information Retrieval (SIGR) in the period 2006, 2007 and 2008. For the combined period 2006-2008, including the years 2006 and 2008, of the papers submitted had statistical tests used and of these tests were used wrongly. Key Words: Significance Test, Information Retrieval, Parametric Tests, Non-parametric Tests, Hypothesis Testing Chapter One 1.0 Introduction Statistical methods play a very important role in all aspects of research, ranging from data collection, recording, analysis, to making conclusions and inferences. The credibility of the research results and conclusions will depend on each and every step mentioned above; any fault made in these steps can render a research carried out for several years, spending millions of shillings to be worthless. This does not mean carrying any test and mincing figures shows that statistics has been used in the given research; the researcher should be able support why he or she used that specific test or method. Misuse of significance test is not new in the world of science. According to Campbell (1974), there are different types of statistical misuse: Discarding unfavorable portion of data This occurs when the researcher selects only a portion of data which produces the results that he/she requires perfectly while discarding the other portion. After a well done research, the researcher might get values that are not consistent to what he/she was expecting. This researcher might decide to ignore this section of data during the analysis so as to get the ââ¬Å"expected resultsâ⬠. This is a wrong take since the inconsistent data could give very new thoughts in that particular field that is if these irregularities are checked and explained why they occurred, more ideas abut that area can be explored.. Overgeneralization Sometimes the conclusions from a research can only work on that particular research problem but the researcher might blindly generalize the results obtained to other kinds of research similar or dissimilar. Overgeneralization is a common mistake in current research activities. A researcher after successfully completing a research on a particular field, he/she might be tempted to make generalizations reached in this research to other fields of study without regarding the different orientations of these different populations and assumptions in them. Non representative sample This arises when the researcher selects a sample which produces results geared towards his/her liking. Sample selected for a particular study should be one that truly represents the entire population. The procedure of selecting the sample units to be used in the study should be done in an unbiased manner. Consciously manipulating data Occurs when a researcher consciously changes the collected data in order to reach a particular conclusion. This is mainly noticed when the researcher knows exactly what the customers aim are, so the researcher changes part of the data so that the aim of that research is covered strongly. For example if a researcher is carrying out a regression analysis and does a scatter plot, if he/she sees that there are many out liers,the researcher might decide to change some values so that the scatter plot appears as a straight line or something very close to that. This act leads to results which are appealing to the customer and the eyes of other user but in real sense does not give a clear indicator of what is really happening in the population at large. 1.0.5 False correlation This is observed when the researcher claims that one factor causes the other while in real sense both two factors are caused by another hidden factor which was not identified during the study. Correlation researches are common in social sciences and sometimes they are not adequately approached, this leads to wanting results. In correlation studies say to check if variable X causes variable Y, in real sense there are four possible things. The first one is that X causes Y,secondly Y causes X, third is X and Y are both caused by another unidentified variable say Z and lastly the correlation between X and Y occurred purely by sheer luck. All these possibilities should be checked while doing these kinds of study to avoid rushing into wrong conclusions. False causality can be eliminated in studies by using two groups for the same experiment that is the ââ¬Å"control group (the one receiving a placebo)â⬠and the ââ¬Å"treatment group (the one receiving the treatment)â⬠. Even though this method is efficient, implementing it raises very many challenges. There are ethical issues like when one patient is given a placebo (effect less drug) without his/her conscious and the other group given the right drug. One question comes to mind; is it ethical to do this to the first group? Carrying out the experiment in parallel for two different groups can also prove to be very expensive. 1.0.6 Overloaded questions. The questions used in survey can really affect the outcome of the survey. The structure of questions in a questionnaires and the method of formulating and asking the questions can influence the manner in which the respondent answers the questions. Long wordy questions in a questionnaire can be too boring to a respondent and he/she might just fill the questionnaire in a hurry so that he/she finishes it but does not really care about the answers that he/she has provided. The framing of questions can also yield leading questions. Some questions will just lead the respondent on what to answer for example ââ¬Å"The government is not offering security to its citizens, do you agree to this? (Yes or No)â⬠Use of statistical significance has been with us for more than 300 years (Huberty, 1993).Despite being used for a long time, this field of decision making is cornered by criticism from all directions, which has led to many researchers writing materials digging into the problems of statistical significance testing. Harlow et. al (1997), discussed the controversy in significance testing in depth. Carver (1993) expressed dislike of significance tests and clearly advocated researchers to stop using them. In his book, How to Lie with Statistics, Huff (1954) outlined errors both intentional and unintentional and misinterpretations made in statistical analyses in depth. Some journals e.g. American Psychological Association (APA) recommended minimum use of statistical significance test by researchers submitting papers for publications (APA, 1996), though not revoking the use of the tests. With the relentless criticism, other researchers have not given up on using statistical significance testing but have clearly encourage users of the tests to have good knowledge in them before making conclusions using them. Mohr (1990) discussed the use of these tests and supported their use but warning researchers to know the limitations of each tests and correct application of the tests so as to make a correct inferences and conclusions. In his paper, Burr (1960) supported the use of statistical significance test but requested researchers to make allowances for existence of statistical errors in the data. Amidst these controversies, statistical significance testing has been applied to many areas of research and remarkable achievements have been recorded. One such area is the information retrieval (IR). Significant tests have been used to compare different algorithms in information retrieval. 1.1.0 Information retrieval Information retrieval is defined as the science of searching databases, World Wide Web and other documents looking for information on a particular subject. In order to get information, the user is required to enter keywords which are to be used for searching, a combination of objects containing the keywords are usually returned from which the user looking for information can single out and pick one which gives him or her the much required information. The user usually progressively refines the search by narrowing down and using specific words. Information retrieval has developed as a highly dynamic and empirical discipline, requiring careful and thorough evaluation to show the superior performance of different new techniques on representative document collections. There are many algorithms for Information Retrieval .It is usually important to measure the performance of different information retrieval systems so as to know which one gives the required information faster. In order to measure information retrieval effectiveness, three test items are required; (i) A collection of documents on which the different retrieval methods will be run on and compared. (ii) A test collection of information needs which are expressible in terms of queries (iii)A collection of ââ¬Å"relevance judgmentâ⬠that will distinguish on whether the results returned are relevant to the person doing the search or they are irrelevant. A question might arise on which collection of objects to be used in testing different systems. There are several standard test collections used universally, these include; (i) Text Retrieval Conference (TREC). ââ¬â This a standard collection comprising 6 CDs containing 1.89 million documents (mainly, but not exclusively, newswire articles) and relevance judgments for 450 information needs, which are called topics and specified in detailed text passages. Individual test collections are defined over different subsets of this data. (ii)GOV2-This was developed by The U.S. National Institute of Standards and Technology (NIST).It is a 25 paged collection of web pages. (iii) NII Test Collections for IR Systems (NTCIR)-This is also a large test collection focusing mainly on East Asian language and cross-language information retrieval, where queries are made in one language over a document collection containing documents in one or more other languages. (iii) Cross Language Evaluation Forum (CLEF). This Test collection is mainly focused on European languages and cross-language information retrieval. (iv) 20 Newsgroups. This text collection was collected by Ken Lang. It consists of 1000 articles from each of 20 Usenet newsgroups (the newsgroup name being regarded as the category). After the removal of duplicate articles, as it is usually used, it contains 18941 articles. (v) The Cranfield collection. This is the oldest test collection in allowing precise quantitative measures of information retrieval effectiveness, but is nowadays too small for anything but the most elementary pilot experiments. It was collected in the United Kingdom starting in the late 1950s and it contains 1398 abstracts of aerodynamics journal articles, a set of 225 queries, and exhaustive relevance judgments of all (query, document) pairs. There exist several methods of measuring the performance of retrieval systems namely; Precision, Recall, Fall-Out, E-measure and F-measure just to mention a few since researchers are coming up with other new methods. A brief description of each method will shade some light. 1.1.1 Recall Recall in information retrieval is defined as the number of relevant documents returned from a search divided by the total number of documents that can be retrieved from a database. Recall can also be looked at as evaluating how well the method that is being used to retrieve information gets the required information. Letbe the set of all retrieved objects andbe the set of all relevant objects then, Recall(1.1) As an example, if a database contains 500 documents, out of which 100 contain relevant information required by a researcher, the complement ,number of documents not required = 400. If the researcher uses a system to search for the documents in this database and it return 100 documents of which all of them are relevant to the researcher, then the recall is given by: Recall Supposed that out of 120 returned documents, 30 are irrelevant, then the recall would be given by Recall 1.1.2 Precision Precision is defined as the number of relevant documents retrieved from the system over the total number of documents retrieved in that search. It valuates how well the method being used to retrieve information filters the unwanted information. Letbe the set of all retrieved objects andbe the set of all relevant objects then, Precision(1.2) As an example, if a database contains 500 documents, out of which 100 contain relevant information required by a researcher, the complement ,number of documents not required = 400. If the researcher uses a system to search for the documents in this database and it returns 100 documents of which all of them are relevant to the researcher, then the precision is given by: Precision Supposed that out of 120 returned documents, 30 are irrelevant, then the precision would be given by Precision Both precision and recall are based on one term; Relevance Oxford dictionary defines relevance as ââ¬Å"connected to the issue being discussedâ⬠. Yolanda Jones (2004) identified three types of relevance, namely; Subject relevance which is the connection between the subject submitted via a query and subject covered by returned texts. Situational relevance: connection between the situation being considered and texts returned by database system. Motivational relevance: connection between the motivations of a researcher and texts returned by database system. There are two measures of relevance; Novelty Ratio: This refers to the proportion of items returned from a search and acknowledged by the user as being relevant, of which they were previously unaware of. Coverage Ratio: This refers to the proportion of items returned from a search out of the total relevant documents that the user was aware of before he/she started the search. Precision and recall affect each other i.e. increase in recall value decreases precision value. If one increases a systemââ¬â¢s ability to retrieve more documents, this implies increasing recall, this will have a drawback since the system will also be retrieving more irrelevant documents hence reducing the precision of that system. This means that a trade-off is required in these two measures so as to ensure better search results. Precision and recall measures make use of the following assumptions They make the assumption that either a system returns a document or doesnââ¬â¢t. They make the assumption that either the document is relevant or not relevant, nothing in between. New methods are being introduced by researchers which rank the degree of relevance of the documents. 1.1. 3 Receiver Operating Characteristics (ROC) Curve This is the plot of the true positive rate or sensitivity against the false positive rate or (1 âËâ specificity).Sensitivity is just another term for recall. The false positive rate is given by. An ROC curve always goes from the bottom left to the top right of the graph. For a good system, the graph climbs steeply on the left side. For unranked result sets, specificity, given bywas not seen as a very useful idea. Because the set of true negatives is always so large, its value would be almost 1 for all information needs (and, correspondingly, the value of the false positive rate would be almost 0). 1.1.4 F-measure and E-measure This is defined as the weighted harmonic mean of the recall and precision. Numerically, it is defined as (1.3) Whereis the weight. Ifis assumed to be 1, then (1.4) The E-measure is given by(1.5) E ââ¬âmeasure has a maximum value of 1.0, 1.0 being the best. 1.1.5 Fall-Out This is defined as the proportion of irrelevant documents that are returned in a search out of all the possible irrelevant documents. Fall out(1.6) It can also be defined as the probability of a system retrieving an irrelevant document. These are just a few methods of measuring performance of search systems. Then after looking after one system, there arise a problem of comparing two systems or algorithms, that is, is this system better than the other one? To answer this question, scientist in Information retrieval use statistical significance tests to do the comparisons in order to establish if the difference in systems performance are not by chance. These tests are used to confirm beyond doubt that one system is better than another. Statement of the problem Statistical inference tools like statistical significance tests are important in decision making. Their use has been on the rise in different areas of research. With their rise, novel users make use of these tools but in questionable manners. There are many researchers who do not understand the basic concepts in statistics leading to misuse of the tools. Any conclusions reached from a research might be termed bogus if the statistical tests used in it are shoddy. More light needs to be shade in this area of research to ensure correct use of these tests. Researchers in Information Retrieval also use these tests to compare systems and algorithms, are the conclusions from these tests truly correct? Are there any other ways of comparison which minimize the use of statistical tests? Objectives of the study The objectives of this study are: Investigate use and misuse of statistical significance tests in scientific papers submitted by researchers to SIGIR. Shade light on different statistical significance tests their use, assumptions and limitations. Identify the most important statistical concepts that can provide solutions to the problems of statistical significance in scientific papers submitted by researchers to SIGIR. Investigate the reality of the problems of statistical significance in scientific papers submitted by researchers to SIGIR. Investigate the use of statistical significant tests used by researchers in Information Retrieval Discover the availability of statistical concepts and methods that can provide solutions to the problems of statistical significance in scientific papers submitted by researchers to SIGIR Chapter Two This section of this paper has been divided into three major parts, the sample selection and sample size choosing which will discusses methods of selecting a sample and the size of the sample to be used in a given research, the second part deals with statistical analysis methods and procedures, mainly in significance testing and the third part discusses other statistical methods that can be used in place of statistical significance test. 2.0 Sample Selection and Sample Size 2.0.1 Sample selection Sampling plays a major role in research, according to Cochran (1977), sampling is the process of selecting a portion of the population and using the information derived from this portion to make inferences about the entire population. Sampling has several advantages, namely; (i)Reduced cost For example it is very expensive to carry out a census than just collecting information from a small portion of the population. This is because only a small number of measures will be made so only a few people will be hired to do the job compared to complete census which will require a large labor force. (ii)Greater speed during the process(less time) Since only a few people will be used or rather only a few items will be measured, the time for doing the measurement will be reduced and also summarization of the data will be quick as opposed to when measures are taken for the whole population. (iii)Greater accuracy Since only a few people will be considered in the process, the researchers will be very thorough as compared to the entire population which will see the researchers get tired in the middle of the process leading to lousy collection of data and shoddy analysis. The choice of the sampling units in a given research may affect the credibility of the whole research. The researcher must make sure that the sample being used is not biased, that is it represents the whole population. There are several methods of selecting samples to be used in a study. A researcher should always make sure that the sample drawn is large enough to be a representative of the population as a whole and at the same time manageable. In this section the two major types of sampling, random and non-random, will be examined. 2.0.1.1 Random sampling In random sampling, all the items or individuals in the population have equal chances of being selected into the sample. This procedure ensures that no bias is introduced during the selection of sample units since a n items selection will be only by chance and will not depend on the person assigned with the duty of coming up with the sample. There exist five major random sampling techniques, namely; simple random sampling, multi-stage sampling, stratified sampling, cluster sampling and systematic sampling. The following section discusses each of these. 2.0.1.1.1 Simple random sampling In simple random sampling, each item in the population has the same and equal chance of being included in the sample. Usually each sampling unit is assigned a unique number and then numbers are generated using a random number generator and a sampling unit is included in the sample if its corresponding number is generated from the random number generator. One advantage attributed to simple random sampling is its simplicity and ease in application when dealing with small populations. Every entity in the population has to be enlisted and given a unique number then their respective random numbers be read. This makes this method of sampling very tedious and cumbersome especially where large populations are involved. 2.0.1.1.2 Stratified sampling In stratified random sampling, the entire population is first divided into N disjoint subpopulations .Each sampling unit belongs to one and only one sub population. These sub populations are called strata, they might be of different sizes and they are homogenous within the strata and each stratum completely differs with the other strata. It is from these strata that samples are drawn for a particular study. Examples of strata that are commonly used include States, provinces, Age and Sex, religion, academic ability or marital status etc. Stratification is most useful when the stratifying variables are simple to work with, easy to observe and closely related to the topic of the survey (Sheskin, 1997). Stratification can be used to select more of one group than another. This may be done if it is felt that the responses obtained vary in one group than another. So, if the researcher knows that every entity in each group has much the same value, he/she will only need a small sample to get information for that group; whereas in another group, the values may differ widely and a bigger sample is needed. If you want to combine group level information to get an answer for the whole population, you have to take account of what proportion you selected from each group. This method is mainly used when information is required for only a particular subdivision of the population, administrative convenience is an issue and the sampling problems differ greatly in different portions of the population of study. 2.0.1.1.3 Systematic sampling Systematic sampling is quite different from the other methods of sampling, supposed the population contains N units and a sample of n units is required, a random number is generated using the random number generator, call it k, then a unit(represented as a number) is drown from the sample then the researcher picks every kth unit thereafter. Consider the example that k is 20 and the first unit that is drawn is 5, the subsequent units will be 25,45,65,85 and so on. The implication of this method is that the selection of the whole sample will be determined by only the first item since the rest will be obtained sequentially. This type is called an every kth systematic sample. This technique can also be used when questioning people in a sample survey. A researcher might select every 15th person who enters a particular store, after selecting a person at random as a starting point; or interview the shopkeepers of every 3rd shop in a street, after selecting a starting shop at random. It may be that a researcher wants to select a fixed size sample. In this case, it is first necessary to know the whole population size from which the sample is being selected. The appropriate sampling interval, I, is then calculated by dividing population size, N, by required sample size, n. This method is advantageous since it is easy and it is more precise than simple random sampling. Also it is simpler in systematic sampling to select one random number and then every kth member on the list, than to select as many random numbers as sample size. It also gives a good spread right across the population. A disadvantage is that the researcher may be forced to have a starting list if he/she wishes to know the sample size and calculate the sampling interval. 2.0.1.1.4 Cluster sampling The Austarlian Bureau of Statistics insinuates that cluster sampling divides the population into groups, or clusters. A number of clusters are selected randomly to represent the population, and then all units within selected clusters are included in the sample. No units from non-selected clusters are included in the sample. They are represented by those from selected clusters. This differs from stratified sampling, where some units are selected from each group. The clusters are heterogeneous within each cluster (that is the sampling units inside a cluster vary from each other completely) and each cluster looks alike with the other clusters. Cluster sampling has several advantages which include reduced costs, simplified field work and administration is more convenient. Instead of having a sample scattered over the entire coverage region, the sample is more concentrated in relatively few collection points (clusters). Cluster sampling provides results that are less accurate compared to stratified random sampling. 2.0.1.1.5 Multi-stage sampling Multi-stage sampling is like cluster sampling, but involves selecting a sample within each chosen cluster, rather than including all units in the cluster. The Australian Bureau of Statistics postulates that multi-stage sampling involves selecting a sample in at least two stages. In the first stage, large groups or clusters are selected. These clusters are designed to contain more population units than are required for the final sample. In the second stage, population units are chosen from selected clusters to derive a final sample. If more than two stages are used, the process of choosing population units within clusters continues until the final sample is achieved. If two stages are used then it will be called a two stage sampling, if three stages are used it will be called a three stage sampling and so on. 2.0.2 Determination of sample size to be used 2.1 Statistical Analysis In this section, different statistical tests are discussed in details in their general form, then move to discussed how each of them(the ones used in IR) are applied to information retrieval. Only some of these tests are used to compare systems or/and algorithms. In this paper we look at three sections of statistical analysis, namely: (i) Summarizing data using a single value. (ii) Summarizing variability. (iii) Summarizing data using an interval (no specific value) In the first case, we have the mean, mode, median etc and in the second case, we look at variability in the data and in the third case we look at the confidence intervals, parametric and nonparametric tests of hypothesis testing 2.1.1 Summarizing data using a single value In this case, the data being analyzed is represented by a single value, example for this scenario are discussed below: 2.1.1.1 Mean There are three different kinds of mean: (i)Arithmetic mean (ii)Geometric Mean (iii)Harmonic mean (i) Arithmetic mean This is computed by summing all the observations then dividing by the number of observations that you have collected. Letbe n observations of a random variable X. The arithmetic mean is defined as Arithmetic mean When to use the arithmetic mean The arithmetic mean is used when: When the collected data is a numeric observation. When the data has only one mode (uni-modal) When the data is not skewed i.e. not concentrated to extreme values. When the data does not have many outliers (very extreme values) The arithmetic mean is not used when: You have categorical data When the data is extremely skewed. (ii) Geometric mean This is defined as the product of the observations, everything raised to power of, usually n. Letbe n observations of a random variable X. The geometric mean is defined as Geometric mean The Geometric mean is used when: The observations are numeric. The item that we are interested in is the product of the observations. (iii) Harmonic mean This is defined as the number of observations divide be the sum of reciprocals of the observations. Letbe n observations of a random variable X. The harmonic mean is defined as Harmonic mean The Harmonic mean is used when: The average can be justified for the reciprocal of the observations. 2.1.1.2 Median This is defined as the middle value of the observations. The observations are first arranged in ascending or descending order then the middle value is taken as the median. The median is used when: When the observations are skewed. The observations have a single mode. The observations are numerical. The median is not used when: We are interested in the total value. 2.1.1.3 Mode This is defined as the largest value in the given dataset or the value that has the highest frequency of occurrence. The mode is used when: The dataset is categorical. The dataset is both numeric and multimodal. 2.1.2 Summarizing variability Variability in a data can be summarized using the following measures: 2.1.2.1 Sample variance Letbe n observations of a random variable X, then the Sample variance, is given by The standard deviation is used when: The data is normally distributed. 2.1.2.2 The C