Carrying out a Metrics investigation: methodological guidelines
There are seven principle steps in carrying out a metrics study, the first five of which constitute the design stage of the investigation.
These will be described in turn, using two recently completed studies examples to illustrate each step:
An investigation of customer transactions in area housing offices
An analysis of crime hot-spots to inform the development of a crime reduction strategy
The purpose and scope of a metrics study must be defined as clearly as possible. For instance, in the Housing study the aims were:
- To analyse the frequency and duration of each type of customer transaction
- To analyse the staff skills and the nature of front-back office interaction required to resolve each query
The purpose of the study was to decide which front-office functions could to transferred to a new customer contact centre, to be delivered through a one-stop shop arrangement or a call centre. The topic of the study was therefore "Customer Transactions". The metrics data would bear on the feasibility of transferring particular transactions to the call centre (e.g. how much expertise did they require?), on the resourcing implications (i.e. staffing levels and training), and the possible benefits to be accrued
The goal of a metrics study is the production of a set of Indicators which carry key information upon which important decisions (e.g. regarding resources or IT support) will be based.
Computation of these indicators will require the gathering and processing of raw data. Defining the indicators will dictate the raw data requirements. It is vital to think through this relationship rigorously. Exactly what indicators are required? What raw data is needed? Failure to consider this carefully will lead either to the collection of redundant material, or to wasted effort if it is realised after the empirical phase that certain key data items have not been collected. This design step is problematic because it is often impossible to know the full requirements of the study in advance of doing it; moreover there may be constraints about the information available in the field situation that are not apparent at this pre-empirical stage. Hence, the importance of design step 5, the pilot study (see below).
In the Housing study, the indicators were relatively clear:
- Average time for each transaction type
- Frequency of occurrance of each transaction
- Typical back office support for each transaction in relation to staff skill level
In the crime study, the indicators were also clearly defined:
- Rate of occurrence of each major crime type per local government ward, normalised in terms of either population level or number of households
- Levels of certain key risk factors: e.g. social deprivation
Considering the latter, the following raw data items are required in order to produce the indicators:
- Counts of each crime type per ward
- Population levels and number of households
- Deprivation indices, e.g. proportion of lower socio-economic groups
Metrics investigations typically involve the definition and measurement of a set of categories of work in relation to the primary topic of interest. In the Housing study, for instance, the topic of the investigation was customer transactions, and a set of categories was therefore sought which encompassed the range of types of interaction that could occur. The raw data and the derived indicators constituted measurements of various relevant aspects (attributes) of those categories, e.g. transaction completion time.
Defining such a taxonomy is often problematic as the detailed categories may not be known precisely at the outset of the study. It is therefore vital to discuss the types of activity that constitute the taxonomy with operational managers and staff. In the case of the crime project, the crime categories were pre-given (a Home Office taxonomy exists of types of crime); in the Housing project, although the high level categories (repairs, council tax enquiry, tenancy management, lettings, housing benefit) were known in advance, much of the detail remained to be determined. A series of discussions were held with staff to determine a provisional set of customer transactions. Many subcategories were generated, too many indeed, and some merging was then required to reduce the taxonomy to a manageable scale.
Defining the right taxonomy is a difficult intellectual process, with discussions typically cycling through a process of expansion and consolidation. In the housing study, for instance, many sub-categories of enquiry regarding the progress of repairs (e.g. waiting for inspection, waiting for contractors, type of repair etc.) were at first generated. After reflection, it was decided that such detailed differentiation was unnecessary, and the subcategories were eventually collapsed to a single catch-all, entitled "Follow-up". Determining the level of requisite detail is always a tricky question, to which the answer is often, rather unhelpfully, "not too much but not too little either". The judgement of the investigator guided by a sense of the goals of the study is required to set the right balance. Clearly it is better to err on the side of more rather than less, rather than finding oneself without key information once the field-work has been completed.
Having produced a set of categories of activity, it is necessary to determine which aspects (attributes) of the category need to be measured and to produce a set of operational definitions to guide the recording process. This is often referred to as operationalisation . For the Housing study, four attributes were agreed on for each customer transaction:
- Staff skill level: 3 levels were agreed (L= a few weeks or less, M=several months, H=one or more years)
- Time of day: for plotting diurnal trends in demand
- Handling time: in minutes
- Back-office interaction: F= none required to resolve, S=some support but resolved ultimately by front-office, B=handed over to back-office
Data collection will involve the design of a suitable form. Depending on the complexity and detail of information required, this could take the form of one page for each instance of the category of work being documented, or a more condensed "List format" could be used allowing several instances to be recorded per page.
After initial experiments with a one page pro-forma, a list format was adopted for the Housing study, allowing ten transactions to be captured per page.
Consideration as to whom should capture the data should be reflected upon. If possible there are clear advantages to the staff filling out the records as part of their jobs, although the additional workload of doing this is an obvious problem. Sometimes, as in the crime example, routine mechanisms are in place for capturing the data. Typically this is not the case and if systems do exist, they need to be supplemented by field work.
Decisions regarding the timing and location of the observational work are referred to as the sampling protocol . The design of the protocol is critical and will be governed by two key considerations:
- The need to gain a representative view of work in the target environment
- The need to gather enough data to produce indicators that are stable and reliable enough to provide a solid base for the decisions motivating the study
The former consideration will involve deciding which work-sites should be visited, when, how often, which staff to be observed etc. In the Housing study, to gain a representative profile, the following decisions were made:
- To visit two sites: a busy central office and a quieter peripheral office
- To gather data throughout the whole working day
- To sample a mix of known busy days (Mondays at the beginning of the month) and quieter days
How much data to collect is a complex matter. It is beyond the scope of this document to address this question in depth, save to provide some very broad guidelines; the reader is referred to a statistical textbook for a full discussion of the issues. "An introduction to Psychological Research and Statistics, third edition", by Andrew Tilley (Pineapple Press, Brisbane) is recommended as a highly readable introduction. The book also contains a useful discussion of issues relating to experimental design, which are highly pertinent to the design of metrics studies.
In general, the more observations that are collected means the indicators will be more accurate and reliable. This seems intuitively correct, but to understand why we need to take on board the idea of "sampling variation", a key concept in any statistical investigation.
Carrying out a metrics study involves gathering a sample of observations with the aim of measuring some target indicator, e.g. the proportion of burglaries as a ratio of total crime. Let us call this indicator, RB. At this point, it is critical to consider a distinction between what statisticians call sample estimates and population parameters. The latter idea refers to some underlying true value of the indicator that is fixed and stable. This value is of course not known, and the purpose of the sample is to estimate it. Imagine having collected the sample and calculated the indicator from the data, yielding a value RB1. RB and RB1 are not the same thing. RB1 is the sample estimate and will not be the same as the true value of RB, because it is subject to the vagaries of the sampling process.
Imagine gathering another sample for a different time period. Even though the real value of RB may not have changed, RB2 will not be the same as RB1, because a different sample has been gathered. This variability of statistical estimates is known as sampling variation. Larger samples are more reliable simply because sampling variation is less. If we collected lots of samples, then the estimates calculated from the larger samples would move around from sample to sample less than those derived from the smaller samples.
Going back to the burglary example, assume that the burglary rate is to estimated from a sample of 100 crimes. Assume that the real value of RB equals 0.1. Let us assume that our sample reveals 11 burglaries, i.e. RB1=0.11. Had we gathered more than one sample, statistical theory tells us that the sample estimates will vary across from as low 0.04 to as high as 0.17, all with the same population value. If we increased sample size to 1000, this range would reduce to 0.09 to 0.11. More observations means greater accuracy, though the cost of this can be prohibitive. To reduce sampling variation by half, four times as many observations are required.
The question of how much data to collect also depends on what is called the level of measurement . Where measurement involves counting categories (so-called, nominal measurement) to estimate proportions, quite large numbers of observations can be required, of the order several hundred if there are a large number of categories. It is important to have a minimum of 5 observations per category, and ideally many more if reliable estimates are required. In the Housing study, there were approximately 15 categories. This implies the need for many hundreds of observations to be collected to attain a workable degree of accuracy. And accuracy is vital given that key business decisions may be based on the results of the study. A decision, for instance, to deploy major police resources in a particular area on the grounds that the burglary rate is 13% compared to the general average of 10%, would be highly irresponsible based on a sample of size 100.
If the level of measurement is more quantitative (technically called interval level measurement), then less observations will generally be required. For example, length of transaction is an interval scale measurement. A dozen or so observations may suffice to provide a usable estimate.
We will say no more on the topic of sample size here. All that was intended was to introduce the issue in broad terms and to emphasise how important it is. The investigator is advised to seek advice from a trained statistician or to consult an introductory statistical textbook. The chapter on "statistical power" is the one to focus on.
The need for a dry run has been emphasised at several points through out the foregoing steps. It is vital that the data-collection procedures are tested out in a short pilot experiment, and refined as a result of this experience.
Having validated the study instruments, the investigator takes to the field and gathers data. The Housing data was recorded on a paper form, and then transcribed to an Excel spreadsheet.
Having calculated whatever indicators are required based on the raw data, a variety of statistical techniques should be considered or order to analyse the indicators. Statistical methods fall into two main categories:
- Descriptive statistics (methods for summarising the main characteristics of a set of data, e.g. how spread out the data are)
- Inferential statistics (methods for making decisions based on data)
Again the reader is referred to a suitable statistical textbook for a full account of concepts and methods. All that will be given here is a couple of key definitions and some pointers to important issues.
Descriptive statistics include different methods for measuring the "middle value" of a set of numbers (the mean or the median are the normal candidates) or its spread (the range and standard deviation are commonly used here). Pictorial methods such as histograms and stem-and-leaf displays are also widely used and are available in spreadsheet packages such as Excel.
Displaying patterns in data and summarised key properties is one thing. Making decisions based on the data is another and there are many tools available. Inferential statistics is a vast field, and again the reader is referred either to a textbook or to a professional statistician. Here we will simply introduce one key notion, the idea of statistical significance, and illustrate with a simple widely-used statistical test why this is such a vital idea.
In the Housing example, we used a pivot table to break down the need for back office support against the skill level of the operator. The hypothesis is being investigated that experienced operatives require less back office support. The table shows three levels of operator experience (High, Medium, Low) and transactions have been categorised in three groups (Y=referred to back office, S=completed by front office with some back office support, and N= no back office support needed). Entries in the table indicate the number of transactions recorded for each combination of skill level (column) and need for support (row). At first sight it may appear that the skilled staff require less support (109/153= 71% of transactions completed without referral) compared to less experienced staff where the completion rate is 65% (54/82).
But is this difference a real one? Or to put the question differently, how else might it have a occurred? Another possibility is that the underlying rate is the same and that the apparent differences are due to sampling variation. This ambiguity arises in all situations where we are working with samples and a battery of statistical tests have been developed to evaluate this possibility; they test whether a result is real (i.e. statistically significant ) or whether it is due to chance variation.
The chi-square test is the test of choice for problems of this ilk, i.e. involving cross-tabulation. The test has been performed on the worksheet and a value of .66 has been produced. This result is not significant, i.e. there is no reason to believe that skill level does make a real difference. Had a business decision been based on the un-tested data, then this decision would have been founded on insecure and spurious evidence, possibly at considerable cost to the organisation. This vignette is salutary in drawing attention to the need to carry out rigorous analysis using established statistical methods.
The report for the Housing study in Eccles Housing Office is shown in as an example of good practice. (Viewable via the downloads pages)
|< Prev||Next >|