Monday, April 02, 2001

Review questions, final examination

Sampling:

1. What are the main advantages of data collection by sampling, rather than doing a complete census?

2. What are the main differences between probability and non-probability samples?

3. Describe the main types of non-probability samples, and the conditions under which their use might be warranted.

4. Describe the basic procedures of simple random and systematic sampling, and their possible advantages and disadvantages.

5. What do we mean by proportionate and disproportionate stratified sampling, and under which conditions might we apply these techniques?

6. Describe cluster sampling, and its advantages and disadvantages.

7. What do we mean by a “sample frame”? Give a few examples, and describe their potential defects.

8. What are the dangers of non-response in a sample survey?

9. Describe what is meant by sampling “with probabilities proportionate to size”?

Experimental design:

1. Explain the purpose of random assignment of experimental subjects.

2. Explain the main features of a double-blind medical experiment.

3. Explain the main features of the classical design experiment (pretest-posttest with a control group).

4. Explain the main features of “true” experiments, and the major subtypes.

5. Explain the main features of “pre-experiments,” list the major subtypes, and their main shortcomings.

6. Describe two types of quasi-experiments.

7. What is meant by “internal validity,” and what are the main threats to it?

8. What is meant by “external validity,” and what are the main threats to it?

Evaluation Research:

1. Describe the similarities between outcome evaluation and experimental research.

2. What is meant by “needs assessment,” and what are the data types we can use in such a study?

3. What are “social indicators,” and how can they be used?

4. In what ways do concept formation, measurement and sampling differ in applied research, when compared to theoretical research?


Qualitative research:

1. Describe the controversy surrounding Freeman’s criticism of Margaret Mead.

2. What are Kvale’s 12 steps that make up the “mode of understanding” in the qualitative interview?

3. Describe the role of the participant observer in qualitative research.

4. What is an example of “triangulation”in qualitative research?

5. What are the general steps in the design of a qualitative field study?

Methods of analyzing available data:

1. What are the main characteristics of a content analysis study, and give two examples.

2. What are the main components of a content analysis study?

3. Describe Holsti’s requirements for a content analysis study.

4. List the two main types of unobtrusive measures, and give examples.

Univariate, bivariate, and multivariate analyses:

1. When we do a cross-tabulation, why do we percentage down and compare across?

2. Describe the following possible outcomes of trivariate analyses, and give examples:
spurious relationships; replication; specification; interpretations and intervening variables; a suppressor variable; distorter variables.

Friday, March 16, 2001

This is a cancellation notice for classes scheduled Monday March 19, 2001.

Monday, March 05, 2001

MONDAY, MARCH 5, 2001

DUE TO UNFORESEEN CIRCUMSTANCES TODAY'S CLASS IS CANCELLED.
WEDNESDAY'S CLASS (MARCH 7) WILL PROCEED AS PLANNED.
PLEASE READ THE CHAPTER ON EXPERIMENTAL DESIGN.

Thursday, March 01, 2001

SOCI 2127/POLI 3007: SECOND LIST OF REVIEW QUESTIONS FOR THE MIDTERM TEST


1. What are the four main ways of administering surveys?
2. Name the main advantages and main disadvantages of each strategy.
3. What are the major guidelines we have to follow in question wording?
4. How can we maximize response rates for mail surveys?
5. What is CATI? How does it work?
6. Describe the characteristics and operation of a focus group.
7. What types of topics can be addressed in survey research?
8. Name five major biases and errors that can occur in question formulation in surveys.

Wednesday, February 21, 2001

REVIEW QUESTIONS, MIDTERM TEST, PART 1.

More midterm test questions will be added next week, based on the assigned readings and my lectures.


1. Describe the main features of the following research approaches: social surveys; social experiments; qualitative field studies.

2. What would you consider the main advantages and disadvantages of these methods?

3. What are the eleven steps in the development of a survey ?

4. Describe the problem of causality and time in social surveys, and the survey designs which address this issue (panel analysis, etc.)

5. Describe the various phases of Wallace's model of science.

6. What does Kuhn mean by a "paradigm" and a "scientific revolution"?

7. What are the advantages of associations between variables and concepts?

8. Describe the main characteristics of scientific theories?

9. Describe what we mean by an "an abstract concept," and the main advantages and disadvantages of their use.

10. What are the conditions for the establishment of a causal relationship, and what is meant by a spurious relationship?

11. Describe three types of definitions.

12. Describe the main types of reliability

13. What is the general approach of criterion validation?

14. What is face validity, and why is it unsatisfactory?

15. Describe the general process of construct validation, using the "Becoming Modern" study.

16. What is the relationship between reliability and validity?

17. Describe the four levels of measurement.

Wednesday, February 07, 2001

Walter Schwager: Variables and levels of measurement

In doing social research it is current practice to store the data gathered in a computer, so the data can be analyzed using various software programs such as SPSS (Statistical Package for the Social Sciences). To prepare data for computer storage the various values on a variable are usually given numerical codes. Thus we may assign the category “male” the code 1 on the variable sex, and “females” the code 2. Thus, in developing coded information we assign a numerical code to each of the values or categories of our variables. Allocating numerical codes to data allows us to store information in a computer more efficiently. But another advantage of doing so is actually more important: the use of numbers enables us to use the powerful and elegant language of mathematics in dealing with these data. This advantage is associated with a drawback, however. A major peril is that the use of numbers frequently causes a unjustified feeling of exactness and reliability in dealing with the research data. As Moroney put it: "It is an easy and fatal step to think that the accuracy of our arithmetic is equivalent to the accuracy of our knowledge about the problem at hand." So the use of numbers brings major advantages, but also potential dangers. The question we shall address in this section is: what do these numbers mean? What arithmetical or mathematical characteristics are associated with our use of numbers for the values or categories of different variables? What interpretations of these numbers are warranted?
A brief example may help to clarify the issue. You are undoubtedly familiar with the notion of an "average", or more accurately, the arithmetic mean. If our sample consists of 5 individuals who, on the variable age, have the following values (in years): 20, 26, 30, 34 and 40, what is their mean age? We find the mean by adding up the numbers in the series, and dividing their total by the number of elements. In our example the mean age would be:

(20 +26 + 30 + 34 + 40)/5 = (150)/5 = 30

But now take the variable of religion. Let us assume that the five individuals referred to have the following religions, with in brackets the numerical code for that category:
one Protestant (1);
one Roman Catholic (2);
one Jew (3);
one with No Religion (4);
and one with an "Other" religion (5).
What is the "average" religion for this sample of 5 ?

Well, we can proceed with our computations in the same way: add up all the values, and divide the total by the number of cases. In this instance:

(1 + 2 + 3 + 4 + 5)/5 = 15/5 = 3

As we know, 3 refers to the Jewish category. Does this mean that the "average" religion is Jewish? And how would we have interpreted a result that would have given us a "mean" religion of, say, 2.1? What does an "average" religion refer to, anyway?
Let's scrutinize what we just did: we gave numerical codes to religious categories, but those numerical codes were no more than labels. We could with equal justification have assigned entire ly different numbers in an entirely different sequence: Protestant (8), Roman Catholic (1), Jewish (6), No Religion (0), Other (7). The only restriction is that each category should be assigned a unique code number, so we could not confuse it with another category. If we had given these other numbers to the categories, the "average" religion would have been quite different:

(8 + 1 + 6 + 0 + 7)/5 = 22/5 = 4.4

The result that we obtained depended totally on our arbitrary assignment of given codes, and therefore cannot be interpreted in any meaningful fashion.
In the example where we computed the mean for the variable “age”, the mean referred to a value that could be interpreted: a mean of 30 means that the average age is 30 years. But a "mean" religion makes as much sense as an average telephone number for a sample, or a mean car license number. In other words: the notion of a mean age makes mathematical sense, whereas the notion of a mean religion does not. The numbers attached as codes to religious categories are no more than labels: we only know that cases having the same code are the same on religion, and those having different codes have different religions.
You can add and subtract years, and say that someone who is 14 is half as old as someone else who is 28, and twice as old as someone who is 7 years old. But can you say that a Protestant (1) is half as religious as a Roman Catholic (with code 2) and one-fourth as religious as someone with No Religion (with code 4)? That statement would not make sense, as the results are, once again, purely caused by our arbitrary assignment of numerical codes to religious categories. Differences between these numbers (as indicated by subtracting them) do not refer to differences in "religiosity". In other words: you cannot add, subtract, divide or multiply the numerical codes attached to the categories of the variable "religion".
This clarifies the statement in the first paragraph of this section: what arithmetical or mathematical characteristics are associated with our use of numbers for the values or categories of different variables? The various ways in which we can use numbers are called levels of measurement, and each level is called a scale. The four levels of measurement that we shall discuss here are the following:
1. nominal scales;
2. ordinal scales;
3. interval scales;
4. ratio scales.

These four levels form a kind of ladder. The bottom level, nominal scales, is the most rudimentary; each subsequent level becomes more refined, but includes all the characteristics of the preceding one. You may be glad to hear that you already know all there is to know about the most sophisticated level of measurement, that of ratio scales.
The fact that for nominal scales we cannot apply what we generally consider "standard"
mathematical operations points to the following problem. Whether we can apply a certain mathematical operation to some aspect of reality is a question which can only be answered by checking whether the assumptions of that operation fit the characteristics of that situation. Many students have found this hard to grasp. When asked: is 1 plus 2 always 3? nearly everybody answers affirmatively. But what about
one cup of coffee, to which we add two spoonfuls of sugar? This is not a trick question: it demonstrates that the addition of units, according to arithmetical rules, is only possible if the units remain the same, and are not decreased or increased in number due to the physical aspects of the actual addition operation. This requirement is generally satisfied when we deal with cookies or apples, or even dollars or years; it is not when we deal with coffee and sugar, or even a male and a female rabbit, given a couple of months. Therefore a mathematical operation can only be applied if the assumptions of that operation are satisfied by the subject matter that the operation is applied to! This fit between the requirements of a mathematical operation and the characteristics of some subject matter is called isomorphism: similarity of form.
We shall now proceed to a more systematic discussion of these four levels of measurement, and the basic questions we shall be addressing for each one are the following:
a. what are the implications of the way in which numbers are used for each level of measurement?
b. associated with this is the following problem: what mathematical operations are permitted for each level of measurement?
c. this in turn leads to the final problem: what statistical measures are appropriate for each level of measurement? We shall deal with this final question in an introductory manner only.
Before starting on this discussion first a word about terminology. We are, in this topic, always discussing what we can do with the numbers that represent various values on a given variable, as in the examples above, and what these numbers represent or mean. Such a variable is said to be at a certain measurement level, or to be a certain scale. The variable "religion" is at the nominal level of measurement, and can be said to be a nominal scale. A nominal scale may be any variable at that level. A ratio scale is a variable at the ratio level of measurement; as we shall see later, that might be "age", or "years married." Unfortunately the term "scale" also refers to instruments to measure attitudes, so some confusion may arise; so beware.)

Nominal Scales, or: when is 1 plus 2 not 3?

In the example of religion, as we discussed a moment ago, the allocation of numbers was merely a labelling exercise: we assign a (numerical) name to a given category. This is why we call this use of numbers: measurement at the level of categorical or nominal scales (from the Latin "nomen": name).
What characteristics are associated with this way of using numbers? Only those of similarities and differences,- a unit of analysis with a given code is similar (on that variable!) to all other units with the same code, and it differs on that variable from all units of analysis with a different code. (In algebraic terms, a=a, and b=b; and a is not equal to b, and vice versa.) Put technically, the numerical codes identify equivalence classes, as all the elements within a certain category are equivalent: equal in value.
The allocation of numbers is purely arbitrary, however, as we already discussed. As one author put it, "any two numbers may be interchanged without affecting anything but the notation." As long as we keep the numbers distinct for different categories, and we assign the same number to all cases within the same category, we may allocate any numbers we wish.
What mathematical operations are permitted for nominal scales? Apart from operations dealing with similarities and dissimilarities, none. The only operations allowed for equivalence classes are frequency counts: e.g., how many Protestants do we have in our sample? Let's review this systematically.
1. We can count the number of cases with a given code, e.g. the number of Protestants, or the
number of Catholics;
2. Can we compare these numerical codes in terms of more or less? In other words, can we rank them? No, as we have assigned them in an arbitrary fashion. (In our example No religion -code 4- would be "more" on some fictional variable than the preceding three categories: Protestant, Catholic, and Jewish, with codes 1, 2, and 3!)
3. Can we add or subtract these code numbers? No, as we have assigned them totally arbitrarily, and what would additions and subtractions mean here? Although we all tend to believe that 1 + 2 = 3, at the level of nominal scales we cannot add one Protestant to one Roman Catholic to make one member of the Jewish faith; that would be an odd kind of interreligious procreation. Thus at this level of measurement 1 + 2 3!
4. Can we multiply or divide these numbers? Again, the answer is no. our assignment of numbers has been arbitrary, and what would it mean to do the following sum: 2/2= 1? Something like the following: RC/RC = Protestant?

In summary it can be stated that at the level of nominal scales we can only count (heads); we cannot rank, add, subtract, multiply or divide.
The statistical measures that are appropriate at the level of nominal scales are those that are based on head counts only: percentages, proportions, and modes or modal categories, as well as frequencies.
The notion of nominal scales is puzzling at first sight, so you may want to have a look at some other nominal or categorical level variables. Some of the most important categorical variables in sociology are: sex, ethnicity, race, religion, -occupation, party affiliation and marital status.
Nominal level variables are also known as categorical variables, as the values on them are distinct categories. They are also known as qualitative variables, in contrast to the next three types, which are lumped together as quantitative variables. (The Baker text considers ordinal variables as qualitative as well.)

The ranking of numbers: ordinal scales

In many situations we use variables with values, that can be ranked in terms of more or less, or of greater or smaller. The educational achievement of a respondent’s mother or father can be fitted into one of the following categories:

What is the highest level of formal education completed by your parents?

EDUCATION MOTHER FATHER
No schooling..............................................1 1
Some Elementary schooling..................... 2 2
Completed Elementary school...................3 3
Some Secondary school............................4 4
Completed Secondary school....................5 5
Some University or College......................6 6
University degree or degrees....................7 7
Other (write in)
Mother__________________________ 8
Father ___________________________ 8
Don't know..............................................9 9

The first observation we can make is that the numerical codes can be interpreted in terms of similarity and dissimilarity, as in nominal scales. (After we have completed our discussion of levels of measurement you will note that each higher level of measurement has all the characteristics of preceding levels of measurement, plus some new ones.) But the codes also make sense in terms of more and less: "no schooling" (1) is clearly less than, say, "some elementary" (2); (5), "completed secondary school", is clearly more than (4), "some secondary school."
So in what way do ordinal scales differ from nominal scales? The codes of an ordinal scale can be ranked in terms of more or less on a given variable. (With the exception of the 8 -other- and 9 -don't know- categories this applies to the example above, as shown.) This ranking possibility results in a rank order, and therefore the term ordinal scales.
What can we say about the size of the differences between two values on an ordinal scale? In general, little or nothing. How would you compare the difference between 2 and 1, or 7 and 6, in the example just given? Because the differences are unequal or unknown, we cannot compare them in mathematical terms: we cannot add or subtract, therefore (7-6)is not equal to (2-1).
For the same reason we can also not multiply or divide these numbers, as is discussed below.
What statistical measures can we apply to ordinal scales? Basically the same as for nominal scales, plus the ones based on ranking. These include percentiles (and quartiles) and such measures as the median. If you have ranked a class of 15 students with scores on a music test, the score of the middle student -the 8th in this case- is the median value.
Many of the variables in social science research are of an ordinal kind: job prestige, educational level, a country's level of industrialization, and so on. The largest collection probably consists of individual attitudes and aptitudes: the strength of your opinion in favour or against capital punishment, economic nationalism, sexual equality, or your scores on IQ tests, classroom tests, academic subjects and so on. If a student gets a score of 70 on an academic test, we can presumably say that she has a higher score than someone with a score or 35, but can we say that the difference between a score of 70 and 35 is the same as that between 0 and 35? We cannot.
This also implies that we cannot multiply or divide numbers at the level of an ordinal scale: Return to our example for a moment: would you be able to say that 6/3 = 2? In other words, would you be able to say that someone with some university or college education has twice as much schooling as someone who finished elementary school? That would not be a very meaningful statement to make.
To summarize our discussion more systematically, we can state that:
A. the mathematical connotations of numbers used at the
level of ordinal scales are:
1. those of similarity or dissimilarity, as for nominal scales;
2. those of ranking, resulting in rank orders;

B. the mathematical operations permitted for ordinal scales are:
1. those of counting: how many elements are in the 1- category, for example;
2. those of ranking, or comparisons in terms of more or less;

C. The mathematical operations that are not permitted are those of:
1. subtraction and addition;
2. multiplication;
3. division.

D. what statistical measures are appropriate at the level of ordinal scales?
1. those associated with head counts: percentages, proportions, and frequencies; modes and modal categories;
2. those associated with rank orders: the median, and percentiles, to give only two examples.

The mathematical characteristics of an ordinal system include the requirement of transitivity: if a is larger than b, and b is larger that c, than a must be larger than c; i.e., if a is larger than b, and b is larger than c, then c must be larger than a. In reality this transitivity requirement may be violated. The simplest example concerns sports teams: Team A may beat Team B (i.e. be better); Team B may beat Team C; but Team C may beat Team A! This is an example of intransitivity. (In such a situation the criteria for an ordinal scale are not fulfilled. But in all sports competitive rules ensure that such intransitivity does not occur.)
In the natural sciences a few examples of ordinal scales still exist, including the Beaufort scale of wind velocity, (1: leaves move slightly; 10: buildings blow over), the Richter scale for earthquake strength, and the Mohs scale for the hardness of minerals.
In the social sciences many variables are at the level of ordinal scales, but because much more powerful and useful statistics are available for the next level of measurement, most of these ordinal variables are treated as interval level variables. The consequences of this are debated in the profession, but these debates are of no great concern to us for the moment.

Interval Scales, or: When is 10 not twice 5?

The clearest example to illustrate interval scales comes from the measurement of temperature. In comparing Fahrenheit and Celsius scales, for instance, we can state that, roughly speaking,
34 degrees F = 1 degree C; and
68 degrees F = 20 degrees C.
How do these temperatures compare to each other? Well, in terms of Fahrenheit, 68/34=2, so it is tempting to say that one temperature is twice as warm as the other. But how do they compare in terms of the Celsius or centigrade scale? Now, 20/1=20, so here we might want to say that one temperature is twenty times warmer than the other. How come that the two measurement systems give us two different results? After all, our mathematical computations have been correct. Why are our conclusions contradictory?
Could it be that we get these peculiar results because we employ different measurement systems? No, because in measuring lengths in imperial and metric measures, two different systems, we still end up with the same results:
2 yards = 1.82 metres; and
4 yards = 3.64 metres.

How do these two lengths compare? Well, in yards, clearly, 4/2 = 2, and in metres, 3.64/1.82 = 2 as well. So changing measures of length did not influence the results here. It also does not for surface and volume measures. By doing the conversion on a simple example you can check that yourselves, if you want.
The explanation for our conflicting results is that they are contradictory, because the two scales we compared have different zero points: the two scales start counting at different points. These two scales, plus a third one, the Kelvin scale, can be illustrated as in Figure 5.2. The line represents the variable "temperature".


-273 degrees Celsius -18 0 38 100
TEMPERATURE POINTS * * * *
___________________________________________________________________________________
-460 degrees Fahrenheit 0 32 100 212
0 degrees Kelvin 255 273 311 373

A GRAPHIC COMPARISON OF THREE TEMPERATURE SCALES

The Kelvin scale starts at the "natural" zero of absolute zero, but the Celsius and Fahrenheit scales starts at relatively arbitrary points along the temperature line. That is why you cannot divide temperatures on these two scales. (You would also run into problems with negative temperatures on these scales.)
These scales do have the quality that each unit difference on the scale is equivalent to each other unit; in other words, one degree Celsius difference is always the same, no matter where it is located. Thus the difference between 10 F and 5 F is the same as that between 15 degrees and 10 degrees, for example. The intervals between successive numbers are the same, or as it is sometimes put, they are equidistant. That is different from the situation in the preceding level of measurement, ordinal scaling: there a point difference might mean, in one case, the difference between "some elementary schooling" and "no schooling" (codes 2 and 1 on question 40 in the preceding section), or between "some university or college", and "completed secondary school" (codes 6 and 5). So in ordinal scaling the implications or meaning of a difference of a point depend on where on the scale you are. At this new level of measurement, however, each point difference can be seen as an interval of equal size, hence the name "interval scale." (The reasons why these intervals are the same are rather complex, so we'll bypass that discussion.)
Because these intervals are the same, it is meaningful to compare differences by adding and subtracting numbers on the same scale, as we did in stating that the difference between 10 and 5 degrees Celsius was the same as that between, say, 33 and 28. After all, (10-5)=5, and (33-28)=5.
What are the permitted mathematical operations for an interval scale? First of all, we have to mention those that apply also to nominal and ordinal scales, and introduce the new feature specific to interval scales:
1. similarity and dissimilarity;
2. ranking;
3. addition and subtraction.

We cannot, however, multiply or divide, as our comparison of Fahrenheit and Celsius scales illustrated.

The statistical measures applicable to interval measures are first the ones we have encountered already:
1. those based on counts: frequencies, percentages, proportions; modes and modal categories;
2. those based on ranks: percentiles, medians, etc.

But for interval scales there are important new additions:
3. statistical measures based on equal intervals: the
(arithmetic) mean, and its associated measures of dispersion: the variance and the standard deviation.
4. we can now also use standard correlational techniques.

Because the mean and the standard deviation, and correlational analyses are very useful statistical tools, social scientists like to use ordinal scales as if they were interval scales, as was stated in the previous section. Most social scientists now seem to accept this practice. ("Pure" examples of interval scales in the social sciences are actually rather rare.)
Finally, do not confuse "interval scales" with a specific type of attitude scale, the Thurstone "equal appearing interval scales." Apart from the similarities in name these two scales are very different.


Ratio scales, or: back to basics

Ratio scales bring us back to familiar arithmetical examples: cookies, apples, or whatever else was used to teach you basic math.
Ratio scales have the characteristics of interval scales, plus the advantages of a natural zero point. Take income, for example: dollars can be added and subtracted, multiplied, divided, and so on. And you can start from a natural zero!
The mathematical operations applicable to ratio scales are all the ones that you are familiar with:
1. counting;
2. ranking;
3. addition and subtraction;
as well as the most important new operations:
4. multiplication and division.

As divisions establish ratios between numbers, this level of measurement is called a "ratio scale".
The statistical measures appropriate to ratio scales are the same ones as we applied to interval scaling, plus a rather obscure one hardly ever used in social science: the geometric mean, which you can safely ignore.
Examples of ratio scales, or variables at the ratio level of measurement in the social sciences, mainly deal with persons, objects or physical characteristics, such as space and time. They include: number of residents in a community; annual income in dollars; number of children per family; number of years married; number of years education; number of working days; square foot per house; and so on. Sometimes ratio scales deal with the frequency of events, e.g., the frequency of moves over the last ten years.
This tidiness of ratio scales is often lost to some degree when we combine values into groupings, or grouped data. Take this question, for example:

41. To the best of your knowledge, what was your total income in the past year?
Up to $15,000.................................................................................................1
$15,001-$25,000.............................................................................................2
$25,001-$35,000.............................................................................................3
$35,001-$45,000.............................................................................................4
$45,001-$55,000.............................................................................................5
$55,001-$65,000.............................................................................................6
$65,001 and over............................................................................................7
Don't know......................................................................................................9

Given the unequal size of the groupings, especially the ones at both ends, it may be dangerous to treat this variable as a ratio scale. The treatment of grouped data requires some statistical caution, but we cannot go into that now.

A Summary of Levels of Measurement Scales

What are the characteristics and implications of systems of numbers in social measurement? The answer depends on the kind of variable that these numbers are used with: nominal variables, ordinal variables, interval variables, and ratio variables. These variables are at different levels of measurement, and they form different measurement scales.
Nominal level variables are often called qualitative, as we are dealing here with qualitative differences between various categories, which are otherwise incomparable. In ordinary language you might call these "cheese and chalk" or "apples and oranges" variables. The other three levels of measurement are called quantitative variables, as they measure quantities of some characteristic. (As stated earlier, Baker classifies these scales somewhat differently.
15. AN OVERVIEW OF THE MAIN TYPES OF RELIABILITY AND VALIDITY

THE MAIN TYPES OF RELIABILITY
The reliability of a measure is defined here as the extent to which its measurement results remain the same under various (supposedly irrelevant) measurement conditions.

I. EQUIVALENCE
GENERAL QUESTION: what influence do various (supposedly irrelevant) measurement conditions have on the data collected, where these data are collected at the same point in time?
GENERAL APPROACH: compare data collected with the same instrument, but under slightly varying conditions. This results in the following main subtypes of equivalence:

IA. Intersubjective agreement: vary human element:
1. inter-interviewer agreement: compare data collected by different interviewers on comparable samples;
2. inter-coder agreement: compare results after different coders have coded the some set of interviews or other materials;
3. inter-rater agreement: compare results after different raters have rated the same set of subjects (or the same material);
4. inter-observer agreement: compare results after different observers have observed the some set of situations.

IB. Comparability: vary instrument slightly:
1. agreement between single questions: compare results (for the same, or similar samples) of somewhat different question wordings, e.g., split ballot format.
2. alternate forms: compare results for supposedly similar multiple item instruments, composed of different items.

II. STABILITY: repeat measures at different times
GENERAL QUESTION: how do results on the same measure for the same
subject, taken at two points in time, compare? It is assumed that
the variable, and so the measurement results, should have
remained the same.
GENERAL APPROACH: compare data collected at two points in time
for the some respondents, and the same instrument, assuming no
change has occurred in target variable. This results in one main
type of reliability:

IIA. Test-retest reliability: see description just given.

III. INTERNAL CONSISTENCY: check coherence of items
GENERAL QUESTION: what is the general relationship between the
items of a multiple item instrument, or between the items and
their total score? How do these results fit the theoretical
expectations of the measurement model?
GENERAL APPROACH: study correlations between the items of a
multiple-item instrument, or between items and their total score,
in various ways. This results in the following subtypes:

IIIA. Split-half reliability: split the total instrument in
various halves and compare the results.

1. odd-even: odd-numbered items form one half; even-numbered items the other; compare results for same subjects on two halves.
2. coefficient alpha: calculates average split-half correlations for all possible halves for instrument.
3. Kuder-Richardson formula 20: comparable to coefficient alpha.

IIIB. Guttman scaling: assume Guttman model, and apply scalogram
analysis.

IIIC. Inter-item correlations: study patterns of correlations
between items: techniques such as factor analysis.

IIID. Item-score correlations: analyze correlations between items
and their total score.

1. Likert item selection technique: analyze means for the sameitem for bottom and top 25 percent of scorers.
2. Phi coefficient: comparable to Likert, but compares bottom and top 50 percent of scores.
3. Item analysis procedures of a more refined type.


THE MAIN TYPES OF VALIDITY AND VALIDATION

VALIDITY: "the validity of a measurement is the extent to which
its measures what it is intended to measure."

Terminological note: The latest recommendations of the American Psychological Association and the American Educational Research Association now consider all types of validity sub-types of construct validation. Thus, they propose that we refer to the criterion-oriented type of construct validation, etc. That changes little to our discussion, however. (The APA and AERA do not recognize face validity as an acceptable type of validation.)

I. CRITERION VALIDATION
GENERAL QUESTION: how adequate is a new measure for measuring an already established concept or practical application, with an existing measure?
GENERAL APPROACH: some accepted measurement procedure already exists for the target concept: the criterion measure. The results for
the new measurement procedure are compared with the results for
the criterion measure for the same sample. This results in criterion
validation, with two subtypes:

IA. concurrent validity: data for new measure and criterion measure collected at the same time, concurrently;

IB. predictive validity: data for criterion measure are collected at a later time than those for new the measure, the predictor measure.

II. CONSTRUCT VALIDATION
GENERAL QUESTION: to what extent will the results of a new
measurement procedure for a new concept, when correlated with
measures of other concepts, confirm the hypotheses that follow
from the theory associated with the introduction of the new
concept?

GENERAL APROACH: collect data on new measurement procedure for
the target concept, and for theoretically related concepts, and
check whether these results confirm the theoretically predicted
results.

MISCELLANEOUS TYPES OF VALIDITY

FACE VALIDITY

GENERAL QUESTION: does the measurement procedure look adequate
for its intended purpose?

GENERAL APPROACH: prima facie inspection of the instrument for possible shortcomings; no comparison with other data.

CONTENT VALIDITY

GENERAL QUESTION: does the measurement instrument adequately
represent the content area of the target variable, or the target
concept.
GENERAL APPROACH: (a) where the content area is clearly defined,
check whether the measure covers intended content; (b)"expert
judgment": see whether other researchers agree with measure
8. Validity in theoretical research

The aim of theoretical research is the development of conceptual frameworks, linking concepts by a network of causal or statistical associations. The approaches to the problem of validity in theoretical research can only be understood against the background of the aims and purposes of theory development. It is therefore important that you have some grasp of the goals of theory development in dealing with theoretical validity.
Validation procedures in theoretical research largely concentrate on two main issues: 1. Is a new measurement procedure a good way to measure an already well-established, well-measured concept? 2. Do the data resulting from a new measurement procedure for a new concept support the theoretical implications of a new concept, and its related theory. Although these issues are quite general and abstract, they shall be clarified by relevant examples from actual social research.

9. The validation of new measures for old concepts

In every subfield of social science a number of well-established theories can be found. The meaning of the concepts within the theory is relatively clear, and measurement procedures have been developed for the major concepts that are reasonably reliable and valid. As a major example in sociology we can point to social stratification theory, linking the concepts of education, income, and occupation to many other concepts. Other examples of such theories can be found in any introductory text, although no theory will be without its critics. Within social stratification theory the concept of socio-economic status and other concepts have demonstrated their utility, in that many studies have found that these concepts are strongly related to many other concepts. In other theories we can also find concepts that have similarly proven their value. Although these concepts and their measurement procedures are well-established, we may still want to validate new ways of measuring them.
Why would we want new measures for old concepts with satisfactory measures? We may do so for a number of reasons. First of all, the new measure may be easier to use. Short versions of long attitude or personality measures have sometimes been developed, as they would take less time to administer. The results for the new, shorter version can then be compared with the results for the older, longer version, when the two are administered at the same time to a sample of subjects or respondents. Secondly, the new measure may be less reactive or obtrusive than the older measure. An observational measure of socio-economic status may be less intrusive than questions about income and education, for instance. Unobtrusive techniques were intended to produce measures which were less reactive. Ethnicity can often be accurately measured by pronunciation, even where the subject is trying to disguise his identity. (There are at least three incidents reported in history where invading foreigners were identified by their pronunciation of a single word. The most recent example concerns the Netherlands in 1940, where suspected German spies operating behind the Dutch lines had to pronounce the name "Scheveningen.") Thirdly, a new measure may have a wider or different range of applicability. A clear example comes from the area of mental health studies. Achenbach and his associates developed a child problem checklist that could be used to pinpoint potential mental health problems in boys and girls between 11 and 16 years of age. The checklist had to be filled in by parents. Since that original instrument new versions have been developed for younger children, as well as forms that could be filled in by the child's teacher or by the child itself (Achenbach, 1983). As a fourth reason, the new measure may be more precise or reliable. Although in daily life most of us do not encounter problems distinguishing men from women, for the Olympic Games more precise sex tests, in the form of chromosome counts are required. (The pronunciation test for ethnicity in Holland was also considered more reliable than other measures.) As the fifth and final reason, the new measure may be higher in systematic import, i.e., it may be linked to other concepts via stronger (and simpler) correlations than the existing measure. This is a claim made for the Wilson C-scale for conservatism, for instance, apart from additional claims that it is simpler to use and more reliable than other measures related to the same concept (Wilson, 1973).
Social science examples of old measures being replaced by new ones tend to show only a few of the advantages just listed. In the physical sciences new measures may replace older measures for most of the reasons listed. The measurement of time, e.g., has moved from sun dials to pendulum clocks, to other kinds of clocks, to the present state of the art, atomic clocks. If you compare a sun dial as a measure of time with a quartz clock, you will find that most of the advantages listed apply.
In validating a new measure for an established concept we proceed in general by correlating the results for the new measure with the results for an already existing measure. As the older measure can be considered as a criterion measure, we can consider such validation as a theoretical application of criterion validation.

10. Validation strategies for new concepts

In the last section we discussed the ways of validating new measures for established concepts with already existing measurement procedures. The next question follows naturally: how do we validate new measures for new concepts, for which no satisfactory measurements have as yet been developed? In general we shall assume that a new theoretical concept is introduced together with a related theory. In such cases we are usually dealing with what often are called cluster concepts, linking many characteristics. Attitudes, to name one prominent kind of cluster concept in the social sciences, are not merely predispositions to act in one specific way toward one specific object. If an individual gives money to the Salvation Army once, we usually do not say on the basis of one such action that he has an altruistic attitude. Such an attitude refers to a general disposition to act, feel, and evaluate in a variety of altruistic ways toward a variety of objects, such as charitable organizations, deprived groups, and so forth. Being prejudiced does not merely imply that one acts once in a contemptuous manner towards one member of a minority group; the concept of prejudice assumes that individuals tend to react in a generalized manner, usually unfavourably, toward most, if not all, members of minority groups, in highly diverse situations. Another type of cluster concepts consists of personality traits.
Cluster concepts can be considered as mini-theories, as they posit that specific regularities will be found; concepts are inconceivable without regularities. A simple example can clarify the point that a concept may have unwarranted assumptions, and thus be invalid. The theory of witchcraft, as developed in Europe in the Middle Ages and later, assumed that various empirical regularities tended to coincide: older women, who had formed a pact with the Devil, were able to slip through exceedingly small holes, fly through the air on broomsticks, goats, and other objects. These women were also capable of inflicting physical and other damage on other human beings, animals, buildings, crops, and other targets by supra-natural means. Because of their propensity to fly witches also were characterized by light body weight. In some European towns, such as Oudewater in Holland, the operationalist consequences of this witchcraft theory can still be seen: large weighing scales, to be used for the identification (measurement) of witches, who were supposedly light. How should we, as social researchers, approach the obvious validation question: is body weight a valid measure for the concept of witchdom? Can weighing scales distinguish validly between witches and non-witches? Unless you believe in witches, the absurdity of these questions should be evident. Confronted with such questions, we should answer that we do not share their assumptions: there is no acceptable evidence that witches exist. In other words: no such cluster of characteristics as low body weight, propensity to fly, etc., have been demonstrated (not to mention the assumption of the devil’s existence). As a result, merely asking whether body weight is a valid measure of being a witch is a misleading way of posing the question; it assumes that witches exist. We can only assess the validity of a cluster concept if we also consider the strength of the evidence supporting the assumptions implicit in that concept. That evidence has been called, not surprisingly, justificatory evidence by the philosopher Kaplan. (This points to one of the major differences between pragmatic and theoretical validation. In pragmatic validation, as long as a socially accepted criterion measure exists, you can engage in criterion validation.)
Cluster concepts are frequently called theoretical constructs, and the method of validating them is therefore called construct validation.

11. Construct validation: the example of “Becoming Modern”

In construct validation it is necessary that the introduction of a new cluster concept be accompanied by a set of hypotheses linking that concept to other ones. These other, associated concepts may be causal factors, effects, or other correlates of the concept in question, as our example will demonstrate. We shall illustrate the procedures of construct validation by describing one large study in some detail. The study selected, conducted by Alex Inkeles and his co-workers, notably David Smith and Karen Miller, focused on the concept of individual modernity. The main results were published in a monograph, Becoming Modern, co-authored by Inkeles and Smith (1974), and a series of articles by Inkeles and others. The data for the project consisted of lengthy interviews with 5,500 men in a number of developing countries: Argentina, Bangladesh (then East Pakistan), Chile, India, Israel, and Nigeria. The fieldwork took place in the early sixties.
What was the study topic of Becoming Modern? What was its theoretical perspective?

In the Harvard six-nation study we were guided by a particular theoretical perspective. In most general terms, the main purpose of the research was to test whether, where, and how far individuals come to incorporate as personal attributes qualities which are analogous to or derive from the organizational properties of the institutions and the roles in which these individuals are regularly and deeply involved. To give this model greater specificity, we selected the factory as the embodiment of one major type of modern institution, so that our general question could be rephrased more concretely, as follows: "What are some of the personal qualities which extended service in a factory might inculcate in individuals who moved into such service after growing up in the typical agricultural village of one of the less developed countries?" (Inkeles, 1977: 142-143)

Inkeles and his fellow researchers then performed an analysis of the characteristics of the factory system and the set of qualities they assumed the new factory worker would learn as a result of his occupational role.

Among the qualities we expected... were a sense of personal efficacy, openness to new experience, respect for science and technology, acceptance of the necessity for strict scheduling of time, and a positive orientation toward planning ahead. Each of these characteristics we then designated as components in our definition of the modern man conceived in psychosocial terms.

To that set a series of other attributes, based on an analysis of the expected or required attributes of modern man as a participant citizen, as a family member, and in other roles, was added.

For example, in the family realm we defined as more modern the insistence on selecting one's own spouse rather than accepting a spouse chosen by one's parents or other "elders," the preference for small rather than large families, and the willingness to practice birth control and the actual limitation of family size, as against the passive acceptance of "whatever number of children God might send."

This analysis produced a list of 24 main themes, each considered
part of the larger set of qualities defining individual modernity... It reflected a definite theoretical position, which we believe it should have, since that permitted testing whether certain explicit expectations underlying the definition were sound (Inkeles, 1977: 142-143).

These lengthy quotations set out the basic theory of individual modernity, and partially describe the main elements of the modernity cluster or syndrome. A great number of items, 119, were then developed to cover each of these elements. A number of item selection procedures were used to produce an individual modernity scale. (Actually a number of scales were produced by different item selection techniques, but a detailed overview is unnecessary for our purpose.) This scale was subsequently checked for its internal consistency, and these results were satisfactory.
How could the researchers now proceed in testing the validity of their modernity scale? Could they use criterion validation? Clearly not, as there is no single, clear and accepted criterion of individual modernity. This resulted in a dilemma, which Inkeles and Smith describe as follows. (The OM scale is the Overall Modernity scale, combining the various aspects or dimensions of the modernity cluster.) As the authors set out their problem very clearly we shall quote a rather lengthy section of Becoming Modern.

To prove its worth a scale not only must distinguish one individual from another, but must do so accurately. The usual method for establishing the validity of a scale is to apply it to people whose characteristics are already known by some independent criterion, which is why this approach is called the "criterion method of scale validation." If we were devising a test of psychic adjustment, for example, we might compare the scores of patients in a mental hospital with those of individuals whom psychiatrists had rated as well adjusted. If the scale was any good we would expect it correctly to identify the criterion group of hospital patients. Even the method of validation by d criterion group of "known" quality is full of vicissitudes, as the example just given will surely suggest. Our problem, however, was even more serious. There simply is no generally accepted external criterion by which we can certify a man to be modern. Indeed, one objective of our project was precisely to establish who were the modern men.
Our theory of modernization offered a way out, but it also put us on the horns of a dilemma. The theory held that certain institutions and experiences have the capacity to change men in ways which make them more modern. We assumed that the more such experiences a man had been exposed to, the greater would be the degree of his individual modernity, as expressed in his attitudes, values, and behavior. Therefore, if the OM scale was valid it should have assigned higher scores to men who had been much exposed to modernizing institutions and experiences. In other words, our theory indicated that we should take certain objective social characteristics as defining the external criteria by which to test the OM scale. Accordingly, those who were better educated, who worked in industry rather than agriculture, who lived in the city rather than the countryside, and who made above-average use of the mass media should have scored as more modern.

Although this sounded very plausible, we had good reason to hesitate before committing ourselves to this method for testing the validity of the OM scale. The proposed approach suffered from the defect that it assumed the correctness of the very theory we were attempting to test. We therefore faced the prospect of being confronted by a dilemma should we discover that individuals more exposed to modernizing experiences failed to score higher on the OM scale. We had to recognize hat if such were the outcome we would be faced with two alternative explanations without being able to choose between them. One alternative would be to argue that the fault was in the OM scale. In other words, one might maintain that the institutions cited did actually change men in ways which made them more modern, but that the OM scale failed to reflect those changes. Adopting that explanation would imply that our theory of change had been correct, but the OM scale was invalid. The second alternative would be to assume that the OM scale was quite good at telling which men were truly modern, but that the institutions cited did not contribute to making them so. Adopting that interpretation would lead to the conclusion that the scale was valid, but that our theory about the causes of individual change had been incorrect.
Although we were distressed by the prospect of great ambiguity should the OM scale fail to be positively associated with modernizing experiences, we saw no alternative for establishing the validity of the scale. And we took comfort in the realization that should the OM score indicate greater modernity among those more exposed to modernizing experiences, we would be a double winner. That result, we felt, would establish simultaneously that the OM scale was valid as judged by an external criterion and that increased exposure to modernizing institutions brought about greater individual modernity. Thus, our causal theory would be proved correct, and the scale established as valid, simultaneously. (Inkeles and Smith, 1974: 119-121; emphases in original.)

Let's review how Inkeles and Smith proposed validating their overall modernity, OM scale. Initially they tested the scale for internal consistency, which can be considered as a step in the trait validation process. After that:
1. first they set out the theory, listing which causal factors or "modernizing experiences" would be associated with high scores on the OM scale;
2. they then checked whether the data indeed supported their theoretical predictions that increased exposure to these modernizing experiences would be associated with high scores on the OM scale;
3. if this were the caser they would be double winners: their theory would be supported, and their measure proven valid;
4. on the other hand, if the results were negative, there would be a number of possibilities:
a. the scale was valid, but the theory incorrect;
b. the scale invalid, but the theory correct;
and, although Inkeles and Smith overlooked this possibility:
c. the scale was invalid, and the theory incorrect.

Inkeles and Smith had specified that nine different experiences would lead to an increase in the OM scores, and they developed specific hypotheses for these nine causal factors. An example is that "Higher OM Scale scores will be associated wit higher mass media exposure levels" (Inkeles and Smith, 1974:161). These hypotheses were tested statistically, and in general the predictions were well supported by the results.


12. The strategies of construct validation

The example of the Harvard six-nation study illustrates the main characteristics of construct validation This set of procedures can be applied when a new concept is proposed, that is linked to other concepts in a scientific theory. The justification for this procedure was given by Cronbach and Meehl, when they stated:

Scientifically speaking, too "make clear what something is" means to set forth the laws in which it occurs. We s a@ll refer to the interlocking system of laws which constitute a theory as a construct network (Cronbach and Meehl, 1956).

Construct validation can be applied in the situation where a tentative measure of a proposed new concept has to be tested as to its validity. That concept should be clearly linked to other, measurable concepts via a theoretical network. The procedures of construct validation should follow these general steps:

1. As a preliminary step the reliability of the measuring instrument should be assessed;
2. The hypotheses linking the new measure are then tested by analyzing whatever data are available;
3. Where the results tend to support the hypotheses, both the adequacy of the proposed theory and the validity of the new measure are supported;
4. Where the results do not tend to support the hypotheses put forward, this can be due to:
a. the adequacy of the proposed new theory;
b. the lack of validity of the new measure; or
c. some combination of these two possibilities.

(We assume here that other aspects of our research are not to blame for the negative results, e.g., the results are not caused by inadequacies of experimental design.)

13. The notion of content validity

A very different type of validity than the ones we have just is formed by content validity. Content validity refers to the extent to which a measure, usually a multiple item instrument, represents the specific content area of a given target variable or target concept. As such it is most useful in educational tests, where the curriculum or other guidelines often delimit the content area for a test of, say, grade 13 calculus, rather carefully. Nevertheless content validity is also an important criterion in other areas of applied and pure research. The authors of an inventory of measures of occupational attitudes evaluated all the instruments considered on the "proper sampling of content:"

Proper sampling is not easy to achieve, nor can exact rules be specified for ensuring that it is done properly... Nevertheless, there is little doubt of the critical nature of the general aim in scale construction... In the job satisfaction area, we have given detailed consideration to the analysis of responses to open-ended questions from representative samples which ask the respondent, "What things (do you like best) (don't you like) about your job?" We feel that these responses offer invaluable guidelines to the researcher as to both the universe of factors he should be covering and the weight that should be given to such factors. (Robinson, Athanasiou, Head, 1969:4)

Paul Lazarsfeld has suggested a method, called conceptual
analysis, to increase the content validity of attitude scales and
similar instrument. He suggests the following four steps:

1. "the creation of a rather vague image or construct;"
2. the concept is then elaborated by the specification of
aspects or dimensions of the concept;
3. for each of these dimensions indicators or items are then
developed;
4. the best items for each dimension are then combined in an
overall scale. (Lazarsfeld, 1959)

In the development of the Wilson Conservatism Scale the following dimensions of conservatism were specified, for instance (Wilson and Patterson, 1967:3):

a. Religious fundamentalism
b. Right-wing political orientation
c. Insistence on strict rules and punishment
d. Intolerance of minority groups
e. Preference for conventional art, clothing, and institutions
f. Anti-hedonistic outlook
g. Superstitious resistance to science

In this manner the content area of a particular scale will at least be considered explicitly, rather than being left unconsidered. (In other parts of scale development one would have to consider aspects of item selection, internal consistency, validity, and so on.)
Although the content validity of an instrument can vary from poor to good, its assessment still has to rely on the considered review by researchers. In the end the level of scale's content validity is therefore a matter of professional judgment.

14. Validity, reliability, and measurement error

What is the relationship between reliability and validity?
There are various ways to answer this question. First of all, if
a measure is valid, it necessarily is reliable; it cannot be
subject to high levels of measurement error, as that would
necessarily decrease its validity. If a scale measures, say, the
concept of prejudice accurately, it cannot be strongly influenced
by extraneous factors.
If we take the first two types of reliability, equivalence
and stability, we can state that low reliability virtually
excludes high validity. If a measure is highly susceptible to
measurement error, it is unlikely to be a good measure for any
concept. (The only exception is that such a measure may be useful
as an indicator of measurement error, say, social desirability
set.) On the other hand, high reliability does not guarantee high
validity: a consistent measure may still be off! In summary, a
valid measure is necessarily reliable; a reliable measure is not
necessarily valid, but an unreliable measure is always invalid. In other terms, reliability is a necessary, but not sufficient condition of validity. (The only combination not possible is unreliable but valid.)
As far as internal consistency forms of reliability and
their relationships to validity are concerned, they are complex
and depend on the measurement model for the target variable. But
we shall not dwell further on this topic.


In sociological methodology a somewhat different approach to the adequacy of data is presently gaining in prominence. In this model, which has largely been derived from psychology, a respondent's score on a test or scale is considered the result of two factors: the respondent's true score on the variable measured, and a separate error score, which is due to various measurement errors and biases. (Sometimes the measurement error is divided into two separate components, the systematic error--bias-- and the random error.) This model is formalized in the following formula:

0(m) = T(s) + E(s),

or: the observed measurement 0(m) equals the sum of the true
score T(s) and the error score E(s).
This measurement error approach is very useful in developing
statistical approaches to measurement error, as well as to the influence of measurement error on statistical relationships between variables. Some authors have suggested that this model should replace the older approaches to validity and reliability that have been covered in this chapter. Still it is likely that the latter approaches will remain important, as they offer substantive approaches to the assessment of measurement adequacy.

Thursday, February 01, 2001

ATTENTION; THE CLASS OF MONDAY, FEBRUARY 5, HAS BEEN CANCELLED. PLEASE READ THE MATERIAL BELOW. PLEASE LET YOUR FRIENDS AND CLASSMATES KNOW!

SEE YOU ON WEDNESDAY!