Chapter 2: Description of Samples and Populations

STAT 218 - Week 2, Lecture 3

January 18^th, 2024

Let’s Refresh Our Memory

In the last 2 weeks, we

discussed the differences among anectodal evidence, observational study, and experimental study
defined the terms population and sample
examined different types of random sampling
had an introduction to descriptive statistics
- saw some examples of frequency distributions enabling us to summarize categorical data and/or numeric data
  - either as a table or a graph

And OF COURSE

We took our first steps in R!

What about Today?

Today we will see some examples to understand

the shape, center, and dispersion in a data set.

Shapes of Distributions

The shape of a distribution can be represented by a smooth curve that provides an approximation of the histogram.

Shapes of Distributions

Central Tendency

Measures of Center

To understand the center or typical value of a data set, we calculate

Mean
Median
Mode

We also call these as “Central Tendency”

Mean

You might be familiar with this term. It is also known as
- arithmetic mean OR sample mean

Tip

Remember we employed a symbolic convention to differentiate between a variable and an observed value of that variable.

\(Y = birthweight\) (Variable)
\(y = 12.8\) lb (Observed Value)

We now denote

the observations in a sample by \(y_1\), \(y_2\), . . . , \(y_n\)
the mean of the sample by the symbol \(\bar{y}\) (read “y-bar”).

Mean

We calculate the mean by using this formula

Median

Imagine what would happen if Bill Gates was in our class and we calculated the average money in our bank account.
- It might not be the best idea to interpret this average.
Instead, we can calculate median which is a value that splits the ordered data into two equal parts.

How to Find the Median

Arrange the observations in increasing order.
In the array of ordered observations, the median is
- the middle value (if n is odd) or
- midway between the two middle values (if n is even).
We denote the median of the sample by the symbol \(\tilde{y}\) (read “y-tilde”).

Mode

The mode in a dataset is the number that occurs with the highest frequency.
It serves as a measure of central tendency, indicating the most prevalent choice or the characteristic that appears most frequently in your sample.

Let’s have another toy example

Assume that we have a following dataset:
- 22, 6, 6, 4, 2

Measures of Center	Data and Calculation	Result
Mean	(2+4+6+6+22)/5	8
Median	2,4,6,6,22	6
Mode	2,4,6,6,22	6

Robustness

A statistic is said to be robust if the value of the statistic is relatively unaffected by changes in a small portion of the data, even if the changes are dramatic ones.
The median is a robust statistic, but the mean is not robust because it can be greatly shifted by changes in even one observation.

Mean vs. Median

Both are useful measures!

Warning

Mean value is related to the sum which makes it sometimes very little sense.
- In some situations in life science such as bioassay, survival, and toxicity studies, the mean value cannot be computed (until last patient has died, for instance) whereas the median can be calculated.

Median is more robust than the mean.
- Remember Bill Gates example!
Mean can be sometimes more efficient than median.
- Mean takes full advantage of all the information available which makes it a rock star in classical methods in statistics.

Visualizing Mean and Median

Let’s see Rossman & Chance Applet to visualize mean and median.

Spread of Distributions

Let’s assume we managed to collect data from our squirrels on campus :) Our class was divided into three groups, and each group measured the weights (lbs) of 10 squirrels. Here are the results:

Group 1: 1.25, 1.25, 1.25, 1.25, 1.25, 1.25, 1.25, 1.25, 1.25, 1.25

Group 2: 1.0, 1.0, 1.0, 1.0, 1.0, 1.5, 1.5, 1.5, 1.5, 1.5

Group 3: 1.0, 1.4, 1.2, 1.4, 1.1, 1.3, 1.6, 1.0, 1.2, 1.3

Dr. Demirci mentioned that looking at these numbers is so confusing. Can you please calculate the sample mean for them to summarize this data?

All these groups calculated the same mean, which is 1.25 lbs. Dr. Demirci seemed not so happy with this number.

Why?

Spread of Distribution

So far, we’ve explored the shapes and central tendencies of distributions
- but an effective depiction of a distribution should also capture its dispersion
  - whether the observations in the sample are mostly uniform or exhibit significant variation.
We can also report this by using
- Range
- Standard Deviation
- Interquartile Range

Range

Range is one of the measures of dispersion indicating the difference between largest and smallest observations in a sample.
Let’s calculate range
- Group 1: \(1.25 - 1.25 = 0\)
- Group 2: \(1.5 - 1.0 = 0.5\)
- Group 3: \(1.6 - 1.0 = 0.6\)

Warning

Range is easy to calculate, but very sensitive to extreme values
- It is not robust.

Standard Deviation and Variance

Variance is denoted as \(s^2\), which is the standard deviation squared
Let’s calculate standard deviation for each squirrel group!

Quartile Range & Interquartile Range

Comparison of Measures of Dispersion

The range is simple to understand, but it can be a poor descriptive measure because it depends only on the extreme tails of the distribution.
- highly nonrobust.
The interquartile range, by contrast, describes the spread in the central “body” of the distribution.
The standard deviation takes account of all the observations and can roughly be interpreted in terms of the spread of the observations around their mean.
- However, the SD can be inflated by observations in the extreme tails.
The interquartile range is a robust measure, while the SD is not robust.

Chapter 2: Description of Samples and Populations

Let’s Refresh Our Memory

What about Today?

More on Histograms

Shapes of Distributions

Shapes of Distributions

Central Tendency

Measures of Center

Mean

Mean

Median

How to Find the Median

Mode

Let’s have another toy example

Robustness

Mean vs. Median

Visualizing Mean and Median

Spread of Distributions

Spread of Distribution

Range

Standard Deviation and Variance

Quartile Range & Interquartile Range

Comparison of Measures of Dispersion