# Distribution of Sample Proportion

.

In the last section we introduced the idea of a **sample proportion**.

.

Recall that the data is binomial, meaning each data point is "success" or "fail"

.

The **sample proportion** is the fraction of the sample which scores a success on the question being studied.

.

- We use $\hat{p}$ to represent the sample proportion

- $\hat{p}$ is a
**sample statistic**so it varies from sample to sample

- $\hat{p} = \dfrac{\text{num of successes in sample}}{\text{sample size (n)}}$

.

The usual process in statistics is to select **one** sample from the population and draw a conclusion about the population from the sample.

.

In this section, we will collect a significant number of samples from the same population (returning the sample to the population each time)

.

The **sample proportion** then behaves like a **binomial distribution**.

.

### Caution:

- When we studied binomial distributions, we used
**n**= the number of trials

- In this topic, we use
**n**= sample size

- The two meanings are related: in one sample we are effectively performing
**n**trials of a binomial variable

.

### Example 1

It is known that **12%** of students in a school of **1500** students are left handed

- the population proportion $p = 0.12$

- the population size $N = 1500$

.

We will use a sample size of $n = 20$ students

.

Let **X** be the variable which is the number of left handed students in each sample.

.

… … $\hat{p} = \dfrac{X}{n}$

.

We took **50** samples with $n = 20$ and produced the following frequency table (modelled using random numbers)

In other words:

- there were
**5**samples with**0**left-handed students,

**10**samples with**1**left handed student, etc

there were no samples with $X > 6$ (more than 6 left-handed students)

.

We can find the **mean** and **standard deviation** of this set of data:

.

… … **Mean**

… … … … $\mu = \dfrac{\Sigma \hat{p} \times f}{\Sigma f}$

… … … … … $= \Sigma \big( \hat{p} \times RF \big)$ … … {RF is Relative Frequency}

… … … … … $= 0.12$

.

… … **Variance**

… … … … $\sigma^2 = E \big( \hat{p}^2 \big) - \mu^2$

… … … … … $= 0.0055$

.

… … **Standard Deviation**

… … … … $\sigma = \sqrt{0.0055}$

… … … … … $= 0.0742$

.

Despite having modelled this with random numbers, the mean sample proportion worked out to be exactly **0.12** which is the same as the population proportion.

.

## Expected Value and Standard Deviation of Sample Proportion

Larger samples give better estimates of the population proportion, **p**.

.

If the sample is **sufficiently large**, then

- the distribution of
**X**, the number of successes, can be treated as a binomial variable

- the distribution of $\hat{p}$ can therefore also be treated as a binomial variable

.

- We know that the sample proportion: $\hat{p} = \dfrac{x}{n}$

.

- For a large sample, the random variable: $\hat{P} = \dfrac{X}{n}$

.

Therefore:

… … $\text{E} \big( \hat{P} \big) = \text{E} \Big( \dfrac{X}{n} \Big)$

.

… … … $= \dfrac{1}{n} \text{E} \big( X \big)$

.

… … … $= \dfrac{1}{n} \times np$

.

… … … $= p$

.

$\text{E} \big( \hat{P} \big) = p$ means that the expected average over a lot of samples of $\hat{p}$ will be the population proportion, **p**

.

Also

… … $\text{Var} \big( \hat{P} \big) = \text{Var} \Big( \dfrac{X}{n} \Big)$

.

… … … … $= \Big( \dfrac{1}{n} \Big)^2 \text{Var} \big( X \big)$

.

… … … … $= \dfrac{1}{n^2} \times np(1-p)$

.

… … … … $= \dfrac{p(1-p)}{n}$

.

hence

… … $\text{SD} \big( \hat{P} \big) = \sqrt{ \dfrac{ p(1-p) }{n} }$

.

### Example 1b

In the example above of left-handed students, where $p = 0.12, \; n = 20$ we get theoretical results of:

.

… … $\text{E} \big( \hat{P} \big) = 0.12$

.

… … $\text{SD} = \sqrt{ \dfrac{ 0.12(1 - 0.12) }{20} } = 0.0727$

.

Compare these values with the experimental results obtained from **50** samples

… … $\mu = 0.12 \qquad \qquad \sigma = 0.0742$

.

## Large Samples

.

The above theory works best when sufficiently large samples are taken.

.

One definition of a **large sample** is that it fits the following **3** rules:

… … $np \geqslant 10$

… … $n(1 - p) \geqslant 10$

… … $10n \geqslant N$

.

### Example 1c

Consider the example above where we took samples of **20** students from a school of **1500** to test for left-handedness $\big(p = 0.12\big)$.

Is this sample sufficiently large? If not, how large should the sample be?

.

**Solution:**

… … Compare **n, p, N** to the three rules listed above.

.

… … … $np = 20 \times 0.12 = 2.4$ … … which is NOT $\geqslant 10$

.

… … … $n(1- p) = 20(1 - 0.12) = 17.6$ … … which is $\geqslant 10$

.

… … … $10n = 10 \times 20 = 200$ … … which is NOT $\geqslant 1500$

.

… … Hence $n = 20$ was **not** sufficiently large according to this set of rules.

.

… … To find how large to make the sample, we need **n** such that $np > 10$

.

… … … $n \times 0.12 \geqslant 10$

.

… … … $n \geqslant 83.3$

.

… … … Round $83.3$ up to the next integer gives $n = 84$

.

…. … … Check $n = 84$ against all 3 rules:

.

… … … … $np = 84 \times 0.12 = 10.08$ … … $10.08 \geqslant 10$

.

… … … … $n(1 - p) = 84(1 - 0.12) = 73.92$ … … $73.92 \geqslant 10$

.

… … … … $10n = 10 \times 84 = 840$ … … $840 \text{ is NOT } \geqslant 1500$

.

… … … So $n = 84$ is still not big enough to meet the third rule.

.

… … … Try $n = 150$ obtained from the third rule.

.

… … … … $np = 150 \times 0.12 = 18$ … … $18 \geqslant 10$

.

… … … … $n(1 - p) = 150(1 - 0.12) = 132$ … … $132 \geqslant 10$

.

… … … … $10n = 10 \times 150 = 1500$ … … $1500 \geqslant 1500$

… … $n = 150$ meets all 3 rules, hence $n = 150$ is a sufficiently large sample

.

## Theoretical Distribution of Sample Proportion

.

When we know the population size **(N)** and the population proportion **(p)** we can perform the following calculations.

.

The total number of ways a sample of **n** can be selected from a population of **N** is given by $^NC_n$.

.

If the population proportion is **p**, then the number of successes in the population is $N \times p$

.

and the number of fails in the population is $N(1 - p)$

.

If **x** is the number of successes in a sample of size **n**, there will be $(n - x)$ fails.

.

Therefore, the total number of ways we can get:

… … **x** successes out of **Np** possible successes

and … $(n – x)$ fails out of $N(1 – p)$ possible fails

.

is given by: … $^{Np}C_x \times ^{N(1-p)}C_{n-x}$

.

### Example 2

A large tub contains **20** pieces of fruit of which **6** are apples.

If we consider selecting an apple as a success then $p = \dfrac{6}{20} = 0.3$

If we take a number of random samples where $n = 5$

Let **X** = number of apples in one sample.

… **a)** .. construct a table of the possible number of samples for each value of $X = x$, together with the relative frequencies.

… **b)** .. construct a table of the sampling distribution

… **c)** .. calculate the Theoretical **Expected Value** and **Standard Deviation** for the sample proportion, $\hat{p}$.

.

**Solution**

… **a)** .. construct a table of the possible number of samples for each value of $X = x$, together with the relative frequencies.

… … The total number of possible samples is $^{20}C_5 = 15504$

… … $n = 5$, so in each sample, the number of apples we could get is $\big\{0,\; 1,\; 2,\; 3,\; 4,\; 5 \big\}$.

.

… **b)** .. construct a table of the sampling distribution

… … The **sampling distribution** is the probability distribution for the sample proportion.

… … Notice that the **relative frequency** from the above table becomes the **Probability**

.

… **c)** .. calculate the Theoretical **Expected Value** and **Standard Deviation** for the sample proportion, $\hat{p}$.

… … $\text{E} \big( \hat{P} \big) = p = 0.3$

.

… … $\text{SD} \big( \hat{P} \big) = \sqrt{ \dfrac{ p(1-p) }{n} } = 0.2049$

.

## Approximation to Normal Distribution

If we take enough large samples from a population, the distribution of the sample proportion will approximate a **normal distribution**.

For example, the histogram below was produced using **1000** samples of size $n = 100$ using a random number generator.

.

In the next section, **Confidence Intervals**, we will treat the sample proportion as a normal distribution

.