1 Descriptive statistics

Author

Abhijith M S

Published

February 19, 2025

Descriptive statistics deals with methods to describe and summarize data.
Describing of data is effectively done through tables or graphs. Those often reveal important features such as the range, the degree of concentration, and the symmetry of the data.
The summary of data is expressed through numerical quantities (summary statistics) whose values are determined by the data.

1.0.1 Describing Data sets

1.0.2 Frequency Tables and Graphs

A data set having a relatively small number of distinct values can be conveniently presented in a frequency table.
Data from a frequency table can be graphically represented by:
- Line Graph
- Bar Graph
- Frequency Polygon

import numpy as np 
import matplotlib.pyplot as plt 

# Sample data
data = np.array(['A', 'B', 'C', 'A', 'A', 'B','B','B','B','B','B','C','C','C','C'])

# Calculate frequencies
values, frequencies = np.unique(data, return_counts=True)

# Line Graph
plt.plot(values, frequencies, marker='o')
plt.title('Frequency Polygon')
plt.xlabel('X-axis')
plt.ylabel('Y-axis')
plt.show()


plt.bar(values, frequencies, width=0.02)
plt.show()

1.0.3 Relative Frequency Tables and Graphs

Consider a data set consisting of n values. If f is the frequency of a particular value, then the ratio f /n is called its relative frequency.
That is, the relative frequency of a data value is the proportion of the data that have that value.
The relative frequencies can be represented graphically by:
- relative frequency line
- relative frequency bar graph
- relative frequency polygon
- pie chart: A pie chart is often used to indicate relative frequencies when the data are not numerical in nature. A circle is constructed and then sliced into different sectors with areas proportional to the respective relative frequencies.

import numpy as np 
import matplotlib.pyplot as plt 

# Sample data
data = np.array(['A', 'B', 'C', 'A', 'A', 'B','B','B','B','B','B','C','C','C','C'])

# Calculate frequencies
values, frequencies = np.unique(data, return_counts=True)

relative_frequencies = frequencies/len(data)
print(relative_frequencies)


plt.pie(relative_frequencies, labels = values, autopct='%1.1f%%')
plt.show()

[0.2        0.46666667 0.33333333]

1.0.4 Grouped Data, Histograms, Ogives, and Stem and Leaf Plots

For some data sets the number of distinct values is too large to utilize frequency tables.
Instead, in such cases, it is useful to divide the values into groupings, or class intervals, and then plot the number of data values falling in each class interval.
The number of class intervals chosen should be a trade-off between:
- choosing too few classes at a cost of losing too much information about the actual data values in a class.
- choosing too many classes, which will result in the frequencies of each class being too small.
It is common, although not essential, to choose class intervals of equal length.
The endpoints of a class interval are called the class boundaries.
We will adopt the left-end inclusion convention, which stipulates that a class interval contains its left-end but not its right-end boundary point.
Thus, for instance, the class interval 20-30 contains all values that are both greater than or equal to 20 and less than 30.
A bar graph plot of class data, with the bars placed adjacent to each other, is called a histogram.
The vertical axis of a histogram can represent either the class frequency or the relative class frequency.

import numpy as np
import matplotlib.pyplot as plt

# Set seed for reproducibility
np.random.seed(50)

# Generate 300 random integers between 100 and 1500
random_integers = np.random.randint(100, 1501, size=300)

# Create bins for the class intervals
bins = np.arange(100, 1600, 100)
print(bins)

# Plot histogram
plt.hist(random_integers, bins=bins, edgecolor='black')
plt.title('Histogram of Random Integers')
plt.xlabel('Class Interval')
plt.ylabel('Frequency')
plt.show()

[ 100  200  300  400  500  600  700  800  900 1000 1100 1200 1300 1400
 1500]

We are sometimes interested in plotting a cumulative frequency (or cumulative relative frequency) graph.
A point on the horizontal axis of such a graph represents a possible data value; its corresponding vertical plot gives the number (or proportion) of the data whose values are less than or equal to it.
A cumulative frequency plot is called an ogive.

import numpy as np
import matplotlib.pyplot as plt

# Set seed for reproducibility
np.random.seed(50)

# Generate 300 random integers between 100 and 1500
random_integers = np.random.randint(100, 1501, size=300)

# Create bins for the class intervals
bins = np.arange(100, 1600, 100)
print(bins)

histograms = np.histogram(random_integers, bins=bins)[0]
print(histograms)
cumulativeSum = np.cumsum(histograms) 
print(cumulativeSum)


plt.plot(bins[:-1], cumulativeSum, marker='o', linestyle='-')
plt.title('Ogive (Cumulative Frequency Graph)')
plt.xlabel('Class Interval')
plt.ylabel('Cumulative Frequency')
#plt.grid(True)
plt.show()

[ 100  200  300  400  500  600  700  800  900 1000 1100 1200 1300 1400
 1500]
[22 25 15 24 14 21 20 24 19 21 23 21 28 23]
[ 22  47  62  86 100 121 141 165 184 205 228 249 277 300]

An efficient way of organizing a small-to moderate-sized data set is to utilize a stem and leaf plot.
Such a plot is obtained by first dividing each data value into two parts - its stem and its leaf.
For instance, if the data are all two-digit numbers, then we could let the stem part of a data value be its tens digit and let the leaf be its ones digit.
Thus, for instance, the value 62 is expressed as

import numpy as np

# Sample data
data = np.array([62, 67, 63, 68, 69, 61, 64, 65, 66, 60, 75, 74, 76, 78, 90, 92, 34, 36, 56, 57, 45, 53, 52, 59, 73, 74, 79, 80, 81,20, 34, 36, 85])

# Create stem and leaf plot
stem_leaf = {}

for number in data:
    stem = number // 10
    leaf = number % 10
    if stem in stem_leaf:
        stem_leaf[stem].append(leaf)
    else:
        stem_leaf[stem] = [leaf]

# Print stem and leaf plot
for stem, leaves in sorted(stem_leaf.items()):
    print(f"{stem} | {' '.join(map(str, sorted(leaves)))}")

2 | 0
3 | 4 4 6 6
4 | 5
5 | 2 3 6 7 9
6 | 0 1 2 3 4 5 6 7 8 9
7 | 3 4 4 5 6 8 9
8 | 0 1 5
9 | 0 2