1  Descriptive statistics

Author

Abhijith M S

Published

February 19, 2025

1.0.1 Describing Data sets

1.0.2 Frequency Tables and Graphs

  • A data set having a relatively small number of distinct values can be conveniently presented in a frequency table.

  • Data from a frequency table can be graphically represented by:

    • Line Graph

    • Bar Graph

    • Frequency Polygon

import numpy as np 
import matplotlib.pyplot as plt 

# Sample data
data = np.array(['A', 'B', 'C', 'A', 'A', 'B','B','B','B','B','B','C','C','C','C'])

# Calculate frequencies
values, frequencies = np.unique(data, return_counts=True)

# Line Graph
plt.plot(values, frequencies, marker='o')
plt.title('Frequency Polygon')
plt.xlabel('X-axis')
plt.ylabel('Y-axis')
plt.show()


plt.bar(values, frequencies, width=0.02)
plt.show()

1.0.3 Relative Frequency Tables and Graphs

  • Consider a data set consisting of n values. If f is the frequency of a particular value, then the ratio f /n is called its relative frequency.

  • That is, the relative frequency of a data value is the proportion of the data that have that value.

  • The relative frequencies can be represented graphically by:

    • relative frequency line
    • relative frequency bar graph
    • relative frequency polygon
    • pie chart: A pie chart is often used to indicate relative frequencies when the data are not numerical in nature. A circle is constructed and then sliced into different sectors with areas proportional to the respective relative frequencies.
import numpy as np 
import matplotlib.pyplot as plt 

# Sample data
data = np.array(['A', 'B', 'C', 'A', 'A', 'B','B','B','B','B','B','C','C','C','C'])

# Calculate frequencies
values, frequencies = np.unique(data, return_counts=True)

relative_frequencies = frequencies/len(data)
print(relative_frequencies)


plt.pie(relative_frequencies, labels = values, autopct='%1.1f%%')
plt.show()
[0.2        0.46666667 0.33333333]

1.0.4 Grouped Data, Histograms, Ogives, and Stem and Leaf Plots

  • For some data sets the number of distinct values is too large to utilize frequency tables.

  • Instead, in such cases, it is useful to divide the values into groupings, or class intervals, and then plot the number of data values falling in each class interval.

  • The number of class intervals chosen should be a trade-off between:

    • choosing too few classes at a cost of losing too much information about the actual data values in a class.

    • choosing too many classes, which will result in the frequencies of each class being too small.

  • It is common, although not essential, to choose class intervals of equal length.

  • The endpoints of a class interval are called the class boundaries.

  • We will adopt the left-end inclusion convention, which stipulates that a class interval contains its left-end but not its right-end boundary point.

  • Thus, for instance, the class interval 20-30 contains all values that are both greater than or equal to 20 and less than 30.

  • A bar graph plot of class data, with the bars placed adjacent to each other, is called a histogram.

  • The vertical axis of a histogram can represent either the class frequency or the relative class frequency.

import numpy as np
import matplotlib.pyplot as plt

# Set seed for reproducibility
np.random.seed(50)

# Generate 300 random integers between 100 and 1500
random_integers = np.random.randint(100, 1501, size=300)

# Create bins for the class intervals
bins = np.arange(100, 1600, 100)
print(bins)

# Plot histogram
plt.hist(random_integers, bins=bins, edgecolor='black')
plt.title('Histogram of Random Integers')
plt.xlabel('Class Interval')
plt.ylabel('Frequency')
plt.show()
[ 100  200  300  400  500  600  700  800  900 1000 1100 1200 1300 1400
 1500]

  • We are sometimes interested in plotting a cumulative frequency (or cumulative relative frequency) graph.

  • A point on the horizontal axis of such a graph represents a possible data value; its corresponding vertical plot gives the number (or proportion) of the data whose values are less than or equal to it.

  • A cumulative frequency plot is called an ogive.

import numpy as np
import matplotlib.pyplot as plt

# Set seed for reproducibility
np.random.seed(50)

# Generate 300 random integers between 100 and 1500
random_integers = np.random.randint(100, 1501, size=300)

# Create bins for the class intervals
bins = np.arange(100, 1600, 100)
print(bins)

histograms = np.histogram(random_integers, bins=bins)[0]
print(histograms)
cumulativeSum = np.cumsum(histograms) 
print(cumulativeSum)


plt.plot(bins[:-1], cumulativeSum, marker='o', linestyle='-')
plt.title('Ogive (Cumulative Frequency Graph)')
plt.xlabel('Class Interval')
plt.ylabel('Cumulative Frequency')
#plt.grid(True)
plt.show()
[ 100  200  300  400  500  600  700  800  900 1000 1100 1200 1300 1400
 1500]
[22 25 15 24 14 21 20 24 19 21 23 21 28 23]
[ 22  47  62  86 100 121 141 165 184 205 228 249 277 300]

  • An efficient way of organizing a small-to moderate-sized data set is to utilize a stem and leaf plot.

  • Such a plot is obtained by first dividing each data value into two parts - its stem and its leaf.

  • For instance, if the data are all two-digit numbers, then we could let the stem part of a data value be its tens digit and let the leaf be its ones digit.

  • Thus, for instance, the value 62 is expressed as

import numpy as np

# Sample data
data = np.array([62, 67, 63, 68, 69, 61, 64, 65, 66, 60, 75, 74, 76, 78, 90, 92, 34, 36, 56, 57, 45, 53, 52, 59, 73, 74, 79, 80, 81,20, 34, 36, 85])

# Create stem and leaf plot
stem_leaf = {}

for number in data:
    stem = number // 10
    leaf = number % 10
    if stem in stem_leaf:
        stem_leaf[stem].append(leaf)
    else:
        stem_leaf[stem] = [leaf]

# Print stem and leaf plot
for stem, leaves in sorted(stem_leaf.items()):
    print(f"{stem} | {' '.join(map(str, sorted(leaves)))}")
2 | 0
3 | 4 4 6 6
4 | 5
5 | 2 3 6 7 9
6 | 0 1 2 3 4 5 6 7 8 9
7 | 3 4 4 5 6 8 9
8 | 0 1 5
9 | 0 2