import numpy as np
import matplotlib.pyplot as plt
# Sample data
= np.array(['A', 'B', 'C', 'A', 'A', 'B','B','B','B','B','B','C','C','C','C'])
data
# Calculate frequencies
= np.unique(data, return_counts=True)
values, frequencies
# Line Graph
='o')
plt.plot(values, frequencies, marker'Frequency Polygon')
plt.title('X-axis')
plt.xlabel('Y-axis')
plt.ylabel(
plt.show()
=0.02)
plt.bar(values, frequencies, width plt.show()
1 Descriptive statistics
Descriptive statistics deals with methods to describe and summarize data.
Describing of data is effectively done through tables or graphs. Those often reveal important features such as the range, the degree of concentration, and the symmetry of the data.
The summary of data is expressed through numerical quantities (summary statistics) whose values are determined by the data.
1.0.1 Describing Data sets
1.0.2 Frequency Tables and Graphs
A data set having a relatively small number of distinct values can be conveniently presented in a frequency table.
Data from a frequency table can be graphically represented by:
Line Graph
Bar Graph
Frequency Polygon
1.0.3 Relative Frequency Tables and Graphs
Consider a data set consisting of n values. If f is the frequency of a particular value, then the ratio f /n is called its relative frequency.
That is, the relative frequency of a data value is the proportion of the data that have that value.
The relative frequencies can be represented graphically by:
- relative frequency line
- relative frequency bar graph
- relative frequency polygon
- pie chart: A pie chart is often used to indicate relative frequencies when the data are not numerical in nature. A circle is constructed and then sliced into different sectors with areas proportional to the respective relative frequencies.
import numpy as np
import matplotlib.pyplot as plt
# Sample data
= np.array(['A', 'B', 'C', 'A', 'A', 'B','B','B','B','B','B','C','C','C','C'])
data
# Calculate frequencies
= np.unique(data, return_counts=True)
values, frequencies
= frequencies/len(data)
relative_frequencies print(relative_frequencies)
= values, autopct='%1.1f%%')
plt.pie(relative_frequencies, labels plt.show()
[0.2 0.46666667 0.33333333]
1.0.4 Grouped Data, Histograms, Ogives, and Stem and Leaf Plots
For some data sets the number of distinct values is too large to utilize frequency tables.
Instead, in such cases, it is useful to divide the values into groupings, or class intervals, and then plot the number of data values falling in each class interval.
The number of class intervals chosen should be a trade-off between:
choosing too few classes at a cost of losing too much information about the actual data values in a class.
choosing too many classes, which will result in the frequencies of each class being too small.
It is common, although not essential, to choose class intervals of equal length.
The endpoints of a class interval are called the class boundaries.
We will adopt the left-end inclusion convention, which stipulates that a class interval contains its left-end but not its right-end boundary point.
Thus, for instance, the class interval 20-30 contains all values that are both greater than or equal to 20 and less than 30.
A bar graph plot of class data, with the bars placed adjacent to each other, is called a histogram.
The vertical axis of a histogram can represent either the class frequency or the relative class frequency.
import numpy as np
import matplotlib.pyplot as plt
# Set seed for reproducibility
50)
np.random.seed(
# Generate 300 random integers between 100 and 1500
= np.random.randint(100, 1501, size=300)
random_integers
# Create bins for the class intervals
= np.arange(100, 1600, 100)
bins print(bins)
# Plot histogram
=bins, edgecolor='black')
plt.hist(random_integers, bins'Histogram of Random Integers')
plt.title('Class Interval')
plt.xlabel('Frequency')
plt.ylabel( plt.show()
[ 100 200 300 400 500 600 700 800 900 1000 1100 1200 1300 1400
1500]
We are sometimes interested in plotting a cumulative frequency (or cumulative relative frequency) graph.
A point on the horizontal axis of such a graph represents a possible data value; its corresponding vertical plot gives the number (or proportion) of the data whose values are less than or equal to it.
A cumulative frequency plot is called an ogive.
import numpy as np
import matplotlib.pyplot as plt
# Set seed for reproducibility
50)
np.random.seed(
# Generate 300 random integers between 100 and 1500
= np.random.randint(100, 1501, size=300)
random_integers
# Create bins for the class intervals
= np.arange(100, 1600, 100)
bins print(bins)
= np.histogram(random_integers, bins=bins)[0]
histograms print(histograms)
= np.cumsum(histograms)
cumulativeSum print(cumulativeSum)
-1], cumulativeSum, marker='o', linestyle='-')
plt.plot(bins[:'Ogive (Cumulative Frequency Graph)')
plt.title('Class Interval')
plt.xlabel('Cumulative Frequency')
plt.ylabel(#plt.grid(True)
plt.show()
[ 100 200 300 400 500 600 700 800 900 1000 1100 1200 1300 1400
1500]
[22 25 15 24 14 21 20 24 19 21 23 21 28 23]
[ 22 47 62 86 100 121 141 165 184 205 228 249 277 300]
An efficient way of organizing a small-to moderate-sized data set is to utilize a stem and leaf plot.
Such a plot is obtained by first dividing each data value into two parts - its stem and its leaf.
For instance, if the data are all two-digit numbers, then we could let the stem part of a data value be its tens digit and let the leaf be its ones digit.
Thus, for instance, the value 62 is expressed as
import numpy as np
# Sample data
= np.array([62, 67, 63, 68, 69, 61, 64, 65, 66, 60, 75, 74, 76, 78, 90, 92, 34, 36, 56, 57, 45, 53, 52, 59, 73, 74, 79, 80, 81,20, 34, 36, 85])
data
# Create stem and leaf plot
= {}
stem_leaf
for number in data:
= number // 10
stem = number % 10
leaf if stem in stem_leaf:
stem_leaf[stem].append(leaf)else:
= [leaf]
stem_leaf[stem]
# Print stem and leaf plot
for stem, leaves in sorted(stem_leaf.items()):
print(f"{stem} | {' '.join(map(str, sorted(leaves)))}")
2 | 0
3 | 4 4 6 6
4 | 5
5 | 2 3 6 7 9
6 | 0 1 2 3 4 5 6 7 8 9
7 | 3 4 4 5 6 8 9
8 | 0 1 5
9 | 0 2