Business Analytics exploration of data 代写
100%原创包过,高质量代写&免费提供Turnitin报告--24小时客服QQ&微信:273427
Business Analytics
Chapter 5
Numerical descriptive measures…..Cont
Chapter outline
Lecture 04
5.3 Measures of relative standing and box plots
5.4 Approximating descriptive measures for grouped data
5.5 Measures of association
5.6 General guidelines on the exploration of data
Learning objectives
LO1 to L03 were covered last week
This week
LO4 Explain the concepts of percentiles, deciles, quartiles and interquartile range, and show their usefulness through the application of a box plot
LO5 Calculate the mean and variance when the data are already in grouped form
LO6 Obtain numerical measures to calculate the direction and strength of the linear relationship between two variables
LO7 Understand the use of graphical methods and numerical measures to present summary information about a data set.
5.3 Measures of Relative Standing
and Box Plots
Measures of relative standing are designed to provide information about the position of particular values relative to the entire data set.
Percentile: the pth percentile is the value for which p percent are less than that value and 100(1-p)% are greater than that value.
Suppose you scored in the 60th percentile on your final exam, that means 60% of the other students’ scores were below yours, while 40% of scores were above yours.
Percentiles
The pth percentile of a set of measurements is the value for which
at most p% of the measurements are less than that value
at most 100(1–p)% of all the measurements are greater than that value.
For example, suppose 77 is the 68th percentile of a statistics exam score. Then
Quartiles
We have special names for the 25th, 50th and the 75th percentiles, namely quartiles.
•First (lower) quartile, Q1 = 25th percentile (p25)
•Second (middle) quartile, Q2 = 50th percentile (p50) (which is also the median)
•Third (upper) quartile, Q3 = 75th percentile (p75)
We can also convert percentiles into quintiles (fifths) and deciles (tenths).
Commonly Used Percentiles…
First (lower) decile = 10th percentile
First (lower) quartile, Q1 = 25th percentile
Second (middle)quartile,Q2 = 50th percentile
Third quartile, Q3, = 75th percentile
Ninth (upper) decile = 90th percentile
For example, if your exam mark places you in the 80th percentile, that doesn’t mean you scored 80% on the exam – it means that 80% of your peers scored lower than you and 20% scored higher than you in the exam. It is about your position relative to others, not the actual mark.
Example 11
Find the quartiles of the following set of measurements
7, 18, 12, 17, 29, 18, 4, 27, 30, 2, 4, 10, 21, 5, 8
Example 11 - Solution
First sort the measurements
2, 4, 4, 5, 7, 8, 10, 12, 17, 18, 18, 21, 27, 29, 30
Location of Percentiles
Find the location of any percentile using the formula
Example 12
Example 12 - Solution
After sorting the data we have
0, 0, 5, 7, 8, 9, 12, 14, 22, 33.
Example 12 – Solution…
The 50th percentile is halfway between the fifth and sixth observations (in the middle between 8 and 9), that is 8.5. That is,
p50 = 8 + (.5)(9 – 8) = 8.5
Example 12 – Solution…
The 75th percentile is one quarter of the distance between the eighth and ninth observation. That is
p75 = 14+.25(22 – 14) = 16.
Location of Percentiles…
Please remember…
Quartiles and Variability
Quartiles can provide an idea about the shape of a histogram.
Interquartile Range…
The quartiles can be used to create another measure of variability, the interquartile range, which is defined as follows:
Interquartile Range (IQR) = Q3 – Q1
The interquartile range measures the spread of the
middle 50% of the observations.
Large values of this statistic mean that the 1st and 3rd quartiles are far apart, indicating a high level of variability.
Box Plots
Box Plot is a pictorial display that graphs five main descriptive measures of the measurement set:
L – The largest measurement
Q3 – The upper quartile
Q2 – The median
Q1 – The lower quartile
S – The smallest measurement
Box Plots…
The box plot is a technique that graphs five statistics:
• the minimum and maximum observations, and
the first, second, and third quartiles.
Box Plots…
The lines extending to the left and right are called whiskers.
Any points that lie outside the whiskers are called outliers.
The whiskers extend outward to the smaller of 1.5 times the interquartile range or to the most extreme point that is not an outlier.
Example 13
Create a box plot for the data regarding the number of customers who purchased petrol in an Independent petrol station each day in the last 200 days.
Interpreting the box plot results
The number of customers range from 410 to 700.
About half the days, the number of customers are less than 560, and about half are greater than 560.
About half the days, the number of customers lie between 530 and 590.
About a quarter lies below 530 and a quarter above 590.
5.4 Approximating Descriptive
Measures for Grouped Data
Approximating descriptive measures for grouped data may be needed when approximated values satisfy the needs when only secondary grouped data are available.
Approximate the mean and standard deviation of the telephone call durations problem, represented by the frequency distribution.
5.5 Measures of Association
Two numerical measures are presented, for the description of linear relationship between two variables depicted in the scatter diagram.
Covariance (is there any pattern to the way two variables move together?)
Correlation coefficient (how strong is the linear relationship between two variables?)
Covariance…
Covariance…
In much the same way there was a ‘shortcut’ for calculating sample variance without having to calculate the sample mean, there is also a shortcut for calculating sample covariance without having to first calculate the means:
Covariance…
Coefficient of Correlation…
The coefficient of correlation is defined as the covariance divided by the standard deviations of the variables:
Coefficient of Correlation…
The coefficient of correlation can take positive or negative values.
It can take only values between –1 and +1.
Coefficient of Correlation…
Strong positive linear relationship
If the two variables are very strongly positively linear related, the coefficient value is close to +1.
Strong negative linear relationship
If the two variables are very strongly negatively linear related, the coefficient value is close to –1.
No linear relationship
No linear (straight line) relationship is indicated by a coefficient value close to zero.
Coefficient of Correlation…
Compute the covariance and the coefficient of correlation between advertising expenditure and sales level and discuss the strength and direction of the relationship between them. Base your calculation on the data (in millions) provided below.
Excel output
Interpretation
The covariance (10.2679) indicates that advertisement expenditure and sales level are positively related
The coefficient of correlation (0.797) indicates that there is a strong positive linear relationship between advertisement expenditure and sales level.
The Least Squares Method
The objective of the scatter diagram is to measure the strength and direction of the linear relationship.
Both can be more easily judged by drawing a straight line through the data.
We need an objective method of producing a straight line.
Such a method has been developed; it is called the least squares method.
The Least Squares Method…
Recall, the slope-intercept equation for a line is expressed in these terms:
y = mx + b
where:
m is the slope of the line
b is the y-intercept.
If we’ve determined that there is a linear relationship between two variables using the covariance and the coefficient of correlation, can we determine a linear function of the relationship?
The Least Squares Method
…produces a straight line drawn through the points so that the sum of squared deviations between the points and the line is minimised. This line is represented by the equation:
The Least Squares Method
The coefficients and are given by:
Fixed and Variable Costs
Fixed costs are costs that must be paid whether or not any units are produced.
These costs are ‘fixed’ over a specified period of time or range of production.
Variable costs are costs that vary directly with the number of products produced.
Fixed and Variable Costs
There are some expenses that are mixed.
There are several ways to break the mixed costs in its fixed and variable components. One such method is the least squares line. That is, we express the total costs of some component as
y = b0 + b1x
where y = total mixed cost, b0 = fixed cost and b1 = variable cost, and x is the number of units.
XM05-18 A tool and die maker operates out of a small shop making specialised tools. He is considering increasing the size of his business and needs to know more about his costs.
One such cost is electricity, which he needs to operate his machines and lights. (Some jobs require that he turn on extra bright lights to illuminate his work.) He keeps track of his daily electricity costs and the number of tools that he made that day. Determine the fixed and variable electricity costs.
The y-intercept is 9.587.
That is, the regression line strikes the y-axis at 9.587. This is simply the value of when x = 0.
However, when x = 0, we are producing no tools and hence the estimated fixed cost of electricity is $9.59 per day.
When we introduced the coefficient of correlation we pointed out that except for −1, 0, and +1 we cannot precisely interpret its meaning.
We can judge the coefficient of correlation in relation to its proximity to −1, 0, and +1 only.
Fortunately, we have another measure that can be precisely interpreted. It is the coefficient of determination, which is calculated by squaring the coefficient of correlation. For this reason we denote it R2.
The coefficient of determination is
R2 = 0.758
This tells us that 75.8% of the variation in electrical costs is explained by the number of tools. The remaining 24.2% is unexplained.
Interpreting Correlation
Because of its importance we remind you about the correct interpretation of the analysis of the relationship between two numerical variables. That is, if two variables are linearly related, it does not mean that X is causing Y. It may mean that another variable is causing both X and Y or that Y is causing X. Remember
‘Correlation is not Causation’
Parameters and Sample Statistics