Thegoalofdataanalysisistogaininformationfromthedata

上传人:xx****x 文档编号:242870231 上传时间:2024-09-10 格式:PPT 页数:28 大小:445.50KB
返回 下载 相关 举报
Thegoalofdataanalysisistogaininformationfromthedata_第1页
第1页 / 共28页
Thegoalofdataanalysisistogaininformationfromthedata_第2页
第2页 / 共28页
Thegoalofdataanalysisistogaininformationfromthedata_第3页
第3页 / 共28页
点击查看更多>>
资源描述
Click to edit Master title style,Click to edit Master text styles,Second level,Third level,Fourth level,Fifth level,*,The goal of data analysis is to gain information from the data.,Exploratory data analysis:,set of methods to display and summarize the data.,Data on just one variable: the distribution of the observations is analyzed by,Displaying the data in a graph that shows overall patterns and unusual observations (bar chart, histogram, density curve),Computing descriptive statistics that summarize specific aspects of the data (center and spread).,Exploratory Data Analysis,1,Review of Histograms,A histogram represents percent by area.,The height of each block represents frequencies/percentages of the observations falling in the interval.,The total area under a histogram is _ if height in frequencies,The total area under a histogram is _ if height in percentages,There is no fixed choice for the number of classes in a histogram:,If class intervals are too small, the histogram will have spikes;,If class intervals are too large, some information will be missed.,Use your judgment!,Typically statistical software will choose the class intervals for you, but you can modify them.,2,3,Center and Spread,4,The most common measures are the,mean,(or,average,) and the,median,.,The Mean or Average,To calculate the average of a set of observations, add their value and divide by the number of observations:,Data:,Number of home runs hit by Babe Ruth as a Yankee,54, 59, 35, 41, 46, 25, 47, 60, 54, 46, 49, 46, 41, 34, 22,The mean number of home runs hit in a year is:,Measuring Centers,5,The median,The median M is the midpoint of a distribution, the number such that half the observations are smaller and the other half are larger.,To find the median:,Sort all the observations in order of size from smallest to largest,If the number of observations n is odd, the median M is the center observation in the ordered list; I.e. M=(n+1)/2-th,obs,.,If the number of observations n is even, the median M is the mean of the two center observations in the ordered list.,Example 1: Ordered list of home run hits by Babe Ruth:,22 25 34 35 41 41 46,46,46 47 49 54 54 59 60 N=15 Median = 46,8,th,Example 2: Ordered list of home run hits by Roger Maris:,8 13 14 16,23 26,28 33 39 61 N=10 Median = (23+26)/2=24.5,6,Mean versus Median,The mean and median of a symmetric distribution are close together,Mean Median,50%,50%,50%,In skewed distributions, the mean is farther out in the long tail than is the median. The mean is more sensitive to extreme values.,Median,Mean,Median,Mean,Right-skewed distribution,Left-skewed distribution,Symmetric distribution,7,Mean or Median?,The,mean,is a good measure for the center of a,symmetric,distribution,The,median,is a resistant measure and should be used for,skewed,distributions. Its value is only slightly affected by the presence of extreme observations, no matter how large these observations are.,8,On average, the cars under study drive 18.9 miles per gallon, and 50% of the cars under study drive at least 18 miles per gallon.,The,mode,is the observation value with the highest frequency,The Mode,9,Q,1,M Q,3,Spread of a Distribution,Two measures of spread:,1.,The Quartiles:,First quartile,Q,1,= is the value such that 25% of the observations fall at or below it,(Q,1,is often called 25th percentile).,The third quartile,Q,3,= the value such that 75% of the observations fall at or below it, (Q,3,is often called 75th percentile).,Typically used if the,distribution of the observations is,skewed,.,25%,10,First quartile (Q1) = 16, third quartile (Q3) = 21,What does this mean in terms of the data?,11,Percentiles (also called Quantiles):,In general the,n,th,percentile,is a value such that n% of the observations fall at or below or it;,In the example before:,5,th,percentile = 10.35 95,th,percentile = 24.1,10,th,percentile = 11 90,th,percentile = 22,Hence about 80% of the cars get between 11 and 22 miles per gallon.,n,th,percentile,n%,12,Descriptive measures for skewed distributions,If the histogram of the data is skewed, use the following descriptive statistics:,Min, Q1, Median, Q3, Max,To describe the distribution of the observed variable.,In our example,Min=8, Q1=16, Median=18, Q3=21, Max=61,13,The Standard Deviation,If a distribution is symmetric,:,Use the,average,to measure the center and,the,Standard Deviation,to measure the spread.,The standard deviation s (or,SD ),measures how far the observations are from the average.,Example:,A persons metabolic rate= rate at which the body consumes energy. Rates of 7 men in a study on dieting: 1792, 1666, 1614, 1460, 1867, 1439, 1362.,The mean is and the s.d. s =189.24,1300 1400 1500 1600 1700 1800 1900,Metabolic rate, ,Deviation=1867 1600=267,Deviation=1600 1439=161,14,In symbols, the standard deviation s of n observations is,The variance of an observed variable is defined as the square of the standard deviation.,Variance = s,2,Formula for the SD,15,Properties of the SD,It measures the spread about the mean.,Only used in association with the mean. Good descriptive measure for,symmetric distributions,If s = 0, all the observations have the same value,It is a POSITIVE value, the larger s is, the more spread out the observations are around the mean,It is NOT a resistant measure, a few extreme observations may affect its value (make it very large).,The variance is the square of the s.d.,16,Interpreting the SD,For many lists of observations especially if,their histogram is bell-shaped,Roughly 68% of the observations in the list lie within,1 standard deviation,of the average,95% of the observations lie within,2 standard deviations,of the average,Average,Ave-s.d.,Ave+s.d.,68%,95%,Ave-2s.d.,Ave+2s.d.,17,Example,In a large university, data were collected to study the academic achievements of computer science majors. Well consider the SAT math scores of 224 first year CS students.,The average SATM score is,595.28,with s.d.,s= 86.40,Histogram of the SATM Scores,Are the average and s.d. good,descriptions of the SATM scores distribution?,Roughly 68% of the students have scores between 510 and 680,Roughly 95% of the students have scores between 422 and 768,18,CS students example: Descriptive statistics,Mean = 595.28 Std Deviation = 86.40 Max= 800 Min= 300,Q1 = 540 Median = 600.00 Q3= 650 IQR=110 1.5xIQR=165,5,th,percentile = 460 95,th,percentile = 750,Histogram of the SATM Scores,422,768,95% of scores,19,Analysis of the scores,for male and female students:,SATM scores for men,SATM scores for women,20,Exploratory Data Analysis:,Always plot your data,Look for overall patterns & striking deviations such as outliers,Calculate a numerical summary to describe the center and the spread,NEXT STEP: sometimes the overall pattern is so regular that we can describe it through a smooth curve, called a density curve,21,Computing descriptive statistics in Excel,There are two ways:,Use the formula palette click on the,f,x,button,OR,Use the Data Analysis,Toolpak,& select descriptive statistics,22,The descriptive statistics tool,Input range:,sequence of cells containing the data,Label in First row,Output range:,tell Excel where to put the output,Summary statistics,: to be checked,23,24,Select an empty cell, and type the function name you want to compute or use the function palette for the list of available functions.,For instance to compute the min of the fuel consumption data in the city, type,=min(b2:b31),Formulas for 5-number summary,25,Normal distributions,Normal curves,provide a simple, compact way to describe symmetric, bell-shaped distributions.,SAT math scores for CS students,Normal curve,26,Money spent in a supermarket,Is the normal curve a good approximation?,27,The area under the histogram, i.e. the percentages of the observations, can be approximated by the corresponding area under the normal curve.,If the histogram is symmetric, we say that the data are approximately normal (or normally distributed).,We need to know only,the average,and the,standard deviation,of the observations!,SAT math scores for CS students,28,
展开阅读全文
相关资源
正为您匹配相似的精品文档
相关搜索

最新文档


当前位置:首页 > 图纸专区 > 大学资料


copyright@ 2023-2025  zhuangpeitu.com 装配图网版权所有   联系电话:18123376007

备案号:ICP2024067431-1 川公网安备51140202000466号


本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。装配图网仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知装配图网,我们立即给予删除!