资源描述
单位代码 01 学 号040101086 分 类 号 密 级_ _ _文献翻译数据库管理系统概述 院(系)名称信息工程学院专 业 名 称计算机科学与技术学 生 姓 名指 导 教 师2008年4月15日英文译文数据库管理系统概述赫克托加西亚-莫利纳,杰夫乌尔曼,珍妮佛1.2 数据库管理系统概述从图1.1我们可以看到一个完整的数据库管理系统概况。单框代表系统组件,而双框代表内存数据结构。实线显示控制流和数据流,而虚线仅表示数据流。由于这个图很复杂,我们将分几个阶段来考虑细节。首先,在顶部,我们认为应该有两个不同的命令来源到达数据库:(1)请求或修改数据的传统用户和应用程序。(2)数据库管理员:负责数据库结构或模型的个人或组织。1.2.1 数据定义语言命令第二种命令是简单的进程,从图1.1的右上侧开始,我们可以看见它的路径。例如,为一所大学搞注册的数据库管理员,或简称DBA,应该为每个学生建一张表或关系,从而说明这个学生所参加的课程以及那门课程的分数。数据库管理员还要规定学生的成绩只能是A 、B 、C 、D和F。这个结构和约束信息就是数据库的全部。这表明在图1.1中,数据库管理员必须要有特殊的权力才能执行模式更改指令,因为这些指令对数据库有着深远的影响。这些模式更改数据库定义语言指令(“DDL”代表“数据定义语言”)是由数据库定义语言处理器解析,并传递给执行引擎,经过搜索/存档/记录管理,再到元数据,即模型信息数据库。1.2.2 查询处理概述与数据库管理系统的绝大部份交互都是沿着图1.1左侧的路径。用户或应用程序启动一些行为,并不会影响数据库的模式,但可能会影响到数据库的内容(如果是一个修改命令行为),或将从数据库中提取数据(如果是一个查询行为)。1.1节讲过,用这些命令描述的语言称为数据操纵语言(即DML),说白了就是查询语言。我们可以使用很多数据操纵语言,但是在范例1.1 中所提到的那些数据查询语言,是目前最常用的。DML语句由两个独立的子系统来处理,其过程如下:查询回复查询就是利用查询编译器进行解析和优化。由此产生的查询计划,或数据库管理系统的行为序列将会作用于对查询的回复。执行引擎会为小段数据,特别是记录或关系元组发送一系列响应到资源管理器,从而让它了解数据文件(具有的关系)、那些文档格式和记录大小、索引文件,这有助于快速找到数据文件的元素。请求数据被翻译成页,这些请求被传递给缓冲管理器。我们将在1.2.3节讨论缓冲区管理器的作用,但简单来说,它的任务是把在二级存储器里(通常是磁盘)永久保存的部分合适数据发送到主存缓冲器中。通常,页或“磁盘块”是缓冲器和磁盘间的传送单元。缓冲管理器和存储管理器相互通信而从磁盘获得数据。存储管理器可能会含有一些操作系统指令,但更特殊的是,数据库管理系统可以直接向磁盘控制器发送指令。事物处理查询和其它数据操纵语言行为被划分成事物,事物是彼此孤立必须自动执行的单元。通常每一个查询或修改行为自身就是一个事物。此外,事物的执行必须是持久的,意思是任何一个完成了的事物其结果必须是恒定的,即使系统恰巧在事物完成时崩溃。我们把事物处理器分成两个主要部分:(1)一个并发控制管理器,或者调度器,负责确保事物的原子性和孤立性。(2)一个日志恢复管理器,负责确保事物的持久性。我们将在1.2.4节进一步讲述这些组件。1.2.3存储缓冲管理器数据库的数据通常放在二级存储器,在现今的计算机系统中“二级存储器“ 一般指磁盘。不过,要对数据执行任何有用的操作,则数据必须在主存。存储管理器的工作是控制数据在磁盘的存放以及数据在磁盘和主存储器间的传递。在一个简单的数据库系统中,存储管理器或许仅仅是底层操作系统的文件系统。但是,为了提高效率,数据库管理系统一般直接控制对磁盘的存储,至少在某些情况下。存储管理器记录文件在磁盘上的位置,并获得该块或含有来自缓冲管理器回复的文件的那些块。大家知道,磁盘一般可分为磁盘块,这些磁盘块是一些相邻的区域,含有大量的字节,可能是212或214(约4000至16000字节)。缓冲管理器负责把可用主存划分成许多缓冲器,它们是页大小的区域,能够存放磁盘块大小的内容。因此,当所有的数据库管理器组件需要来自磁盘的信息时,便直接或间接通过执行引擎与缓冲器和缓冲管理器交互。不同组件所需要的各种信息可能包括:(1)数据:数据库本身的内容。(2)元数据:描述数据库结构和约束的数据库模型。(3)统计数据:数据库管理系统收集和存储的有关数据的属性,如大小、值、各种关系以及数据库组件。(4)索引:支持高效访问数据的数据结构。有关缓冲管理器的更完整描述及其发挥的作用将在15.7节讲述。1.2.4事物处理把一个或更多的数据库操作分组成一个事务是很正常的,事务就是一个必须要自动执行并明显脱离其它事务的工作单元。此外,数据库管理系统提供持久性保证:事务一旦完成,将永远不会消失。因此,事务管理器接受来自一个应用的事务指令,这些指令会告诉事务管理器什么时候事务开始或结束,以及此应用所其期望的信息。所以接受交易指令,从一个应用,其中告诉经理人交易时,交易的开始和结束,以及信息的期望应用(例如,有些可能不希望请求原子数)。事务处理器执行下列任务:(1)登记日志:为了保证持久性,数据库的每一次变动都会单独记录在磁盘上。日志管理器遵循其中一些设计,以确保无论何时系统发生故障或“冲突“现象,恢复管理器将能够审查日志的变化和恢复数据库,使其状态一致。日志管理器最初把日志记录在缓冲器里,并与缓冲区管理器协商,以确保缓冲器里的内容在适当的时候写回到磁盘(磁盘里可以防止冲突)。(2)并发控制:事物必须能独立执行。但在大多数系统中,事实上有许多事务同时执行。因此,调度器(并发控制管理器)必须确保各种事务的个人行动有序进行,结果就象是这些事务是一个整体在执行,一次一个。一个典型的调度程序,它的工作就是在某些数据库片段保持锁。这些锁,是防止两个事务访问同一块数据,以至于交互性很差。这些锁一般都存放在主存的锁表里,就象图1.1 展示的那样。调度器通过禁止执行引擎访问部分锁定的数据库来制约查询的执行和其他数据库操作。(3)解除死琐:当事物经由调度器授予的锁来竞争资源时,它们很容易陷入一种状态,在这种状态下任何事务都不能进行,因为每一个事物都需要彼此已拥有的资源。事务管理器有责任干预和取消一个或更多的事务,从而让其它事物可以进行下去。1.2.5查询处理器数据库管理系统这部分,对用户影响最大的就是查询处理器。图1.1中查询处理器由两部分组成:1、查询编译器,将查询结果翻译成一种内部形式,即查询计划。后者是对数据的一系列操作。通常这些在查询计划里的操作是对“关系代数“的操作,这些将在5.2节讨论。往往是在一查询计划是实施的关系代数的经营方式,这是讨论在第。查询编译器包括三个主要单元:(1)查询分析器,它根据文字上的形式查询建立在一个树结构。(2)查询预处理器,它从事对查询的语义检查(例如,确保查询中的所有关系都真实存在),并把分析树转变成一棵代表初始查询计划的代数运算树。(3)查询优化器,它将原始查询计划转变成对实际数据操作的最佳可用序列。查询编译器使用元数据和统计数据,以决定哪些操作序列可能是最快的。例如,存在着一种索引,它是提供访问数据的一种专门数据结构。并为那些数据的一个或多个组件赋值,可以使这些计划速度远远超过另外的那些。2、执行引擎,它负责执行所选定查询计划的每一步。执行引擎会直接或通过缓冲器与其它大部分数据库组件相交互。为了处理那些数据,它必须将来自数据库的数据送到缓冲器里。它需要与调度器相交互,为了防止访问已锁定的数据,并与日志管理器相联系,以确保所有数据库的变化都妥当记录。1.3数据库概述系统研究意念相关数据库系统,可分为三大类:(1)数据库设计。怎样创建一个有用的数据库?什么样的信息进入数据库?这些信息是怎么组织的?要对数据项的值和类型提出什么样的假设?数据项又是如何连接的?(2)数据库编程。怎样表达查询和其它数据库操作?在一个应用中如何使用数据库管理系统的其他功能,如事务或约束?数据库编程和常规编程是怎样融合的?(3)数据库系统实施。如何建立一个数据库管理系统,包括查询处理,事务处理以及实现有效访问的组织存储等事情?1.3.1数据库设计第2章刚开始为表达数据库设计描述了一高级概念,即实体关系模型。我们在第3章介绍了关系模型,它是数据库管理系统最广泛采用的,且我们在1.1.2节接触过 。我们讲述了如何把实体关系设计转换成关系设计,又叫“关系数据库模式”。以后,在6.6节,我们将向大家展示如何使关系数据库模式格式化成SQL语言的数据定义部分。第3章还向读者介绍了“依赖”的概念,这是格式化的描述一个关系中元组间关系的假设。依赖允许我们通过一个被称为关系“正常化”的进程改进关系数据库的设计。在第4章我们将探讨数据库设计中的面向对象方法。那里,我们采用了ODL语言,它允许用面向对象的高级语句来描述数据库。我们也在寻找将面向对象的设计与关系模型相结合的方法,从而得到一种所谓的“对象-关系”模型。最后,第四章还介绍了“半结构化数据”,它是一种特别灵活的数据库模型,我们可以在文档语言XML中看到它的时尚体现。1.3.2数据库编程第5章整个10节都涵盖有数据库编程。第5章首先以关系模型的一个抽象查询方法开始,介绍了构成“关系代数”的操作符集。第6章介绍了有关SQL查询和数据库模型语句的基本思想。第七章介绍了有关数据上的约束和触发器SQL的各方面。第8章涵盖了SQL编程的某些高级方面。首先,最简单的SQL编程模型是一个独立、通用查询界面,在实践中大多数SQL编程是嵌入在一个用传统语言编写的较大项目,如C语言。在第八章我们学习如何将周围程序与SQL语句连接起来,以及怎样将数据从数据库传递给程序变量,反之亦然。本章还讲述了如何利用SQL的功能,简化事务,连接客户机到服务器,并授权非法用户进入数据库。在第9章我们将注意力转向面向对象的数据库编程标准。在这里,我们考虑两个方向。第一、OQL(对象查询语言),可以看作是试图使C + + ,或其他面向对象编程语言与高级数据库编程需求相兼容。第二、近来在SQL标准中采用的面向对象特征,可以被看作是使关系数据库、SQL与面向对象编程兼容的一次尝试。最后,在第10章,我们回到在第5章中开始的对抽象查询语言的研究。在这里,我们研究逻辑语言,看看它们是如何被用于扩展现代SQL功能的。1.3.3数据库系统实现本书的第三部分重点在如何实现数据库管理系统。数据库系统的实现,这个课题可以大致分为三个部分:(1)存储管理:如何有效使用二级存储来容纳数据以及实现它们的快速访问。(2)查询处理:如何用一种很高级的语言,如SQL来表示查询,并能实现高效执行。(3)事务管理:如何用1.2.4节中提到的ACID属性支持事务。这里的每个题目都涵盖了书中的几个章节。存储管理概述第11章介绍了存储器。不过,由于二级存储器,尤其是磁盘,是数据库管理系统管理数据的中心,所以我们要仔细研究数据存储的方式以及在磁盘上的访问。于是我们引入了基于磁盘数据的“块模型”, 它几乎影响了数据库系统中所有的操作。第12章涉及储存的数据元素关系,元组,属性值,以及其它数据模型里的等价物符合数据块模型的要求。接着我们看看用于构建索引的重要数据结构。索引是一个支持高效存取的数据结构。第13章涵盖了重要的一维索引结构索引顺序文件,B-树和哈希表。这些索引通常被用于数据库管理系统,以支持属性值已知并符合元组要求的查询。B-树也是用来访问按给定属性排列的关系。第14章论述了多维索引,它们是专门应用的数据结构,如地理数据库,那里可以专门查询某个地区的相关内容。这些索引结构也支持复杂的SQL查询,这种查询限定两个或两个以上属性的值,而其中的这些结构已开始在商业数据库管理系统中出现。查询处理概述第15章,涵盖了基本的查询执行。我们学过一些关系代数操作的高效算法。这些算法的设计是高效的,当数据存储在磁盘时,并在某些情况下,这些算法与主存算法有很大的差别。在第16章,我们考虑查询编译器和优化器的结构。我们将从解析查询以及对它们的语义检查开始。接着,我们考虑查询转换,从SQL到关系代数,逻辑查询计划的选择,也就是,一个代数式,代表必须执行的特殊操作,以及有关操作命令的必要约束。最后,我们探讨物理查询计划的选择,在此过程中,我们对特殊操作命令,用来实现每一步操作的算法都做了简要概述。事务处理概述在第17章中,我们了解到在数据库管理系统中如何实现事务的持久性。中心思想是设置一个能记录数据库所有变化的日志。任何存在于主存但不在磁盘的内容都可能在冲突(比如,电力供应中断)时丢失。因此,我们必须谨慎行事,以一种恰当的秩序将数据从从缓冲区移到磁盘,无论是数据库自身的变化还是日志的变更。这里有几个日志策略可用,但每次都在某些方面限制了我们的行动自由。随后,我们在第18章谈到了并发控制的独立性和原子性。我们将事务看作是读写数据库元素的操作序列。本章的主要课题是如何管理数据库元素上的锁:使用的不同类型的锁,事务获得和释放锁的方式。此外,本章还研究了不使用琐而能保证事务原子性和独立性的一系列方法。第19章总结了我们对事务处理的学习。我们总结了日志需求间的交互,这在第17章讨论过,和并发性的要求,在第18章讲过。处理死锁,事务管理器的另一项重要功能,这里也提到过。在分散的环境里延长并发控制,也会在第19章介绍。最后,我们认为事务是“长”的是可能的,它会花费几小时或几天的时间,而不是数毫秒。长事务不可能锁住数据而没有产生混乱,因为有可能有其它用户使用此数据,所以这迫使我们重新思考包含长事务的应用并发控制。1.3.4信息集成概述数据库系统近来的许多演变都朝着允许来自不同数据源功能的方向发展,这些数据源可能是在一个更大的整体上不能被数据库管理系统处理的数据库或信息资源。在第1.1.7节,我们简要的向你介绍了这些问题。我们讨论集成的主要模式,包括翻译和集成的源拷贝,称为“数据仓库”,以及收集来源的虚拟“观点”,又叫解调器。摘自:赫克托加西亚-莫利纳,杰夫乌尔曼,珍妮佛. 数据库系统世界. 附:英文原文Overview of a Database Management SystemHector Garcia-Molina, Jeff Ullman, Jennifer Widom1.2 Overview of a Database Management SystemIn Fig. 1.1 we see an outline of a complete DBMS. Single boxes represent system components, while double boxes represent in-memory data structures. The solid lines indicate control and data flow, while dashed lines indicate data flow only. Since the diagram is complicated, we shall consider the details in several stages. First, at the top, we suggest that there are two distinct sources of commands to the DBMS:1. Conventional users and application programs that ask for data or modify data.2. A database administrator: a person or persons responsible for the structure or schema of the database.1.2.1 Data-Definition Language CommandsThe second kind of command is the simpler to process, and we show its trail beginning at the upper right side of Fig. 1.1. For example, the database administrator, or DBA, for a university registrars database might decide that there should be a table or relation with columns for a student, a course the student has taken, and a grade for that student in that course. The DBA might also decide that the only allowable grades are A, B, C, D, and F. This structure and constraint information is all part of the schema of the database. It is shown in Fig. 1.1 as entered by the DBA, who needs special authority to execute schema-altering commands, since these can have profound effects on the database. These schema-altering DDL commands (“DDL” stands for “data-definition language”) are parsed by a DDL processor and passed to the execution engine, which then goes through the index/file/record manager to alter the metadata, that is, the schema information for the database.1.2.2 Overview of Query ProcessingThe great majority of interactions with the DBMS follow the path on the left side of Fig. 1.1. A user or an application program initiates some action that does not affect the schema of the database, but may affect the content of the database (if the action is a modification command) or will extract data from the database. Remember from Section 1.1 that the language in which these commands are expressed is called a data-manipulation language (DML) or somewhat colloquially a query language. There are many data-manipulation languages available, but SQL, which was mentioned in Example 1.1, is by far the most commonly used. DML statements are handled by two separate subsystems, as follows.Answering the queryThe query is parsed and optimized by a query compiler. The resulting query plan, or sequence of actions the DBMS will perform to answer the query, is passed to the execution engine. The execution engine issues a sequence of requests for small pieces of data, typically records or tuples of a relation, to a resource manager that knows about data files (holding relations), the format and size of records in those files, and index files, which help find elements of data files quickly.The requests for data are translated into pages and these requests are passed to the buffer manager. We shall discuss the role of the buffer manager in Section 1.2.3, but briefly, its task is to bring appropriate portions of the data from secondary storage (disk, normally) where it is kept permanently, to main memory buffers. Normally, the page or “disk block” is the unit of transfer between buffers and disk.The buffer manager communicates with a storage manager to get data from disk. The storage manager might involve operating-system commands, but more typically, the DBMS issues commands directly to the disk controller.Transaction processingQueries and other DML actions are grouped into transactions, which are units that must be executed atomically and in isolation from one another. Often each query or modification action is a transaction by itself. In addition, the execution of transactions must be durable, meaning that the effect of any completed transaction must be preserved even if the system fails in some way right after completion of the transaction. We divide the transaction processor into two major parts:1. A concurrency-control manager, or scheduler, responsible for assuring atomicity and isolation of transactions, and2. A logging and recovery manager, responsible for the durability of transactions.We shall consider these components further in Section 1.2.4.1.2.3 Storage and Buffer ManagementThe data of a database normally resides in secondary storage; in todays computer systems “secondary storage” generally means magnetic disk. However, to perform any useful operation on data, that data must be in main memory. It is the job of the storage manager to control the placement of data on disk and its movement between disk and main memory.In a simple database system, the storage manager might be nothing more than the file system of the underlying operating system. However, for efficiency purposes, DBMSs normally control storage on the disk directly, at least under some circumstances. The storage manager keeps track of the location of files on the disk and obtains the block or blocks containing a file on request from the buffer manager. Recall that disks are generally divided into disk blocks, which are regions of contiguous storage containing a large number of bytes, perhaps 212 or 214 (about 4000 to 16,000 bytes).The buffer manager is responsible for partitioning the available main memory into buffers, which are page-sized regions into which disk blocks can be transferred. Thus, all DBMS components that need information from the disk will interact with the buffers and the buffer manager, either directly or through the execution engine. The kinds of information that various components may need include:1. Data: the contents of the database itself.2. Metadata: the database schema that describes the structure of the database.3. Statistics: information gathered and stored by the DBMS about data properties such as the sizes of, and values in, various relations or other components of the database.4. Indexes: data structures that support efficient access to the data.A more complete discussion of the buffer manager and its role appears in Section 15.7.1.2.4 Transaction ProcessingIt is normal to group one or more database operations into a transaction, which is a unit of work that must be executed atomically and in apparent isolation from other transactions. In addition, a DBMS offers the guarantee of durability: that the work of a completed transaction will never be lost. The transaction manager therefore accepts transaction commands from an application, which tell the transaction manager when transactions begin and end, as well as information about the expectations of the application (some may not wish to require atomicity, for example). The transaction processor performs the following tasks:1. Logging: In order to assure durability, every change in the database is logged separately on disk. The log manager follows one of several policies designed to assure that no matter when a system failure or “crash” occurs, a recovery manager will be able to examine the log of changes and restore the database to some consistent state. The log manager initially writes the log in buffers and negotiates with the buffer manager to make sure that buffers are written to disk (where data can survive a crash) at appropriate times.2. Concurrency control: Transactions must appear to execute in isolation. But in most systems, there will in truth be many transactions executing at once. Thus, the scheduler (concurrency-control manager) must assure that the individual actions of multiple transactions are executed in such an order that the net effect is the same as if the transactions had in fact executed in their entirety, one-at-a-time. A typical scheduler does its work by maintaining locks on certain pieces of the database. These locks prevent two transactions from accessing the same piece of data in ways that interact badly. Locks are generally stored in a main-memory lock table, as suggested by Fig. 1.1. The scheduler affects the execution of queries and other database operations by forbidding the execution engine from accessing locked parts of the database.3. Deadlock resolution: As transactions compete for resources through the locks that the scheduler grants, they can get into a situation where none can proceed because each needs something another transaction has. The transaction manager has the responsibility to intervene and cancel (“roll-back” or “abort”) one or more transactions to let the others proceed.1.2.5 The Query ProcessorThe portion of the DBMS that most affects the performance that the user sees is the query processor. In Fig. 1.1 the query processor is represented by two components:1. The query compiler, which translates the query into an internal form called a query plan. The latter is a sequence of operations to be performed on the data. Often the operations in a query plan are implementations of “relational algebra” operations, which are discussed in Section 5.2. The query compiler consists of three major units:(a) A query parser, which builds a tree structure from the textual form of the query.(b) A query preprocessor, which performs semantic checks on the query (e.g., making sure all relations mentioned by the query actually exist), and performing some tree transformations to turn the parse tree into a tree of algebraic operators representing the initial query plan.(c) A query optimizer, which transforms the initial query plan into the best available sequence of operations on the actual data.The query compiler uses metadata and statistics about the data to decide which sequence of operations is likely to be the fastest. For example, the existence of an index, which is a specialized data structure that facilitates access to data, given values for one or more components of that data, can make one plan much faster than another.2. The execution engine, which has the responsibility for executing each of the steps in the chosen query plan. The execution engine interacts with most of the other components of the DBMS, either directly or through the buffers. It must get the data from the database into buffers in order to manipulate that data. It needs to interact with the scheduler to avoid accessing data that is locked, and with the log manager to make sure that all database changes are properly logged.1.3 Outline of Database-System StudiesIdeas related to database systems can be divided into three broad categories:1. Design of databases. How does one develop a useful database? What kinds of information go into the database? How is the information structured? What assumptions are made about types or values of data items? How do data items connect?2. Database programming. How does one express queries and other operations on the database? How does one use other capabilities of a DBMS, such as transactions or constraints, in an application? How is database programming combined with conventional programming?3. Database system implementation. How does one build a DBMS, including such matters as query processing, transaction processing and organizing storage for efficient access?1.3.1 Database DesignChapter 2 begins with a high-level notation for expressing database designs, called the entity-relationship model. We introduce in Chapter 3 the relational model, which is the model used by the most widely adopted DBMSs, and which we touched upon briefly in Section 1.1.2. We show how to translate entity-relationship designs into relational designs, or “relational database schemas”. Later, in Section 6.6, we show how to render relational database schemas formally in the data-definition portion of the SQL language.Chapter 3 also introduces the reader to the notion of “dependencies”, which are formally stated assumptions about relationships among tuples in a relation. Dependencies allow us to improve relational database designs, through a process known as “normalization” of relations.In Chapter 4 we look at object-oriented approaches to database design. There, we cover the language ODL, which allows one to describe databases in a high-level, object-oriented fashion. We also look at ways in which object-oriented design has been combined with relational modeling, to yield the so-called “object-relational” model.Finally, Chapter 4 also introduces “semistructured data” as an especially flexible database model, and we see its modern embodiment in the document language XML.1.3.2 Database ProgrammingChapters 5 through 10 cover database programming. We start in Chapter 5 with an abstract treatment of queries in the relational model, introducing the family of operators on relations that form “relational algebra”.Chapters 6 through 8 are devoted to SQL programming. As we mentioned, SQL is the dominant query language of the day. Chapter 6 introduces basic ideas regarding queries in SQL and the expression of database schemas in SQL.Chapter 7 covers aspects of SQL concerning constraints and triggers on the data.Chapter 8 covers certain advanced aspects of SQL programming. First, while the simplest model of SQL programming is a stand-alone, generic query interface, in practice most SQL programming is embedded in a larger program that is written in a conventional language, such as C. In Chapter 8 we learn how to connect SQL statements with a surrounding program and to pass data from the database to the programs variables and vice versa. This chapter also covers how one uses SQL features th
展开阅读全文