外文翻译SQL服务器中的软件容错

资源描述

毕业设计（论文）英文翻译中文：SQL服务器中的软件容错摘要：对现在大多数软件，软件容错几乎只意味着可以保证为现成软件提供了更好的可靠性，相比没有定制开发或额外的成本高很多，我们公布一个自身实验装置的经验，就拿现成SQL数据库服务器来说。首先，我们描述了一个防护性包装来掩盖错误的影响，在其中一台服务器，而不从供应商哪里去等待足够的修复。然后，我们讨论如何结合成一个多元化的模块化冗余配置（N版本软件或N种自检软件）成各种各样的服务器。通过包装保证了数据库不同副本之间的一致性，并为多个CLI废除限制并发客户之间的交易，实践表明，对于数据库不同的保护性包装模块化冗余是可行的，复杂的甚至都可以实现容错现成组件。1、介绍对于用（OTS）的软件元件在会议上各方各持己见。本文中，我们侧重于可靠性概率LEMS、OTS的组件构成的系统集成介绍，他们的文档通常仅限于定义良好的接口，简单的示例应用程序演示了如何可以在一个系统集成的组件。组件厂商很少提供有关信息的质量和使用的VV程序。这将创建任何严格的可靠性要求的集成问题。至少在非安全关键CAL行业，供应商往往把不能接受甚至反感的现成组件的查询质量。这也是目前系统集成商所面临的。我们所使用的“组件”一词在通用工程意义称为组装，它可形成一个系统，并且是在他们自己的权利系统基础上建立的。“组件”可能是任何一件事，在软件库中用于组装应用，并可以作为独立系统使用的应用程序。我们一起审议现成的商业和非商业（COTS），例如：开放资源组件在我们的讨论中并不显著。但是源代码是可用的，可以利用它的规模和复杂特性。可能会拒绝优势通常采取授予时的源代码是可用的系统集成商不能信任的组成部分，充分可靠的系统的需求，往往不是系统的建设任务。我们认为容错往往是获取所需可靠性方式的唯一办法，在系统升级中，使用OTS的组件。通常情况下，不能改善OTS的组件，执行额外的VV活动是不可行的。这种情况很可能在未来改变，如果客户强烈要求实现与OTS的组件开发，其交易影响力是可想而知的，但这种可能性并不能帮助当下的系统集成商。容错可采取多种形式。例如，额外的（可能是特制的，但相对简单）组件进行保护性包装，像看门狗，显示器，对OTS的组件还具有审计职能，前检测意外是为了防止程序产生严重后果，甚至影响组件的状态恢复全面复制。这种“不同的模块化冗余”似乎是可取的，通过一个非常简单的架构达到端与端连接，以及防止相同的故障，将保护在副本内非多样的模块化冗余。两个或两个甚至更多的在职培训组件（其中有些可能是免费的）采购成本远远比自己开发的少很多。众所周知，使用OTS的组件系统开发问题，是有关的可靠性收益，实施困难和该特定系统的将给他们带来额外费用。为了研究这些问题，我们已选定了广泛的应用，相当复杂的OTS的组件类别：SQL数据库服务器。在现有的SQL服务器是常见的故障。可以看看修复其产品的每一个新版本的供应商所提供的清单。 SQL服务器的进一步改进可靠性，似乎唯一可能的是通过设计多样性容错。鉴于许多可用的SQL服务器的在职培训和不断增长其功能（SQL 92，SQL 99）建立一个容错性的OTS服务器和SQL服务器是合理的。我们已经开发了一个实验测试平台，它实现了一个多样化的冗余SQL服务器通过包装SQL服务器的冗余，使多个用户同时包裹的SQL服务器上运行他们的交易。我们正在运行的实验，以确定通过容错来实现可靠性收益。在本文中，我们报告关于建设OTS的这些具体的组成部分容错设计方面取得的经验：我们认为对于不同的模块化冗余，N版本编程（NVP）和N-版本自检编程（NSCP）使用的术语。 NVP系统的输出是由复制输出的一票。在NSCP，每一个不同的“版本”应该失败干净，使任何人都可以复制的产出作为系统的输出使用。这两种解决方案都依赖于保证数据库的不同副本的状态之间的一致性。这个副本一致性的问题审议了很长一段时间，仍然解决了一般数据库服务器关于保护性包装，我们提出的保护理念t OTS的组件包装。包装不正确的和潜在的之间的在职培训计划的组成部分和其余危险通信拦截系统，从而保护了对方的故障。对于OTS SQL服务器，保护包装保护客户端，服务器对服务器的故障对故障的客户，每个客户对故障的间接影响其他客户。在我们的设计方法，我们假设OTS的SQL服务器没有改变，因为我们没有进入其内部。必要性，因此，我们的解决方案是基于限制的客户端和SQL服务器（S）之间的相互作用。2、OTS的SQL服务器的实验环境软件实验中心在伦敦城市大学，理工大学，保加利亚的普罗夫迪夫已初见成效。它允许一个客户端运行各种应用程序，同时针对不同的SQL服务器使用入门级的SQL-92语言子集。该试验台作为一个DCOM组件包装实现客户端应用程序访问SQL服务器。图1为它的架构。创建测试平台，允许3个功能相媲美的SQL服务器，Oracle 8.0.5，MS SQL7.0，Interbase6.0进行实验。服务器可运行在任何操作系统，其中我们用于实验的三个服务器的Windows 2000专业版适用本产品，我们已经尝试用1和100之间的客户数目反映每个不等的客户端动作，其中包括查询或修改数据库（选择）（INSERT，UPDATE，DELETE）触发器和存储过程。我们用到两个应用程序（i）在现实生活中的仓库应用程序的变化；（ii）对于一个简化的银行应用程序，在资金总额不变的条件下资金账户之间的转移是保持不变的，允许整体系列是否正确处理服务器（S）的一个简单的正确性检查。与“仓库”的客户端应用程序，数据库中的表进行比较，检查是否在预定的时间间隔里保持数据库一致，为了检测服务器已在分歧的情况下失败。第三个应用是在发展的TPC-C进行基准测试。测试平台配置参数是允许改变的，如：客户数量和每一个实验中提交的查询; 客户“需求概况”，集中定义一个概率分布查询使用。测试平台选择查询类型和参数值，AC盘带用户设置的概率分布。存在一个“模板编辑器”工具延长交易集和概率分布，这样就可以尝试使用更广泛服务器上的负载;并发控制的各种模式之间的客户。免费模式，即不受限制地接触到所有客户端的服务器。由服务器提供的交易之间的隔离级别设置为serialisable，但没有机制（如原子广播）是在测试平台实施控制查询被传递到单个服务器。客户端是多线程的，每一个服务器是一个单独的线程谈话。瓶颈模式，一个非常严格的客户端访问服务器的订单总额，与客户之间的任何并发。代表客户的线程同步（使用的关键部分）和服务器在同一时间只有一个事务提供。下一交易（来自任何竞争的客户端）启动后，以前的事务被提交或回滚。Write Bottleneck模式，在包装允许的情况下，目前的观测（即只读）交易被发送到服务器，但任意数量的并发之间没有修改交易（其中包含至少在INSERT，DELETE或UPDATE声明）。其只能修改交易开始后的对于以前的修改交易完成（提交或回滚）。数据库中的SERV-ING间隔ERS检查他们是否仍在运行。对于每个详细的事件日志做实验记录，例如，作为发送的查询，所提出的所有异常，ping响应，查询和响应的时间与数据库进行比较。3、服务器故障包装容错是一种明确的处理与已知的故障。我们以Microsoft SQL V7.0服务器为例，发现，当客户数量超过20人次，创建SQL服务器竞争线程之间客户服务共享锁停止正常工作。一个奇特的情况可能会出现一些客户获得他们所需要的锁，但仍然在“等待”状态，从而保持持续（这种不正常的情况是不是死锁的所有其他客户端试图获取同一个锁），其中服务器将检测和处理回滚所有竞争性的交易，但是问题在于并发客户端数量增大时，频率也随之增加。我们后来发现，微软的SQL服务器的故障（错误56013）问题。一个解决的是应用程序管理员 - 或 - 检测情况，并介入杀害的线程持有的所有锁，但仍然在“睡眠”状态（即是在封锁链的根线程）。然而，由管理员手动除是昂贵的，如果所有的客户端能正确处理这种情况可能仍然允许前正在开展的大延误。明确处理在客户端应用程序的问题仅仅是的SATIS工厂。我们已经找到了另一种完全自动化的解决方案，这种方案，可以改变旧版客户端的情况下并包装在一起。它采用了PA的MSSQL，LOCK_TIMEOUT，可以明确设置为每个查询的具体。它的默认值是0，即阻塞的线程将等待永远为所需要的锁。设置锁超时过期时将其设置为非零值（10秒），使服务器提高异常“锁请求超时时间超过”。客户端不用等待永远会得到异常，可以回滚事务，并锁传递给其他客户。在一些回滚的交易成本中，该解决方案是不足以解决“坏堵”现象。如果我们在包装包括为LOCK_TIMEOUTs异常处理程序，将LOCK_TIMEOUT期间将逐步增加，或只是重复轧后回交易，从而使“坏堵”条件的决议完全透明的，它可以改善客户端。当然，我们简单的解决方案的成本，回滚多个交易：不一定是成本高。另一种方法杀线程块顶部?链也有其成本。被杀害，如果一个服务器线程，客户端和服务器之间的连接丢失和将要建立一个新的连接，这是一个比回滚几个交易更昂贵的操作。值得指出，我们让包装操作MSSQL特定的的锁定超时不干扰查询超时设置（不同的机制，可与任何SQL服务器的客户端应用程序）。一个复杂的查询可能需要很长的轻负载下（例如执行复杂的查询与子查询）完成，因此，所有查询一个大的查询超时设置是合理的。查询执行期间，多个锁可以交换服务器之间的线程访问共享资源的竞争。没有BUG56013，查询超时长可以共同存在很短，没有任何中止LOCK_TIMEOUT。这个在包装实施的变通方法，用户组织可以提供容错错误发现有损其装置的可靠性，无需等待供应商认识到这个问题，并发出了一个补丁，在任何情况下可能无法完全消除不受欢迎的行为。当已知问题留由供应商开放，系统集成商引入保护包装，或在客户端应用程序建立一个解决的唯一选择。选择后者可能是更有效的，但它是管理和正确实施更加修复和任何后续的升级，它必须在所有客户端应用程序的复制。请注意，在我们的例子中，如果某些客户端没有正确使用将LOCK_TIMEOUT的防守，他们可以阻止访问共享资源，另一方面，“忠诚”客户。即永远的客户，也是唯一的，收到多个“锁请求超时超出”不知何故被删除，直到非碰巧的客户端阻塞链，如超时各自的查询，这可能需要很长时间。总之，在包装实施修复，减少对供应商的修复的依赖，似乎总是比它在客户端应用程序实现更好的选项，只要可行的，并在运行时所产生的性能损失是能接受的。外文资料Software Fault-Tolerance with Off-the-Shelf SQL ServersP. Popov1, L. Strigini1, A. Kostov2, V. Mollov2, and D. Selensky2 1 Centre for Software Reliability, City University, London, UK ptp,striginicsr.city.ac.uk 2 Department of Computing, Technical University, Plovdiv, Bulgaria alexobs.bg,vmollov,selensky Abstract:With off-the-shelf software, software fault tolerance is almost the only means available for assuring better dependability than the off-the-shelf software offers, without the much higher costs of bespoke development or extra V&V. We report our experience with an experimental setup we have developed with off-the-shelf SQL database servers. First, we describe the use of a protective wrapper to mask the effects of a bug in one of the servers, without depending on an adequate fix from the vendors. We then discuss how to combine the diverse off-the-shelf servers into a diverse modular redundant configuration (N-version software or N-self-checking software). A wrapper guarantees the consistency between the diverse replicas of the database, serving multiple clients, by restricting the concurrency between the client transactions We thus show that diverse modular redundancy with protective wrapping is a viable way of achieving fault-tolerance with even complex off-the-shelf components, like database servers. 1 Introduction The audience of this conference is well aware of the pros and cons of using off-theshelf (OTS) software components1. In this paper we focus on the dependability problems that OTS components pose to system integrators: their documentation is usually limited to well defined interfaces, and simple example applications demonstrating how the components can be integrated in a system. Component vendors rarely provide information about the quality and V&V procedures used. This creates problems for any integrator with stringent dependability requirements. At least in non-safety critical industry sectors, vendors often treat queries of the quality of the off-the-shelf components as unacceptable or even offensive 1. System integrators are thus faced 1 We use the term “components” in the generic engineering meaning of “pieces that are assembled to form a system, and are systems in their own right”. “Components” may be anything ranging from software libraries, used to assemble applications, to complete applications that can be used as stand-alone systems. We consider together commercial-off-the-shelf (COTS) and non-commercial off-the-shelf, e.g. open-source, components: the difference is not significant in our discussion. Even when the source code is available, it may be impossible to make use of it its size and complexity (and often poor documentation) may deny the system integrator the advantages usually taken for granted when the source code is available. R. Kazman and D. Port (Eds.): ICCBSS 2004, LNCS 2959, pp. 117126, 2004. Springer-Verlag Berlin Heidelberg 2004 with the task of building systems out of components which cannot be trusted to be sufficiently dependable for the systems needs, and often are not. As we argued elsewhere 2 fault-tolerance is often the only viable way of obtaining ones required dependability at the system level, given the use of OTS components. In this common scenario, the alternatives improving the OTS components, performing additional V&V activities are either impossible or infeasible without costs comparable to those of bespoke development. This situation may well change in the future, if customers with serious dependability requirements achieve more clout in their dealings with OTS component developers, but this possibility does not help system integrators who are in this kind of situation now. Fault tolerance may take multiple forms, e.g., additional (possibly purpose-built but relatively simple) components performing protective wrapping, watchdog, monitoring, auditing functions, to detect undesired behaviour of the OTS components, prevent their producing serious consequences, and possibly effecting recovery of the components states; or even full-fledged replication with diverse versions of the components. Such “diverse modular redundancy” seems desirable because it offers endto-end protection via a fairly simple architecture, and protection against the identical faults that would be present in replicas within a non-diverse modular-redundant system. The cost of procuring two or even more OTS components (some of which may be free) would still be far less than that of developing ones own. All these design solutions are well known. The questions, for the developers of a system using OTS components, are about the dependability gains, implementation difficulties and extra costs that they would bring for that specific system. To study these issues, we have selected a category of widely used, fairly complex OTS components: SQL database servers. Faults in the currently available SQL servers are common. For evidence one can just look at the long list of bug fixes supplied by the vendors with every new release of their products. Further reliability improvement of SQL servers seems only possible if fault-tolerance through design diversity is employed 3. Given the many available OTS SQL servers and the growing standardisation of their functionality (SQL 92, SQL 99), it seems reasonable to build a fault-tolerant SQL server from available OTS servers. We have developed an experimental testbed which implements a diverse-redundant SQL server by wrapping a redundant set of SQL servers, so that multiple users run their transactions concurrently on the wrapped SQL servers. We are running experiments to determine the dependability gains achieved through fault tolerance 4. In this paper, we report on experience gained about the design aspects of building fault tolerance with these specific OTS components: ? regarding diverse modular redundancy, we consider N-version programming (NVP) and N-version self-checking programming (NSCP) to use the terminology of 5. In NVP, the systems output is formed by a vote on the replicated outputs. In NSCP, each diverse “version” is supposed to fail cleanly, so that anyone of the replicated outputs can be used as the systems output. Both solutions depend on guaranteed consistency between the states of the diverse replicas of the database. This problem of replica consistency, despite having been under scrutiny for a long time, is still far from being solved in general for database servers 6, 7; ? regarding protective wrapping, we have outlined elsewhere 8 the idea of protective wrapping for OTS components. Wrappers intercept both incorrect and potentially dangerous communications between OTS components and the rest of the system, thus protecting them against each others faults. For an OTS SQL server, the protective wrapper protects the clients against faults of the server, the server against faults of the clients, and also each client against the indirect effects of faults of the other clients. In our design approach we assume no changes to the OTS SQL servers, since we do not have access to their internals. By necessity, therefore, our solutions are based on restricting the interaction between the clients and the SQL server(s). 2 The Experimental Environment for OTS SQL Servers The testbed has been built in collaboration between the Centre for Software Reliability at City University, London, and the Technical University in Plovdiv, Bulgaria. It allows one to run various client applications concurrently against diverse SQL servers which use a significant sub-set of the entry-level SQL-92 language. The testbed contains a wrapper for the SQL servers, implemented as a DCOM component, accessed by the client applications. Fig. 1 shows its architecture. The testbed was created to allow experiments with 3 functionally comparable OTS SQL servers, Oracle 8.0.5, MS SQL 7.0, Interbase 6.0. The servers can run under any operating system for which there are versions of the products used; we used Windows 2000 Professional edition for experiments with the three servers and several operating systems (Win2k, Win98 and RedHat Linux 6.0) for experiments with Interbase. Fig. 1. Architecture of the testbedWe have experimented with between 1 and 100 clients and varying numbers of transactions per client, which include queries (SELECT) or modifications of the databases (INSERT, UPDATE, DELETE), “triggers” and “stored procedures”. We have used two applications: i) a variation of a real-life warehouse application; ii) a simplified banking application, in which funds are being transferred between accounts under the invariant condition that the total amount of funds remains constant. This invariant allows a simple correctness check (“oracle”) for whether the overall series of transactions is processed correctly by the server(s). With the “warehouse” client application, a comparison of the tables in the databases checks at predefined intervals whether the databases remain consistent, but no oracle exists to detect which of the servers has failed in case of disagreement. A third application is under development, based on the TPC-C benchmark 9. The testbed allows different configuration parameters to be changed such as: the number of clients and of queries submitted by each in an experiment; the “demand profiles” of the clients, a probability distribution defined on the set of queries used. The query types and parameter values are chosen by the testbed, according to user-set probability distributions. A “Template Editor” tool exists for extending the set of transactions and setting the probability distributions, so that one can experiment with a wide range of loads on the servers; various modes of concurrency control between the clients: Free mode, i.e. unrestricted access to servers by all clients. The level of isolation between the transactions provided by the servers is set to “serialisable”, but no mechanism (e.g. atomic broadcast) is implemented in the testbed to control the order in which the queries are delivered to the individual servers and hence executed by the servers. The clients are multithreaded, with a separate thread talking to each of the servers. Bottleneck mode, which imposes a very restrictive total order of the access by the clients to the server, with no concurrency between the clients. The threads representing the clients are synchronised (using critical sections) and the servers are supplied with only one transaction at a time. The next transaction (coming from any of the competing clients) is only initiated after the previous transaction is either committed or rolled back; WriteBottleneck mode, in which the wrapper allows an arbitrary number of concurrent observing (i.e. read-only) transactions to be sent to the servers but no concurrency between the modifying transactions (which contain at least one INSERT, DELETE or UPDATE statement). A modifying transaction can only be started after the previous modifying transaction is completed (committed or rolled back). intervals for comparison of the tables in the databases and for “ping”-ing the servers to check whether they are still functioning. For each experiment, a detailed log of events is recorded, including, e.g., all queries as sent, all exceptions raised, ping responses, results of database comparison, with timestamps for queries and responses. 3 Wrapping against Known Faults in a Server A form of fault-tolerance is to deal explicitly with known faults. We give one example here. With the Microsoft SQL v7.0 server, we observed that when the number of clients exceeded 20, the sharing of LOCKS between the competing threads created by the SQL server to serve its clients could cease to work properly. A peculiar situation could arise in which some clients acquired the locks they needed but remained in the “waiting” state, thus keeping all the other clients (trying to acquire the same locks) from continuing (this abnormal situation is not a deadlock, which the server would detect and handle by rolling back all competing transactions but one). The problem only occurs when the number of concurrent clients is large, and become more frequent as this number increased. As we later found out, Microsoft reported the problem as due to a fault of the SQL server (Bug #56013, 10). A work-around is for the administrator - or for an application - to detect the situation and intervene by killing the thread that holds all the LOCKs but remains in a “sleeping” state (i.e. is in the root of the chain of blocked threads). However, manual intervention by the administrator is costly and may still allow large delays before being undertaken. Handling the problem explicitly in the client applications is only satisfactory if all clients handle the situation properly. We have found another fully automated solution, which is relatively painless and can be incorporated in a wrapper, without changes to the legacy clients. It utilises a parameter, specific for MSSQL, LOCK_TIMEOUT, which can be explicitly set for each query. Its default value is 0, i.e. the blocked thread would wait for the needed lock forever. Setting it to non-zero value (we used 10 seconds) would make the server raise an exception “Lock request timeout period exceeded” when the set lock timeout expires. Now the client instead of waiting forever will get the exception and can roll the transaction back, while the locks are passed on to other clients. This solution is sufficient to resolve the occurrences of “bad blocking”, at the cost of some number of transactions being rolled back. It can be improved if we include in the wrapper an exception handler for LOCK_TIMEOUTs, which would gradually increase the LOCK_TIMEOUT period, or just repeat the transaction after rolling it back, and thus make the resolution of the “bad blocking” condition completely transparent to the client. The cost of our simple solution, of course, is rolling back multiple transactions: not necessarily a high cost. The alternative - killing the thread at the top of the blocking chain -also has its cost. If a server thread is killed, the connection between the client and the server is lost and a new connection will have to be established, which is a more expensive operation than rolling back a few transactions. It is worth pointing out that our letting the wrapper manipulate the MSSQL-specific lock timeout does not interfere with the setting of the query timeout (a different mechanism, available to the client applications with any SQL server). A complex query may take very long to complete even under light load (e.g. executing a complex query with sub-queries) and, therefore, setting a large query timeout for all queries is reasonable. During the execution of a query, multiple LOCKS can be exchanged between the server threads which compete for access to a shared resource. Without bug #56013, long query timeouts can co-exist with very short LOCK_TIMEOUT without any aborts. With this approach of implementing work-arounds in wrappers, a user organisation can provide fault tolerance for bugs it discovers to be detrimental to the dependability of its installations, without waiting for the vendor to recognise the problem and issue a patch, which in any case may not completely eliminate the undesired behaviour. When known problems are left open by the vendor, the system inte

展开阅读全文

外文翻译SQL服务器中的软件容错

最新文档