数据存储方案_装配图网

资源描述

引言文献是由Rick Cattell撰写的论文，论文讨论了可扩展的结构化数据的、非结构化的（包括基于键值对的、基于文档的和面向列的）数据存储方案（注：NOSQL是支撑大数据应用的关键所在。事实上，将NOSQL翻译为“非结构化”不甚准确，因为NOSQL更为常见的解释是：Not Only SQL（不仅仅是结构化），换句话说，NOSQL并不是站在结构化SQL的对立面，而是既可包括结构化数据，也可包括非结构化数据）。论文信息Scalable SQL and NoSQL Data StoresRick Cattell Originally published in 2010, last revised December 2011摘要ABSTRACTIn this paper, we examine a number of SQL and so- called “NoSQL” data stores designed to scale simple OLTP-style application loads over many servers.Originally motivated by Web 2.0 applications, these systems are designed to scale to thousands or millions of users doing updates as well as reads, in contrast to traditional DBMSs and data warehouses.We contrast the new systems on their data model, consistency mechanisms, storage mechanisms, durability guarantees, availability, query support, and other dimensions. These systems typically sacrifice some of these dimensions, e.g. database-wide transaction consistency, in order to achieve others, e.g. higher availability and scalability.在这篇文献中，我们验证了许多SQL和所谓的NoSQL数据存储（它设计于支持简单的OLTP风格的应用，能够用于扩展在很多服务器上）它最先由Web 2.0应用引起，与传统的数据库管理系统和数据仓库对比，这些系统设计为可扩展到数以千计或数以百万计的用户做更新，同时读取。我们对比了新系统上的数据模型，一致性机制, 存储机制，持久性保证，可用性，支持的查询以及其它属性，这些系统典型的牺牲（为了实现其它属性而去掉）了一些属性。如数据库常有的事务一致性，牺牲了这个是为了其它的属性，如高可用，可扩展。Note: Bibliographic references for systems are not listed, but URLs for more information can be found in the System References table at the end of this paper.注：参考书没列出来（翻译省）授课：XXXCaveat: Statements in this paper are based on sources and documentation that may not be reliable, and the systems described are “moving targets,” so some statements may be incorrect. Verify through other sources before depending on information here. Nevertheless, we hope this comprehensive survey is useful! Check for future corrections on the authors web site 警告：一些提及的书可能不可用。尽管如此，我们还是希望这篇综合的文献对大家有帮助，我们网站：Disclosure: The author is on the technical advisory board of Schooner Technologies and has a consulting business advising on scalable databases.透漏：作者是可扩展数据库商业顾问。1. OVERVIEWIn recent years a number of new systems have been designed to provide good horizontal scalability for simple read/write database operations distributed over many servers. In contrast, traditional database products have comparatively little or no ability to scale horizontally on these applications. This paper examines and compares the various new systems.近年，很多系统的设计提供良好水平扩展，支持在多服务器上分布式读写。相比较传统的系统，一般为无扩展，规模小。本篇文献研究与对比很多不同的新系统（Yol注，其实就是各种NOSQL设计进行对比，比如Mongo与Hbase分类，简介）Many of the new systems are referred to as “NoSQL” data stores. The definition of NoSQL, which stands for “Not Only SQL” or “Not Relational”, is not entirely agreed upon. For the purposes of this paper, NoSQL systems generally have six key features:NoSQL等于Not Only SQL, 或者Not Relational(弱关系型数据库，与mysql比较起来)，NoSQL的systems一般有6重要特征：1. the ability to horizontally scale “simple operation” throughput over many servers,通过简单操作在多服务器上水平扩展的能力2. the ability to replicate and to distribute (partition) data over many servers,复制和分发 (分区) 数据在多个服务器的能力3. a simple call level interface or protocol (in contrast to a SQL binding),一种简单的调用级接口或协议 (相比较于 SQL 绑定)授课：XXX4. a weaker concurrency（并发性，并行性） model than the ACID transactions of most relational (SQL) database systems,对比大多数关系数据库 (SQL) 数据库管理系统 ACID 事务，它是一种较弱的并发模型5. efficient use of distributed indexes and RAM for data storage,有效地利用分布式的索引和 RAM 的数据存储6.and the ability to dynamically add new attributes to data records.动态地在数据记录中添加新的属性The systems differ in other ways, and in this paper we contrast those differences. They range in functionality from the simplest distributed hashing, as supported by the popular memcached open source cache, to highly scalable partitioned tables, as supported by Googles BigTable 1. In fact, BigTable, memcached, and Amazons Dynamo 2 provided a “proof of concept” that inspired many of the data stores we describe here:这些系统在其他方面也有不同，在本文中我们对比了这些差异。它们的范围从简单的分布式哈希算法，如流行的开源memcached 缓存，到高度可扩展的已分区表，如谷歌的 BigTable 1。事实上，BigTable，memcached 和亚马逊的Dynamo 2 提供”概念证明”，催动了许多我们在这儿描述的数据存储： Memcached demonstrated（论证，证明） that in-memory indexes can be highly scalable, distributing and replicating objects over multiple nodes. Memcached 表明内存中索引可以是高度可伸缩、分布式和在多个节点上复制对象。 Dynamo pioneered the idea of eventual consistency as a way to achieve higher availability and scalability: data fetched are not guaranteed to be up-to-date, but updates are guaranteed to be propagated to all nodes eventually. Dynamo的先驱想了一个idea，以实现更高的可用性和可伸缩性的最终一致性, 那就是: 获取数据不能保证是最新的，但保证这个最新能最终传播到所有节点。 BigTable demonstrated that persistent record storage could be scaled to thousands of nodes, a feat that most of the other systems aspire to. BigTable 表明，持续的记录存储可以缩放到数千个节点，是其他系统最向往的。A key feature of NoSQL systems is “shared nothing” horizontal scaling replicating and partitioning data over many servers. This allows them to support a large number of simple read/write operations per second. This simple operation load is traditionally called OLTP (online transaction processing), but it is also common in modern web applications授课：XXXNoSQL 系统的一个核心特征是”无共享”的水平扩展复制和数据分区在多台服务器。这使他们能够支持大量的每秒简单的读写操作。这个简单的操作负荷传统上称为 OLTP (联机事务处理)，但这在 web 应用程序中很常见。The NoSQL systems described here generally do not provide ACID transactional properties: updates are eventually propagated, but there are limited guarantees on the consistency of reads. Some authors suggest a “BASE” acronym in contrast to the “ACID” acronym:通常这里描述的 NoSQL 系统不提供事务的 ACID 属性: 更新最终传播，但一致性的读取有有限的保证。对比ACID的缩写，有些作者建议”BASE”的首字母缩略词，意义如下： BASE = Basically Available, Soft state, Eventually consistent 基本可用，软状态，最终一致 ACID = Atomicity, Consistency, Isolation, and Durability 原子性、一致性、隔离和耐久性The idea is that by giving up ACID constraints, one can achieve much higher performance and scalability.这其中的想法是通过放弃ACID约束，可以实现多更高的性能和可扩展性.However, the systems differ in how much they give up. For example, most of the systems call themselves “eventually consistent”, meaning that updates are eventually propagated to all nodes, but many of them provide mechanisms for some degree of consistency, such as multi-version concurrency control (MVCC).然而，系统在他们放弃多少有所不同。例如，大部分的系统调用自己”最终一致性”，意味着更新最终传播到所有节点，但其中许多人提供一定程度的一致性的机制，例如多版本并发控制 (MVCC)Proponents(n. (某事业、理论等的)支持者,拥护者) of NoSQL often cite Eric Brewers CAP theorem 4, which states that a system can have only two out of three of the following properties: consistency, availability, and partition-tolerance. The NoSQL systems generally give up consistency. However, the trade-offs are complex, as we will see.NoSQL 的拥护者经常援引 Eric Brewer 帽定理 4，其中指出，一个系统可以有只有 2 / 3 的以下属性: 一致性、可用性和分区容忍性。NoSQL 系统通常会放弃一致性。然而，权衡取舍是复杂的正如我们将看到New relational DBMSs have also been introduced to provide better horizontal scaling for OLTP, when compared to traditional RDBMSs. After examining the NoSQL systems, we will look at these SQL systems and compare the strengths of the approaches. The SQL systems strive to provide horizontal scalability without 授课：XXXabandoning SQL and ACID transactions. We will discuss the trade-offs（权衡取舍） here.此外介绍了新的关系型 Dbms 提供更好水平扩展用于 OLTP，相比传统的 Rdbms。在检查后的 NoSQL 系统，我们将看看这些 SQL 系统，然后比较优势。SQL 系统极力在不放弃 SQL 和 ACID 事务的前提下提供水平可伸缩性。我们将在这里讨论权衡取舍In this paper, we will refer to both the new SQL and NoSQL systems as data stores, since the term “database system” is widely used to refer to traditional DBMSs. However, we will still use the term “database” to refer to the stored data in these systems. All of the data stores have some administrative unit that you would call a database: data may be stored in one file, or in a directory, or via some other mechanism that defines the scope of data used by a group of applications. Each database is an island unto itself, even if the database is partitioned and distributed over multiple machines: there is no “federated database” concept in these systems (as with some relational and object-oriented databases), allowing multiple separately-administered databases to appear as one. Most of the systems allow horizontal partitioning of data, storing records on different servers according to some key; this is called “sharding”. Some of the systems also allow vertical partitioning, where parts of a single record are stored on different servers.在本文中，我们将新 SQL 和 NoSQL 系统称为数据存储，因为”数据库系统”一词被广泛用于指传统 DBMS。但是，我们仍将使用”数据库”一词指在这些系统中存储的数据引用。数据存储的都是一些数据库的（行政，管理）单位，: 数据可能存储在一个文件中，或在目录中，或通过定义范围的数据使用的其他一些机制的一组应用程序。每个数据库是一座孤岛本身，即使数据库分区并且分布在多台机器: 在这些系统中有没有”联邦的数据库”概念 (如一些关系数据库和面向对象数据库)，允许多个单独管理的数据库，显示为一个（Yol注：也就是不允许多个单独的显示为一个）。大多数系统允许根据一些键，进行水平分区存储数据，记录在不同的服务器，;这就被所谓”切分”。一些系统还允许进行垂直分区，单个记录的分成部分，分布存储在不同服务器上。1.1 Scope of this Paper此文献讨论范围Before proceeding, some clarification is needed in defining “horizontal scalability” and “simple operations”. These define the focus of this paper.在开始之前，在定义”横向扩展”和”操作简单”需要一些澄清。这些定义本文的重点。By “simple operations”, we refer to key lookups, reads and writes of one record or a small number of records. This is in contrast to complex queries or joins, read- mostly access, or other application loads. With the advent of the web, especially Web 2.0 sites where millions of users may both read and write data, scalability for simple database operations has become more important. For example, applications may search and update multi-server databases of electronic mail, personal profiles, web 授课：XXXpostings, wikis, customerrecords, online dating records, classified ads, and many other kinds of data. These all generally fit the definition of “simple operation” applications: reading or writing a small number of related records in each operation.“简单的操作，”指：我们是指关键的查找、读取和写入一条记录或记录的小数目。这是与复杂的查询或联接（joins），只读主要访问，或其他应用程序加载相对比的。随着互联网的出现，特别是 Web 2.0 网站在那里数以百万计的用户可同时读取和写入数据，简单的数据库操作的可扩展性已变得更为重要。例如，应用程序可以搜索和更新多个服务器数据库上的电子邮件、个人配置文件、网络帖子、 wiki、客户记录、在线约会记录，分类广告和许多其他类型的数据。这些一般都符合定义的应用程序”操作简单”: 即读取或写入每个操作中的相关记录的小数目。The term “horizontal scalability” means the ability to distribute both the data and the load of these simple operations over many servers, with no RAM or disk shared among the servers. Horizontal scaling differs from “vertical” scaling, where a database system utilizes （利用）many cores and/or CPUs that share RAM and disks. Some of the systems we describe provide both vertical and horizontal scalability, and the effective use of multiple cores is important, but our main focus is on horizontal scalability, because the number of cores that can share memory is limited, and horizontal scaling generally proves less expensive, using commodity（商品） servers. Note that horizontal and vertical partitioning are not related to horizontal and vertical scaling, except that they are both useful for horizontal scaling.“横向扩展”，(Yol注：英文中horizontal scalability可以说成横向扩展，水平扩展，与纵向扩展，垂直扩展相对应)是指在多个服务器，进行数据分布式和简单操作的负载，这些服务器之间没有 RAM 共享或磁盘共享。水平扩展，有别于”垂直”扩展，垂直扩展是一个数据库系统利用多核和/或共享 RAM 和磁盘的 Cpu。一些我们所描述的系统同时提供纵向和横向的可扩展性，当然多个内核的有效利用是重要的，但我们的主要焦点是水平可伸缩性，因为可以共享内存的内核的数量是有限的，水平缩放一般提供便宜，商用的服务器。请注意，水平和垂直分区与水平和垂直扩展无关的，虽然他们都有益于水平扩展。1.2 Systems Beyond our Scope超过我们范围的系统Some authors have used a broad definition of NoSQL, including any database system that is not relational. Specifically, they include:一些作者已经使用是广义定义的NoSQL，包括任何不是关系型的如： Graph database systems: Neo4j and OrientDB provide efficient distributed storage and queries of a graph of nodes with references among them.图形数据库系统: Neo4j 和 OrientDB 提供了高效的分布式的存储和在相互引用的节点中查询。授课：XXX Object-oriented database systems: Object-oriented DBMSs (e.g., Versant) also provide efficient distributed storage of a graph of objects, and materialize these objects as programming language objects.面向对象数据库系统: 面向对象的数据库管理系统 (例如，Versant) 也提供对象的高效的分布式的图存储，实现这些对象作为编程语言对象 Distributed object-oriented stores: Very similar to object-oriented DBMSs, systems such as GemFire distribute object graphs in-memory on multiple servers.分布式面向对象存储：非常类似于面向对象的数据库管理系统，像GemFire，在多个服务器内存上进行分布式对象的图形存储These systems are a good choice for applications that must do fast andextensive reference-following（索引跟踪）, especially where data fits in memory. Programming language integration is also valuable. Unlike the NoSQL systems, these systems generally provide ACID transactions. Many of them provide horizontal scaling for reference-following and distributed query decomposition, as well. Due to space limitations, however, we have omitted these systems from our comparisons. The applications and the necessary optimizations for scaling for these systems differ from the systems we cover here, where key lookups and simple operations predominate over reference- following and complex object behavior. It is possible these systems can scale on simple operations as well, but that is a topic for a future paper, and proof through benchmarks.对于那些应用程序是必须do fast和索引跟踪的需求，尤其是应用数据在内存中的情况，这些系统是一个不错的选择。编程语言集成也是有价值的（？这句没懂）。不像 NoSQL 系统，这些系统一般提供 ACID 事务。其中许多为提供索引跟踪和分布式查询分解，提供水平扩展。然而，由于篇幅的限制，我们省略了这些系统间的比较。应用程序和为这些系统的必要优化不是我们在这里要讨论的，我们重点是关键查询和操作简单而不是索引跟踪和复杂的对象行为。它是可能这些系统可以通过简单的操作进行扩展，但那是未来的文献再讨论并通过一些原则再证明的了。Data warehousing database systems provide horizontal scaling, but are also beyond the scope of this paper. Data warehousing applications are different in important ways:数据仓库数据库系统提供水平扩展，但也超出了本文的范围。数据仓库应用程序是不同的重要途径（本小节以下略） They perform complex queries that collect and join information from many different tables. The ratio of reads to writes is high: that is, the database is read-only or read-mostly.There are existing systems for data warehousing that scale well horizontally. Because the data is infrequently updated, it is possible to organize or replicate the database in ways that make scaling possible.1.3 Data Model Terminology数据模型术语授课：XXXUnlike relational (SQL) DBMSs, the terminology（术语） used by NoSQL data stores is often inconsistent. For the purposes of this paper, we need a consistent way to compare the data models and functionality.不像关系型数据库系统，NoSQL 数据存储的术语往往是不一致的。对于本文而言，我们需要以一致的方式进行比较的数据模型和功能All of the systems described here provide a way to store scalar values, like numbers and strings, as well as BLOBs. Some of them also provide a way to store more complex nested or reference values. The systems all store sets of attribute-value pairs, but use different data structures, specifically:所有这里描述的系统提供一种标量值，如数字、字符串，如 Blob 存储方式。其中有些还提供存储更复杂的嵌套或参考值的方法。系统所有存储组属性-值对，但使用了不同的数据结构，具体为： A “tuple” is a row in a relational table, where attribute names are pre-defined in a schema, and the values must be scalar. The values are referenced by attribute name, as opposed to an array or list, where they are referenced by ordinal position. “元组”是一个关系表中的一行，在这里面，属性名称在schema预定义，值必须是标量。由属性名称做值的索引，而不像数组或列表中，值由它们的序号位置做索引。 A “document” allows values to be nested documents or lists as well as scalar values, and the attribute names are dynamically defined for each document at runtime. A document differs from a tuple in that the attributes are not defined in a global schema, and this wider range of values are permitted. “文档”允许将嵌套的文档或列表值作为标量值，并为每个文件在运行时动态定义的属性名称。文档不同于一个元组，它不是在全局schema中定义的，它允许更宽范围的值。 An “extensible record” is a hybrid between a tuple and a document, where families of attributes are defined in a schema, but new attributes can be added (within an attribute family) on a per-record basis. Attributes may be list-valued. “可扩展记录”（列存储）是元组和文档的混合，家族families的属性定义在schema中，但新的属性可以每个记录的基础上增加(属性属于这个属性家族) 。属性可以是列表值。 An “object” is analogous to an object in programming languages, but without the procedural methods. Values may be references or nested objects. “对象”是类似于编程语言的对象，但不需要程序。值可以是引用或嵌套的对象。1.4 Data Store Categories数据存储类别In this paper, the data stores are grouped according to their data model:授课：XXX策略根据他们的数据模型 Key-value Stores: These systems store values and an index to find them, based on a programmer- defined key.KV存储，这类系统存储值和一个能找到这些值的索引，索引是由编程定义的key决定的。 Document Stores: These systems store documents, as just defined. The documents are indexed and a simple query mechanism is provided.文档型：这个系统存储文档，如刚才定义的文件。文档编制索引，并提供了一个简单的查询机制 Extensible Record Stores: These systems store extensible records that can be partitioned vertically and horizontally across nodes. Some papers call these “wide column stores”.列存储：这些系统存储可扩展记录存储，可以跨节点被分成垂直和水平方向。一些文献称这些”宽（大）列存储” Relational Databases: These systems store (and index and query) tuples. The new RDBMSs that provide horizontal scaling are covered in this paper.关系数据库: 这些系统存储 (索引和查询) 元组。提供水平扩展，本文中涉及了这部分。Data stores in these four categories are covered in the next four sections, respectively. We will then summarize and compare the systems.以上四个类别的数据存储都分别在接下来的四部分中。我们将总结并比较这些系统（注：可编辑下载，若有不当之处，请指正，谢谢!）授课：XXX

展开阅读全文

数据存储方案

最新文档