Monday, June 3, 2019
NoSQL Databases | Research Paper
NoSQL Databases Research PaperIn the world of enterprise com intrusting, we contrive seen many changes in platforms, languages, processes, and architectures. But throughout the entire time integrity thing has remained unchanged relational informationbases. For almost as long as we have been in the softw ar profession, relational selective informationbases have been the default resource for sobering entropy retentivity, especially in the world of enterprise use programs. There have been times when a database technology threatened to take a piece of the action, much(prenominal) as object databases in the 1990s, that these alternatives never got anywhere.In this research paper, a new challenger on the block was explored under the name of NoSQL. It came into existence because of there was a bring to clutches large volumes of data which forced a shift to building bigger hardw atomic number 18 platforms through large sub syllabus of commodity servers. The term NoSQL appl ies to a itemise of young non-relational databases such as Cassandra, MongoDB, Neo4j, and color in Table entrepot. NoSQL databases provided the value of building systems that were more performing, scaled much better, and were easier to program with.The paper considers that we are now in a world of Polyglot Persistence where different technologies are used by enterprises for the management of data. For this reason, architects should know what these technologies are and should be able to decide which ones to use for respective(a) purposes. It provides information to decide whether NoSQL databases squeeze out be seriously considered for future projects. The attempt is to provide enough backcloth information on NoSQL databases on how they work and what advantages they will bring to the display panel.Table of ContentsIntroductionLiteratureTechnical AspectsDocument orientatedMeritsDemeritsCase subject area MongoDBKey ValueMeritsDemeritsCase Study Azure Table StorageColumn St oresMeritsDemeritsCase Study CassandraGraphsMeritsDemeritsCase Study Neo4jConclusionReferencesIntroductionNoSQL is usually interpreted as not only SQL. It is a class of database management systems and is does not adhere to the traditional RDBMS model. NoSQl databases handle a large variety of data including structured, unstructured or semi-structured data. NoSQL database systems are bluely optimized for retrieval and append routines and offer less functionality former(a) than record storage. The run time surgical operation is decrease compared to full SQL systems but there is increased gain in scal dexterity and performance for few data models 3.NoSQL databases prove to be beneficial when a huge measuring stick of data is to be processed and a relational model does not satisfy the datas nature. What truly matters is the ability to gunstock and see huge amount of data, but not the relationships amidst them. This is especially useful for real time or statistical analysis f or growing amount of data.The NoSQL community is experiencing a rapid change. It is transitioning from the community-driven platform maturement to an application-driven market. Facebook, Digg and Twitter have been successful in using NoSQL and scaling up their web infrastructure. Many successful attempts have been made in developing NOSQL applications in the fields of image/signal processing, biotechnology, and defense. The traditional relational database systems vendors in addition assess the strategy of developing NoSQL solutions and integrating them in existing offers.LiteratureIn recent years with expansion of cloud computing, problems of data- intensifier services have become prominent. The cloud computing seems to be the future architecture to support large-scale and data intensive applications, although there are certain requirements of applications that cloud computing does not fulfill sufficiently 7. For years, development of information systems has relied on vertical sca ling, but this approach requires high level of skills and it is not reliable in some cases. Database partitioning across multiple cheap machines added dynamically, horizontal scaling or scaling-out can take care scalability in a more effective and cheaper way. Todays NoSQL databases designed for cheap hardware and using the shared-nothing architecture can be a better solution.The term NoSQL was coined by Carlo Strozzi in 1998 for his Open Source, Light Weight Database which had no SQL interface. Later, in 2009, Eric Evans, a Rackspace employee, reused the term for databases which are non-relational, distributed and do not conform to atomicity, consistency, isolation and durability. In the same year, nosql(east) conference held in Atlanta, USA, NoSQL was discussed a lot. And eventually NoSQL saw an unprecedented growth 1.Scalable and distributed data management has been the vision of the database research community for more than leash decades. Many researches have been focused on designing scalable systems for both update intensive workloads as well as ad-hoc analysis workloads 5. Initial designs imply distributed databases for update intensive workloads, and parallel database systems for analytical workloads. Parallel databases grew to become large commercial systems, but distributed database systems were not real successful. Changes in the data access patterns of applications and the need to scale out to thousands of commodity machines led to the birth of a new class of systems referred to as NoSQL databases which are now being widely adopted by dissimilar enterprises.Data processing has been viewed as a constant battle betwixt parallelism and concurrency 4. Database acts as a data chime in with an additional restrictive software layer which is constantly being bombarded by transactions. To handle all the transactions, databases have deuce choices at each stage in computation parallelism, where two transactions are being processed at the same time a nd concurrency, where a processor switches between the two transactions rapidly in the snapper of the transaction. Parallelism is faster, but to avoid inconsistencies in the results of the transaction, coordinating software is required which is hard to operate in parallel as it involves frequent communication between the parallel threads of the two transactions. At a global level, it becomes a choice between distributed and scale-up single-system processing.In certain instances, relational databases designed for scale-up systems and structured data did not work well. For indexing and serve massive amounts of rich text, for semi-structured or unstructured data, and for streaming media, a relational database would require consistency between data copies in a distributed environment and will not be able to perform parallelism for the transactions. And so, to minimize costs and to maximize the parallelism of these types of transactions, we turned to NoSQL and other non-relational appr oaches.These efforts combined open-source software, large amounts of small servers and loose consistency constraints on the distributed transactions (eventual consistency). The basic idea was to minimize coordination by identifying types of transactions where it didnt matter if some users got old data rather than the latest data, or if some users got an answer while others didnt.Technical AspectsNoSQL is a non-relational database management system which is different from the traditional relational database management systems in significant ways. NoSQL systems are designed for distributed data stores which require large scale data storage, are schema-less and scale horizontally. Relational databases rely upon very structured rules to govern transactions. These rules are encoded in the astringent model which requires that the database must endlessly preserve atomicity, consistency, isolation and durability in each database transaction. The NoSQL databases follow the hind end model which provides three loose guidelines basic availability, soft state and eventual consistency.Two primary reasons to consider NoSQL are handle data access with sizes and performance that demand a cluster and to improve the productivity of application development by using a more convenient data interaction expression 6. The common characteristics of NoSQL areNot using the relational modelRunning well on clustersOpen-sourceBuilt for 21st century web estates strategy lessEach NoSQL solution uses a different data model which can be put in four widely used categories in the NoSQL Ecosystem describe-value, document, tug-family and chart. Of these the first three share a common characteristic of their data models called unite orientation. Next we briefly describe each of these data models.3.1 Document OrientedThe main concept of a document oriented database is the notion of a document 3. The database stores and retrieves documents which encapsulate and encode data in some standard for mats or encodings like XML, JSON, BSON, and so on. These documents are self-describing, hierarchical tree data structures and can offer different ways of organizing and grouping documentsCollectionsTagsNon-visible MetadataDirectory HierarchiesDocuments are addressed with a crotchety key which represents the document. Also, beyond a simple key-document lookup, the database offers an API or research language that allows retrieval of documents based on their content.img1.jpgFig 1 Comparison of terminology between Oracle and MongoDB3.1.1 MeritsIntuitive data structure.Simple natural modeling of requests with flexible query functions 2.Can act as a central data store for event storage, especially when the data captured by the events keeps changing.With no predefined schemas, they work well in content management systems or blogging platforms.Can store data for real-time analytics since parts of the document can be updated, it is easy to store page views and new metrics can be added with out schema changes.Provides flexible schema and ability to evolve data models without expensive database refactoring or data migration to E-commerce applications 6.DemeritsHigher hardware demands because of more dynamic DB queries in part without data preparation.Redundant storage of data (denormalization) in favor of higher performance 2.Not suitable for atomic cross-document operations.Since the data is saved as an aggregate, if the design of an aggregate is constantly changing, aggregates have to be saved at the lowest level of granularity. In this case, document databases may not work 6..3.1.3 Case Study MongoDBMongoDB is an open-source document-oriented database system au thentic by 10gen. It stores structured data as JSON-like documents with dynamic schemas (MongoDB calls the format BSON), making the integration of data in certain types of applications easier and faster. The language support includes Java, JavaScript, Python, PHP, Ruby and it also supports sharding via confi gurable data fields. Each MongoDB instance has multiple databases, and each database can have multiple collections 2,6. When a document is stored, we have to choose which database and collection this document belongs in. concurrence in MongoDB database is configured by using the replica sets and choosing to wait for the writes to be replicated to a given number of slaves. Transactions at the single-document level are atomic transactions a write either succeeds or fails. Transactions involving more than one operation are not possible, although there are few exceptions. MongoDB implements replication, providing high availability using replica sets. In a replica set, there are two or more nodes participating in an asynchronous master-slave replication. MongoDB has a query language which is expressed via JSON and has variety of constructs that can be combined to create a MongoDB query. With MongoDB, we can query the data inside the document without having to retrieve the whole document by its key and then introspect the document. Scaling in MongoDB is achieved through sharding. In sharding, the data is split by certain field, and then moved to different Mongo nodes. The data is dynamically moved between nodes to ensure that shards are always balanced. We can add more nodes to the cluster and increase the number of writable nodes, enabling horizontal scaling for writes 6, 9.3.2 Key-valueA key-value store is a simple hash table, primarily used when all access to the database is via primary key. They allow schema-less storage of data to an application. The data could be stored in a data type of a programming language or an object. The following types exist Hierarchical key-value store Eventually-consistent key-value store, hosted services, key-value chain in RAM, ordered key-value stores, multi value databases, tuple store and so on.Key-value stores are the simplest NoSQL data stores to use form an API perspective. The client can get or put the value for a key, or d elete a key from the data store. The value is a blob that is just stored without knowing what is inside it is the responsibility of the application to understand what is stored 3, 6.3.2.1 MeritsPerformance high and predictable.Simple data model.Clear separation of saving from application logic (because of lacking query language).Suitable for storing session information. exploiter profiles, product profiles, preferences can be easily stored.Best suited for shopping cart data and other E-commerce applications.Can be scaled easily since they always use primary-key access.3.2.2 DemeritsLimited range of functionsHigh development effort for more complex applicationsNot the best solution when relationships between different sets of data are required.Not suited for multi operation transactions.There is no way to inspect the value on the database side.Since operations are limited to one key at a time, there is no way to operate upon multiple keys at the same time.3.2.3 Case Study Azure Tabl e StorageFor structured forms of storage, Windows Azure provides structured key-value pairs stored in entities cognise as Tables. The table storage uses a NoSQL model based on key-value pairs for querying structured data that is not in a typical database. A table is a bag of typed properties that represents an entity in the application domain. Data stored in Azure tables is partitioned horizontally and distributed across storage nodes for optimized access.Every table has a property called the Partition Key, which defines how data in the table is partitioned across storage nodes rows that have the same partition key are stored in a partition. In addition, tables can also define Row Keys which are unique in spite of appearance a partition and optimize access to a row within a partition. When present, the pair partition key, row key uniquely identifies a row in a table. The access to the Table service is through domicile APIs 6.3.3 Column StoreColumn-family databases store data in column-families as rows that have many columns associated with a row key. These stores allow storing data with key mapped to value, and values grouped into multiple column families, each column family being a map of data. Column-families are groups of related data that is often accessed together.The column-family model is as a two-level aggregate structure. As with key-value stores, the first key is often described as a row identifier, picking up the aggregate of interest. The difference with column-family structures is that this row aggregate is itself formed of a map of more detailed values. These second-level values are referred to as columns. It allows accessing the row as a whole as well as operations also allow picking out a particular column 6.3.3.1 MeritsDesigned for performance.Native support for persistent views towards key-value store.Sharding Distribution of data to various servers through hashing.More effective than row-oriented systems during aggregation of a few colum ns from many rows.Column-family databases with their ability to store any data structures are great for storing event information.Allows storing blog entries with tags, categories, links, and trackbacks in different columns.Can be used to count and categorize visitors of a page in a web application to calculate analytics.Provides a functionality of expiring columns columns which, after a given time, are deleted automatically. This can be useful in providing demo access to users or showing ad banners on a website for a specific time.3.3.2 DemeritsLimited query options for dataHigh maintenance effort during changing of existing data because of updating all lists.Less efficient than all row-oriented systems during access to many columns of a row.Not suitable for systems that require ACID transactions for reads and writes.Not good for early prototypes or initial tech spikes as the schema change required is very expensive.3.3.3 Case Study CassandraA column is the basic unit of storage i n Cassandra. A Cassandra column consists of a name-value pair where the name behaves as the key. Each of these key-value pairs is a single column and is stored with a timestamp value which is used to expire data, resolve write conflicts, crapper with stale data, and other things. A row is a collection of columns attached or linked to a key a collection of similar rows makes a column family. Each column family can be compared to a container of rows in an RDBMS table where the key identifies the row and the row consists on multiple columns. The difference is that various rows do not need to have the same columns, and columns can be added to any row at any time without having to add it to other rows.By design Cassandra is highly available, since there is no master in the cluster and every node is a peer in the cluster. A write operation in Cassandra is considered successful once its written to the commit log and an in-memory structure known as memtable. While a node is down, the data that was supposed to be stored by that node is handed off to other nodes. As the node comes back online, the changes made to the data are handed back to the node. This technique, known as hinted handoff, for faster restore of failed nodes. In Cassandra, a write is atomic at the row level, which means inserting or updating columns for a given row key will be treated as a single write and will either succeed or fail. Cassandra has a query language that supports SQL-like commands, known as Cassandra Query Language (CQL) 2, 6. We can use the CQL commands to create a column family. Scaling in Cassandra is done by adding more nodes. As no single node is a master, when we add nodes to the cluster we are improving the capacity of the cluster to support more writes and reads. This allows for maximum uptime as the cluster keeps serving requests from the clients while new nodes are being added to the cluster.3.4 GraphGraph databases allow storing entities and relationships between these entiti es. Entities are also known as nodes, which have properties. Relations are known as edges that can have properties. Edges have directional significance nodes are organized by relationships which allow finding interesting patterns between the nodes. The institution of the chart lets the data to be stored once and then interpreted in different ways based on relationships.Relationships are first-class citizens in graph databases most of the value of graph databases is derived from the relationships. Relationships dont only have a type, a start node, and an end node, but can have properties of their own. Using these properties on the relationships, we can add intelligence to the relationship for example, since when did they become friends, what is the distance between the nodes, or what aspects are shared between the nodes. These properties on the relationships can be used to query the graph 2, 6.3.4.1 MeritsVery compact modeling of networked data.High performance efficiency.Can be d eployed and used very effectively in social networking.Excellent choice for routing, dispatch and location-based services.As nodes and relationships are created in the system, they can be used to make recommendation engines.They can be used to search for patterns in relationships to witness fraud in transactions.3.4.2 DemeritsNot appropriate when an update is required on all or a subset of entities.Some databases may be unable to handle lots of data, especially in global graph operations (those involving the whole graph).Sharding is difficult as graph databases are not aggregate-oriented.3.4.3 Case Study Neo4jNeo4j is an open-source graph database, implemented in Java. It is described as an embedded, disk-based, fully transactional Java persistence engine that stores data structured in graphs rather than in table. Neo4j is ACID compliant and easily embedded in individual applications.In Neo4J, a graph is created by making two nodes and then establishing a relationship. Graph datab ases ensure consistency through transactions. They do not allow dangling relationships The start node and end node always have to exist, and nodes can only be deleted if they dont have any relationships attached to them. Neo4J achieves high availability by providing for replicated slaves. Neo4j is supported by query languages such as Gremlin (Groovy based traversing language) and Cypher (declarative graph query language) 6. There are three ways to scale graph databasesAdding enough RAM to the server so that the working set of nodes and relationships is held entirely in memory.Improve the read scaling of the database by adding more slaves with read-only access to the data, with all the writes going to the master. Sharding the data from the application side using domain-specific knowledge.ConclusionsNoSQL databases are still evolving and more number of enterprises is switching to move from the traditional relational database technology to non-relational databases. But given their limi tations, they will never completely replace the relational databases. The future of NoSQL is in the usage of various database tools in application-oriented way and their broader sufferance in specialized projects involving large unstructured distributed data with high requirements on scaling. On the other hand, an adoption of NoSQL data stores will hardly compete with relational databases that represent reliability and matured technology.NoSQL databases leave a lot work on the application designer. The application design is an important part of the non-relational databases which enable the database designers to provide certain functionalities to the users. Hence a good understanding of the architecture for NoSQL systems is required. The need of the hour is to take advantage of the new trends emerging in the world of databases the non-relational databases. An effective solution would be to combine the power of different database technologies to meet the requirements and maximize th e performance.
Subscribe to:
Post Comments (Atom)
No comments:
Post a Comment
Note: Only a member of this blog may post a comment.