Sunday, September 23, 2012

Database

A database is an organized collection of data. The data is typically organized to model relevant aspects of reality (for example, the availability of rooms in hotels), in a way that supports processes requiring this information (for example, finding a hotel with vacancies).
The term database is correctly applied to the data and their supporting data structures, and not to the database management system (DBMS). The database data collection with DBMS is called a database system.
The term database system implies that the data is managed to some level of quality (measured in terms of accuracy, availability, usability, and resilience) and this in turn often implies the use of a general-purpose database management system (DBMS).[1] A general-purpose DBMS is typically a complex software system that meets many usage requirements to properly maintain its databases which are often large and complex. The utilization of databases is now so widespread that virtually every technology and product relies on databases and DBMSs for its development and commercialization, or even may have DBMS software embedded in it. Also, organizations and companies, from small to large, depend heavily on databases for their operations.


Well known DBMSs include Oracle, IBM DB2, Microsoft SQL Server, Microsoft Access, PostgreSQL, MySQL, and SQLite. A database is not generally portable across different DBMS, but different DBMSs can inter-operate to some degree by using standards like SQL and ODBC together to support a single application built over more than one database. A DBMS also needs to provide effective run-time execution to properly support (e.g., in terms of performance, availability, and security) as many database end-users as needed.
A way to classify databases involves the type of their contents, for example: bibliographic, document-text, statistical, or multimedia objects. Another way is by their application area, for example: accounting, music compositions, movies, banking, manufacturing, or insurance.
The term database may be narrowed to specify particular aspects of organized collection of data and may refer to the logical database, to the physical database as data content in computer data storage or to many other database sub-definitions.

Contents

History

Database concept

The database concept has evolved since the 1960s to ease increasing difficulties in designing, building, and maintaining complex information systems (typically with many concurrent end-users, and with a large amount of diverse data). It has evolved together with database management systems which enable the effective handling of databases. Though the terms database and DBMS define different entities, they are inseparable: a database's properties are determined by its supporting DBMS. The Oxford English dictionary cites[citation needed] a 1962 technical report as the first to use the term "data-base." With the progress in technology in the areas of processors, computer memory, computer storage and computer networks, the sizes, capabilities, and performance of databases and their respective DBMSs have grown in orders of magnitudes. For decades it has been unlikely that a complex information system can be built effectively without a proper database supported by a DBMS. The utilization of databases is now spread to such a wide degree that virtually every technology and product relies on databases and DBMSs for its development and commercialization, or even may have such embedded in it. Also, organizations and companies, from small to large, heavily depend on databases for their operations.
No widely accepted exact definition exists for DBMS. However, a system needs to provide considerable functionality to qualify as a DBMS. Accordingly its supported data collection needs to meet respective usability requirements (broadly defined by the requirements below) to qualify as a database. Thus, a database and its supporting DBMS are defined here by a set of general requirements listed below. Virtually all existing mature DBMS products meet these requirements to a great extent, while less mature either meet them or converge to meet them.

Evolution of database and DBMS technology

See also Database management system#History
The introduction of the term database coincided with the availability of direct-access storage (disks and drums) from the mid-1960s onwards. The term represented a contrast with the tape-based systems of the past, allowing shared interactive use rather than daily batch processing.
In the earliest database systems, efficiency was perhaps the primary concern, but it was already recognized that there were other important objectives. One of the key aims was to make the data independent of the logic of application programs, so that the same data could be made available to different applications.
The first generation of database systems were navigational,[2] applications typically accessed data by following pointers from one record to another. The two main data models at this time were the hierarchical model, epitomized by IBM's IMS system, and the Codasyl model (Network model), implemented in a number of products such as IDMS.
The relational model, first proposed in 1970 by Edgar F. Codd, departed from this tradition by insisting that applications should search for data by content, rather than by following links. This was considered necessary to allow the content of the database to evolve without constant rewriting of applications. Relational systems placed heavy demands on processing resources, and it was not until the mid 1980s that computing hardware became powerful enough to allow them to be widely deployed. By the early 1990s, however, relational systems were dominant for all large-scale data processing applications, and they remain dominant today (2012) except in niche areas. The dominant database language is the standard SQL for the Relational model, which has influenced database languages also for other data models.
Because the relational model emphasizes search rather than navigation, it does not make relationships between different entities explicit in the form of pointers, but represents them rather using primary keys and foreign keys. While this is a good basis for a query language, it is less well suited as a modeling language. For this reason a different model, the entity-relationship model which emerged shortly later (1976), gained popularity for database design.
In the period since the 1970s database technology has kept pace with the increasing resources becoming available from the computing platform: notably the rapid increase in the capacity and speed (and reduction in price) of disk storage, and the increasing capacity of main memory. This has enabled ever larger databases and higher throughputs to be achieved.
The rigidity of the relational model, in which all data is held in tables with a fixed structure of rows and columns, has increasingly been seen as a limitation when handling information that is richer or more varied in structure than the traditional 'ledger-book' data of corporate information systems: for example, document databases, engineering databases, multimedia databases, or databases used in the molecular sciences. Various attempts have been made to address this problem, many of them gathering under banners such as post-relational or NoSQL. Two developments of note are the object database and the XML database. The vendors of relational databases have fought off competition from these newer models by extending the capabilities of their own products to support a wider variety of data types.

General-purpose DBMS

A DBMS has evolved into a complex software system and its development typically requires thousands of person-years of development effort.[citation needed] Some general-purpose DBMSs, like Oracle, Microsoft SQL Server, and IBM DB2, have been undergoing upgrades for thirty years or more. General-purpose DBMSs aim to satisfy as many applications as possible, which typically makes them even more complex than special-purpose databases. However, the fact that they can be used "off the shelf", as well as their amortized cost over many applications and instances, makes them an attractive alternative (Vs. one-time development) whenever they meet an application's requirements.
Though attractive in many cases, a general-purpose DBMS is not always the optimal solution: When certain applications are pervasive with many operating instances, each with many users, a general-purpose DBMS may introduce unnecessary overhead and too large "footprint" (too large amount of unnecessary, unutilized software code). Such applications usually justify dedicated development. Typical examples are email systems, though they need to possess certain DBMS properties: email systems are built in a way that optimizes email messages handling and managing, and do not need significant portions of a general-purpose DBMS functionality.
Types of people involved
Three types of people are involved with a general-purpose DBMS:
  1. DBMS developers - These are the people that design and build the DBMS product, and the only ones who touch its code. They are typically the employees of a DBMS vendor (e.g., Oracle, IBM, Microsoft, Sybase), or, in the case of Open source DBMSs (e.g., MySQL), volunteers or people supported by interested companies and organizations. They are typically skilled systems programmers. DBMS development is a complicated task, and some of the popular DBMSs have been under development and enhancement (also to follow progress in technology) for decades.
  2. Application developers and database administrators - These are the people that design and build a database-based application that uses the DBMS. The latter group members design the needed database and maintain it. The first group members write the needed application programs which the application comprises. Both are well familiar with the DBMS product and use its user interfaces (as well as usually other tools) for their work. Sometimes the application itself is packaged and sold as a separate product, which may include the DBMS inside (see embedded database; subject to proper DBMS licensing), or sold separately as an add-on to the DBMS.
  3. Application's end-users (e.g., accountants, insurance people, medical doctors, etc.) - These people know the application and its end-user interfaces, but need not know nor understand the underlying DBMS. Thus, though they are the intended and main beneficiaries of a DBMS, they are only indirectly involved with it.

Database machines and appliances

In the 1970s and 1980s attempts were made to build database systems with integrated hardware and software. The underlying philosophy was that such integration would provide higher performance at lower cost. Examples were IBM System/38, the early offering of Teradata, and the Britton Lee, Inc. database machine. Another approach to hardware support for database management was ICL's CAFS accelerator, a hardware disk controller with programmable search capabilities. In the long term these efforts were generally unsuccessful because specialized database machines could not keep pace with the rapid development and progress of general-purpose computers. Thus most database systems nowadays are software systems running on general-purpose hardware, using general-purpose computer data storage. However this idea is still pursued for certain applications by some companies like Netezza and Oracle (Exadata).

Database research

Database research has been an active and diverse area, with many specializations, carried out since the early days of dealing with the database concept in the 1960s. It has strong ties with database technology and DBMS products. Database research has taken place at research and development groups of companies (e.g., notably at IBM Research, who contributed technologies and ideas virtually to any DBMS existing today), research institutes, and academia. Research has been done both through theory and prototypes. The interaction between research and database related product development has been very productive to the database area, and many related key concepts and technologies emerged from it. Notable are the Relational and the Entity-relationship models, the atomic transaction concept and related Concurrency control techniques, Query languages and Query optimization methods, RAID, and more. Research has provided deep insight to virtually all aspects of databases, though not always has been pragmatic, effective (and cannot and should not always be: research is exploratory in nature, and not always leads to accepted or useful ideas). Ultimately market forces and real needs determine the selection of problem solutions and related technologies, also among those proposed by research. However, occasionally, not the best and most elegant solution wins (e.g., SQL). Along their history DBMSs and respective databases, to a great extent, have been the outcome of such research, while real product requirements and challenges triggered database research directions and sub-areas.
The database research area has several notable dedicated academic journals (e.g., ACM Transactions on Database Systems-TODS, Data and Knowledge Engineering-DKE, and more) and annual conferences (e.g., ACM SIGMOD, ACM PODS, VLDB, IEEE ICDE, and more), as well as an active and quite heterogeneous (subject-wise) research community all over the world.

Database type examples

The following are examples of various database types. Some of them are not main-stream types, but most of them have received special attention (e.g., in research) due to end-user requirements. Some exist as specialized DBMS products, and some have their functionality types incorporated in existing general-purpose DBMSs. Though may differ in nature and functionality, these various types typically have to comply with the usability requirements below to comply as databases.
  • Active database
An active database is a database that includes an event-driven architecture which can respond to conditions both inside and outside the database. Possible uses include security monitoring, alerting, statistics gathering and authorization.
Most modern relational databases include active database features in the form of database trigger.
  • Cloud database
A Cloud database is a database that relies on cloud technology. Both the database and most of its DBMS reside remotely, "in the cloud," while its applications are both developed by programmers and later maintained and utilized by (application's) end-users through a web browser and Open APIs. More and more such database products are emerging, both of new vendors and by virtually all established database vendors.
  • Data warehouse
Data warehouses archive data from operational databases and often from external sources such as market research firms. Often operational data undergoes transformation on its way into the warehouse, getting summarized, anonymized, reclassified, etc. The warehouse becomes the central source of data for use by managers and other end-users who may not have access to operational data. For example, sales data might be aggregated to weekly totals and converted from internal product codes to use UPCs so that it can be compared with ACNielsen data. Some basic and essential components of data warehousing include retrieving, analyzing, and mining data, transforming,loading and managing data so as to make it available for further use.
Operations in a data warehouse are typically concerned with bulk data manipulation, and as such, it is unusual and inefficient to target individual rows for update, insert or delete. Bulk native loaders for input data and bulk SQL passes for aggregation are the norm.
  • Distributed database
The definition of a distributed database is broad, and may be utilized in different meanings. In general it typically refers to a modular DBMS architecture that allows distinct DBMS instances to cooperate as a single DBMS over processes, computers, and sites, while managing a single database distributed itself over multiple computers, and different sites.
Examples are databases of local work-groups and departments at regional offices, branch offices, manufacturing plants and other work sites. These databases can include both segments shared by multiple sites, and segments specific to one site and used only locally in that site.
  • Document-oriented database
A document-oriented database is a computer program designed for storing, retrieving, and managing document-oriented, or semi structured data, information. Document-oriented databases are one of the main categories of so-called NoSQL databases and the popularity of the term "document-oriented database" (or "document store") has grown with the use of the term NoSQL itself.
Utilized to conveniently store, manage, edit and retrieve documents.
  • Embedded database
An embedded database system is a DBMS which is tightly integrated with an application software that requires access to stored data in a way that the DBMS is “hidden” from the application’s end-user and requires little or no ongoing maintenance. It is actually a broad technology category that includes DBMSs with differing properties and target markets. The term "embedded database" can be confusing because only a small subset of embedded database products is used in real-time embedded systems such as telecommunications switches and consumer electronics devices.[3]
  • End-user database
These databases consist of data developed by individual end-users. Examples of these are collections of documents, spreadsheets, presentations, multimedia, and other files. Several products exist to support such databases. Some of them are much simpler than full fledged DBMSs, with more elementary DBMS functionality (e.g., not supporting multiple concurrent end-users on a same database), with basic programming interfaces, and a relatively small "foot-print" (not much code to run as in "regular" general-purpose databases). However, also available general-purpose DBMSs can often be used for such purpose, if they provide basic user-interfaces for straightforward database applications (limited query and data display; no real programming needed), while still enjoying the database qualities and protections that these DBMSs can provide.
  • Federated database and multi-database
A federated database is an integrated database that comprises several distinct databases, each with its own DBMS. It is handled as a single database by a federated database management system (FDBMS), which transparently integrates multiple autonomous DBMSs, possibly of different types (which makes it a heterogeneous database), and provides them with an integrated conceptual view. The constituent databases are interconnected via computer network, and may be geographically decentralized.
Sometime the term multi-database is used as a synonym to federated database, though it may refer to a less integrated (e.g., without an FDBMS and a managed integrated schema) group of databases that cooperate in a single application. In this case typically middleware for distribution is used which typically includes an atomic commit protocol (ACP), e.g., the two-phase commit protocol, to allow distributed (global) transactions (vs. local transactions confined to a single DBMS) across the participating databases.
  • Graph database
A graph database is a kind of NoSQL database that uses graph structures with nodes, edges, and properties to represent and store information. General graph databases that can store any graph are distinct from specialized graph databases such as triplestores and network databases.
  • Hypermedia databases
The World Wide Web can be thought of as a database, albeit one spread across millions of independent computing systems. Web browsers "process" this data one page at a time, while web crawlers and other software provide the equivalent of database indexes to support search and other activities.
  • Hypertext database
In a Hypertext database, any word or a piece of text representing an object, e.g., another piece of text, an article, a picture, or a film, can be linked to that object. Hypertext databases are particularly useful for organizing large amounts of disparate information. For example they are useful for organizing online encyclopedias, where users can conveniently jump in the texts, in a controlled way, by using hyperlinks.
  • In-memory database
An in-memory database (IMDB; also main memory database or MMDB) is a database that primarily resides in main memory, but typically backed-up by non-volatile computer data storage. Main memory databases are faster than disk databases. Accessing data in memory reduces the I/O reading activity when, for example, querying the data. In applications where response time is critical, such as telecommunications network equipment, main memory databases are often used.[4]
  • Knowledge base
A knowledge base (abbreviated KB, kb or Δ[5][6]) is a special kind of database for knowledge management, providing the means for the computerized collection, organization, and retrieval of knowledge. Also a collection of data representing problems with their solutions and related experiences.
  • Operational database
These databases store detailed data about the operations of an organization. They are typically organized by subject matter, process relatively high volumes of updates using transactions. Essentially every major organization on earth uses such databases. Examples include customer databases that record contact, credit, and demographic information about a business' customers, personnel databases that hold information such as salary, benefits, skills data about employees, Enterprise resource planning that record details about product components, parts inventory, and financial databases that keep track of the organization's money, accounting and financial dealings.
  • Parallel database
A parallel database, run by a parallel DBMS, seeks to improve performance through parallelization for tasks such as loading data, building indexes and evaluating queries. Parallel databases improve processing and input/output speeds by using multiple central processing units (CPUs) (including multi-core processors) and storage in parallel. In parallel processing, many operations are performed simultaneously, as opposed to serial, sequential processing, where operations are performed with no time overlap.
The major parallel DBMS architectures (which are induced by the underlying hardware architecture are:
  • Shared memory architecture, where multiple processors share the main memory space, as well as other data storage.
  • Shared disk architecture, where each processing unit (typically consisting of multiple processors) has its own main memory, but all units share the other storage.
  • Shared nothing architecture, where each processing unit has its own main memory and other storage.
  • Real-time database
If a DBMS system responses users' request in a given time period, it can be regarded as a real time database.
  • Spatial database
A spatial database can store the data with multidimensional features. The queries on such data include location based queries, like "where is the closest hotel in my area".
  • Temporal database
A temporal database is a database with built-in time aspects, for example a temporal data model and a temporal version of Structured Query Language (SQL). More specifically the temporal aspects usually include valid-time and transaction-time.
  • Unstructured-data database
An unstructured-data database is intended to store in a manageable and protected way diverse objects that do not fit naturally and conveniently in common databases. It may include email messages, documents, journals, multimedia objects etc. The name may be misleading since some objects can be highly structured. However, the entire possible object collection does not fit into a predefined structured framework. Most established DBMSs now support unstructured data in various ways, and new dedicated DBMSs are emerging.

Major database usage requirements

The major purpose of a database is to provide the information system (in its broadest sense) that utilizes it with the information the system needs according to its own requirements. A certain broad set of requirements refines this general goal. These database requirements translate to requirements for the respective DBMS, to allow conveniently building a proper database for the given application. If this goal is met by a DBMS, then the designers and builders of the specific database can concentrate on the application's aspects, and not deal with building and maintaining the underlying DBMS. Also, since a DBMS is complex and expensive to build and maintain, it is not economical to build such a new tool (DBMS) for every application. Rather it is desired to provide a flexible tool for handling databases for as many as possible given applications, i.e., a general-purpose DBMS.

Functional requirements

Certain general functional requirements need to be met in conjunction with a database. They describe what is needed to be defined in a database for any specific application.

Defining the structure of data: Data modeling and Data definition languages

The database needs to be based on a data model that is sufficiently rich to describe in the database all the needed respective application's aspects. A data definition language exists to describe the databases within the data model. Such language is typically data model specific.

Manipulating the data: Data manipulation languages and Query languages

A database data model needs support by a sufficiently rich data manipulation language to allow all database manipulations and information generation (from the data) as needed by the respective application. Such language is typically data model specific.

Protecting the data: Setting database security types and levels

The DB needs built-in security means to protect its content (and users) from dangers of unauthorized users (either humans or programs). Protection is also provided from types of unintentional breach. Security types and levels should be defined by the database owners.

Describing processes that use the data: Workflow and Business process modeling

Manipulating database data often involves processes of several interdependent steps, at different times (e.g., when different people's interactions are involved; e.g., generating an insurance policy). Data manipulation languages are typically intended to describe what is needed in a single such step. Dealing with multiple steps typically requires writing quite complex programs. Most applications are programmed using common programming languages and software development tools. However the area of process description has evolved in the frameworks of workflow and business processes with supporting languages and software packages which considerably simplify the tasks. Traditionally these frameworks have been out of the scope of common DBMSs, but utilization of them has become common-place, and often they are provided as add-on's to DBMSs.

Operational requirements

Operational requirements are needed to be met by a database in order to effectively support an application when operational. Though it typically may be expected that operational requirements are automatically met by a DBMS, in fact it is not so in most of the cases: To be met substantial work of design and tuning is typically needed by database administrators. This is typically done by specific instructions/operations through special database user interfaces and tools, and thus may be viewed as secondary functional requirements (which are not less important than the primary).

Availability

A DB should maintain needed levels of availability, i.e., the DB needs to be available in a way that a user's action does not need to wait beyond a certain time range before starting executing upon the DB. Availability also relates to failure and recovery from it (see Recovery from failure and disaster below): Upon failure and during recovery normal availability changes, and special measures are needed to satisfy availability requirements.

Performance

Users' actions upon the DB should be executed within needed time ranges.

Isolation between users

When multiple users access the database concurrently the actions of a user should be uninterrupted and unaffected by actions of other users. These concurrent actions should maintain the DB's consistency (i.e., keep the DB from corruption).

Recovery from failure and disaster

All computer systems, including DBMSs, are prone to failures for many reasons (both software and hardware related). Failures typically corrupt the DB, typically to the extent that it is impossible to repair it without special measures. The DBMS should provide automatic recovery from failure procedures that repair the DB and return it to a well defined state.

Backup and restore

Sometimes it is desired to bring a database back to a previous state (for many reasons, e.g., cases when the database is found corrupted due to a software error, or if it has been updated with erroneous data). To achieve this a backup operation is done occasionally or continuously, where each desired database state (i.e., the values of its data and their embedding in database's data structures) is kept within dedicated backup files (many techniques exist to do this effectively). When this state is needed, i.e., when it is decided by a database administrator to bring the database back to this state (e.g., by specifying this state by a desired point in time when the database was in this state), these files are utilized to restore that state.

Data independence

Data independence pertains to a database's life cycle (see Database building, maintaining, and tuning below). It strongly impacts the convenience and cost of maintaining an application and its database, and has been the major motivation for the emergence and success of the Relational model, as well as the convergence to a common database architecture. In general the term "data independence" means that changes in the database's structure do not require changes in its application's computer programs, and that changes in the database at a certain architectural level (see below) do not affect the database's levels above. Data independence is achieved to a great extent in contemporary DBMS, but it is not completely attainable, and achieved at different degrees for different types of database structural changes.

Major database functional areas

The functional areas are domains and subjects that have evolved in order to provide proper answers and solutions to the functional requirements above.

Data models

A data model is an abstract structure that provides the means to effectively describe specific data structures needed to model an application. As such a data model needs sufficient expressive power to capture the needed aspects of applications. These applications are often typical to commercial companies and other organizations (like manufacturing, human-resources, stock, banking, etc.). For effective utilization and handling it is desired that a data model is relatively simple and intuitive. This may be in conflict with high expressive power needed to deal with certain complex applications. Thus any popular general-purpose data model usually well balances between being intuitive and relatively simple, and very complex with high expressive power. The application's semantics is usually not explicitly expressed in the model, but rather implicit (and detailed by documentation external to the model) and hinted to by data item types' names (e.g., "part-number") and their connections (as expressed by generic data structure types provided by each specific model).

Early data models

These models were popular in the 1960s, 1970s, but nowadays can be found primarily in old legacy systems. They are characterized primarily by being navigational with strong connections between their logical and physical representations, and deficiencies in data independence.
Hierarchical model
In the Hierarchical model different record types (representing real-world entities) are embedded in a predefined hierarchical (tree-like) structure. This hierarchy is used as the physical order of records in storage. Record access is done by navigating through the data structure using pointers combined with sequential accessing.
This model has been supported primarily by the IBM IMS DBMS, one of the earliest DBMSs. Various limitations of the model have been compensated at later IMS versions by additional logical hierarchies imposed on the base physical hierarchy.
Network model
In this model a hierarchical relationship between two record types (representing real-world entities) is established by the set construct. A set consists of circular linked lists where one record type, the set owner or parent, appears once in each circle, and a second record type, the subordinate or child, may appear multiple times in each circle. In this way a hierarchy may be established between any two record types, e.g., type A is the owner of B. At the same time another set may be defined where B is the owner of A. Thus all the sets comprise a general directed graph (ownership defines a direction), or network construct. Access to records is either sequential (usually in each record type) or by navigation in the circular linked lists.
This model is more general and powerful than the hierarchical, and has been the most popular before being replaced by the Relational model. It has been standardized by CODASYL. Popular DBMS products that utilized it were Cincom Systems' Total and Cullinet's IDMS. IDMS gained a considerable customer base and exists and supported until today. In the 1980s it has adopted the Relational model and SQL in addition to its original tools and languages.
Inverted file model
An inverted file or inverted index of a first file, by a field in this file (the inversion field), is a second file in which this field is the key. A record in the second file includes a key and pointers to records in the first file where the inversion field has the value of the key. This is also the logical structure of contemporary database indexes. The related Inverted file data model utilizes inverted files of primary database files to efficiently directly access needed records in these files.
Notable for using this data model is the ADABAS DBMS of Software AG, introduced in 1970. ADABAS has gained considerable customer base and exists and supported until today. In the 1980s it has adopted the Relational model and SQL in addition to its original tools and languages.

Relational model

The relational model is a simple model that provides flexibility. It organizes data based on two-dimensional arrays known as relations, or tables as related to databases. These relations consist of a heading and a set of zero or more tuples in arbitrary order. The heading is an unordered set of zero or more attributes, or columns of the table. The tuples are a set of unique attributes mapped to values, or the rows of data in the table. Data can be associated across multiple tables with a key. A key is a single, or set of multiple, attribute(s) that is common to both tables. The most common language associated with the relational model is the Structured Query Language (SQL), though it differs in some places.

0 comments:

Post a Comment

FB friends

 
Design by Free WordPress Themes | Bloggerized by Lasantha - Premium Blogger Themes | cheap international calls