Databases and SQL

Relational databases

A database can be defined as a structured collection of records or data that is stored in a computer so that a program can consult it to answer queries. The computer program used to manage and query a database is known as a database management system (DBMS). The central concept of a database is that of a collection of records, or pieces of data. Typically, for a given database, there is a structural description of the type of facts held in that database: this description is known as a schema. The schema describes the objects that are represented in the database, and the relationships among them. There are a number of different ways of organizing a schema, that is, of modeling the database structure: these are known as database models (or data models). The model in most common use today is the relational model, which in layman's terms represents all information in the form of multiple related tables each consisting of rows and columns. This model represents relationships by the use of values common to more than one table.

Three key terms are used extensively in relational database models: entities, attributes, and domains. A entity is a table with columns and rows. The named columns of the relation are called attributes, and the domain is the set of values the attributes are allowed to take.

There are two commonly used systems of diagramming to aid in the visual representation of the relational model: the entity-relationship diagram (ERD), and the related IDEF diagram used in the IDEF1X method. The end-product of the Entity Relational Modelling (ERM) process is an entity-relationship diagram or ERD. Data modeling requires a graphical notation for representing such data models. An ERD is a type of conceptual data model or semantic data model.

Figure 6.14. Entity-relationship Model

Entity-relationship Model

The relationship between entities are classified as one-to-one, one-to-many or many-to-many. The one-to-many is relationship is the ideal for building a well structured database. Many-to-many relationships should be avoided since they lead to data redundancies. Many-to-many relationships should be avoided by introducing a third entity and converting the many-to-many relationship to two one-to-many relationships.

The basic data structure of the relational model is the table, where information about a particular entity (say, an employee) is represented in columns and rows (also called tuples). Thus, the "relation" in "relational database" refers to the various tables in the database; a relation is a set of tuples. The columns enumerate the various attributes of the entity (the employee's name, address or phone number, for example), and a row is an actual instance of the entity (a specific employee) that is represented by the relation. As a result, each tuple of the employee table represents various attributes of a single employee.

All relations (and, thus, tables) in a relational database have to adhere to some basic rules to qualify as relations. First, the ordering of columns is immaterial in a table. Second, there can't be identical tuples or rows in a table. And third, each tuple will contain a single value for each of its attributes.

Tables also have a designated single attribute or a set of attributes that can act as a "key", which can be used to uniquely identify each tuple in the table. A key that can be used to uniquely identify a row in a table is called a primary key. Keys are commonly used to join or combine data from two or more tables. For example, an Employee table may contain a column named Location which contains a value that matches the key of a Location table. Keys are also critical in the creation of indexes, which facilitate fast retrieval of data from large tables. Any column can be a key, or multiple columns can be grouped together into a compound key.

Users (or programs) request data from a relational database by sending it a query that is written in a special language, usually a dialect of SQL. Although SQL was originally intended for end-users, it is much more common for SQL queries to be embedded into software that provides an easier user interface. In response to a query, the database returns a result set, which is just a list of rows containing the answers. The simplest query is just to return all the rows from a table, but more often, the rows are filtered in some way to return just the answer wanted. Often, data from multiple tables are combined into one, by doing a join. There are a number of relational operations in addition to join.

Figure 6.15. Example database

Example database
Example database

In the data set depicted in Figure 6.14, “Entity-relationship Model” there are six entities: Customer, Order, Order Line, Invoice, Invoice Line and Product. The bold, underlined attributes are primary keys. The non-bold, underlined attributes are foreign keys. If we attempted to insert a new customer with the ID 1234567890, this would violate the relational integrity since Customer ID is a primary key and we already have a customer 1234567890. The DBMS must reject a transaction such as this that would render the database inconsistent by a violation of an integrity constraint. Foreign keys are integrity constraints enforcing that the value of the attribute set is drawn from a primary key in another relation, for example in the Order relation the attribute Customer ID is a foreign key. A join is the operation that draws on information from several relations at once. By joining tables from the example above we could query the database for all of the Customers, Orders, and Invoices. If we only wanted the tuples for a specific customer, we would specify this using a restriction condition. If we wanted to retrieve all of the Orders for Customer 1234567890, we could query the database to return every row in the Order table with Customer ID 1234567890 and join the Order table to the Order Line table based on Order No.

There is a design flaw in the database design above. The Invoice entity contains an Order No attribute. Each tuple in the Invoice table will have one Order No, which implies that there is precisely one Order for each Invoice. But in reality an invoice can be created against many orders, or indeed for no particular order. Additionally the Order entity contains an Invoice No attribute, implying that each Order has a corresponding Invoice. An order is sometimes paid through several invoices, and sometimes paid without an invoice. There can be many Invoices per Order and many Orders per Invoice. This is a many-to-many relationship between Order and Invoice. Many-to-many relationships in database design should be avoided because they cause data redundancies. The design flaw can be rectified by representing this relationship in the database as a new table whose role is to specify the correspondence between Orders and Invoices:

Now, the Order entity has a one-to-many relationship to the OrderInvoice entity, as does the Invoice entity. If we want to retrieve every Invoice for a particular Order, we can query for all orders where Order No in the Order relation equals the Order No in OrderInvoice, and where Invoice No in OrderInvoice equals the Invoice No in Invoice.

Database normalization is usually performed when designing a relational database, to improve the logical consistency of the database design and the transactional performance. Database normalization is a design technique for structuring relational database tables. Tables can be normalized to a greater or lesser degree. Database theory describes a table's degree of normalization in terms of normal forms. The most common normal forms, from least normalized to most normalized, are as follows:


* First normal form (1NF)
* Second normal form (2NF)
* Third normal form (3NF)

A table that is not sufficiently normalized can suffer from logical inconsistencies of various types, and from anomalies involving data operations. Most 3NF tables are free of update, insertion, and deletion anomalies.

In recent years, the object-oriented paradigm has been applied to database technology, creating a new programming model known as object databases. These databases attempt to bring the database world and the application programming world closer together, in particular by ensuring that the database uses the same type system as the application program. This aims to avoid the overhead (sometimes referred to as the impedance mismatch) of converting information between its representation in the database (for example as rows in tables) and its representation in the application program (typically as objects). At the same time, object databases attempt to introduce the key ideas of object programming, such as encapsulation and polymorphism, into the world of databases.

A variety of ways have been tried for storing objects in a database. Some products have approached the problem from the application programming end, by making the objects manipulated by the program persistent. This also typically requires the addition of some kind of query language, since conventional programming languages do not have the ability to find objects based on their information content. Others have attacked the problem from the database end, by defining an object-oriented data model for the database, and defining a database programming language that allows full programming capabilities as well as traditional query facilities.

Database Management Systems

A database management system (DBMS) is computer software designed for the purpose of managing databases. Typical examples of DBMSs include Oracle, DB2, Microsoft SQL Server, Postgres, MySQL, 4th Dimension, FileMaker, Microsoft Access and SQLite. DBMSs are typically used by Database administrators in the creation of Database systems.

A DBMS is a complex set of software programs that controls the organization, storage and retrieval of data in a database. A DBMS includes:

  • A modeling language to define the schema (relational model) of each database hosted in the DBMS, according to the DBMS data model.

  • Data structures (fields, records and files) optimized to deal with very large amounts of data stored on a permanent data storage device (which implies slow access compared to volatile main memory).

  • A database query language to allow users to interactively interrogate the database, analyse its data and update it.

  • A transaction mechanism, that ideally would guarantee the ACID properties, in order to ensure data integrity, despite concurrent user accesses (concurrency control), and faults (fault tolerance). ACID (Atomicity, Consistency, Isolation, Durability) is a set of properties that guarantee that database transactions are processed reliably. In the context of databases, a single logical operation on the data is called a transaction. An example of a transaction is a transfer of funds from one account to another, even though it might consist of multiple individual operations. The DBMS can maintain the integrity of the database by not allowing more than one user to update the same record at the same time.

  • Advanced database management services including automated backups, replication (database mirroring), and redundancy mechanisms such as failover clustering.

Database servers are specially designed computers that hold the actual databases and run only the DBMS and related software. Database servers are usually multiprocessor computers, with RAID disk arrays used for stable storage. DBMS's are found at the heart of most database applications. Sometimes DBMSs are built around a private multitasking kernel with built-in networking support although nowadays these functions are left to the operating system.

Structured Query Language

SQL (Structured Query Language) is a computer language used to create, retrieve, update and delete data from relational database management systems. SQL has been standardized by both ANSI and ISO. Major database products (or projects)implementing SQL include Oracle DBMS, DB2, MySQL, PostgreSQL and Microsoft SQL Server.

SQL was adopted as a standard by ANSI (American National Standards Institute) in 1986 and ISO (International Organization for Standardization) in 1987. However, since the dissolution of the NIST data management standards program in 1996 there has been no certification for compliance with the SQL standard so vendors must be relied on to self-certify. The most recent revision of the standard SQL:2006 also introduces ISO/IEC 9075-14:2006 which defines ways in which SQL can be used in conjunction with XML. It defines ways of importing and storing XML data in an SQL database, manipulating it within the database and publishing both XML and conventional SQL-data in XML form. In addition, it provides facilities that permit applications to integrate into their SQL code the use of XQuery, the XML Query Language published by the World Wide Web Consortium (W3C), to concurrently access ordinary SQL-data and XML documents.

SQL is designed for a specific purpose: to query data contained in a relational database. SQL is a set-based, declarative programming language, not an imperative language such as C or BASIC. Language extensions such as Oracle Corporation's PL/SQL bridge this gap to some extent by adding procedural elements, such as flow-of-control constructs. Another approach is to allow programming language code to be embedded in and interact with the database. For example, Oracle and others include Java in the database, and SQL Server 2005 allows any .NET language to be hosted within the database server process, while PostgreSQL allows functions to be written in a wide variety of languages, including Perl, Tcl, and C. Extensions to and variations of the standards exist.

Commercial implementations commonly omit support for basic features of the standard, such as the DATE or TIME data types. SQL code can rarely be ported between database systems without major modifications. PL/SQL, IBM's SQL PL (SQL Procedural Language) and Sybase / Microsoft's Transact-SQL are of a proprietary nature because the procedural programming language they present are non-standardized.

The most common operation in SQL databases is the query, denoted with the SELECT keyword. SQL SELECT queries are declarative. SELECT retrieves data from tables in a database often grouped with data manipulation Language statements. A SQL query includes a list of columns to be included in the final result immediately following the SELECT keyword. An asterisk ("*") can also be used as a "wildcard" indicator to specify that all available columns of a table (or multiple tables) are to be returned. SELECT is the most complex statement in SQL, with several optional keywords and clauses:

  • FROM clause

    The FROM clause indicates the source tables from which the data is to be drawn. The FROM clause can include optional JOIN clauses to join related tables to one another.

  • WHERE clause

    The WHERE clause includes a comparison predicate, which is used to narrow the result set. The WHERE clause eliminates all rows from the result set for which the comparison predicate does not evaluate to True.

  • GROUP BY clause

    The GROUP BY clause is used to combine rows with related values into elements of a smaller set of rows.

  • HAVING clause

    The HAVING clause is used to identify which of the "combined rows" (combined rows are produced when the query has a GROUP BY clause or when the SELECT part contains aggregates), are to be retrieved. HAVING acts much like a WHERE, but it operates on the results of the GROUP BYand can include aggregate functions.

  • ORDER BY clause

    The ORDER BY clause is used to identify which columns are used to sort the resulting data. Unless an ORDER BY clause is included, the order of rows returned by SELECT is never guaranteed.

Figure 6.16. SQL data query statements

SQL data query statements

Standard Data Manipulation Language (DML) elements is the subset of the language used to add, update and delete data:

  • INSERT

    INSERT is used to add rows of data to an existing table.

  • UPDATE

    UPDATE is used to modify the values of a set of existing table rows.

  • MERGE

    MERGE is used to combine the data of multiple tables. It is something of a combination of the INSERT and UPDATE elements.

  • DELETE

    DELETE removes zero or more existing rows from a table.

Figure 6.17. SQL data modification statements

SQL data modification statements

Transactions (not available in all DBMSs) can be used to wrap around the data management operations:

  • BEGIN

    BEGIN WORK (or START TRANSACTION, depending on SQL dialect) can be used to mark the start of a database transaction, which either completes completely or not at all.

  • COMMIT

    COMMIT causes all data changes in a transaction to be made permanent.

  • ROLLBACK

    ROLLBACK causes all data changes since the last COMMIT or ROLLBACK to be discarded, so that the state of the data is "rolled back" to the way it was prior to those changes being requested.

COMMIT and ROLLBACK interact with areas such as transaction control and locking. Strictly, both terminate any open transaction and release any locks held on data.

Figure 6.18. SQL transaction statements

SQL transaction statements

The second group of keywords is the Data Definition Language (DDL). DDL allows the user to define new tables and associated elements. Most commercial SQL databases have proprietary extensions in their DDL, which allow control over nonstandard features of the database system. The most basic items of DDL are the CREATE,ALTER,RENAME,TRUNCATE and DROP commands:

  • CREATE

    CREATE causes an object (a table, for example) to be created within the database.

  • DROP

    DROP causes an existing object within the database to be deleted, usually irretrievably.

  • TRUNCATE

    TRUNCATE deletes all data from a table (non-standard, but common SQL command).

  • ALTER

    ALTER command permits the user to modify an existing object in various ways -- for example, adding a column to an existing table.

Figure 6.19. SQL table definition statements

SQL table definition statements

Open Database Connectivity (ODBC) provides a standard software API method for using database management systems (DBMS). The designers of ODBC aimed to make it independent of programming languages, database systems, and operating systems. The ODBC specification offers a procedural API for using SQL queries to access data. An implementation of ODBC will contain one or more applications, a core ODBC library, and one or more "database drivers". The core library, independent of the applications and DBMS systems, acts as an "interpreter" between the applications and the database drivers, whereas the database drivers contain the DBMS-specific details. Thus a programmer can write applications that use standard types and features without concern for the specifics of each DBMS that the applications may encounter. Likewise, database driver implementors need only know how to attach to the core library. This makes ODBC modular.

ODBC provides a standard of ubiquitous data access because hundreds of ODBC drivers exist for a large variety of data sources. ODBC operates with a variety of operating systems and drivers exist for non-relational data such as spreadsheets, text and XML files. Because ODBC dates back more than ten years, it offers connectivity to a wider variety of data sources than other data-access APIs. More drivers exist for ODBC than drivers or providers exist for newer APIs such as OLE DB, JDBC, and ADO.NET.

Despite the benefits of ubiquitous connectivity and platform-independence, ODBC has certain drawbacks. Administering a large number of client machines can involve a diversity of drivers and DLLs. This complexity can increase system administration overhead. Large organizations with thousands of PCs have often turned to ODBC server technology to simplify the administration problem. Differences between drivers and driver maturity can also raise important issues. Newer ODBC drivers do not always have the stability of drivers already deployed for years. Years of testing and deployment mean a driver may contain fewer bugs.

Developers needing features or types not accessible with ODBC can use other SQL APIs. When not aiming for platform-independence, developers can use proprietary APIs. If developers need to produce portable, platform-independent, albeit language specific code, they can use the JDBC API (Java API).

Microsoft SQL Server uses a variant of SQL called T-SQL, or Transact-SQL, an implementation of SQL-92 (the ISO standard for SQL, certified in 1992) with many extensions. T-SQL mainly adds additional syntax for use in stored procedures, and affects the syntax of transaction support. (Note that SQL standards require Atomic, Consistent, Isolated, Durable or "ACID" transactions.) Microsoft SQL Server communicate over networks using an application-level protocol called Tabular Data Stream (TDS). The TDS protocol has also been implemented by the FreeTDS project in order to allow more kinds of client applications to communicate with Microsoft SQL Server and Sybase databases. Microsoft SQL Server also supports Open Database Connectivity (ODBC). The latest release SQL Server 2005 also supports the ability to deliver client connectivity via the Web Services SOAP protocol. This allows non-Windows Clients to communicate cross platform with SQL Server. Microsoft has also released a certified JDBC driver to let Java Applications like BEA and IBM WebSphere communicate with Microsoft SQL Server 2000 and 2005.

Microsoft and various other vendors provide a number of software development tools designed to allow business applications to be developed using the data stored by Microsoft SQL Server. Microsoft SQL Server 2005 now includes the common language runtime (CLR) component for Microsoft .NET. Applications developed with .NET languages such as Visual Basic can implement stored procedures and other functions. Older versions of Microsoft development tools typically use APIs to access Microsoft SQL Server functionality. Rapid application development tools incorporate native database gateways for high speed database access and automatic table drill-down for the creation of quick prototype applications for viewing, editing and adding data to any table in the database.