---[REF]---"style":"heading"---[REF]--- ---[SIDEBAR]--- "type":"objectives", "title":"Databases and SQL", "text":"After completing this lesson, the successful student will be able to:
  1. Describe the insertion, update, and deletion anomalies.
  2. Explain basic normalization including first, second, and third normal forms.
  3. Describe how normalizing a database reduces the likelihood of insertion, update, and deletion anomalies.
" ---[SIDEBAR]--- ---[REF]---"style":"link","pageId":"lesson-designing-tables"---[REF]--- discussed the process of designing database tables. This lesson extends that conversation to describe the process of normalization. Normalization means to design database tables in such a way to reduce or eliminate errors in the data stored in the tables. First, the lesson will describe three common types of errors. It will then discuss normalization and how normalization can reduce errors in data. ---[SIDEBAR]--- "type":"term", "title":"Normalization", "text":"Organizing database tables using standard rules to reduce or eliminate errors in the data stored in the database." ---[SIDEBAR]--- ## Insertion, Update, and Deletion Anomalies You learned in ---[REF]---"style":"link","pageId":"lesson-designing-tables"---[REF]--- that data are stored in separate tables according to the theme of the data being stored. In the museum database used throughout this resource data able artists, works of art, and museums are all stored on separate tables. This is actually the first step in organizing data to reduce errors. Suppose that all of the data known about artists, works of art, and museums were stored in a single table instead. ---[museum-data-one-table]--- shows what this table might look like with a sample of the museum data currently stored in the database. At first, this may seem like a reasonable and efficient way to store the data. At a minimum, it is easy to see how the artist, works of art, and museums are related. On a single row, you can easily identify the artist and the exhibiting museum for each work of art. ---[FIGURE]--- "id":"museum-data-one-table", "title":"Sample Museum Data in a Single Table", "url":"https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEi64J3uK2QUaweD4W7W9JdSNINgGhyphenhyphensGmoLpV73zft-QaNm5cnJMA-kBERmD-Nt9w55YCot4VsJSIFt3BZbImLgxXPvkjmeSBQzpZGEMIR2H4sirgIFPf8jwnv85trNIU9N7YIQaWTGLFMsx2ZwWMIFYuVDk1n7el5bbAFBhjiTTSF95ey0fc0HUgfqUJUi/s16000/non_normal_data.png", "altText": "A sample of the museum data stored in a single table instead of separate tables for artists, works of art, and museums.", "imgHandling": "scale" ---[FIGURE]--- One of the main problems with the single table approach is that it often leads to errors or inconsistencies in the data. Three common types of errors, or anomalies, that occur when a database is not designed correctly are insertion anomalies, update anomalies, and deletion anomalies. We will describe each of them. ---[SIDEBAR]--- "type":"term", "title":"Anomaly", "text":"An error or inconsistency that occurs in the data stored in a database." ---[SIDEBAR]--- ### Insertion Anomalies An insertion anomaly refers to an error or inconsistency in data that happens when new data is added to a database. There are at least two ways that these anomalies happen. 1. The inserted data is incomplete. 2. The inserted data is incorrect. ---[SIDEBAR]--- "type":"term", "title":"Insertion Anomaly", "text":"An error or inconsistency that occurs in data when it is added to a database." ---[SIDEBAR]--- Incomplete data can be problematic, particularly if the missing data is never added to the database. So problematic, in fact, that the database will likely be set to not allow rows to be added to a table if values for important columns are missing. This creates a dilemma as well if a needed row cannot be added because of missing data. In either case, storing all data in a single table increases the likelihood of this type of insertion anomaly. Consider that you need to add the artist Leonardo Da Vinci to the data, but you don't have details about his works of art or the museums that exhibit his works of art. Adding just the data able Da Vinci to the table creates a number of missing or NULL values (see ---[museum-data-insert-anomaly]---). If the database allows these NULL values, there will be a lot of missing values on the table. If, on the other hand, important columns (such as work_name or museum_name) as set to not allow NULL values, the user will not be able to add Da Vinci to the database. Either result is not optimal. ---[FIGURE]--- "id":"museum-data-insert-anomaly", "title":"Example Insertion Anomaly", "url":"https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEisEhArzux7A8I_R9HZL15niSSmzJrLVR1kI28wY3NQz5kISK7Xj59Ogy1hmYBwas4eKXFY9Hd822bvNVCNuJXDpedENuztiPYSODylZNI2mKVIDk4HP6k3d6JMXvUUme4Zv-NQYPaH1_XZisuCzm5yJxrfvK9sgsWH_L9wxc_KaKzVzvoBEvkL1XpZ6vUO/s1057/InsertionAnomaly.png", "altText": "An example of an insertion anomaly that results in a number of NULL values.", "imgHandling": "scale" ---[FIGURE]--- Insertion anomalies also occur when data is inserted incorrectly. A common example of this is a misspelled word. There is always the possiblity that a user can incorrectly add a value to a database. However, storing all data in a single table increases the likelihood of this type of insertion anomaly as well. Consider the number of times that the names of the individual museums appear in this sample data. The Metropolitan Museum of Art appears in six of the 25 rows in the table (see ---[museum-data-insert-anomaly-the-met]---). ---[FIGURE]--- "id":"museum-data-insert-anomaly-the-met", "title":"Repeated Values", "url":"https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEi0BtNuTMicrC-RdbBPAoYXQIF4gTNA38aFr2RUoO8jJEP20_FAxXYkDbM8HSW4KEoMW8CHzchtxToqjttpneqIaMR_HdpqJdmDVc8Ai59Nh5Zy-Te_lm4jZMWbPDdpFQ6_Zd4FlsD8xUXJ9bsRGW-KVdNvZzCNjtnrZz4QDB0DkkiE9_ACEMWlahl6yJJb/s425/UpdateAnomaly.png", "altText": "An example of a value repeatedly appearing in a database table." ---[FIGURE]--- Each time a value is added to the table, there is a small possiblity that the value will be inserted incorrectly. The more times that a value is added, the more likely it is that the value will be inserted incorrectly, particularly for a value that contains a lot of words or numbers. Inserted the same value repeatedly in a database unecessarily increases the risk that it will be entered incorrectly at least once. Ideally, a value will be entered only once in the database. This will minimize the possiblity that it will be entered incorrectly. ### Update Anomalies Update anomalies occur when errors in data result from changing the existing values of a table. This most likely happens when data are incorrectly changed. As with insertion anomalies, it is impossible to eliminate the risk that a values will be incorrectly changed, but storing all data in a single table does increase the likelihood of update anomalies. ---[SIDEBAR]--- "type":"term", "title":"Update Anomaly", "text":"An error or inconsistency that occurs in data when it is changed in a database." ---[SIDEBAR]--- Recall that The Metropolitan Museum of Art appears in six of the 25 rows in the data in ---[museum-data-insert-anomaly-the-met]---. Suppose you would like to update the name of The Metropolitan Museum of Art to the shorter and more commonly used name "The Met". Given that it appears six times in this data, it must be changed correctly in all six places. As the number of places that a value must be changed increases, the likelihood that it will be changed incorrectly at least once also increases. ### Deletion Anomalies The last error discussed in this lesson is a deletion anomaly. A deletion anomaly is an error in the data that occurs when a row is deleted from the database. This results in data being removed from the database that was not intended. ---[SIDEBAR]--- "type":"term", "title":"Deletion Anomaly", "text":"An error or inconsistency that occurs in data when data is removed from a database." ---[SIDEBAR]--- As with the other anomalies discussed in this lesson, deletion anomalies are much more likely to occur if the database is not organized correctly. If all of the data are stored on a single table, deletion anomalies are very likely to occur. Assume you want to remove work named "Women on a Cafe Terrace in the Evening" (see ---[museum-data-delete-anomaly]---). ---[FIGURE]--- "id":"museum-data-delete-anomaly", "title":"Deletion Anomaly", "url":"https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhnMibOD210OOL17CQhy-8GVemyA35M5fFWepSrKdrCO1krB57c9ie_W_ciBBTx5JC9LmGHeGT0pxZ5PqUMr5sEpfWWBScy_T1s_6PjZZvaCKmoovYcNQLPZ5iOJinhpXV78g4KJKtyJaRozWfwk3yNELsINptKhyBxRPvBNxZhS3Q_w96epdtjASQi-87_/s2108/DeletionAnomaly.png", "altText": "An example of a deletion anomaly.", "imgHandling": "scale" ---[FIGURE]--- Notice that this particular work of art is the only work exhibited in Musee d'Orsay. It is also the only work of art by Edgar Degas. If you were to delete the "Women on a Cafe Terrace in the Evening" all of the information about Musee d'Orsay and Edgar Degas will also be deleted. If later you want to add a different work of art by Degas or exhibited in Musee d'Orsay, you would have lost the information about Degas and Musee d'Orsay. ### Organizing Data to Reduce Anomalies Insertion, update, and delete anomalies can be reduced or even eliminated if the data are properly organized. The first step of this process is to store data in separate tables according to themes. In this case, you would store the data about artists, works of art, and museums on separate tables. These tables are connected using primary and foreign keys. ---[museum-data-normalized]--- shows this data organized in separate tables. ---[FIGURE]--- "id":"museum-data-normalized", "title":"Data Organized to Reduce Anomalies", "url":"https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjSRuLkqIfXC9B8p_xL7WpqhQYuFwigIK9PNbnog1pFl_VelfxMSbWChGsqsFi7VaUjSV3ZG7ebXRHgzJBHm6LEoLW-vzuuzryKZWEiaI1TEILZRlYhfu8QnMkY6dJHqi0IGZr3T9ummVTUtwo8bdyb_PHB7Jnsj4DAk4qZ_G_W9SyqEiiPXWyTToBBGuBw/s16000/NormalizedData.png", "altText": "Data about artists, works or art, and museums organized to reduce anomalies.", "imgHandling": "scale" ---[FIGURE]--- With the data organized, it is time to try our example tasks again. 1. Add the artist Leonardo Da Vinci to the database. When the data was stored on a single table, inserting Da Vinci without work or museum data resulted in a number of NULL values. Notice that adding Da Vinci to the artist table does not create NULL values on the other two tables. If one of Da Vinci's works of art is added to the database, it will be added to the work table and linked to Da Vinci by adding the work of art with Da Vinci's artist_id as a foreign key. 2. Update the name of The Metropolitan Museum of Art to "The Met". When the data were stored in a single table, the name must be updated six times. With the museum data stored on a separate table, The Metropolitan Museum of Art would only need to be updated once. 3. Delete the artwork named, "Women on a Cafe Terrace in the Evening". With the data organized on a single table, deleting this work of art resulted in the deletion of all of the data about Edgar Degas and Musee d'Orsay at the same time. With the data organized into separate tables, deleting "Women on a Cafe Terrace in the Evening" would leave the data on Musee d'Orsay and Edgar Degas untouched in the museum and artist tables. ## Database Normalization Organizing data properly involves following specific rules. The process of organizing data according to these rules is called normalization. When a database follows these rules, it is said to be normalized. ---[SIDEBAR]--- "type":"term", "title":"Normalization", "text":"Organizing database tables using standard rules to reduce or eliminate errors in the data stored in the database." ---[SIDEBAR]--- Normalization is a staged process. This means that data are organized to comply with the normalization rules one at a time. When a database complies with the first rule, it is said to be in first normal form. Once it conforms with the second normalization rule, it is second normal form. Since this is a staged process, you would not apply the second normalization rule without first ensuring that the data are in first normal form. Because of this, when a database is in second normal form, it is also in first normal form. There are quite a few normalization rules. The more rules applied to a database, the less likely it is that the data in the database will have anomalies. However, as a database becomes more moralized, it becomes harder to use. Additionally, each additional rule that is followed brings marginally less reduction in the possibliity of anomalies. For both of these reasons, there is general consensus that third normal form (or complying with the first three rules of normalization) is the most practical. The remainder of this lesson will describe the first three normal forms. ### First Normal Form The first normalization rule prescribes that data stored in a database table should be in its simplest form. In practice this means that there can't be a table of data within a table. A table can violate first normal form in different ways. Here are two examples. ---[SIDEBAR]--- "type":"term", "title":"First Normal Form", "text":"Each piece of information in a table must be in its simplest form (a single value in a single column for each row)." ---[SIDEBAR]--- Suppose you want to list the works of art created by each artist on the artist table. This would be easy if each artist has created only one work of art. However, most artists in the database have created multiple works of art. One way to store these multiple values would be to place them all in the same column. ---[artist-with-work-data-single-column]--- shows this approach. ---[FIGURE]--- "id":"artist-with-work-data-single-column", "title":"Works of Art in a Single Column", "url":"https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiTvsYpoFNPVb7qcaCc3xyhO2vXzdUpnOcqx1qBUjchE90PKfnFWyhytt7d6jGw3WkIcEH2fNZfT3og5_72SqSn14eWAJc1bxEDNOPFdicTctG5yh-AnQAbNMB_B3cU4cPUQud8WN5AQEz7zwqsrxHtDhCXy0Tbo6rs4C_IK4awHuCRagLcZUsYka1bG11I/s16000/ArtistWorksSingleColumn.png", "altText": "Data about artists' works or art organized in a single column.", "imgHandling": "scale" ---[FIGURE]--- Notice the four works of art for Renior. Organizing them into a single column is problematic. First, it would be difficult to sort the column base on the works of art. Which of the many values should be used in the sorting. Second, it would be similarly difficult to use the values in a WHERE clause. Which of the values should be used in the logical test. Finally, it would be difficult to use the values to JOIN the artist table to the work table. Which of the values would be used in the JOIN? For these reasons, multiple values should not be stored in the same column on a table. The second example involves separating the values into multiple columns. ---[artist-with-work-data-multiple-columns]--- shows this approach. ---[FIGURE]--- "id":"artist-with-work-data-multiple-columns", "title":"Works of Art in Multiple Columns", "url":"https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEg8zmMIXQdJiQAQEPWz9oo3YYK2FaiUK-L8agsD-RAvZvFYrqb3t_Hmdu-6aMlWQmAGF6zd7t8R7Txs8l9kqtB9S1bE-n3lqD7EMMkGleeBi1HYAmzDAdv1HS9Sym3RJL7tm3wvzBR6mKuPk9nH2Vy8o15gHPKyDDBlx_poHm2xVc2ynZScIXhWE4Bh9FDq/s16000/ArtistWorksMultipleColumns.png", "altText": "Data about artists' works or art organized in multiple columns.", "imgHandling": "scale" ---[FIGURE]--- Although this approach does store a single piece of information in each cell of the table, it would still be difficult to use. For example, which of the columns should be used to sort the data of the table, which column would be used in the criteria of a WHERE clause, and which column would be used to JOIN the artist and work tables. Additionally, it would be difficult to know how many columns to include on the table. This approach is also unworkable. For the data in a table to follow the first rule of normalization, it must be atomic or in it's simplest form. This means that each element of the table must be stored as a single value for each row in a single column of the table. If the data cannot be fit into an atomic value, it must be stored in a separate table. In this case, since many artist have created multiple works of art, it would be impossible to store the works of art as a single value in a single column of the artist table. This is one reason that the works of art are stored on separate table (work) in the database. ### Second Normal Form To define what it means to be in second normal form, you must first understand the concept of functional dependency. Functional dependency involves how the columns in a database table are related to each other. A column is said to be functionally dependent on another column if the values for one column determine values for the second column. For example, artist names are functionally dependent on artist_ids. If a specific artist_id is known, you can unambiguously determine the name of that artist. However, the birth year of artists is not functionally dependent on nationality. If you know a specific nationality, you cannot determine the value for birth year. Artists with the different nationalities are born in the same year. ---[SIDEBAR]--- "type":"term", "title":"Functional Dependency", "text":"Occurs when the values for a column determine the values for a second column." ---[SIDEBAR]--- The second rule of normalization specifies that all of the non-key attributes of a table must be functionally dependent on the primary key of the table. Consider ---[museum-with-artist-name]---. Notice that all of the columns on the table are functionally dependent on the primary key museum_id, except artist_name. If you know the museum_id, you can unabiguously determine the values for each of the other columns (expect artist_name). ---[SIDEBAR]--- "type":"term", "title":"Second Normal Form", "text":"Each column on a table must be functionally dependent on the primary key of the table." ---[SIDEBAR]--- ---[FIGURE]--- "id":"museum-with-artist-name", "title":"Museums with Artist Names", "url":"https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjcqdXnrCRzFMZivFxZFYU2wt0NWkORX-nkMnz2pivgwH4h8OogBessoHaOWrm-uyGhB9Q-fZSpDlcvHbJhDGniS2Zb4hoWBBR_xQI3wmUDXYpRyH9XuurlogSPO89ANrybFnKNkX1YxTD5xa-KY2sExvEAK6C4QSZNp2nShEHxJ2c5L-Dbb6LkzfmMASJj/s16000/MuseumWithArtistName.png", "altText": "Data about museums with the names of artists.", "imgHandling": "scale" ---[FIGURE]--- With artist_name, multiple artists' work are exhibited in each museum. This means that knowing the museum_id does tell you precisely which artist to include on the table. Therefore the artist names do not belong on the museum table. Since the name of the artist is functionally determined by the artist_id, it makes the most sense to store artist names on the artist table. ### Third Normal Form Once you have determined that all database tables are in second normal form (they comply with the rules for first and second normal form), it is time to ensure that they are in third normal form. To be in third normal form, all of the non-key columns of a table must be functionally dependent on only the primary key (or a column that could be the primary key). ---[SIDEBAR]--- "type":"term", "title":"Third Normal Form", "text":"Each non-key column on a table must only be functionally dependent on the primary key or a column that could be the primary key." ---[SIDEBAR]--- Examine the table depicted in ---[work-with-museum-name]---. All non-key columns on the table are functionally dependent on the primary key, museum_id. This includes the museum_name. Since a work art can only be exhibited at a single museum, if you know the museum_id you can determine which museum is exhibiting that work of art. This table is in second normal form. ---[FIGURE]--- "id":"work-with-museum-name", "title":"Works of Art with Museum Names", "url":"https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgUlj8nM2ZSh_DVel_u9e37IoiNvRMN6cTmrO6yaxYLBrjwcxnbQtYPO_sdma8oErXQ_NOvj93wah3nEftSps5Bu1HzTou1I_mDEY2blGvIfKZKVq6xCytQ6ctCPqMEiYD2KMHpwsd1kN4HTplABLgAVRYVrowKchiTjG94nno6BfSAT9gPdH_qJ9V7cMbb/s16000/WorksWithMuseumNames.png", "altText": "Data about works of art that includes the names of museums.", "imgHandling": "scale" ---[FIGURE]--- Notice that museum_name is also functionally dependent on the museum_id. If you know the museum_id, you can determine exactly the name of the museum. Although museum_name is functionally dependent on the primary key (work_id), it is also functionally dependent on a different column of the table (museum_id). This does not necessarily mean that museum_name does not comply with the third rule of normalization. For a column to comply with this rule it must be functionally dependent on either the primary key *or another column that could be the primary key*. Since you know that museum_name is functionally dependent on museum_id, you must determine if museum_id could be the primary key of the work table. For a column to be the primary key, all values for that column on the table must be unique. Notice that some of the museum_id values repeat on the table. For this reason, museum_id cannot be the primary key of the table. This means that museum_name does not comply with the third normalization rule. To fix this, the column that breaks the rule should be stored on a different table. In this case, museum_name should be stored on the museum table. ## Summary This lesson described how to organize a database to reduce or eliminate the likelihood of errors. Three types of errors (insertion anomalies, update anomalies, and deletion anomalies) were presented and ways to reduce the likelihood of these errors. The specific rules for organizing or normalizing data were also presented. Specifically, the lesson described rules for organizing a database to third normal form.