Wednesday, June 6, 2018 | Justin Tribuna
The term Big Data continues to gain prominence, and the ability to actually pull meaningful information and analysis from data becomes more real every day. One thing we always need to be concerned with as we help our partners make business decisions based on analytical models is the quality of the data. To ensure the integrity of the data itself and eventually the models we use it in, and most importantly to safeguard against error and the "garbage in, garbage out" problem, we start with the humble database. It is critical to focus on its design from the beginning to support your business, your customers and everyone's safety and success.
A database does exactly what its name implies—it holds data. The most common kind are relational databases, which utilize a language called SQL (Structured Query Language) to add, modify, and retrieve data, as well as to modify the structure of the database itself. There are non-relational databases that work a little differently, usually called NoSQL databases, as they aren’t bound to any particular language like SQL. (Interestingly enough, NoSQL databases can make use of SQL—they simply don’t necessarily have to.) There are advantages and disadvantages to both relational and NoSQL databases. One of the advantages of using a relational database is enforced data consistency, which I’ll be discussing in this post.
A relational database primarily consists of tables, which hold different kinds of data. For example, a medical database might have separate tables for patients, doctors, and prescriptions. Each table consists of columns and rows. A row is a single entry in a table, so each row in our patients table would represent an individual patient. Columns are data attributes. For a patient, those attributes might be name, birthdate, address, etc . I’ll be explaining exactly what makes a database relational later in this post.
Some sample tables for a medical database.
To keep data consistent and usable, several constraints are made available to the database designer. Constraints are sets of rules imposed and enforced on columns so that the data can’t stray from certain standards, thus allowing data integrity to be maintained. Here, we’ll be reviewing the most commonly used constraints in SQL databases.
The patient table, with PatientId as the primary key.
Arguably the most important constraint is the primary key constraint, often shortened to PK. Most tables have a primary key. A primary key is a column that serves as the main identifier for data in that table. A table can only have one primary key, and it must be unique for all rows—so no two patients can have the same ID, no two doctors can have the same ID, and so on. The general naming convention is that the name of the primary key column should be the table’s name (in singular) plus “Id.” So, if your table is called “patients,” then the primary key should be “PatientId.” For the doctors table, it would be “DoctorId.” This helps anyone using the database immediately identify which column in any given table is the primary key, since that’s the naming convention most users will be looking for, and even if the user isn’t familiar with databases, it will be fairly obvious what “PatientId” means.
While having a primary key is optional in any table, it’s best practice to include them. If your table can contain multiple identical objects, for example a books table in a library database, your table may be prone to duplicate data. Having an individual ID for each book would allow you to avoid this issue. Primary keys also allow tables to easily reference one another, as we’ll see in the next section.
The patients table, with DoctorId as a foreign key referencing the doctors table.
Foreign keys, or FKs, are how data is linked across tables, essentially putting the “relational” in “relational database.” Foreign keys point to other tables, usually by referencing the other table’s primary key. For example, while a doctor may have many patients, a patient only has one primary care provider. For that reason, it would be appropriate for our patients table to have a column called “DoctorId,” which would be a foreign key to the doctors table.
Foreign key constraint rules would enforce that the DoctorId column in the patients table would have to match up with an existing DoctorId in the doctors table. If we were to add a new patient to the patients table, the data would be rejected if the DoctorId column were blank or otherwise did not reference an existing doctor. With these rules enforced, it would be a simple matter to see all the patients a specific doctor has—we would simply write a query that retrieves every patient with the relevant value in the DoctorId column. Similarly, a doctor may write many prescriptions over the course of their career, and a patient may have multiple prescriptions at various times. Because of this, it wouldn’t make sense for our doctors table to have a PrescriptionId column, nor would it make sense for our patients table to have one. We could, however, have a DoctorId column and a PatientId column, both constrained by foreign key rules, in our prescriptions table.
The prescriptions table with foreign keys referencing both patients and doctors.
While other constraints are less vital to the structure of a database, they are no less vital to your data integrity. All these constraints, as well as the foreign and primary key restraints discussed above, can be added to a table when it’s created, or later on if you feel that constraints need to be added or removed from an existing table.
Fergus can’t be added to the doctors table because he doesn’t have a license to practice medicine.
A column with the not null restraint cannot be empty. This is important for any vital data in a table that shouldn’t be left blank. For example, the Name columns in our patients and doctors tables should be subject to not null constraints. If our doctors table has a column for each doctor’s license to practice medicine, it should also non-nullable. Foreign and primary key columns are also implicitly not null.
Fergus still can’t be added to the doctors table because he’s trying to use someone else’s license.
The unique constraint ensures that all values in a given column are unique. This is for data that needs to be unique for each row in a table. We might want to use the unique constraint on the License column from the not null example, or if our patients table has an insurance ID column.
We can’t add Gavin as a patient because his birthdate is invalid.
Check is used to make sure a value in a column fulfills a given requirement. In our patients table, we might add a check to the Birthdate column to ensure that we don’t enter a date later than the current date, since that would imply a patient hasn’t been born yet.
If we try to add a new prescription without a PrescriptionDate, it’ll default to the current date.
The default constraint uses a default value if a column would otherwise be empty. This could be used in our prescriptions table if we had a PrescriptionDate column, which could default to the current date if a new row were to be entered without an explicit PrescriptionDate value.
Indexing a column in a table allows for faster retrieval of information from that table. As database tables may have millions of rows, indexing may be useful in, say, our prescriptions table for each prescription’s drug code. This would allow for faster queries when trying to retrieve all patients who have been prescribed a certain drug.
The whole purpose of a database is to store data so that it can be retrieved at a later date. If the database is to be of any use, the data stored in it must be accurate, and constraints help enforce that accuracy. Without the use of constraints, vital data can be incorrect or left blank. Over time, incorrect or missing data can build up and make large portions of your database useless. Take, for example, using a HospitalName string column in your prescriptions table instead of a HospitalId foreign key column. Any typo or variation in spelling or punctuation (Mass General, Mass. General, Massachusetts General) could mean that a patient doesn’t get their medication.
As discussed earlier, all constraints are optional and must be chosen by the database designer. Thus, accuracy can only be enforced if these constraints are used properly. I’ve given one or two examples for cases in which you might want to use each of the constraints discussed above, but they’re just examples. At the end of the day, it’s the database designer who makes the decision where to include foreign keys, what data is too vital to be nullable, upper and lower bounds for certain data, and so forth. A medical database shouldn’t have patients without doctors, prescriptions without drug codes, multiple doctors having the same license to practice, and so on. On the flip side, not every column needs a constraint placed on it—it’s completely fine if some patients don’t have a middle initial.
By using constraints properly, you can maintain data integrity by ensuring that your information is complete , and nothing is left a mystery. One of the advantages of relational databases over NoSQL databases is that data integrity is more easily maintained, but that integrity only exists in the first place if data constraints are placed where they should be. As recommended by the name, a relational database should be an interconnected web of related information.
As organizations become more sophisticated and realize the value of data it becomes increasingly important to ensure the accuracy at all stages. Increasingly our partners are relying on multiple data sources to deliver services, identify opportunities to create efficiency, and derive business intelligence at all levels of business. In our experience, the value of data starts and ends with its reliability and the best way to protect it is from the beginning.