1
CMPSC-132: Programming and Computation II
Department of Computer Science & Engineering
The Pennsylvania State University
1. Intro to Database Programming
File systems in your computer keep your data are kept in big unstructured named clumps called files.
But file systems seriously lack some of the critical features necessary for managing data such as transaction
support and fast indexing. A database is a collection of data, which is organized into files called tables.
These tables provide a systematic way of accessing, managing, and updating data. Databases allow
indexing based on any attribute or data property, and this helps with fast retrieval of data based on the
indexed attribute. Each value stored in a database is called a record. Records with similar structure are
grouped into tables. Records are retrieved and transformed using queries, which are statements in a query
language. Nowadays, one to the most used query languages is called SQL (Structured Query Language).
SQL is an example of a declarative programming language, this means, statements do not describe
computations directly (unlike Python, where we directly describe the computational processes), but instead
they describe the desired result of some computation, for example ‘Give me all the names of the files
created in this network after 8 am’. It is the role of the query interpreter of the database system to design
and perform a computational process to produce such result.
1.1 Database and Database Management Systems (DBMS)
A database is a collection of related data that can be processed to produce information. For example,
Canvas Gradebook data can be used to obtain the class average, lowest and highest scores, etc. A database
management system (DBMS for the rest of the notes) is a software package designed to define, manipulate,
retrieve, and manage data in a database. DBMS were developed to handle the difficulties of typical file-
processing systems supported by conventional operating systems:
• Data redundancy and inconsistency
• Difficulty in accessing data
• Data isolation – multiple files and formats
• Integrity problems
• Atomicity of updates
• Concurrent access by multiple users – data in parallel
• Security problems
DBMS are equipped with a query language, which makes it more efficient to retrieve and manipulate data.
A user can apply as many and as different filtering options as required to retrieve a set of data. All of the
major DBMS include features that ensure that data maintains consistent throughout software and hardware
crashes, as well as any failed transactions, adhering to the ACID principles:
• Atomicity: Guarantees that either all of the transaction succeeds or none of it does. If one part of
the transaction fails, the whole transaction fails, in other words, it’s either “all or nothing”.
• Consistency: Guarantees that all data will be consistent. All data will be valid according to all
defined rules, including any constraints, cascades, and triggers that have been applied on the
database.
2
• Isolation: Guarantees that all transactions will occur in isolation. No transaction will be affected by
any other transaction, this means, a transaction cannot read data from any other transaction that has
not yet completed.
• Durability: Once a transaction is committed, it will remain in the system – even if there’s a system
crash immediately following the transaction. Any changes from the transaction must be stored
permanently. If the system tells the user that the transaction has succeeded, the transaction must
have, in fact, succeeded.
1.1.1 The Relational Model
In this course, we only provide a brief introduction to the relational model. The central data
description construct in this model is a relation, which can be thought of as a set of records. A description
of data in terms of a data model is called a schema. In the relational model, the schema for a relation
specifies its name, the name of each field (or attribute or column), and the type of each field, so it is based
on first-order predicate logic and defines a table as an n-ary relation. The main highlights of this model are:
• Data is stored in tables called relations.
• Relations can be normalized.
o In normalized relations, values saved are atomic values.
• Each row in a relation contains a unique value.
• Each column in a relation contains values from a same domain.
For example, student information in a university database may be stored in a relation with the following
schema:
Students (sid: string, name: string, email: string, gpa: real, major: string)
The preceding schema says that each record in the Students relation has five fields, with field names and
types as indicated. The relation could look like:
sid name email gpa major
12345 Jane Doe jdx589 3.8 PREMJR
67890 John Doe jdx741 3.4 CS
24679 Tyler Smith tsc781 1.8 EE
13578 Liz Taylor lty663 3.2 ME
24895 James Jones jjt442 2.7 EE
Each row in the Students relation is a record that describes a student. We can make the description of a
collection of students more precise by specifying integrity constraints, which are conditions that the records
in a relation must satisfy. For example, we could specify that every student has a unique sid value. Note
that we cannot capture this information by simply adding another field to the Students schema. Thus, the
ability to specify uniqueness of the values in a field increases the accuracy with which we can describe our
data. The expressiveness of the constructs available for specifying integrity constraints is an important
aspect of the relational data model.
3
1.2 Structured Query Language (SQL)
SQL is a computer language for storing, manipulating and retrieving data stored in relational
database and it is the standard language for most DBMS like MySQL, MS Access, Oracle, PostgreSQL
and SQL Server. Although the SQL language is standardized, most DBMS implement some custom variant
of the language that is endowed with proprietary features. In this document, we will describe SQL statement
in their general form.
1.2.1 Keys
Recall that a table, also called a relation, has a fixed number of named and typed columns. Each
row of a table represents a data record and has one value for each column. A table typically has a column
or combination of columns that contain values that uniquely identify each row in the table. This column,
or columns, is called the primary key (PK) of the table and enforces the entity integrity of the table. If a
primary key constraint is defined on more than one column, values may be duplicated within one column,
but each combination of values from all the columns in the primary key constraint definition must be
unique.
In this example, sid is the perfect candidate for primary key since it can uniquely identify each
student in the table:
sid (PK) name email gpa major
12345 Jane Doe jdx589 3.8 PREMJR
67890 John Doe jdx741 3.4 CS
24679 Tyler Smith tsc781 1.8 EE
13578 Liz Taylor lty663 3.2 ME
24895 James Jones jjt442 2.7 EE
A foreign key (FK) is a column or combination of columns that is used to establish and enforce a link
between the data in two tables to control the data that can be stored in the foreign key table. In a foreign
key reference, a link is created between two tables when the column or columns that hold the primary key
value for one table are referenced by the column or columns in another table. This column becomes a
foreign key in the second table.
sid (PK) name email gpa major(FK) majorid(PK) descr
12345 Jane Doe jdx589 3.8 PREMJR PREMJR Pre-major
67890 John Doe jdx741 3.4 CS CS Computer Science
24679 Tyler Smith tsc781 1.8 EE EE Electrical Engineering
13578 Liz Taylor lty663 3.2 ME DS Data Science
24895 James Jones jjt442 2.7 EE CENG Computer Engineering
57841 Lulu Sam lsw702 1.5 PREMJR
44895 Alex Dar adc610 3.9 CS
Student Table Major Table
4
For example, the Student table has a foreign key link (major) to the Major table (majorid) because there is
a logical relationship between student’s major and the majors available at the university. The major column
in the Student table matches the primary key column of the Major table. By creating this foreign key
relationship, a value for major cannot be inserted into the Student table if it does not already exist in the
Major table.
1.2.2 Basic DDL commands
Data Description Language (DDL) is a syntax for creating and modifying database objects such as
tables, indexes, and users. Common examples of DDL statements include CREATE, ALTER, and DROP.
1.2.2.1 CREATE
The CREATE statement creates a new table, a view of a table, or other object in database. For each
column, we can optionally include the UNIQUE keyword, which indicates that the column can only contain
unique values, or the NOT NULL keyword, which indicates that the column should not contain NULL
values. The general for creating a new table looks like this:
CREATE TABLE TableName(Column1 datatype
Column2 datatype,
…,
PRIMARY KEY(column(s)))
Including the optional if not exists clause will prevent an error if we attempt to create duplicate tables.
CREATE TABLE IF NOT EXISTS TableName(Column1 datatype
Column2 datatype,
…,
PRIMARY KEY(column(s)))
The SQL statement for creating the Student and Major tables can be written as follows:
CREATE TABLE Student(sid INT NOT NULL,
name VARCHAR(30) NOT NULL,
email VARCHAR(20) NOT NULL,
gpa DECIMAL(3,2) NOT NULL,
major VARCHAR(10) NOT NULL,
PRIMARY KEY(sid),
FOREIGN KEY(major) REFERENCES(Major,majorid))
CREATE TABLE Major(majorid INT NOT NULL,
descr VARCHAR(30) NOT NULL,
PRIMARY KEY(majorid))
5
1.2.2.2 DROP
The DROP table statement deletes a table from a database. Including the optional if exists clause will
prevent an error if we attempt to drop a non-existing table. All data, indexes, triggers, constraints, and
permission specifications for that table will also be deleted. In its general format:
DROP TABLE IF NOT EXISTS TableName
We have to be careful when using this command because once a table is deleted then all the information
available in the table would also be lost forever. The SQL statement for deleting the Student table can be
written as follows:
DROP TABLE Student
1.2.3 Basic DML commands
Data Manipulation Language (DML) is a syntax used for adding (inserting), deleting, and
modifying (updating) data in a database. In SQL, DML statements include SELECT, INSERT, UPDATE
and DELETE.
1.2.3.1 SELECT
The SELECT statement is used to fetch the data from a database table which returns data in the
form of a result table. In its general form:
SELECT Column1, Column2
FROM Table Name
The columns of the resulting table are described by a comma-separated list of expressions that are each
evaluated for each row of the existing input table. If we want to fetch all the columns available in the table,
then we can use the following syntax:
SELECT *
FROM Table Name
For example, from the Student table, we can create a two-column table that describes each student by GPA
SELECT name, gpa
FROM Student
name gpa
Jane Doe 3.8
John Doe 3.4
Tyler Smith 1.8
Liz Taylor 3.2
James Jones 2.7
Lulu Sam 1.5
Alex Dar 3.9
6
The SELECT statement has many optional clauses:
▪ WHERE specifies which rows to retrieve.
▪ GROUP BY groups rows sharing a property so that an aggregate function can be applied to each
group.
▪ HAVING selects among the groups defined by the GROUP BY clause.
▪ ORDER BY specifies how to order the returned rows.
▪ AS provides an alias which can be used to temporarily rename tables or columns.
WHERE Clause
A SELECT statement can also include a where clause with a filtering expression. This expression
filters the rows that are projected. Only a row for which the filtering expression evaluates to a true value
will be used to produce a row in the resulting table. In its general form:
SELECT Column1, Column2
FROM Table Name
WHERE Criteria
For example, from the Student table, retrieve student with GPA > 3
SELECT name, gpa
FROM Student
WHERE gpa > 3
For example, from the Student table, retrieve student with GPA > 3
SELECT name, gpa
FROM Student
WHERE gpa > 3 AND
major = ‘CS’
SELECT Student.sid AS PSU_ID, Student.gpa, Major.descr
FROM Student, Major
WHERE gpa <= 3 AND major = 'PREMJR' AND Student.major = Major.majorid name gpa Jane Doe 3.8 John Doe 3.4 Liz Taylor 3.2 Alex Dar 3.9 name gpa John Doe 3.4 Alex Dar 3.9 PSU_ID gpa descr 57841 1.5 Pre-major 7 GROUP BY Clause and Aggregated Functions The SELECT statement joins and manipulates individual rows. However, it is possible to perform aggregation operations over multiple rows. The aggregate functions MAX, MIN, COUNT, SUM and AVG return the maximum, minimum, number, sum and average of the values in a column. Multiple aggregate functions can be applied to the same set of rows by defining more than one column. Only columns that are included by the where clause are considered in the aggregation. SELECT major, gpa FROM Student WHERE gpa>1
SELECT AVG(gpa) AS SchoolAVG
FROM Student
WHERE gpa>1
SELECT major, AVG(gpa) AS MajorAVG
FROM Student
WHERE gpa>1
GROUP BY major
Notice that the second SELECT statement has produced a table with a single row, the average of the entire
data in the column gpa. The GROUP BY clause of a SELECT statement is used to partition rows into
groups and select only a subset of the groups. Any aggregate functions in the column description will apply
to each group independently, rather than the entire set of rows in the table as shown in the third example,
where we group together the GPA scores by major and then compute the average in each group separately.
The combination of these additional clauses allows a SELECT statement to express a wide range of
projections of an input table into a related output table.
1.2.3.2 INSERT
The INSERT statement allows us to add rows to a table in a database. In particular, we can insert
values into all columns of our table, or we can add to one specific column, which will set the other columns
to their default values (if any). INSERT statements have the following form:
INSERT INTO TableName (column1, column2, …) VALUES (value1, value2, …)
8
The SQL statements for inserting rows into the Student can be written as follows:
INSERT INTO Student(sid, name, email, gpa, major) VALUES (12345,’Jane Doe’,’jdx589′,3.8,’PREMJR’)
INSERT INTO Student(sid, name, email, gpa, major) VALUES (67890,’John Doe’,’jdx741′,3.4,’CS’)
INSERT INTO Student(sid, name, email, gpa, major) VALUES
(24679,’Tyler Smith’,’tsc781′,1.8,’EE’),
(13578,’Liz Taylor’,’lty663′,3.2,’ME’)
1.2.3.3 UPDATE
The UPDATE statement sets all entries in certain columns of a table to new values for a subset of
rows as indicated by an optional WHERE clause. We can update all rows by omitting the optional where
clause. The UPDATE statement has the following form:
UPDATE Table Name
SET column1 = value, column2 = value , …
WHERE condition
For the UPDATE to be successful, the updated value must not conflict with all the applicable constraints
such as primary keys, unique indexes, CHECK constraints, and NOT NULL constraints.
UPDATE Student
SET major=’CS’
WHERE gpa>2.5
9
1.2.3.4 DELETE
The DELETE statement deletes a subset of rows of a table as indicated by an optional where clause.
If we do not include a where clause, then we will delete all rows, but an empty table would remain in our
database. ). DELETE statements have the following form:
DELETE FROM TableName
WHERE condition
UPDATE Student
SET major=’EE’
WHERE gpa>3 AND
name LIKE ‘J%’
UPDATE Student
SET major=’PREMJR’
10
DELETE FROM Student
WHERE gpa<2
DELETE FROM Student
WHERE sid=44895
DELETE FROM Student
11
Appendix: List of Frequently SQL Commands
Reference: https://www.codecademy.com/articles/sql-commands
ALTER TABLE lets you add columns to a table in a database.
ALTER TABLE table_name
ADD column_name datatype;
AND is an operator that combines two conditions. Both conditions must be true for the row to be
included in the result set.
SELECT column_name(s)
FROM table_name
WHERE column_1 = value_1
AND column_2 = value_2;
AS is a keyword in SQL that allows you to rename a column or table using an alias.
SELECT column_name AS 'Alias'
FROM table_name;
AVG() is an aggregate function that returns the average value for a numeric column.
SELECT AVG(column_name)
FROM table_name;
The BETWEEN operator is used to filter the result set within a certain range. The values can be
numbers, text or dates.
SELECT column_name(s)
FROM table_name
WHERE column_name BETWEEN value_1 AND value_2;
CASE statements are used to create different outputs (usually in the SELECT statement). It is SQL’s
way of handling if-then logic.
SELECT column_name,
CASE
WHEN condition THEN 'Result_1'
WHEN condition THEN 'Result_2'
ELSE 'Result_3'
END
FROM table_name;
https://www.codecademy.com/articles/sql-commands
12
COUNT() is a function that takes the name of a column as an argument and counts the number of
rows where the column is not NULL.
SELECT COUNT(column_name)
FROM table_name;
CREATE TABLE creates a new table in the database. It allows you to specify the name of the table
and the name of each column in the table.
CREATE TABLE table_name (
column_1 datatype,
column_2 datatype,
column_3 datatype
);
DELETE statements are used to remove rows from a table.
DELETE FROM table_name
WHERE some_column = some_value;
GROUP BY is a clause in SQL that is only used with aggregate functions. It is used in collaboration
with the SELECT statement to arrange identical data into groups.
SELECT column_name, COUNT(*)
FROM table_name
GROUP BY column_name;
HAVING was added to SQL because the WHERE keyword could not be used with aggregate
functions.
SELECT column_name, COUNT(*)
FROM table_name
GROUP BY column_name
HAVING COUNT(*) > value;
An inner JOIN will combine rows from different tables if the join condition is true.
SELECT column_name(s)
FROM table_1
JOIN table_2
ON table_1.column_name = table_2.column_name;
13
INSERT statements are used to add a new row to a table.
INSERT INTO table_name (column_1, column_2, column_3)
VALUES (value_1, ‘value_2’, value_3);
IS NULL and IS NOT NULL are operators used with the WHERE clause to test for empty values.
SELECT column_name(s)
FROM table_name
WHERE column_name IS NULL;
LIKE is a special operator used with the WHERE clause to search for a specific pattern in a column.
SELECT column_name(s)
FROM table_name
WHERE column_name LIKE pattern;
LIMIT is a clause that lets you specify the maximum number of rows the result set will have.
SELECT column_name(s)
FROM table_name
LIMIT number;
MAX() is a function that takes the name of a column as an argument and returns the largest
value in that column.
SELECT MAX(column_name)
FROM table_name;
MIN() is a function that takes the name of a column as an argument and returns the smallest
value in that column.
SELECT MIN(column_name)
FROM table_name;
OR is an operator that filters the result set to only include rows where either condition is true.
SELECT column_name
FROM table_name
WHERE column_name = value_1
OR column_name = value_2;
14
ORDER BY is a clause that indicates you want to sort the result set by a particular column either
alphabetically or numerically.
SELECT column_name
FROM table_name
ORDER BY column_name ASC | DESC;
An outer join will combine rows from different tables even if the join condition is not met. Every
row in the left table is returned in the result set, and if the join condition is not met,
then NULL values are used to fill in the columns from the right table.
SELECT column_name(s)
FROM table_1
LEFT JOIN table_2
ON table_1.column_name = table_2.column_name;
ROUND() is a function that takes a column name and an integer as arguments. It rounds the
values in the column to the number of decimal places specified by the integer.
SELECT ROUND(column_name, integer)
FROM table_name;
SELECT statements are used to fetch data from a database. Every query will begin with SELECT.
SELECT column_name
FROM table_name;
SELECT DISTINCT specifies that the statement is going to be a query that returns unique values in
the specified column(s).
SELECT DISTINCT column_name
FROM table_name;
SUM() is a function that takes the name of a column as an argument and returns the sum of all
the values in that column.
SELECT SUM(column_name)
FROM table_name;
UPDATE statements allow you to edit rows in a table.
UPDATE table_name
SET some_column = some_value
WHERE some_column = some_value;
15
WHERE is a clause that indicates you want to filter the result set to include only rows where the
following condition is true.
SELECT column_name(s)
FROM table_name
WHERE column_name operator value;
WITH clause lets you store the result of a query in a temporary table using an alias. You can also
define multiple temporary tables using a comma and with one instance of the WITH keyword.
WITH temporary_name AS (
SELECT *
FROM table_name)
SELECT *
FROM temporary_name
WHERE column_name operator value;