In this project, we'll be building a database for crime reports using Postgres. We'll begin with a csv file boston.csv, which contains information about crimes occuring in Boston. By the end, we'll have a database crimes_db hosted on a Postgres server, along with a table boston_crimes. We'll also create groups with certain priviledges, such as readonly and readwrite, corresponding to the typical data roles of data analyst and data scientist, respectively. A diagram showing the end product is shown below.
To begin, we'll create an empty database.
Before doing so, we run the following code cell which drops any database, groups, or users created later on in this notebook.
import psycopg2
conn = psycopg2.connect(dbname="postgres", user="postgres", password="6288")
conn.autocommit = True
cur = conn.cursor()
cur.execute("DROP DATABASE IF EXISTS crime_db;")
cur.execute("DROP GROUP IF EXISTS readonly;")
cur.execute("DROP GROUP IF EXISTS readwrite;")
cur.execute("DROP USER IF EXISTS data_analyst;")
cur.execute("DROP USER IF EXISTS data_scientist;")
conn.close()
Now we can use the psycopg2 library to connect to postgresql, and create a database.
conn = psycopg2.connect(dbname="postgres", user="postgres", password="6288")
conn.autocommit = True # allows database to be created.
cur = conn.cursor()
cur.execute("CREATE DATABASE crime_db;")
conn.close()
Let's ensure the database is initialized properly by connecting to it, and create the "crimes" schema.
conn = psycopg2.connect(dbname="crime_db", user="postgres", password="6288")
cur = conn.cursor()
cur.execute("CREATE SCHEMA crimes;")
This new connection will remain open until the end of the notebook.
Let's load in the data we'll be using for this notebook.
import csv
with open('data/boston.csv') as f:
reader = csv.reader(f)
columns = next(reader)
first_row = next(reader)
print(f"column names: {columns}\n row example: {first_row}")
column names: ['incident_number', 'offense_code', 'description', 'date', 'day_of_the_week', 'lat', 'long'] row example: ['1', '619', 'LARCENY ALL OTHERS', '2018-09-02', 'Sunday', '42.35779134', '-71.13937053']
Here are some initial observations:
For determining appropriate datatypes for each column, it will be useful to know how many unique answers there are for each column. We do this in the following code cell.
def get_col_set(file_path, col_index):
col_set = set() #empty set for holding unique values in column
with open(file_path) as f:
next(f)
reader = csv.reader(f)
for row in reader:
col_set.add(row[col_index]) #add item to set
return col_set
# Displaying result of above function for each column
for i in range(7):
print(f"{columns[i]}: {len(get_col_set('data/boston.csv',i))}")
incident_number: 298329 offense_code: 219 description: 239 date: 1177 day_of_the_week: 7 lat: 18177 long: 18177
We see that the day_of_the_week column has only 7 unique answers - this is expected, and suggests we should use an enumerated datatype for this columm.
For the description column, we suspect that VARCHAR(n) is an appropriate datatype. To determine a reasonable choice for n, let's check the current length of the longest description.
#index of description column is 2
descriptions = get_col_set('data/boston.csv',2)
max_length = 0
for description in descriptions:
if len(description) > max_length:
max_length = len(description)
print(f"maximum length of description: {max_length}")
maximum length of description: 58
This suggests we should choose a datatype of VARCHAR(n) with n > 58 to allow for potentially longer descriptions, but not too long so that we don't waste too much memory.
Now, let's take a closer look at the offense_code column. We need to determine the minimum and maximum code, in order to get an idea for how many bits we'll need to represent each code as an integer.
codes = get_col_set('data/boston.csv',1) #list of unique codes
min_code = 10000
max_code = 0
for code in codes:
code = int(code)
if code > max_code:
max_code = code #update largest code
if code < min_code:
min_code = code #update smallest code
print(f"maximum code is: {max_code}")
print(f"minimum code is: {min_code}")
maximum code is: 3831 minimum code is: 111
This suggests we can use the smallint datatype for this columm, which only requires 4 bits.
Using our findings from the previous section, we can make an informed decision regarding the datatypes for each column. We summarize our final choices in the following table.
column name | datatype |
---|---|
incident_number | serial |
offense_code | smallint |
description | varchar(2 X 58) |
date | date (ISO 8601) |
day_of_the_week | enumerated |
lat | DECIMAL |
long | DECIMAL |
Here are a couple of notes about our choices above:
Now, let's create the enumerated datatype.
# double check format of column
weekdays = get_col_set('data/boston.csv', 4)
weekdays
{'Friday', 'Monday', 'Saturday', 'Sunday', 'Thursday', 'Tuesday', 'Wednesday'}
# create weekday_enum datatype with days in proper order
query = "CREATE TYPE weekday_enum AS ENUM ('Sunday', 'Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday');"
cur.execute(query)
With this datatype created, we're ready to create the table with proper datatypes specified for each column.
query = """
CREATE TABLE crimes.boston_crimes (
incident_number SERIAL PRIMARY KEY,
offense_code SMALLINT,
description VARCHAR(116),
date DATE,
day_of_the_week weekday_enum,
lat DECIMAL,
long DECIMAL
);
"""
cur.execute(query)
Now we can use the copy_expert method of psycopg2 to copy the contents of the csv file boston.csv to the boston_crimes table we created above.
with open("data/boston.csv") as f:
cur.copy_expert("COPY crimes.boston_crimes FROM STDIN WITH CSV HEADER;", f)
Let's take a look at the first few rows to ensure everything above worked properly.
cur.execute("SELECT * FROM crimes.boston_crimes LIMIT 3;")
results = cur.fetchall()
for result in results:
print(result)
(1, 619, 'LARCENY ALL OTHERS', datetime.date(2018, 9, 2), 'Sunday', Decimal('42.35779134'), Decimal('-71.13937053')) (2, 1402, 'VANDALISM', datetime.date(2018, 8, 21), 'Tuesday', Decimal('42.30682138'), Decimal('-71.06030035')) (3, 3410, 'TOWED MOTOR VEHICLE', datetime.date(2018, 9, 3), 'Monday', Decimal('42.34658879'), Decimal('-71.07242943'))
Now, we're going to create some groups. To begin, we revoke all privileges from the public schema.
cur.execute("REVOKE ALL ON SCHEMA public FROM public;")
cur.execute("REVOKE ALL ON DATABASE crime_db FROM public;")
Now let's create two groups: readonly, and readwrite. readonly will only be able to select data from tables in the crime schema, while readwrite will be able to select, insert, delete and update. We'll also specify the NOLOGIN option when creating the group to ensure that logins can only be completed by users, not groups.
# create readonly and readwrite groups
cur.execute("CREATE GROUP readonly NOLOGIN;")
cur.execute("CREATE GROUP readwrite NOLOGIN;")
# grant database connection priviledges to both
cur.execute("GRANT CONNECT ON DATABASE crime_db TO readonly;")
cur.execute("GRANT CONNECT ON DATABASE crime_db TO readwrite;")
# grant usage priviledges on crimes schema
cur.execute("GRANT USAGE ON SCHEMA crimes TO readonly;")
cur.execute("GRANT USAGE ON SCHEMA crimes TO readonly;")
# grant group specific priviliedges to corresponding group
cur.execute("GRANT SELECT ON ALL TABLES IN SCHEMA crimes TO readonly;")
cur.execute("GRANT SELECT, INSERT, DELETE, UPDATE ON ALL TABLES IN SCHEMA crimes TO readwrite;")
Now that we have groups, let's create a user for each group. We'll create a data analyst user assigned to the readonly group, and a data scientist user assigned to the readwrite group.
# create data_analyst user and assign to readonly group
cur.execute("CREATE USER data_analyst WITH PASSWORD 'abc';")
cur.execute("GRANT readonly TO data_analyst;")
# create data_scientist user and assign to readwrite group
cur.execute("CREATE USER data_scientist WITH PASSWORD '123';")
cur.execute("GRANT readwrite TO data_scientist;")
Now, our desired database is created. Let's "save" our work by committing all executed queries, and closing the connection.
conn.commit()
conn.close()
In the following section, we'll use postgresql internal tables to take a closer look at the database to ensure everything is working properly.
Let's begin by looking at the schemas present in our database.
conn = psycopg2.connect(dbname="crime_db", user="postgres", password="6288")
cur = conn.cursor()
cur.execute("""
SELECT schema_name FROM information_schema.schemata;
""")
results = cur.fetchall()
for result in results:
print(result)
('pg_toast',) ('pg_temp_1',) ('pg_toast_temp_1',) ('pg_catalog',) ('information_schema',) ('public',) ('crimes',)
We see that there are a number of internal schemas (with prefix "pg_"), as well as the public and information schemas. We also have the crimes schema which we created.
Now, let's look at all tables within the crimes schema.
cur.execute("SELECT * FROM information_schema.tables WHERE table_schema = 'crimes';")
results = cur.fetchall()
for result in results:
print(result)
('crime_db', 'crimes', 'boston_crimes', 'BASE TABLE', None, None, None, None, None, 'YES', 'NO', None)
There is one table named "boston_crimes" that lives within the crimes schema of the crime_db database. This is the desired structure.
Now, let's take a look at the users assigned to each group which we created.
cur.execute("""
SELECT * FROM information_schema.applicable_roles
WHERE role_name = 'readonly' OR role_name = 'readwrite';
""")
results = cur.fetchmany(5)
for result in results:
print(result)
('data_analyst', 'readonly', 'NO') ('data_scientist', 'readwrite', 'NO')
We see that the users we created are assigned to the proper group. Finally, let's conclude our checking by verifying the privileges of each group.
cur.execute("""
SELECT grantee, privilege_type
FROM information_schema.table_privileges
WHERE grantee = 'readwrite' OR grantee = 'readonly';
""")
results = cur.fetchall()
for result in results:
print(result)
('readonly', 'SELECT') ('readwrite', 'INSERT') ('readwrite', 'SELECT') ('readwrite', 'UPDATE') ('readwrite', 'DELETE')
We see that the readonly group has select privileges, while the readwrite group has select, insert, update, and delete priviliges. From these checks, we see that the crimes_db database has the desired stucture, as shown in the initial diagram.
Overall, we were able to:
Now that the data is stored efficiently in a Postgresql Database, relevant data can be efficiently retrieved for various analysis / ML tasks. Moreover, the data is protected from individuals who don't have data modification priviledges, such as data_analyst.