Building a Crime Report Database with Postgres¶

In this project, we'll be building a database for crime reports using Postgres. We'll begin with a csv file boston.csv, which contains information about crimes occuring in Boston. By the end, we'll have a database crimes_db hosted on a Postgres server, along with a table boston_crimes. We'll also create groups with certain priviledges, such as readonly and readwrite, corresponding to the typical data roles of data analyst and data scientist, respectively. A diagram showing the end product is shown below.

Creating the Initial Database¶

To begin, we'll create an empty database.

Before doing so, we run the following code cell which drops any database, groups, or users created later on in this notebook.

In [ ]:
import psycopg2
conn = psycopg2.connect(dbname="postgres", user="postgres", password="6288")
conn.autocommit = True
cur = conn.cursor()
cur.execute("DROP DATABASE IF EXISTS crime_db;")
cur.execute("DROP GROUP IF EXISTS readonly;")
cur.execute("DROP GROUP IF EXISTS readwrite;")
cur.execute("DROP USER IF EXISTS data_analyst;")
cur.execute("DROP USER IF EXISTS data_scientist;")
conn.close()

Now we can use the psycopg2 library to connect to postgresql, and create a database.

In [ ]:
conn = psycopg2.connect(dbname="postgres", user="postgres", password="6288")
conn.autocommit = True      # allows database to be created.
cur = conn.cursor()
cur.execute("CREATE DATABASE crime_db;")
conn.close()

Let's ensure the database is initialized properly by connecting to it, and create the "crimes" schema.

In [ ]:
conn = psycopg2.connect(dbname="crime_db", user="postgres", password="6288")
cur = conn.cursor()
cur.execute("CREATE SCHEMA crimes;")

This new connection will remain open until the end of the notebook.

Data Exploration¶

Let's load in the data we'll be using for this notebook.

In [ ]:
import csv 

with open('data/boston.csv') as f:
    reader = csv.reader(f)
    columns = next(reader)
    first_row = next(reader)

print(f"column names:    {columns}\n row example:   {first_row}")
column names:    ['incident_number', 'offense_code', 'description', 'date', 'day_of_the_week', 'lat', 'long']
 row example:   ['1', '619', 'LARCENY ALL OTHERS', '2018-09-02', 'Sunday', '42.35779134', '-71.13937053']

Here are some initial observations:

  • The dataset has 7 columns
  • incident_number and offense_code are integers
  • lat and long are floats standing for latitude and longitude of the crime commited
  • the description column contains a short textual desciption of the crime

For determining appropriate datatypes for each column, it will be useful to know how many unique answers there are for each column. We do this in the following code cell.

In [ ]:
def get_col_set(file_path, col_index):
    col_set = set()     #empty set for holding unique values in column
    with open(file_path) as f:
        next(f)
        reader = csv.reader(f)
        for row in reader:
            col_set.add(row[col_index])     #add item to set
    
    return col_set 

# Displaying result of above function for each column
for i in range(7):
    print(f"{columns[i]}:  {len(get_col_set('data/boston.csv',i))}")
incident_number:  298329
offense_code:  219
description:  239
date:  1177
day_of_the_week:  7
lat:  18177
long:  18177

We see that the day_of_the_week column has only 7 unique answers - this is expected, and suggests we should use an enumerated datatype for this columm.

For the description column, we suspect that VARCHAR(n) is an appropriate datatype. To determine a reasonable choice for n, let's check the current length of the longest description.

In [ ]:
#index of description column is 2

descriptions = get_col_set('data/boston.csv',2)

max_length = 0
for description in descriptions:
    if len(description) > max_length:
        max_length = len(description)

print(f"maximum length of description:  {max_length}")
maximum length of description:  58

This suggests we should choose a datatype of VARCHAR(n) with n > 58 to allow for potentially longer descriptions, but not too long so that we don't waste too much memory.

Now, let's take a closer look at the offense_code column. We need to determine the minimum and maximum code, in order to get an idea for how many bits we'll need to represent each code as an integer.

In [ ]:
codes = get_col_set('data/boston.csv',1)        #list of unique codes


min_code = 10000
max_code = 0

for code in codes:
    code = int(code)    
    if code > max_code:
        max_code = code                 #update largest code 
    if code < min_code:
        min_code = code                 #update smallest code

print(f"maximum code is:    {max_code}")
print(f"minimum code is:    {min_code}")
maximum code is:    3831
minimum code is:    111

This suggests we can use the smallint datatype for this columm, which only requires 4 bits.

Creating Table¶

Using our findings from the previous section, we can make an informed decision regarding the datatypes for each column. We summarize our final choices in the following table.

column name datatype
incident_number serial
offense_code smallint
description varchar(2 X 58)
date date (ISO 8601)
day_of_the_week enumerated
lat DECIMAL
long DECIMAL

Here are a couple of notes about our choices above:

  • Since there are only 7 unique answers for day of the week, we can save some storage space by having this as an enumerated datatype
  • According to the minimum and maximum offense_code, we can use the smallint datatype to represent this column
  • The incident_number column counts from 1 to 298329, which is the number of rows. This suggests we use a serial datatype - this allows future crimes to be entered into the database without specifying a value for this column.

Now, let's create the enumerated datatype.

In [ ]:
# double check format of column
weekdays = get_col_set('data/boston.csv', 4)
weekdays
Out[ ]:
{'Friday', 'Monday', 'Saturday', 'Sunday', 'Thursday', 'Tuesday', 'Wednesday'}
In [ ]:
# create weekday_enum datatype with days in proper order
query = "CREATE TYPE weekday_enum AS ENUM ('Sunday', 'Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday');"
cur.execute(query)

With this datatype created, we're ready to create the table with proper datatypes specified for each column.

In [ ]:
query = """
CREATE TABLE crimes.boston_crimes (
    incident_number SERIAL PRIMARY KEY,
    offense_code SMALLINT,
    description VARCHAR(116),
    date DATE,
    day_of_the_week weekday_enum,
    lat DECIMAL,
    long DECIMAL
);
"""
cur.execute(query)

Now we can use the copy_expert method of psycopg2 to copy the contents of the csv file boston.csv to the boston_crimes table we created above.

In [ ]:
with open("data/boston.csv") as f:
    cur.copy_expert("COPY crimes.boston_crimes FROM STDIN WITH CSV HEADER;", f)

Let's take a look at the first few rows to ensure everything above worked properly.

In [ ]:
cur.execute("SELECT * FROM crimes.boston_crimes LIMIT 3;")
results = cur.fetchall()
for result in results:
    print(result)
(1, 619, 'LARCENY ALL OTHERS', datetime.date(2018, 9, 2), 'Sunday', Decimal('42.35779134'), Decimal('-71.13937053'))
(2, 1402, 'VANDALISM', datetime.date(2018, 8, 21), 'Tuesday', Decimal('42.30682138'), Decimal('-71.06030035'))
(3, 3410, 'TOWED MOTOR VEHICLE', datetime.date(2018, 9, 3), 'Monday', Decimal('42.34658879'), Decimal('-71.07242943'))

Creating Groups¶

Now, we're going to create some groups. To begin, we revoke all privileges from the public schema.

In [ ]:
cur.execute("REVOKE ALL ON SCHEMA public FROM public;")
cur.execute("REVOKE ALL ON DATABASE crime_db FROM public;")

Now let's create two groups: readonly, and readwrite. readonly will only be able to select data from tables in the crime schema, while readwrite will be able to select, insert, delete and update. We'll also specify the NOLOGIN option when creating the group to ensure that logins can only be completed by users, not groups.

In [ ]:
# create readonly and readwrite groups
cur.execute("CREATE GROUP readonly NOLOGIN;")
cur.execute("CREATE GROUP readwrite NOLOGIN;")

# grant database connection priviledges to both
cur.execute("GRANT CONNECT ON DATABASE crime_db TO readonly;")
cur.execute("GRANT CONNECT ON DATABASE crime_db TO readwrite;")

# grant usage priviledges on crimes schema
cur.execute("GRANT USAGE ON SCHEMA crimes TO readonly;")
cur.execute("GRANT USAGE ON SCHEMA crimes TO readonly;")

# grant group specific priviliedges to corresponding group
cur.execute("GRANT SELECT ON ALL TABLES IN SCHEMA crimes TO readonly;")
cur.execute("GRANT SELECT, INSERT, DELETE, UPDATE ON ALL TABLES IN SCHEMA crimes TO readwrite;")

Creating Users¶

Now that we have groups, let's create a user for each group. We'll create a data analyst user assigned to the readonly group, and a data scientist user assigned to the readwrite group.

In [ ]:
# create data_analyst user and assign to readonly group
cur.execute("CREATE USER data_analyst WITH PASSWORD 'abc';")
cur.execute("GRANT readonly TO data_analyst;")

# create data_scientist user and assign to readwrite group
cur.execute("CREATE USER data_scientist WITH PASSWORD '123';")
cur.execute("GRANT readwrite TO data_scientist;")

Now, our desired database is created. Let's "save" our work by committing all executed queries, and closing the connection.

In [ ]:
conn.commit()
conn.close()

In the following section, we'll use postgresql internal tables to take a closer look at the database to ensure everything is working properly.

Testing¶

Let's begin by looking at the schemas present in our database.

In [ ]:
conn = psycopg2.connect(dbname="crime_db", user="postgres", password="6288")
cur = conn.cursor()
cur.execute("""
            SELECT schema_name FROM information_schema.schemata;
            """)

results = cur.fetchall()
for result in results:
    print(result)
('pg_toast',)
('pg_temp_1',)
('pg_toast_temp_1',)
('pg_catalog',)
('information_schema',)
('public',)
('crimes',)

We see that there are a number of internal schemas (with prefix "pg_"), as well as the public and information schemas. We also have the crimes schema which we created.

Now, let's look at all tables within the crimes schema.

In [ ]:
cur.execute("SELECT * FROM information_schema.tables WHERE table_schema = 'crimes';")

results = cur.fetchall()
for result in results:
    print(result)
('crime_db', 'crimes', 'boston_crimes', 'BASE TABLE', None, None, None, None, None, 'YES', 'NO', None)

There is one table named "boston_crimes" that lives within the crimes schema of the crime_db database. This is the desired structure.

Now, let's take a look at the users assigned to each group which we created.

In [ ]:
cur.execute("""
            SELECT * FROM information_schema.applicable_roles 
            WHERE role_name = 'readonly' OR role_name = 'readwrite';
            """)

results = cur.fetchmany(5)
for result in results:
    print(result)
('data_analyst', 'readonly', 'NO')
('data_scientist', 'readwrite', 'NO')

We see that the users we created are assigned to the proper group. Finally, let's conclude our checking by verifying the privileges of each group.

In [ ]:
cur.execute("""
    SELECT grantee, privilege_type
    FROM information_schema.table_privileges
    WHERE grantee = 'readwrite' OR grantee = 'readonly';
    """)
results = cur.fetchall()
for result in results:
    print(result)
('readonly', 'SELECT')
('readwrite', 'INSERT')
('readwrite', 'SELECT')
('readwrite', 'UPDATE')
('readwrite', 'DELETE')

We see that the readonly group has select privileges, while the readwrite group has select, insert, update, and delete priviliges. From these checks, we see that the crimes_db database has the desired stucture, as shown in the initial diagram.

Conclusion¶

Overall, we were able to:

  • Create an empty database crimes_db with a single schema crimes
  • Analyze the boston.csv file to determine proper datatypes for each column
  • Use psycopg2's copy_expert method to copy the contents of boston.csv into a table in the crimes schema
  • Create two groups with specific privileges, and assign a user to each group
  • Utilize postgresql internal tables to verify the structure of our database

Now that the data is stored efficiently in a Postgresql Database, relevant data can be efficiently retrieved for various analysis / ML tasks. Moreover, the data is protected from individuals who don't have data modification priviledges, such as data_analyst.